-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LightGBMClassifier suffers a great loss in quality in Single Dataset Mode if running with not enough chunkSize #1478
Comments
@fonhorst wow, that's so weird! There must be a major bug somewhere. This parameter is only for batching/copying data. It shouldn't have any impact on metrics at all. It can only have some impact on execution time if the chunk size is a small value and there is a lot of data, since it would increase the number of copies done. |
I think another posted issue may be related to the problem I described. Check this #1404 |
@fonhorst where can I get the dataset you used, is it the same bankruptcy dataset as this one in our overview: I'm trying to run this line: |
I tried this on a cluster with 8 workers, 4 cores each, and 1 driver, also 4 cores. For chunksize=10k, my AUC was 0.7055860805860806 It looks like I can reproduce this issue right now, will take a deeper look into it. |
Interestingly, when I turn off single dataset mode, the chunk size can be 100, 1k or 10k, but I still get the same results. |
The dataset was from the synapse ml example. While I'm not running on Azure I took csv file from here: Sorry for the lightautoml import. It is not really necessary for the example. Here is the updated version. "Interestingly, when I turn off single dataset mode, the chunk size can be 100, 1k or 10k, but I still get the same results." |
@fonhorst I was able to reproduce the issue locally, but interestingly only on this particular dataset - on a different dataset, I did not see this issue. I found something strange. I added debug here: And the last chunk count seems to be higher than what I believe I would expect, and the numbers don't quite make sense to me. I'll continue to investigate. Run1: Run2: |
note the first 4 are from labels and the last 4 are from the dataset (the debug is mixed on lines since it's from 4 threads writing at the same time) |
@fonhorst I was able to fix the issue locally, the problem was that chunkSize was much larger for the features array (specifically, it was numCols * chunkSize, instead of just chunkSize). I will send a PR soon. Thank you for your patience. |
@imatiach-msft Thank you very much for your swift response! |
You can try the build at: Maven Coordinates Maven Resolver |
closing the issue as PR has been merged to fix this issue: The fix will be in the next release after the current 0.9.5 (I'm assuming 0.9.6) |
Hi @imatiach-msft , I just want to understand this bug further.
From what I understand, the features array have a larger chunkSize as a parameter due to numCols value multiplied to the chunkSize input, how does this affect the model quality? Does this mean that the label and features array does not align when copying it to the workers of the cluster? Sorry for the newbie question. |
@Vonatzki yes, there was a bug in the code that copied the data over from Java to native lightgbm layer. If chunksize was set to a low value, some of the values were not copied correctly. Hence, this results in a drop in performance metrics. This appears for the newest version of SynapseML 0.9.5 currently, when useSingleDataset=True, which is on by default since 0.9.5. In next release the issue will be fixed and also on current master this issue is already fixed. |
please see the PR fix here with a longer description of the issue and the code changes: |
Thank you for your response and the snapshot build fix! Helped me a lot! |
Describe the bug
if chunkSize parameter of LightGBMClassifier is less than dataset_size / cores_per_exec * num_of_execs, the resulting model degrades to very bad quality of predictions. See the table with predictions on the test part of data for different cores/chunkSize-s below.
To Reproduce
Check the code in the attachment lightgbm_chunk_size_problem_upd.zip
It happens when I use single dataset mode (useSingleDataset=True).
The full set of used settings for company_bancruptacy_prediction dataset:
objective="binary",
featuresCol="features",
labelCol="Bankrupt?",
useSingleDatasetMode=True,
numThreads=max(1, num_cores - 1),
chunkSize=chunk_size,
isProvideTrainingMetric=True,
verbosity=10,
isUnbalance=True
The train part size: 5812
For each run, the dataset was repartitioned with '.repartition(num_cores)' to have equal number of records in each partition.
Expected behavior
Predictions quality should stay the same for different values of chunkSize parameter. At least, if it is a desired behavior, the documentation should explain impact on quality in more details.
Info (please complete the following information):
Additional context
LightGBMClassifier's numThreads was set num_cores - 1 following what is recommended #1316
The doc says:
"Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset."
It can be seen that the model being run with a chunkSize less than dataset_size / num_cores breaks completely resulting to no quality at all. But chunkSize becomes larger than dataset_size / num_cores (it were runs in the local mode) everything is fine. The border seems to be quite sharp as for 4 cores the model breaks coming from 1500 to 1400.
When the model breaks, in logs I can see frequent appearance the following records:
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
Concerning all mentioned above I have several questions:
I also observe the same behavior for much larger dataset with hundread of thousands rows.
AB#1748520
The text was updated successfully, but these errors were encountered: