Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LightGBMClassifier suffers a great loss in quality in Single Dataset Mode if running with not enough chunkSize #1478

Closed
fonhorst opened this issue Apr 14, 2022 · 19 comments

Comments

@fonhorst
Copy link

fonhorst commented Apr 14, 2022

Describe the bug
if chunkSize parameter of LightGBMClassifier is less than dataset_size / cores_per_exec * num_of_execs, the resulting model degrades to very bad quality of predictions. See the table with predictions on the test part of data for different cores/chunkSize-s below.
Screenshot from 2022-04-14 17-12-00

To Reproduce
Check the code in the attachment lightgbm_chunk_size_problem_upd.zip

It happens when I use single dataset mode (useSingleDataset=True).
The full set of used settings for company_bancruptacy_prediction dataset:
objective="binary",
featuresCol="features",
labelCol="Bankrupt?",
useSingleDatasetMode=True,
numThreads=max(1, num_cores - 1),
chunkSize=chunk_size,
isProvideTrainingMetric=True,
verbosity=10,
isUnbalance=True

The train part size: 5812
For each run, the dataset was repartitioned with '.repartition(num_cores)' to have equal number of records in each partition.

Expected behavior
Predictions quality should stay the same for different values of chunkSize parameter. At least, if it is a desired behavior, the documentation should explain impact on quality in more details.

Info (please complete the following information):

  • SynapseML Version: 0.9.5
  • Spark Version 3.2.0
  • Spark Platform: local model, on-premise Kubernetes cluster

Additional context

LightGBMClassifier's numThreads was set num_cores - 1 following what is recommended #1316

The doc says:
"Advanced parameter to specify the chunk size for copying Java data to native. If set too high, memory may be wasted, but if set too low, performance may be reduced during data copy.If dataset size is known beforehand, set to the number of rows in the dataset."

It can be seen that the model being run with a chunkSize less than dataset_size / num_cores breaks completely resulting to no quality at all. But chunkSize becomes larger than dataset_size / num_cores (it were runs in the local mode) everything is fine. The border seems to be quite sharp as for 4 cores the model breaks coming from 1500 to 1400.

When the model breaks, in logs I can see frequent appearance the following records:
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf

Concerning all mentioned above I have several questions:

  1. Why does it break completely?
  2. Does that mean I cannot process the dataset if doesn't fit into the memory of all executors combined (instead of just processing it slower)?
  3. May be there is a correlation with other settings that I have missed?

I also observe the same behavior for much larger dataset with hundread of thousands rows.

AB#1748520

@imatiach-msft
Copy link
Contributor

imatiach-msft commented Apr 14, 2022

@fonhorst wow, that's so weird! There must be a major bug somewhere. This parameter is only for batching/copying data. It shouldn't have any impact on metrics at all. It can only have some impact on execution time if the chunk size is a small value and there is a lot of data, since it would increase the number of copies done.

@fonhorst
Copy link
Author

I think another posted issue may be related to the problem I described. Check this #1404

@imatiach-msft
Copy link
Contributor

@fonhorst where can I get the dataset you used, is it the same bankruptcy dataset as this one in our overview:
https://github.com/microsoft/SynapseML/blob/master/website/versioned_docs/version-0.9.5/features/lightgbm/LightGBM%20-%20Overview.md

I'm trying to run this line:
df = (
spark.read.format("csv")
.option("header", True)
.option("inferSchema", True)
.load("/opt/spark_data/company_bancruptacy_prediction.csv")
.repartition(num_cores)
.cache()
)
did you just use the same small dataset or are you running on much larger distributed data?

@imatiach-msft
Copy link
Contributor

"LightGBMClassifier's numThreads was set num_cores - 1 following what is recommended #1316"
I noticed you were doing this in the script. However, we do this automatically now, as part of PR #1282 , so it's not needed.

@imatiach-msft
Copy link
Contributor

I don't have lightautoml installed. I'll just skip it for now in your script:

image

@imatiach-msft
Copy link
Contributor

I tried this on a cluster with 8 workers, 4 cores each, and 1 driver, also 4 cores.

For chunksize=10k, my AUC was 0.7055860805860806
For chunksize=1k, my AUC was 0.5897054334554336

It looks like I can reproduce this issue right now, will take a deeper look into it.

@imatiach-msft
Copy link
Contributor

Interestingly, when I turn off single dataset mode, the chunk size can be 100, 1k or 10k, but I still get the same results.

@fonhorst
Copy link
Author

The dataset was from the synapse ml example.
https://github.com/microsoft/SynapseML/blob/master/website/versioned_docs/version-0.9.5/features/lightgbm/LightGBM%20-%20Overview.md

While I'm not running on Azure I took csv file from here:
https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction

Sorry for the lightautoml import. It is not really necessary for the example. Here is the updated version.
lightgbm_chunk_size_problem_upd.zip

"Interestingly, when I turn off single dataset mode, the chunk size can be 100, 1k or 10k, but I still get the same results."
I can confirm the same. There is no such problem in this mode. But it works faster with useSingleDatasetMode=True, that is why I use it.

@imatiach-msft
Copy link
Contributor

@fonhorst I was able to reproduce the issue locally, but interestingly only on this particular dataset - on a different dataset, I did not see this issue. I found something strange. I added debug here:

https://github.com/microsoft/SynapseML/blob/master/lightgbm/src/main/scala/com/microsoft/azure/synapse/ml/lightgbm/dataset/DatasetAggregator.scala#L37

And the last chunk count seems to be higher than what I believe I would expect, and the numbers don't quite make sense to me. I'll continue to investigate.

Run1:
chunk count: 0
chunk count: 0
chunk size: 10000
chunk size: 10000
chunk count: 0
chunk size: 10000
last chunk count: 1358
last chunk count: 1373
chunk count: 0
chunk size: 10000
last chunk count: 1384
last chunk count: 1409
chunk count: 0
chunk size: 10000
last chunk count: 130435
chunk count: 0
chunk size: 10000
last chunk count: 131480
chunk count: 0
chunk size: 10000
last chunk count: 133855
chunk count: 0
chunk size: 10000
last chunk count: 129010

Run2:
chunk count: 1
chunk count: 1
chunk size: 1000
chunk count: 1
chunk size: 1000
chunk count: 1
chunk size: 1000
chunk size: 1000
last chunk count: 373
last chunk count: 409
chunk count: 1
chunk size: 1000
last chunk count: 384
chunk count: 1
chunk size: 1000
last chunk count: 36480
last chunk count: 38855
chunk count: 1
chunk size: 1000
last chunk count: 35435
last chunk count: 358
chunk count: 1
chunk size: 1000
last chunk count: 34010

@imatiach-msft
Copy link
Contributor

note the first 4 are from labels and the last 4 are from the dataset (the debug is mixed on lines since it's from 4 threads writing at the same time)

@imatiach-msft
Copy link
Contributor

@fonhorst I was able to fix the issue locally, the problem was that chunkSize was much larger for the features array (specifically, it was numCols * chunkSize, instead of just chunkSize). I will send a PR soon. Thank you for your patience.

@fonhorst
Copy link
Author

@imatiach-msft Thank you very much for your swift response!

@imatiach-msft
Copy link
Contributor

@fonhorst the issue should be resolved with the PR:
#1490
thank you for discovering this problem, for the great repro steps, and for your patience!

@imatiach-msft
Copy link
Contributor

You can try the build at:

Maven Coordinates
com.microsoft.azure:synapseml_2.12:0.9.5-92-76c32ccf-SNAPSHOT

Maven Resolver
https://mmlspark.azureedge.net/maven

@imatiach-msft
Copy link
Contributor

imatiach-msft commented Apr 25, 2022

closing the issue as PR has been merged to fix this issue:
#1490

The fix will be in the next release after the current 0.9.5 (I'm assuming 0.9.6)

@Vonatzki
Copy link

Vonatzki commented Jul 8, 2022

Hi @imatiach-msft , I just want to understand this bug further.

the problem was that chunkSize was much larger for the features array (specifically, it was numCols * chunkSize, instead of just chunkSize)

From what I understand, the features array have a larger chunkSize as a parameter due to numCols value multiplied to the chunkSize input, how does this affect the model quality? Does this mean that the label and features array does not align when copying it to the workers of the cluster?

Sorry for the newbie question.

@imatiach-msft
Copy link
Contributor

@Vonatzki yes, there was a bug in the code that copied the data over from Java to native lightgbm layer. If chunksize was set to a low value, some of the values were not copied correctly. Hence, this results in a drop in performance metrics. This appears for the newest version of SynapseML 0.9.5 currently, when useSingleDataset=True, which is on by default since 0.9.5. In next release the issue will be fixed and also on current master this issue is already fixed.

@imatiach-msft
Copy link
Contributor

please see the PR fix here with a longer description of the issue and the code changes:
#1490

@Vonatzki
Copy link

Vonatzki commented Jul 9, 2022

Thank you for your response and the snapshot build fix! Helped me a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants