-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDBScan performance issue with large dataset #645
Comments
Since you mentioned that the execution is successful on Jupyter Notebook, the problem could be with the memory usage. It seem there is no stability when executing in your script. For optimizing, I would suggest you ensure that you have enough memory and CPU resources to handle the process. You could leverage GPU acceleration. |
Hi @Bokang-ctrl
Thanks |
Hi @divya-agrawal3103 . Apologies for getting back to you now. To answer your questions; I would recommend using PCA for dimensionality reduction which will reduce the number of features and make the model effective. Try different scaling techniques (Robust scaler, Standard scaler & Min Max Scaler) and check which one gives the best results. Try tuning your parameters, check the attached picture for the way I tuned my params. I'm pretty sure there are other ways but these are what I can think of. For spilling to disk, I ask chatGPT and this is what the response was:
|
Hi Team,
We are currently running the HDBSCAN algorithm on a large and diverse dataset using one of our products to execute the script in Python. Below is the script we are using along with the input data:
Sample file-
sample.csv
We have performed preprocessing steps including OneHotEncoding, Scaling, and Dimensionality Reduction.
The script executes in approximately 8 minutes.
However, switching the algorithm from "prims_kdtree" to "best", "boruvka_kdtree", or "boruvka_balltree" results in a failure within a few minutes with the error message:
Note: When executing the script using Jupyter Notebook, we obtain results for "best", "boruvka_kdtree", "boruvka_balltree", "prims_balltree", and "prims_kdtree" algorithms within a reasonable time.
Could you please help us with the following questions?
Your insights and guidance would be greatly appreciated.
The text was updated successfully, but these errors were encountered: