-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: DML run get timeout if big dataset has more feature columns (Workaround Synapse Spark optimizer issue) #1903
Conversation
Hey @dylanw-oss 👋! We use semantic commit messages to streamline the release process. Examples of commit messages with semantic prefixes:
To test your commit locally, please follow our guild on building from source. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Codecov Report
@@ Coverage Diff @@
## master #1903 +/- ##
==========================================
+ Coverage 86.77% 86.83% +0.05%
==========================================
Files 301 301
Lines 15587 15596 +9
Branches 803 815 +12
==========================================
+ Hits 13526 13543 +17
+ Misses 2061 2053 -8
... and 3 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
What changes are proposed in this pull request?
When applying DML on WExp project, the data has >1M records and >5 categorical features, DML run got timeout even with a large cluster.
We figured out that it's due to Synapse version of Spark optimizer won't be able to handle a complex query plan, split DML pipeline and cache each pipeline result, can fix the timeout issue.
How is this patch tested?
with internal project data