fix: DML run get timeout if big dataset has more feature columns (Workaround Synapse Spark optimizer issue) #1903

dylanw-oss · 2023-03-31T22:37:07Z

What changes are proposed in this pull request?

When applying DML on WExp project, the data has >1M records and >5 categorical features, DML run got timeout even with a large cluster.
We figured out that it's due to Synapse version of Spark optimizer won't be able to handle a complex query plan, split DML pipeline and cache each pipeline result, can fix the timeout issue.

How is this patch tested?

with internal project data

github-actions · 2023-03-31T22:37:18Z

Hey @dylanw-oss 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process.
Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix.
This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

fix: Fix LightGBM crashes with empty partitions
feat: Make HTTP on Spark back-offs configurable
docs: Update Spark Serving usage
build: Add codecov support
perf: improve LightGBM memory usage
refactor: make python code generation rely on classes
style: Remove nulls from CNTKModel
test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source.
Check out the developer guide for additional guidance on testing your change.

dylanw-oss · 2023-04-01T00:44:03Z

/azp run

azure-pipelines · 2023-04-01T00:44:12Z

Azure Pipelines successfully started running 1 pipeline(s).

codecov-commenter · 2023-04-01T00:53:09Z

Codecov Report

Merging #1903 (6d68c41) into master (0f02626) will increase coverage by 0.05%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1903      +/-   ##
==========================================
+ Coverage   86.77%   86.83%   +0.05%     
==========================================
  Files         301      301              
  Lines       15587    15596       +9     
  Branches      803      815      +12     
==========================================
+ Hits        13526    13543      +17     
+ Misses       2061     2053       -8

Impacted Files	Coverage Δ
.../azure/synapse/ml/causal/ResidualTransformer.scala	`91.89% <ø> (ø)`
...rosoft/azure/synapse/ml/train/TrainRegressor.scala	`92.59% <ø> (ø)`
...ft/azure/synapse/ml/causal/DoubleMLEstimator.scala	`89.89% <100.00%> (+0.76%)`	⬆️
...osoft/azure/synapse/ml/train/TrainClassifier.scala	`84.78% <100.00%> (+0.22%)`	⬆️

... and 3 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

dylanw-oss · 2023-04-03T15:36:14Z

/azp run

azure-pipelines · 2023-04-03T15:36:24Z

Azure Pipelines successfully started running 1 pipeline(s).

dylanw-oss · 2023-04-03T21:21:22Z

/azp run

azure-pipelines · 2023-04-03T21:21:47Z

Azure Pipelines successfully started running 1 pipeline(s).

dylanw-oss · 2023-04-04T00:17:36Z

/azp run

azure-pipelines · 2023-04-04T00:17:45Z

Azure Pipelines successfully started running 1 pipeline(s).

dylanw-oss and others added 4 commits March 29, 2023 15:03

Add logs

590d6b0

fixing the timeout issue

4e511bc

remove logs and keep the fix

3e62d6c

clean up logs

d1225da

dylanw-oss requested review from imatiach-msft and mhamilton723 as code owners March 31, 2023 22:37

mhamilton723 previously approved these changes Apr 1, 2023

View reviewed changes

unpersisit data after cache

a3cfb8f

dylanw-oss dismissed mhamilton723’s stale review via a3cfb8f April 3, 2023 15:36

dylanw-oss marked this pull request as draft April 3, 2023 18:42

remove unpersist as it will regression the issue

9c3ebd8

dylanw-oss marked this pull request as ready for review April 3, 2023 21:20

Merge branch 'master' into DML_500_logs

6d68c41

dylanw-oss requested a review from memoryz April 3, 2023 21:22

dylanw-oss requested a review from mhamilton723 April 4, 2023 07:00

dylanw-oss enabled auto-merge (squash) April 4, 2023 07:01

Merge branch 'master' into DML_500_logs

bff7efb

mhamilton723 approved these changes Apr 4, 2023

View reviewed changes

mhamilton723 disabled auto-merge April 4, 2023 12:48

mhamilton723 merged commit 13afff6 into microsoft:master Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: DML run get timeout if big dataset has more feature columns (Workaround Synapse Spark optimizer issue) #1903

fix: DML run get timeout if big dataset has more feature columns (Workaround Synapse Spark optimizer issue) #1903

dylanw-oss commented Mar 31, 2023

github-actions bot commented Mar 31, 2023

dylanw-oss commented Apr 1, 2023

azure-pipelines bot commented Apr 1, 2023

codecov-commenter commented Apr 1, 2023 •

edited

Loading

dylanw-oss commented Apr 3, 2023

azure-pipelines bot commented Apr 3, 2023

dylanw-oss commented Apr 3, 2023

azure-pipelines bot commented Apr 3, 2023

dylanw-oss commented Apr 4, 2023

azure-pipelines bot commented Apr 4, 2023

fix: DML run get timeout if big dataset has more feature columns (Workaround Synapse Spark optimizer issue) #1903

fix: DML run get timeout if big dataset has more feature columns (Workaround Synapse Spark optimizer issue) #1903

Conversation

dylanw-oss commented Mar 31, 2023

What changes are proposed in this pull request?

How is this patch tested?

github-actions bot commented Mar 31, 2023

dylanw-oss commented Apr 1, 2023

azure-pipelines bot commented Apr 1, 2023

codecov-commenter commented Apr 1, 2023 • edited Loading

Codecov Report

dylanw-oss commented Apr 3, 2023

azure-pipelines bot commented Apr 3, 2023

dylanw-oss commented Apr 3, 2023

azure-pipelines bot commented Apr 3, 2023

dylanw-oss commented Apr 4, 2023

azure-pipelines bot commented Apr 4, 2023

codecov-commenter commented Apr 1, 2023 •

edited

Loading