Create dataset util to form repeatable train/vali/test split #2159

amholler · 2022-06-17T15:06:42Z

Forming a repeatable train/validation/test split for datasets that are not pre-split into all 3 is
a common operation that is currently handled by custom code in the experiments repo.
Also, forming those splits w/ output feature stratification is desirable for imbalanced datasets.

This PR provides a utility method for handling this, as discussed in issue 2136.
The PR generalizes the custom methods for doing this that are currently in the experiments repo.

The PR was tested on the new unit tests in this PR and on selected experiment repo datasets.

for more information, see https://pre-commit.ci

github-actions · 2022-06-17T16:30:51Z

Unit Test Results

      6 files ±0       6 suites ±0 2h 1m 35s ⏱️ - 18m 23s
2 828 tests +1 2 794 ✔️ +1   34 💤 ±0 0 ❌ ±0
8 484 runs +3 8 378 ✔️ +3 106 💤 ±0 0 ❌ ±0

Results for commit 4ad8197. ± Comparison against base commit bed01a5.

♻️ This comment has been updated with latest results.

tests/ludwig/utils/test_dataset_utils.py

ludwig/utils/dataset_utils.py

justinxzhao · 2022-06-17T17:18:13Z

ludwig/utils/dataset_utils.py

+        raise ValueError("%s is not a column in the dataframe" % (stratify_colname))
+
+    do_stratify_split = True
+    if "split" in df_input.columns:


What do you think about moving this code to set up global_preprocessing_parameters and simply call new split_dataset() function that we (@tgaddair) will soon be adding?

https://github.com/ludwig-ai/ludwig/pull/2132/files#diff-aa425387c78ff4aa6f62ce6c78e20367590378ff31b9d66b49515e50e090736aR203

A couple of concrete benefits are that the split_dataset() function will also work with dask dataframes, and there would be potentially less code duplication. WDYT?

Thanks for this idea. If the same functionality in this method can be cleanly implemented using
global_preprocessing_parameters/split_dataset(), then it sounds like the way to go. At this point,
I don't have enough context to understand if that is the case. Maybe a good strategy is for me to wait
until the @tgaddair PR is finalized/landed and then I can assess how/whether that code can be used
to achieve the same results as this PR, and then respond.

Hey @amholler, thanks for putting this together. I agree with @justinxzhao that it makes sense to use the new splitter API here. To make that easier, I added your changes to stratified splitting into the PR and added determinism tests, so hopefully it should be very similar in terms of the end result.

The PR is finalized in terms of API at this point, just cleaning up some implementation details raised in comments.

Looking at this PR and the dataset splitting PR more closely, it looks like it would be rather complex to do an in-place replacement that uses the new dataset splitting API that would be more concise than what's already here.

This function flexibly handles cases where the split column already exists and when there may or may not be a validation or test set, which the dataset splitting API does not do. It does not necessarily work for dask dataframes, but this is also a separate goal from unblocking users referring to Ludwig automl examples.

My revised thinking is that we should check this in as is so that we can start using it in ludwig/experiments.

for more information, see https://pre-commit.ci

tgaddair · 2022-06-28T18:15:53Z

@amholler the split refactor PR has landed.

anneholler and others added 5 commits June 17, 2022 07:59

Create dataset util to form repeatable train/vali/test split

d3f543a

[pre-commit.ci] auto fixes from pre-commit.com hooks

6d4a418

for more information, see https://pre-commit.ci

Address flake8 unused imports

26abc70

Merge branch 'splitstrat' of github.com:amholler/ludwig into splitstrat

5df9ca1

[pre-commit.ci] auto fixes from pre-commit.com hooks

4eaef2b

for more information, see https://pre-commit.ci

Address pyupgrade pre-commit hook failure

80ce631

justinxzhao reviewed Jun 17, 2022

View reviewed changes

amholler mentioned this pull request Jun 17, 2022

load_util package in automl example can not be imported #2136

Closed

anneholler and others added 2 commits June 17, 2022 13:21

Address review feedback

8defc91

[pre-commit.ci] auto fixes from pre-commit.com hooks

4ad8197

for more information, see https://pre-commit.ci

justinxzhao approved these changes Jul 5, 2022

View reviewed changes

justinxzhao merged commit cf63600 into ludwig-ai:master Jul 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create dataset util to form repeatable train/vali/test split #2159

Create dataset util to form repeatable train/vali/test split #2159

amholler commented Jun 17, 2022 •

edited

Loading

github-actions bot commented Jun 17, 2022 •

edited

Loading

justinxzhao Jun 17, 2022

amholler Jun 17, 2022

tgaddair Jun 18, 2022

justinxzhao Jul 5, 2022

tgaddair commented Jun 28, 2022

Create dataset util to form repeatable train/vali/test split #2159

Create dataset util to form repeatable train/vali/test split #2159

Conversation

amholler commented Jun 17, 2022 • edited Loading

github-actions bot commented Jun 17, 2022 • edited Loading

Unit Test Results

justinxzhao Jun 17, 2022

Choose a reason for hiding this comment

amholler Jun 17, 2022

Choose a reason for hiding this comment

tgaddair Jun 18, 2022

Choose a reason for hiding this comment

justinxzhao Jul 5, 2022

Choose a reason for hiding this comment

tgaddair commented Jun 28, 2022

amholler commented Jun 17, 2022 •

edited

Loading

github-actions bot commented Jun 17, 2022 •

edited

Loading