Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset util to form repeatable train/vali/test split #2159

Merged
merged 8 commits into from
Jul 5, 2022

Conversation

amholler
Copy link
Collaborator

@amholler amholler commented Jun 17, 2022

Forming a repeatable train/validation/test split for datasets that are not pre-split into all 3 is
a common operation that is currently handled by custom code in the experiments repo.
Also, forming those splits w/ output feature stratification is desirable for imbalanced datasets.

This PR provides a utility method for handling this, as discussed in issue 2136.
The PR generalizes the custom methods for doing this that are currently in the experiments repo.

The PR was tested on the new unit tests in this PR and on selected experiment repo datasets.

@github-actions
Copy link

github-actions bot commented Jun 17, 2022

Unit Test Results

       6 files  ±0         6 suites  ±0   2h 1m 35s ⏱️ - 18m 23s
2 828 tests +1  2 794 ✔️ +1    34 💤 ±0  0 ±0 
8 484 runs  +3  8 378 ✔️ +3  106 💤 ±0  0 ±0 

Results for commit 4ad8197. ± Comparison against base commit bed01a5.

♻️ This comment has been updated with latest results.

tests/ludwig/utils/test_dataset_utils.py Outdated Show resolved Hide resolved
ludwig/utils/dataset_utils.py Outdated Show resolved Hide resolved
ludwig/utils/dataset_utils.py Outdated Show resolved Hide resolved
raise ValueError("%s is not a column in the dataframe" % (stratify_colname))

do_stratify_split = True
if "split" in df_input.columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving this code to set up global_preprocessing_parameters and simply call new split_dataset() function that we (@tgaddair) will soon be adding?

https://github.com/ludwig-ai/ludwig/pull/2132/files#diff-aa425387c78ff4aa6f62ce6c78e20367590378ff31b9d66b49515e50e090736aR203

A couple of concrete benefits are that the split_dataset() function will also work with dask dataframes, and there would be potentially less code duplication. WDYT?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this idea. If the same functionality in this method can be cleanly implemented using
global_preprocessing_parameters/split_dataset(), then it sounds like the way to go. At this point,
I don't have enough context to understand if that is the case. Maybe a good strategy is for me to wait
until the @tgaddair PR is finalized/landed and then I can assess how/whether that code can be used
to achieve the same results as this PR, and then respond.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @amholler, thanks for putting this together. I agree with @justinxzhao that it makes sense to use the new splitter API here. To make that easier, I added your changes to stratified splitting into the PR and added determinism tests, so hopefully it should be very similar in terms of the end result.

The PR is finalized in terms of API at this point, just cleaning up some implementation details raised in comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this PR and the dataset splitting PR more closely, it looks like it would be rather complex to do an in-place replacement that uses the new dataset splitting API that would be more concise than what's already here.

This function flexibly handles cases where the split column already exists and when there may or may not be a validation or test set, which the dataset splitting API does not do. It does not necessarily work for dask dataframes, but this is also a separate goal from unblocking users referring to Ludwig automl examples.

My revised thinking is that we should check this in as is so that we can start using it in ludwig/experiments.

@tgaddair
Copy link
Collaborator

@amholler the split refactor PR has landed.

@justinxzhao justinxzhao merged commit cf63600 into ludwig-ai:master Jul 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants