-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create dataset util to form repeatable train/vali/test split #2159
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
raise ValueError("%s is not a column in the dataframe" % (stratify_colname)) | ||
|
||
do_stratify_split = True | ||
if "split" in df_input.columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about moving this code to set up global_preprocessing_parameters
and simply call new split_dataset()
function that we (@tgaddair) will soon be adding?
A couple of concrete benefits are that the split_dataset()
function will also work with dask dataframes, and there would be potentially less code duplication. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this idea. If the same functionality in this method can be cleanly implemented using
global_preprocessing_parameters/split_dataset(), then it sounds like the way to go. At this point,
I don't have enough context to understand if that is the case. Maybe a good strategy is for me to wait
until the @tgaddair PR is finalized/landed and then I can assess how/whether that code can be used
to achieve the same results as this PR, and then respond.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @amholler, thanks for putting this together. I agree with @justinxzhao that it makes sense to use the new splitter API here. To make that easier, I added your changes to stratified splitting into the PR and added determinism tests, so hopefully it should be very similar in terms of the end result.
The PR is finalized in terms of API at this point, just cleaning up some implementation details raised in comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this PR and the dataset splitting PR more closely, it looks like it would be rather complex to do an in-place replacement that uses the new dataset splitting API that would be more concise than what's already here.
This function flexibly handles cases where the split column already exists and when there may or may not be a validation or test set, which the dataset splitting API does not do. It does not necessarily work for dask dataframes, but this is also a separate goal from unblocking users referring to Ludwig automl examples.
My revised thinking is that we should check this in as is so that we can start using it in ludwig/experiments.
for more information, see https://pre-commit.ci
@amholler the split refactor PR has landed. |
Forming a repeatable train/validation/test split for datasets that are not pre-split into all 3 is
a common operation that is currently handled by custom code in the experiments repo.
Also, forming those splits w/ output feature stratification is desirable for imbalanced datasets.
This PR provides a utility method for handling this, as discussed in issue 2136.
The PR generalizes the custom methods for doing this that are currently in the experiments repo.
The PR was tested on the new unit tests in this PR and on selected experiment repo datasets.