Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructured split config and added datetime splitting #2132

Merged
merged 59 commits into from
Jun 27, 2022
Merged

Conversation

tgaddair
Copy link
Collaborator

Closes #2129.

@github-actions
Copy link

github-actions bot commented Jun 11, 2022

Unit Test Results

       6 files  ±  0         6 suites  ±0   2h 15m 8s ⏱️ - 2m 23s
2 868 tests +13  2 834 ✔️ +13    34 💤 ±0  0 ±0 
8 604 runs  +39  8 498 ✔️ +39  106 💤 ±0  0 ±0 

Results for commit 94f26e9. ± Comparison against base commit edc7c26.

♻️ This comment has been updated with latest results.

@tgaddair tgaddair added the v0.6 label Jun 13, 2022
@tgaddair tgaddair marked this pull request as ready for review June 13, 2022 21:03
@tgaddair tgaddair requested review from justinxzhao and jppgks June 13, 2022 21:03
Copy link
Contributor

@jppgks jppgks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great improvement and much cleaner compared to the existing code, thanks Travis!

2 comments:

  1. Assuming split_index can only be one of 3 values (0, 1 or 2), should we introduce an enum for it instead of working with integers?
  2. Suggestion, potentially out of scope here: add JSON schema for split

ludwig/data/dataframe/dask.py Show resolved Hide resolved
ludwig/data/split.py Outdated Show resolved Hide resolved
ludwig/visualize.py Show resolved Hide resolved
ludwig/data/dataframe/base.py Outdated Show resolved Hide resolved

@split_registry.register("fixed")
class FixedSplitter(Splitter):
def __init__(self, column: str = SPLIT, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point (perhaps in separate PR), what do you think about extending this to enable users to specify custom values for fixed splits?

For example, I can imagine users with an explicit split column, but the values in the column are like "TRAIN", "TEST" instead of 0, 1, and 2.

(default)

split:
  type: fixed
  column: custom_split_column
  train_values: [0]
  validation_values: [1]
  test_values: [2]

(custom)

split:
  type: fixed
  column: custom_split_column
  train_values: [train]
  validation_values: [test]
  test_values: [heldout]

Related: #1823

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should definitely do this. Can be in a follow-up PR right after.

@tgaddair tgaddair added the feature New feature or request label Jun 15, 2022
@justinxzhao
Copy link
Contributor

Latest commits hopefully address #2174, CC: @jppgks

@tgaddair tgaddair merged commit a587181 into master Jun 27, 2022
@tgaddair tgaddair deleted the ref-split branch June 27, 2022 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFC: Reorganize split configuration to support time-based split strategy
3 participants