-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sample_size as a global preprocessing parameter #3650
Conversation
Unit Test Results 6 files ± 0 6 suites ±0 21m 26s ⏱️ - 21m 0s Results for commit 2de5849. ± Comparison against base commit ee92f7d. This pull request removes 19 tests.
♻️ This comment has been updated with latest results. |
ludwig/data/preprocessing.py
Outdated
if sample_cap < len(dataset_df): | ||
dataset_df = dataset_df.sample(n=sample_cap, random_state=random_seed) | ||
else: | ||
logger.info("sample_cap is larger than dataset size, ignoring sample_cap") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps logger.warning?
ludwig/data/preprocessing.py
Outdated
sample_cap = global_preprocessing_parameters["sample_cap"] | ||
if sample_cap: | ||
if sample_ratio < 1.0: | ||
raise ValueError("sample_cap cannot be used when sample_ratio < 1.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we can push this up into a schema validation check, i.e., if preprocessing sample_ratio is specified and it is < 1 and sample_cap is also specified, then raise a ConfigValidationError?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. If we can implement this as an auxiliary validation, that would allow the config to fail as early as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, but generally LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I like the change overall.
ludwig/data/preprocessing.py
Outdated
sample_cap = global_preprocessing_parameters["sample_cap"] | ||
if sample_cap: | ||
if sample_ratio < 1.0: | ||
raise ValueError("sample_cap cannot be used when sample_ratio < 1.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. If we can implement this as an auxiliary validation, that would allow the config to fail as early as possible.
ludwig/data/preprocessing.py
Outdated
@@ -1211,6 +1211,15 @@ def build_dataset( | |||
logger.debug(f"sample {sample_ratio} of data") | |||
dataset_df = dataset_df.sample(frac=sample_ratio, random_state=random_seed) | |||
|
|||
sample_cap = global_preprocessing_parameters["sample_cap"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you refactor this section out into a separate function?
dataset_df = get_sampled_dataset_df(dataset_df, sample_ratio, sample_cap)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arnavgarg1 @justinxzhao Is this what you guys were looking for? I've testing this locally and it seems to have the right functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Last request from me is to add a simple test to https://github.com/ludwig-ai/ludwig/blob/master/tests/ludwig/config_validation/test_checks.py since we're adding code to checks.py
ludwig/data/preprocessing.py
Outdated
dataset_df = dataset_df.sample(frac=sample_ratio, random_state=random_seed) | ||
|
||
if sample_cap: | ||
if sample_cap < len(dataset_df): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len(dataset_df)
is a very expensive op for Dask dataframes (this is why we have an explicit check above to skip calling this when df_engine.partitioned
), so calling it twice is quick succession is definitely not ideal. Let's do this instead:
df_len = len(dataset_df)
if sample_cap < df_len:
# Cannot use 'n' parameter when using dask DataFrames -- only 'frac' is supported
sample_ratio = sample_cap / df_len
dataset_df = dataset_df.sample(frac=sample_ratio, random_state=random_seed)
if sample_cap < len(dataset_df): | ||
# Cannot use 'n' parameter when using dask DataFrames -- only 'frac' is supported | ||
sample_ratio = sample_cap / len(dataset_df) | ||
dataset_df = dataset_df.sample(frac=sample_ratio, random_state=random_seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that for Dask this will not be exact, but that's probably okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a huge fan of the name sample_cap
personally. Can we call it something like sample_size
instead?
ludwig/config_validation/checks.py
Outdated
def check_sample_ratio_and_cap_compatible(config: "ModelConfig") -> None: | ||
sample_ratio = config.preprocessing.sample_ratio | ||
sample_cap = config.preprocessing.sample_cap | ||
if sample_cap and sample_ratio < 1.0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edge case, but this would allow something like:
sample_cap: 0
sample_ratio: 0.5
So would be more correct to ay:
if sample_cap is not None and sample_ratio < 1.0:
- 1000 | ||
expected_impact: 2 | ||
suggested_values: Depends on data size | ||
ui_display_name: Sample Cap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Sample Size.
count = len(train_set) + len(val_set) + len(test_set) | ||
assert sample_size == count | ||
|
||
# Check that sample cap is disabled when doing preprocessing for prediction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: sample size
Adds sample_size as a global preprocessing parameter, allowing users to specify exactly how many samples they want to train on instead of having to calculate the sample_ratio. Adds two integration tests to verify sample_size works as intended.