Add sample_size as a global preprocessing parameter #3650

Infernaught · 2023-09-21T18:37:07Z

Adds sample_size as a global preprocessing parameter, allowing users to specify exactly how many samples they want to train on instead of having to calculate the sample_ratio. Adds two integration tests to verify sample_size works as intended.

for more information, see https://pre-commit.ci

github-actions · 2023-09-21T20:26:48Z

Unit Test Results

  6 files ±  0   6 suites ±0 21m 26s ⏱️ - 21m 0s
12 tests - 19   9 ✔️ - 17   3 💤 - 2 0 ❌ ±0
60 runs - 22 42 ✔️ - 24 18 💤 +2 0 ❌ ±0

Results for commit 2de5849. ± Comparison against base commit ee92f7d.

This pull request removes 19 tests.

tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-experiment-1919-0]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-experiment-1919-1]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-experiment-31-0]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-experiment-31-1]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-train-1919-0]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-train-1919-1]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-train-31-0]
tests.integration_tests.test_cli ‑ test_reproducible_cli_runs[horovod-train-31-1]
tests.integration_tests.test_cli ‑ test_train_cli_horovod
tests.integration_tests.test_horovod ‑ test_horovod_gpu_memory_limit
…

♻️ This comment has been updated with latest results.

arnavgarg1 · 2023-09-21T21:04:10Z

ludwig/data/preprocessing.py

+            if sample_cap < len(dataset_df):
+                dataset_df = dataset_df.sample(n=sample_cap, random_state=random_seed)
+            else:
+                logger.info("sample_cap is larger than dataset size, ignoring sample_cap")


Perhaps logger.warning?

arnavgarg1 · 2023-09-21T21:16:07Z

ludwig/data/preprocessing.py

+        sample_cap = global_preprocessing_parameters["sample_cap"]
+        if sample_cap:
+            if sample_ratio < 1.0:
+                raise ValueError("sample_cap cannot be used when sample_ratio < 1.0")


Wondering if we can push this up into a schema validation check, i.e., if preprocessing sample_ratio is specified and it is < 1 and sample_cap is also specified, then raise a ConfigValidationError?

+1. If we can implement this as an auxiliary validation, that would allow the config to fail as early as possible.

arnavgarg1

Minor comments, but generally LGTM

justinxzhao

Thanks! I like the change overall.

justinxzhao · 2023-09-21T21:44:13Z

ludwig/data/preprocessing.py

+        sample_cap = global_preprocessing_parameters["sample_cap"]
+        if sample_cap:
+            if sample_ratio < 1.0:
+                raise ValueError("sample_cap cannot be used when sample_ratio < 1.0")


+1. If we can implement this as an auxiliary validation, that would allow the config to fail as early as possible.

justinxzhao · 2023-09-21T21:46:50Z

ludwig/data/preprocessing.py

@@ -1211,6 +1211,15 @@ def build_dataset(
            logger.debug(f"sample {sample_ratio} of data")
            dataset_df = dataset_df.sample(frac=sample_ratio, random_state=random_seed)

+        sample_cap = global_preprocessing_parameters["sample_cap"]


Could you refactor this section out into a separate function?

dataset_df = get_sampled_dataset_df(dataset_df, sample_ratio, sample_cap)

…to sample_cap

Infernaught · 2023-09-22T19:59:58Z

ludwig/config_validation/checks.py

@arnavgarg1 @justinxzhao Is this what you guys were looking for? I've testing this locally and it seems to have the right functionality.

Nice! Last request from me is to add a simple test to https://github.com/ludwig-ai/ludwig/blob/master/tests/ludwig/config_validation/test_checks.py since we're adding code to checks.py

tgaddair · 2023-09-22T23:55:09Z

ludwig/data/preprocessing.py

+        dataset_df = dataset_df.sample(frac=sample_ratio, random_state=random_seed)
+
+    if sample_cap:
+        if sample_cap < len(dataset_df):


len(dataset_df) is a very expensive op for Dask dataframes (this is why we have an explicit check above to skip calling this when df_engine.partitioned), so calling it twice is quick succession is definitely not ideal. Let's do this instead:

df_len = len(dataset_df) if sample_cap < df_len: # Cannot use 'n' parameter when using dask DataFrames -- only 'frac' is supported sample_ratio = sample_cap / df_len dataset_df = dataset_df.sample(frac=sample_ratio, random_state=random_seed)

tgaddair · 2023-09-22T23:55:30Z

ludwig/data/preprocessing.py

+        if sample_cap < len(dataset_df):
+            # Cannot use 'n' parameter when using dask DataFrames -- only 'frac' is supported
+            sample_ratio = sample_cap / len(dataset_df)
+            dataset_df = dataset_df.sample(frac=sample_ratio, random_state=random_seed)


Note that for Dask this will not be exact, but that's probably okay.

tgaddair

Not a huge fan of the name sample_cap personally. Can we call it something like sample_size instead?

tgaddair · 2023-09-22T23:59:48Z

ludwig/config_validation/checks.py

+def check_sample_ratio_and_cap_compatible(config: "ModelConfig") -> None:
+    sample_ratio = config.preprocessing.sample_ratio
+    sample_cap = config.preprocessing.sample_cap
+    if sample_cap and sample_ratio < 1.0:


Edge case, but this would allow something like:

sample_cap: 0 sample_ratio: 0.5

So would be more correct to ay:

if sample_cap is not None and sample_ratio < 1.0:

justinxzhao · 2023-10-10T22:10:23Z

ludwig/schema/metadata/configs/preprocessing.yaml

+        - 1000
+    expected_impact: 2
+    suggested_values: Depends on data size
+    ui_display_name: Sample Cap


nit: Sample Size.

justinxzhao · 2023-10-10T22:10:32Z

tests/integration_tests/test_preprocessing.py

+    count = len(train_set) + len(val_set) + len(test_set)
+    assert sample_size == count
+
+    # Check that sample cap is disabled when doing preprocessing for prediction


nit: sample size

Add sample_cap as a global preprocessing parameter

0aaaac1

Infernaught requested review from arnavgarg1 and justinxzhao September 21, 2023 18:37

Infernaught and others added 2 commits September 21, 2023 15:39

Merge branch 'master' into sample_cap

a529b47

[pre-commit.ci] auto fixes from pre-commit.com hooks

1c9fbee

for more information, see https://pre-commit.ci

arnavgarg1 reviewed Sep 21, 2023

View reviewed changes

arnavgarg1 approved these changes Sep 21, 2023

View reviewed changes

justinxzhao reviewed Sep 21, 2023

View reviewed changes

Infernaught added 3 commits September 22, 2023 15:17

Merge branch 'master' of https://github.com/Infernaught/nightlyfix in…

79a6556

…to sample_cap

Move sample ratio and cap compatibility check

f73b984

Move ratio and cap checks to separate function

682bdf0

Infernaught commented Sep 22, 2023

View reviewed changes

Fix issue with n vs. frac

1ad3b48

tgaddair reviewed Sep 22, 2023

View reviewed changes

Infernaught added 4 commits October 10, 2023 10:32

Rename sample_cap to sample_size

5f7c321

Fix sample ratio and size compatibility check

079f857

Call len(dataset_df) only once in preprocessing

d903fba

Add test for compatibility check

a50989b

Infernaught requested review from tgaddair and justinxzhao October 10, 2023 15:06

justinxzhao reviewed Oct 10, 2023

View reviewed changes

Replace "sample cap" with "sample size"

2de5849

Infernaught changed the title ~~Add sample_cap as a global preprocessing parameter~~ Add sample_size as a global preprocessing parameter Oct 11, 2023

justinxzhao approved these changes Oct 12, 2023

View reviewed changes

Infernaught merged commit df6f5ef into master Oct 12, 2023
16 checks passed

Infernaught deleted the sample_cap branch October 12, 2023 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sample_size as a global preprocessing parameter #3650

Add sample_size as a global preprocessing parameter #3650

Infernaught commented Sep 21, 2023 •

edited

Loading

github-actions bot commented Sep 21, 2023 •

edited

Loading

arnavgarg1 Sep 21, 2023

arnavgarg1 Sep 21, 2023

justinxzhao Sep 21, 2023

arnavgarg1 left a comment

justinxzhao left a comment

justinxzhao Sep 21, 2023

justinxzhao Sep 21, 2023

Infernaught Sep 22, 2023

Infernaught Sep 22, 2023

justinxzhao Sep 22, 2023

tgaddair Sep 22, 2023

tgaddair Sep 22, 2023

tgaddair left a comment

tgaddair Sep 22, 2023

justinxzhao Oct 10, 2023

justinxzhao Oct 10, 2023

Add sample_size as a global preprocessing parameter #3650

Add sample_size as a global preprocessing parameter #3650

Conversation

Infernaught commented Sep 21, 2023 • edited Loading

github-actions bot commented Sep 21, 2023 • edited Loading

Unit Test Results

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arnavgarg1 left a comment

Choose a reason for hiding this comment

justinxzhao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tgaddair left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Infernaught commented Sep 21, 2023 •

edited

Loading

github-actions bot commented Sep 21, 2023 •

edited

Loading