Use ray dataset and drop type casting in binary_feature prediction post processing for speedup #2293

magdyksaleh · 2022-07-20T08:59:18Z

Code Pull Requests

Update the binary feature post-processing to use ray datasets to speed up overall time

Documentation Pull Requests

Note that the documentation HTML files are in docs/ while the Markdown sources are in mkdocs/docs.

If you are proposing a modification to the documentation you should change only the Markdown files.

api.md is automatically generated from the docstrings in the code, so if you want to change something in that file, first modify ludwig/api.py docstring, then run mkdocs/code_docs_autogen.py, which will create mkdocs/docs/api.md .

for more information, see https://pre-commit.ci

github-actions · 2022-07-20T09:35:45Z

Unit Test Results

      5 files +      2       5 suites +2 2h 4m 21s ⏱️ + 49m 15s
2 941 tests +      5 2 892 ✔️ +      5   49 💤 ±  0 0 ❌ ±0
8 698 runs +2 951 8 536 ✔️ +2 883 162 💤 +68 0 ❌ ±0

Results for commit c06b7be. ± Comparison against base commit 06594a5.

♻️ This comment has been updated with latest results.

branch 'speed-up-eval-stage-ray' of github.com:ludwig-ai/ludwig into speed-up-eval-stage-ray

…al-stage-ray

tgaddair

@magdyksaleh I'm going to suggest a different approach here, since I suspect the issue we're seeing here with binary_feature will be an issue for other output feature types as well.

The key idea is that all of the postprocess_predictions functions for all output features are only applying per-row transformations. There are not transformations being applied that require computing column-level metadata in this step. As such, instead of passing Dask DFs to the postprocess functions, we can run one top-level map_batches call that applies the transformations to batches as pandas dataframes.

Concretely:

Change the postprocess function in postprocessing.py to something like this:

def posproc_preds(df):
        for of_name, output_feature in output_features.items():
            return output_feature.postprocess_predictions(
                df,
                training_set_metadata[of_name],
            )
    
    predictions = backend.df_engine.map_batches(predictions, posproc_preds)

Change every postprocess_predictions function by removing output_directory and backend from the signature.
Change all the calls to backend.df_engine.map_objects in postprocess_predictions to simply call df.map, since it's just a Pandas DF.

tgaddair · 2022-07-21T18:34:03Z

ludwig/data/dataframe/base.py

@@ -46,6 +46,10 @@ def map_objects(self, series, map_fn, meta=None):
    def map_partitions(self, series, map_fn, meta=None):
        raise NotImplementedError()

+    @abstractmethod
+    def try_map_batches(self, series, map_fn, batch_format="pandas", meta=None):


I would rename to just map_batches, and maybe drop the batch_format param, since it would be difficult to make it work with other backends.

magdyksaleh · 2022-07-21T18:35:15Z

@magdyksaleh I'm going to suggest a different approach here, since I suspect the issue we're seeing here with binary_feature will be an issue for other output feature types as well.

The key idea is that all of the postprocess_predictions functions for all output features are only applying per-row transformations. There are not transformations being applied that require computing column-level metadata in this step. As such, instead of passing Dask DFs to the postprocess functions, we can run one top-level map_batches call that applies the transformations to batches as pandas dataframes.

Concretely:
1. Change the `postprocess` function in `postprocessing.py` to something like this:
def posproc_preds(df):
        for of_name, output_feature in output_features.items():
            return output_feature.postprocess_predictions(
                df,
                training_set_metadata[of_name],
            )
    
    predictions = backend.df_engine.map_batches(predictions, posproc_preds)
2. Change every `postprocess_predictions` function by removing `output_directory` and `backend` from the signature.

3. Change all the calls to `backend.df_engine.map_objects` in `postprocess_predictions` to simply call `df.map`, since it's just a Pandas DF.

I really like this suggestion! I'll try it out now. Thanks travis

for more information, see https://pre-commit.ci

… into speed-up-eval-stage-ray

tgaddair · 2022-07-22T17:41:01Z

ludwig/data/postprocessing.py

+            df = output_feature.postprocess_predictions(
+                df,
+                training_set_metadata[of_name],
+                output_directory=output_directory,


Remove output_directory and backend

tgaddair · 2022-07-22T17:42:18Z

ludwig/data/dataframe/base.py

@@ -46,6 +46,10 @@ def map_objects(self, series, map_fn, meta=None):
    def map_partitions(self, series, map_fn, meta=None):
        raise NotImplementedError()

+    @abstractmethod
+    def map_batches(self, series, map_fn):


First element should be a dataframe, not a series.

tgaddair

LGTM! Please also double-check performance on higgs with this new approach before landing.

…al-stage-ray

magdyksaleh · 2022-07-25T13:56:30Z

This is ready to merge pending CI passing

…::test_dask_known_divisions

…udwig-ai/ludwig into speed-up-eval-stage-ray

magdyksaleh and others added 2 commits July 20, 2022 10:53

Use ray dataset and drop type casting

fbaaf37

[pre-commit.ci] auto fixes from pre-commit.com hooks

f4e984e

for more information, see https://pre-commit.ci

magdyksaleh added 5 commits July 21, 2022 18:54

fix modin and pandas interface

c9b3f0d

Mergh

e6b95d0

branch 'speed-up-eval-stage-ray' of github.com:ludwig-ai/ludwig into speed-up-eval-stage-ray

clean up1

478d326

clean up2

1a703bd

Merge branch 'master' of github.com:ludwig-ai/ludwig into speed-up-ev…

cfbdc2f

…al-stage-ray

magdyksaleh requested a review from ShreyaR July 21, 2022 17:04

tgaddair reviewed Jul 21, 2022

View reviewed changes

magdyksaleh and others added 6 commits July 22, 2022 16:04

rename to map_batches

9135d09

attempt

f70bc30

[pre-commit.ci] auto fixes from pre-commit.com hooks

5ac2c22

for more information, see https://pre-commit.ci

clean up

51aeeab

Merge branch 'speed-up-eval-stage-ray' of github.com:ludwig-ai/ludwig…

29e55c6

… into speed-up-eval-stage-ray

f8 clean up

dd0656c

tgaddair reviewed Jul 22, 2022

View reviewed changes

oops

99a72a1

tgaddair approved these changes Jul 22, 2022

View reviewed changes

magdyksaleh marked this pull request as ready for review July 23, 2022 09:02

Merge branch 'master' of github.com:ludwig-ai/ludwig into speed-up-ev…

dfee605

…al-stage-ray

geoffreyangus and others added 3 commits July 25, 2022 18:51

fix empty partitions in tests/integration_tests/test_preprocessing.py…

6e3ccce

…::test_dask_known_divisions

Merge branch 'fix-test-preprocessing-known-divisions' of github.com:l…

5eabb0f

…udwig-ai/ludwig into speed-up-eval-stage-ray

bring back master for tests

c06b7be

magdyksaleh merged commit cc0a8e2 into master Jul 26, 2022

magdyksaleh deleted the speed-up-eval-stage-ray branch July 26, 2022 18:29

ShreyaR mentioned this pull request Aug 3, 2022

Speed up saving prediction outputs by optimizing dataframe postprocessing #2283

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use ray dataset and drop type casting in binary_feature prediction post processing for speedup #2293

Use ray dataset and drop type casting in binary_feature prediction post processing for speedup #2293

magdyksaleh commented Jul 20, 2022 •

edited

Loading

github-actions bot commented Jul 20, 2022 •

edited

Loading

tgaddair left a comment

tgaddair Jul 21, 2022

magdyksaleh commented Jul 21, 2022

tgaddair Jul 22, 2022

tgaddair Jul 22, 2022

tgaddair left a comment

magdyksaleh commented Jul 25, 2022

Use ray dataset and drop type casting in binary_feature prediction post processing for speedup #2293

Use ray dataset and drop type casting in binary_feature prediction post processing for speedup #2293

Conversation

magdyksaleh commented Jul 20, 2022 • edited Loading

Code Pull Requests

Documentation Pull Requests

github-actions bot commented Jul 20, 2022 • edited Loading

Unit Test Results

tgaddair left a comment

Choose a reason for hiding this comment

tgaddair Jul 21, 2022

Choose a reason for hiding this comment

magdyksaleh commented Jul 21, 2022

tgaddair Jul 22, 2022

Choose a reason for hiding this comment

tgaddair Jul 22, 2022

Choose a reason for hiding this comment

tgaddair left a comment

Choose a reason for hiding this comment

magdyksaleh commented Jul 25, 2022

magdyksaleh commented Jul 20, 2022 •

edited

Loading

github-actions bot commented Jul 20, 2022 •

edited

Loading