Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ray dataset and drop type casting in binary_feature prediction post processing for speedup #2293

Merged
merged 18 commits into from
Jul 26, 2022

Conversation

magdyksaleh
Copy link
Contributor

@magdyksaleh magdyksaleh commented Jul 20, 2022

Code Pull Requests

Update the binary feature post-processing to use ray datasets to speed up overall time

Documentation Pull Requests

Note that the documentation HTML files are in docs/ while the Markdown sources are in mkdocs/docs.

If you are proposing a modification to the documentation you should change only the Markdown files.

api.md is automatically generated from the docstrings in the code, so if you want to change something in that file, first modify ludwig/api.py docstring, then run mkdocs/code_docs_autogen.py, which will create mkdocs/docs/api.md .

@github-actions
Copy link

github-actions bot commented Jul 20, 2022

Unit Test Results

       5 files  +       2         5 suites  +2   2h 4m 21s ⏱️ + 49m 15s
2 941 tests +       5  2 892 ✔️ +       5    49 💤 ±  0  0 ±0 
8 698 runs  +2 951  8 536 ✔️ +2 883  162 💤 +68  0 ±0 

Results for commit c06b7be. ± Comparison against base commit 06594a5.

♻️ This comment has been updated with latest results.

branch 'speed-up-eval-stage-ray' of github.com:ludwig-ai/ludwig into speed-up-eval-stage-ray
@magdyksaleh magdyksaleh requested a review from ShreyaR July 21, 2022 17:04
Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@magdyksaleh I'm going to suggest a different approach here, since I suspect the issue we're seeing here with binary_feature will be an issue for other output feature types as well.

The key idea is that all of the postprocess_predictions functions for all output features are only applying per-row transformations. There are not transformations being applied that require computing column-level metadata in this step. As such, instead of passing Dask DFs to the postprocess functions, we can run one top-level map_batches call that applies the transformations to batches as pandas dataframes.

Concretely:

  1. Change the postprocess function in postprocessing.py to something like this:
def posproc_preds(df):
        for of_name, output_feature in output_features.items():
            return output_feature.postprocess_predictions(
                df,
                training_set_metadata[of_name],
            )
    
    predictions = backend.df_engine.map_batches(predictions, posproc_preds)
  1. Change every postprocess_predictions function by removing output_directory and backend from the signature.

  2. Change all the calls to backend.df_engine.map_objects in postprocess_predictions to simply call df.map, since it's just a Pandas DF.

@@ -46,6 +46,10 @@ def map_objects(self, series, map_fn, meta=None):
def map_partitions(self, series, map_fn, meta=None):
raise NotImplementedError()

@abstractmethod
def try_map_batches(self, series, map_fn, batch_format="pandas", meta=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename to just map_batches, and maybe drop the batch_format param, since it would be difficult to make it work with other backends.

@magdyksaleh
Copy link
Contributor Author

@magdyksaleh I'm going to suggest a different approach here, since I suspect the issue we're seeing here with binary_feature will be an issue for other output feature types as well.

The key idea is that all of the postprocess_predictions functions for all output features are only applying per-row transformations. There are not transformations being applied that require computing column-level metadata in this step. As such, instead of passing Dask DFs to the postprocess functions, we can run one top-level map_batches call that applies the transformations to batches as pandas dataframes.

Concretely:

1. Change the `postprocess` function in `postprocessing.py` to something like this:
def posproc_preds(df):
        for of_name, output_feature in output_features.items():
            return output_feature.postprocess_predictions(
                df,
                training_set_metadata[of_name],
            )
    
    predictions = backend.df_engine.map_batches(predictions, posproc_preds)
2. Change every `postprocess_predictions` function by removing `output_directory` and `backend` from the signature.

3. Change all the calls to `backend.df_engine.map_objects` in `postprocess_predictions` to simply call `df.map`, since it's just a Pandas DF.

I really like this suggestion! I'll try it out now. Thanks travis

df = output_feature.postprocess_predictions(
df,
training_set_metadata[of_name],
output_directory=output_directory,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove output_directory and backend

@@ -46,6 +46,10 @@ def map_objects(self, series, map_fn, meta=None):
def map_partitions(self, series, map_fn, meta=None):
raise NotImplementedError()

@abstractmethod
def map_batches(self, series, map_fn):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First element should be a dataframe, not a series.

Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Please also double-check performance on higgs with this new approach before landing.

@magdyksaleh magdyksaleh marked this pull request as ready for review July 23, 2022 09:02
@magdyksaleh
Copy link
Contributor Author

This is ready to merge pending CI passing

@magdyksaleh magdyksaleh merged commit cc0a8e2 into master Jul 26, 2022
@magdyksaleh magdyksaleh deleted the speed-up-eval-stage-ray branch July 26, 2022 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants