-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ray dataset and drop type casting in binary_feature prediction post processing for speedup #2293
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@magdyksaleh I'm going to suggest a different approach here, since I suspect the issue we're seeing here with binary_feature
will be an issue for other output feature types as well.
The key idea is that all of the postprocess_predictions
functions for all output features are only applying per-row transformations. There are not transformations being applied that require computing column-level metadata in this step. As such, instead of passing Dask DFs to the postprocess functions, we can run one top-level map_batches
call that applies the transformations to batches as pandas dataframes.
Concretely:
- Change the
postprocess
function inpostprocessing.py
to something like this:
def posproc_preds(df):
for of_name, output_feature in output_features.items():
return output_feature.postprocess_predictions(
df,
training_set_metadata[of_name],
)
predictions = backend.df_engine.map_batches(predictions, posproc_preds)
-
Change every
postprocess_predictions
function by removingoutput_directory
andbackend
from the signature. -
Change all the calls to
backend.df_engine.map_objects
inpostprocess_predictions
to simply calldf.map
, since it's just a Pandas DF.
ludwig/data/dataframe/base.py
Outdated
@@ -46,6 +46,10 @@ def map_objects(self, series, map_fn, meta=None): | |||
def map_partitions(self, series, map_fn, meta=None): | |||
raise NotImplementedError() | |||
|
|||
@abstractmethod | |||
def try_map_batches(self, series, map_fn, batch_format="pandas", meta=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename to just map_batches
, and maybe drop the batch_format
param, since it would be difficult to make it work with other backends.
I really like this suggestion! I'll try it out now. Thanks travis |
for more information, see https://pre-commit.ci
… into speed-up-eval-stage-ray
ludwig/data/postprocessing.py
Outdated
df = output_feature.postprocess_predictions( | ||
df, | ||
training_set_metadata[of_name], | ||
output_directory=output_directory, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove output_directory
and backend
ludwig/data/dataframe/base.py
Outdated
@@ -46,6 +46,10 @@ def map_objects(self, series, map_fn, meta=None): | |||
def map_partitions(self, series, map_fn, meta=None): | |||
raise NotImplementedError() | |||
|
|||
@abstractmethod | |||
def map_batches(self, series, map_fn): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First element should be a dataframe, not a series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Please also double-check performance on higgs with this new approach before landing.
This is ready to merge pending CI passing |
…::test_dask_known_divisions
…udwig-ai/ludwig into speed-up-eval-stage-ray
Code Pull Requests
Update the binary feature post-processing to use ray datasets to speed up overall time
Documentation Pull Requests
Note that the documentation HTML files are in
docs/
while the Markdown sources are inmkdocs/docs
.If you are proposing a modification to the documentation you should change only the Markdown files.
api.md
is automatically generated from the docstrings in the code, so if you want to change something in that file, first modifyludwig/api.py
docstring, then runmkdocs/code_docs_autogen.py
, which will createmkdocs/docs/api.md
.