-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Operator Performance Regression on CPU #15429
Comments
Hey, this is the MXNet Label Bot. |
script command: raw script output: |
Yes, #15430 is also caused by op perf regression. I will try to reproduce the case and focus the inference firstly. |
@roywei I've collected some performance data of
I ran 5 times with 10 warmup + 100 runs for each time. Not sure if you're setting the env variables when running the benchmark. And I suggest to re-run the ops for several times to see if it's a real degradation. As for Table1
Table 2
|
Thanks @ciyongch , setting the environment variables did reduce the variances. With the current data Dot and Dropout is not a big concern now. Relu's regression is something we have to accept as otherwise it could lead to nans and bugs. |
@roywei I looked at the table, except a few ops with big variances, most of the degraded ops listed in the table dropped into mshadow. I did some further analysis, and found that the performance regression is caused by the commit of Here's some perf data collected on my local machine (SKX-8180), you can have a try on your platform :)
|
cc @apeforest |
Thanks @ciyongch for diving deep and help in identifying root cause of the issues. Can you please help me understand why we see variance only for few operators without setting environment variables for KMP and OMP? Our users don't do these environment variable setup so is it something we need to worry about that these operators will perform badly and indeterministic without that setting? |
@sandeep-krishnamurthy As most of the MXNet's ops will do the parallelism via openmp on CPU, from performance perspective, binding CPU cores will reduce the performance degradation due to cache misses and get better thread scheduling. |
I agree that what impact actuall user experience is the final speed of the model inference/training as operators get fused and there are other performance improvement technics applied on overall model. Currently we don't have enough data to say these ops regression we have will impact actuall model speed. Regarding the OP regression, we are focusing on root causing regression of broadcast ops, rest ops should not block 1.5.0 release. We found out no matter what's the flag (int32/64), there is around 15% regression on broadcast ops, on both mxnet-mkl and mxnet pip packages between 1.4.1 and 1.5.0. I'm still root causing it.
|
I have ran a search for perf regression on
|
@sandeep-krishnamurthy @Zha0q1
|
another regression from PR #14661 |
Update: all benchmark results before this comment maybe inaccurate as the profiler code between 1.4.1 and 1.5 is different. and the opperf script is using profiler.There is no much regression (accounting variance also) between 1.4.1 and 1.5.0 on broadcast ops if using python
Thanks to @reminisce for helping out and providing the scripts.
|
#15240 may introduced extra runtime in profiler in mxnet 1.5. Using timeit module,
Results using script above
|
There is also no significant regression on BatchNorm op between 1.4.1 and 1.5.0
script:
|
Conclusion
|
@roywei does it mean the operator profiling results with current profiler module is not accurate? |
@ciyongch The profiler was implemented in 1.5 and there are changes in the profiler code which were not in 1.4.1 release: #15240. This could have made difference in runtime as the runtime in each operator is very sensitive. Going forward, we should enhance the profiler module and make it available in CI so we can capture the performance degradation in time. |
@apeforest Thanks for the info, as we saw the current perf degradation only happened for some ops or some certain shapes with v1.5, any suggestion to do profiling for a real model? Could we still rely on the profiling result with latest MXNet code base (with #15240) so far? |
Follow up on dev list discussion:
https://lists.apache.org/thread.html/154ef1e4010671e7375c7a7cbedb413d5a4a3677321488440fb32a3a@%3Cdev.mxnet.apache.org%3E
We have found some operators to have performance regression using the operator benchmark module here:
https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf
@sandeep-krishnamurthy has helped to run the benchmark and this is the training mode result:
https://gist.github.com/sandeep-krishnamurthy/e0a2be893c8c4d484390c9c8813bdf50
The above result is using training mode (
autograd.record()
) and calculating both forward and backward time.As most users use CPU for inference only, to further investigate the impact on inference I have run the scripts using inference mode
Please find the inference and training mode results here:
https://docs.google.com/spreadsheets/d/1_eezNWbrBAm3s3i6G1m0Rd3YYdTEnmKlYtn4klqdyN0/edit?usp=sharing
I have calculated the regression percentage and sorted them, thanks to @aaronmarkham for providing the first version.
Although there are variances on perf numbers between runs, we observe the following commonly used operators be slower consistently.
We need to look into them and fix if root caused.
Some ops regression seems only to happen on mxnet-mkl version (refer to 4th sheet of the google sheet)
Environment:
AWS C5.18xLarge
Deep Learning Base AMI (Ubuntu) Version 18.1
Python 3.6
MXNet versions:
Note: nightly 20190627 contains the latest commit in v.1.5.x
Scripts:
https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf
Notes: you need to modify the scripts a bit to run
False
at this lineand change
run_backward
toFalse
in all files under:https://github.com/apache/incubator-mxnet/tree/master/benchmark/opperf/nd_operations
for example here.
The text was updated successfully, but these errors were encountered: