-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Bulked op segments to allow Variable nodes #14200
Conversation
…LK_EXEC_MAX_NODE_TRAIN_{FWD,BWD}.
Just to clarify: this is bulking support for symbolic execution. Bulking in Gluon was handled in (already merged) PR #13890 |
@mxnet-jenkins add[Backend, pr-awaiting-review] |
I measured the perf gains of this PR under 2 scenarios. The first scenario was a run across 8 Volta GPUs of a mixed precision NHWC Resnet50 v1b (also with horovod and DALI in NVIDIA's MXNet container). To simulate upstream MXNet with it's current bulking, I set the new-PR's bulking to 2. I reasoned that a residual unit has a forward sequence Conv-BN-Relu-Conv-BN-Add-Relu, which would have 4 segments (due to the Variable inputs to the Conv's and BN's) amongst 7 nodes, for an average nodes/segment of 1.75. The speed-up measured from MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN=2 to MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN=15 was:
A second scenario was a run across 8 Volta GPUs of a mixed precision NCHW Inception-v3 (with DALI in NVIDIA's MXNet container). The speed-up measured from MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN=2 to MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN=15 was:
|
The main change of this PR is in graph_executor.cc in an area often touched by @piiswrong @eric-haibin-lin @KellenSunderland. Not sure if any of them would like to weigh in. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the improvement! A few questions.
@@ -53,10 +53,12 @@ struct CachedOpConfig : public dmlc::Parameter<CachedOpConfig> { | |||
.set_default(2) | |||
.describe("Maximum number of operators that can be inlined."); | |||
DMLC_DECLARE_FIELD(forward_bulk_size) | |||
.set_default(dmlc::GetEnv("MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN", 15)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that env_vars like MXNET_EXEC_BULK_EXEC_TRAIN
/MXNET_EXEC_BULK_EXEC_INFER
=0 are not respected by the cached_op. Would you have time to kindly fix it for cached op?
https://github.com/apache/incubator-mxnet/blob/54fd288c7a4bf59d37f793c26ef9a98ed40b0c40/src/imperative/cached_op.cc#L593-L596
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please review the latest commits, which address this valid concern about consistency. I also consolidated all op-bulking env var references to a central place and added timing-based tests for the perf impact of bulking. I'm happy with the PR now (assuming it passes CI). Anyone else you want to pull into the review @eric-haibin-lin?
@mxnet-label-bot add[Backend] |
@eric-haibin-lin Thanks for your comments thus far. The PR is back in a good state for you to complete your review, although it appears testing on Windows has been bypassed . Happy to respond to further comments from you or from others you may wish to pull into the review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment. Otherwise looks good. cc @szha
* Bulked op seg size to ignore Variable nodes, limited by MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_{FWD,BWD}. * Document new env variables. Unify operation with imperative. * Add timing-based tests of symbol and gluon op bulking. * Rename test_in_separate_process -> run_in_spawned_process. * Remove redundant util test_operator_gpu.py:_test_in_separate_process(). * Consolidate references to env vars that set op-bulking policy. * Test for effect of MXNET_EXEC_BULK_EXEC_TRAIN=0. * Fix python2 print() issue. * Trigger CI. * Consolidate similar op bulking routines. * Trigger CI. * Trigger CI. * Add instrumentation to debug failing CI.
* Bulked op seg size to ignore Variable nodes, limited by MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_{FWD,BWD}. * Document new env variables. Unify operation with imperative. * Add timing-based tests of symbol and gluon op bulking. * Rename test_in_separate_process -> run_in_spawned_process. * Remove redundant util test_operator_gpu.py:_test_in_separate_process(). * Consolidate references to env vars that set op-bulking policy. * Test for effect of MXNET_EXEC_BULK_EXEC_TRAIN=0. * Fix python2 print() issue. * Trigger CI. * Consolidate similar op bulking routines. * Trigger CI. * Trigger CI. * Add instrumentation to debug failing CI.
* Bulked op seg size to ignore Variable nodes, limited by MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_{FWD,BWD}. * Document new env variables. Unify operation with imperative. * Add timing-based tests of symbol and gluon op bulking. * Rename test_in_separate_process -> run_in_spawned_process. * Remove redundant util test_operator_gpu.py:_test_in_separate_process(). * Consolidate references to env vars that set op-bulking policy. * Test for effect of MXNET_EXEC_BULK_EXEC_TRAIN=0. * Fix python2 print() issue. * Trigger CI. * Consolidate similar op bulking routines. * Trigger CI. * Trigger CI. * Add instrumentation to debug failing CI.
Description
Background: Operators are bulked into segments that define synchronization points between the cpu-based worker threads and the often-GPU-based operator execution. While the default size of the segments is set to 15, in practice the segment size is much smaller due to the current restriction that 'Variable' nodes (e.g. weight inputs to Convolution) terminate the segment formation. This is an unnecessary restriction that can be removed to increase segment size and improve performance.
This PR changes the operator bulking code to allow Variable nodes to be part of the segment, but does not count them toward the limit, which is set currently by the environment variable MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN (default = 15). Larger segments are desirable to improve both CPU and GPU efficiency. However, there are practical considerations that keep full forward- and backward-pass bulking from being the optimal. For example, generated gradients of the backward pass must wait until the end of a segment before gradient reduction can get started in a multi-GPU training scenario. Because of the possible different optima in the forward and backward pass, this PR adds 2 additional environment variable 'knobs':
MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_FWD - sets the maximum size of bulked operator segments in the forward pass. If unset, then MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN is used, or if that is similarly unset, the default of 15.
MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN_BWD - sets the maximum size of bulked operator segments in the backward pass. If unset, then MXNET_EXEC_BULK_EXEC_MAX_NODE_TRAIN is used, or if that is similarly unset, the default of 15.
While this PR is being reviewed, I will make a further post detailing the performance gains that result. Also due is a commit that adds documentation for the additional environment variables.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments