Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Fix mkldnn backend when using naive engine #15089

Merged
merged 5 commits into from
Jun 1, 2019

Conversation

ZhennanQin
Copy link
Contributor

@ZhennanQin ZhennanQin commented May 29, 2019

Description

This can fix #15078 and resolve all mkldnn related failures mentioned in #15005.

@pengzhao-intel @TaoLv @eric-haibin-lin @zheng-da

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

Copy link
Contributor

@pengzhao-intel pengzhao-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zheng-da
Copy link
Contributor

Could you explain why this modification can fix the problem?

@ZhennanQin
Copy link
Contributor Author

ZhennanQin commented May 30, 2019

@zheng-da Sure. The root cause is, PushAsync works differently between threaded_engine and naive_engine.
For threaded_engine, PushAsync really works asynchronously, which will push MKLDNNDataReorderAsync to the tail of engine task queue. This is the expected behavior.
But for naive_engine, PushAsync will execute weight reorder immediately, just as a synchronous operation. I'm not sure if this is expected, but this isn't as asynchronous as function name indicates.

Now let's look at conv weight case,

weight_mem = GetWeights(weight, fwd->fwd_pd.weights_primitive_desc(), param.conv_param.num_group);

At this time, weight is still in default format, so a reorder is registered inside mkldnn stream, but not submitted. And then,

weight.MKLDNNDataReorderAsync(fwd->fwd_pd.weights_primitive_desc());

When naive_engine is activated, This will inplace reorder weight into mkldnn format immediately. So weight is reordered twice before mkldnn conv kernel launch, causing incorrect result.

The fix is quite simple, change the sequence of above code, to ensure

weight.MKLDNNDataReorderAsync(fwd->fwd_pd.weights_primitive_desc());

happens prior to

weight_mem = GetWeights(weight, fwd->fwd_pd.weights_primitive_desc(), param.conv_param.num_group);

So for GetWeights, the weight is already reordered, no extra reorder will be registered in mkldnn stream.

Do we have plan to adjust naive engine PushAsync behavior to be as same as threaded_engine? I'm worried about this will cause trouble again for other developer.

@abhinavs95
Copy link
Contributor

@mxnet-label-bot add [MKLDNN]

@pengzhao-intel
Copy link
Contributor

@zheng-da @eric-haibin-lin please help take a review :)

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Shall we add some documentation in the code, too?

@szha
Copy link
Member

szha commented May 31, 2019

@ZhennanQin NaiveEngine's PushAsync was intentionally synchronous. Other code should not make any assumption about execution order when using async interface of any engine.

@szha
Copy link
Member

szha commented May 31, 2019

In addition to the docs, should we add test for it?

@eric-haibin-lin
Copy link
Member

We need a separate test which set naive engine env var before mxnet is imported

@ZhennanQin
Copy link
Contributor Author

@szha @eric-haibin-lin We'd better to create naive engine CI for all UT, not only for some mkldnn tests. Naive engine should have the ability to pass all UT.

@pengzhao-intel
Copy link
Contributor

I am going to merge this PR to fix the problem inside MKLDNN implementation under the naive engine.
We will submit the separated PR to enable naive engine in CI @juliusshufan

@pengzhao-intel pengzhao-intel merged commit cdb7b72 into apache:master Jun 1, 2019
@xinyu-intel xinyu-intel mentioned this pull request Jun 3, 2019
7 tasks
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* Fix mkldnn backend when using naive engine

* Rerun CI

* Rerun CI

* Rerun CI

* Add comment
@ZhennanQin ZhennanQin deleted the fix_naive branch September 16, 2019 07:00
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Naive engine produce incorrect result on MKLDNN backend
7 participants