Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Improve sparse adagrad update #9651

Merged
merged 10 commits into from
Mar 3, 2018

Conversation

eric-haibin-lin
Copy link
Member

@eric-haibin-lin eric-haibin-lin commented Jan 31, 2018

Description

  • The weight decay term in Adagrad is incorrect. See Incorrect weight_decay implementation in AdaGrad #9363 for more details.
  • The sparse update is not fast enough for the language model I'm training. Moving the logic to cpp backend so that we don't suffer from blocking calls such as RowSparseNDArray.indices and RowSparseNDArray.data.

@ZiyueHuang @piiswrong @sxjscience @cjolivier01 @szha
Commit author is mistaken. Nvm.

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

Copy link
Member

@ZiyueHuang ZiyueHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just some minor comments.

}
const DType grad_squared = grad_rescaled * grad_rescaled;
state_data[data_j] += grad_squared;
const DType div = grad_rescaled / math::sqrt(state_data[data_j] + epsilon);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : use mshadow_op::square_root::Map? Seems that other kernels use mshadow_op instead of directly call math func

state_data[data_j] += grad_squared;
const DType div = grad_rescaled / math::sqrt(state_data[data_j] + epsilon);
// No need to use KERNEL_ASSIGN, as we already checked req is kWriteInplace
out_data[data_j] = weight_data[data_j] + div * -lr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : ... - div * lr?

'rescale_grad': self.rescale_grad}
if self.clip_gradient:
kwargs['clip_gradient'] = self.clip_gradient
sparse.adagrad_update(weight, grad, history, out=weight, lr=lr, wd=wd, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since wd is still passed to the operator, we can just check whether wd is 0 or not in the backend.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

float wd;
DMLC_DECLARE_PARAMETER(AdagradParam) {
DMLC_DECLARE_FIELD(lr)
.describe("Learning rate");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indented. What's the concern?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind. it's correct

@cjolivier01
Copy link
Member

Is this ready to go in?

@eric-haibin-lin eric-haibin-lin changed the title Fix weight decay in Adagrad and improve sparse adagrad update Improve sparse adagrad update Mar 1, 2018
@eric-haibin-lin
Copy link
Member Author

Hi everyone,

Considering the wd term is implemented inconsistently in many other optimizers (see #9881) , I removed the changes for wd in this PR so that the scope of changes is limited. Let's discuss how to fix the behavior of wd systematically in a separate issue/PR.

@eric-haibin-lin eric-haibin-lin added Sparse and removed Bug labels Mar 1, 2018
@eric-haibin-lin eric-haibin-lin merged commit fc9e70b into apache:master Mar 3, 2018
eric-haibin-lin added a commit to eric-haibin-lin/mxnet that referenced this pull request Mar 4, 2018
* fix adagrad

* add test

* fix lint

* CR comments

* remove raise in python

* enhance unit test

* revert wd changes

* revert dense op changes
jinhuang415 pushed a commit to jinhuang415/incubator-mxnet that referenced this pull request Mar 30, 2018
* fix adagrad

* add test

* fix lint

* CR comments

* remove raise in python

* enhance unit test

* revert wd changes

* revert dense op changes
rahul003 pushed a commit to rahul003/mxnet that referenced this pull request Jun 4, 2018
* fix adagrad

* add test

* fix lint

* CR comments

* remove raise in python

* enhance unit test

* revert wd changes

* revert dense op changes
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
* fix adagrad

* add test

* fix lint

* CR comments

* remove raise in python

* enhance unit test

* revert wd changes

* revert dense op changes
@eric-haibin-lin eric-haibin-lin deleted the adagrad-fix branch September 18, 2018 23:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants