Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

test_ImageRecordIter_seed_augmentation flaky test fix #12485

Merged

Conversation

perdasilva
Copy link
Contributor

@perdasilva perdasilva commented Sep 8, 2018

Description

Fixes flakiness with ImageRecordIter augmentation seed tests (#11359) by moving the seed_aug parameter and RNG seeding out of the default augmenter. We now seed the RNG before each image augmentation (if seed_aug parameter is set). This guarantees reproducibility while still enabling multithreaded augmentation (thus fixing the flakiness).
Additionally, it makes the test more robust by testing the whole iterator and not just the first image.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@perdasilva perdasilva changed the title [WIP]test_ImageRecordIter_seed_augmentation flaky test fix [WIP] test_ImageRecordIter_seed_augmentation flaky test fix Sep 9, 2018
@perdasilva perdasilva force-pushed the imagerecorditer_flaky_test_fix_11359 branch from 0deffe9 to 6aa8d75 Compare September 11, 2018 20:08
@kalyc
Copy link
Contributor

kalyc commented Sep 14, 2018

Thanks for your contribution @perdasilva

@mxnet-label-bot[pr-work-in-progress]

@marcoabreu marcoabreu added the pr-work-in-progress PR is still work in progress label Sep 14, 2018
@sandeep-krishnamurthy
Copy link
Contributor

@perdasilva - Did you run the tests multiple times to verify it is not flaky anymore?

@perdasilva perdasilva force-pushed the imagerecorditer_flaky_test_fix_11359 branch from 6aa8d75 to 97ae4a9 Compare September 24, 2018 07:58
@perdasilva
Copy link
Contributor Author

perdasilva commented Sep 25, 2018

@sandeep-krishnamurthy sorry for the delay in the response. I've been absent lately because life got in the way. The issue was the multi-threading, I've just done a rebase from master and it seems the number of threads has been reduced to 1 (which fixes the flakyness). I need to look at the history of the file to see what's changed since I was last looking at this problem.

EDIT: The problem is still there. I think something might have changed in the build system in the mean time and MP wasn't enabled. I've rebuilt from scratch with MP and it's still flaky.

@perdasilva
Copy link
Contributor Author

I have sketched out a fix to the flakiness here, although its a bit of a hack at the moment.
The reason for the flakiness has to do with the parallel processing and the random number generators used for the augmenters.

One random number generator is created per preprocessing thread (https://github.com/apache/incubator-mxnet/blob/master/src/io/iter_image_recordio_2.cc#L162). While the image records are retrieved in the same order (https://github.com/apache/incubator-mxnet/blob/master/src/io/iter_image_recordio_2.cc#L508) and always stored in the same index in the output array (https://github.com/apache/incubator-mxnet/blob/master/src/io/iter_image_recordio_2.cc#L567) - they are not guaranteed to be processed by the same thread every time. This matter because the default augmenter seeds the thread's random number generator (https://github.com/apache/incubator-mxnet/blob/master/src/io/image_aug_default.cc#L255) the first time it gets called (https://github.com/apache/incubator-mxnet/blob/master/src/io/iter_image_recordio_2.cc#L563).

This means that for different runs, the same record can be processed in a different iteration of the usage of the random number generator by the augmenters, thus leading to different random numbers being generated, which lead to different changes to the image -> flakyness.

To remedy the issue, I've changed to code such that the augmenter seed is a parameter to the image parser. I then use this seed and the image record index to seed the random number generator (outside of the default augmenter) before any changes are made to the image. Since the records are always retrieved in the same order, the same record will always have the same generator seed, independent of the number of threads used -> same random numbers being generated -> reproducible.

It's a bit of a hack, so I would appreciate some input from the code owner (@anirudh2290 ) and the community.

@perdasilva perdasilva force-pushed the imagerecorditer_flaky_test_fix_11359 branch 2 times, most recently from c6152f9 to 1550da3 Compare October 1, 2018 07:50
@sandeep-krishnamurthy
Copy link
Contributor

@haojin2 - Can you provide some inputs here? Thanks.

@Roshrini
Copy link
Member

@haojin2 @apeforest Can you please review this PR? Thanks

@ankkhedia
Copy link
Contributor

@haojin2 @apeforest ping for reviewing the PR!

LOG(INFO) << "tid: " << tid << " idx: " << idx << " index: " << rec.image_index();
}
if (param_.seed_aug.has_value()) {
prnds_[tid]->seed(idx + param_.seed_aug.value());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand why seeds need to be reset all the time. The logic is a bit obscure - the random generators are in the augmentor and get passed to the iterator.
Anyway, what you could try to do (if not refactoring) is to seed all generators in DefaultImageAugmenter::Init here or in a separate method (since the random generator is not passed in to Init). I guess the interface ImageAugmenter would require a separate seed method then that could be called sequentially (instead of openmp parallelized).

Copy link
Contributor Author

@perdasilva perdasilva Nov 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two requirements here: that setting the seed will yield reproducible results and that parallelization should be used to speed up image augmentation. We need to reset the seed for each image because there is no guarantee that the same image will be processed by the same thread, or that even that it will be the ith image processed by that thread, across independent runs or even different hardware (the code figures out the number of threads to use for processing). Meaning the order of draws from the random number generator to augment the same image is not guaranteed. Therefore, resetting the random number generator at the start of processing an image it the only way (at least that I could think of) to guarantee reproducibility when setting a fixed seed. That is, to guarantee that the same random distortions will be applied to the same image across independent runs and different hardware configurations. I hope this make it a little clearer. It's not the easiest topic to discuss in written form lol.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your assessment - it's pretty obscure and it took me a while to figure out exactly what was going on.

@perdasilva perdasilva force-pushed the imagerecorditer_flaky_test_fix_11359 branch from 1550da3 to 12ffd99 Compare November 5, 2018 08:00
@@ -165,6 +167,8 @@ struct ImageRecParserParam : public dmlc::Parameter<ImageRecParserParam> {
.describe("The data shuffle buffer size in MB. Only valid if shuffle is true.");
DMLC_DECLARE_FIELD(shuffle_chunk_seed).set_default(0)
.describe("The random seed for shuffling");
DMLC_DECLARE_FIELD(seed_aug).set_default(dmlc::optional<int>())
Copy link
Member

@anirudhacharya anirudhacharya Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I am missing something here, but the PR says it fixes a flaky test.

Are we adding this new field just to fix the flakiness of a particular test? It does not seem right. we can't change the functionality, just so that it will be easier for us to run tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair call. However, the problem, I think, is architectural.
The seed_aug parameter is passed in to the default image augmenter (https://github.com/apache/incubator-mxnet/blob/master/src/io/image_aug_default.cc#L101). But, the actual random number generator is external, but seeded internally (https://github.com/apache/incubator-mxnet/blob/master/src/io/image_aug_default.cc#L254).

As I've explained above, I can't see how to meet the requirements of processing the images in parallel and ensuring that the output is reproducible (by setting the augmentation seed).

My fix for the flakiness (without violating the aforementioned requirements), is to ensure that the RNG used to process the image is seeded before processing the image within that thread it is processed in.

To be clear, none of the changes made were to facilitate running the tests, but to actually correct the underlying problem. I agree that the PR is a little rough around the edges atm, but what I'm trying to do is reach out to the community to understand what would be the best way to actually fix the underlying problem.

@@ -427,7 +427,7 @@ def check_CSVIter_synthetic(dtype='float32'):
for dtype in ['int32', 'int64', 'float32']:
check_CSVIter_synthetic(dtype=dtype)

@unittest.skip("Flaky test: https://github.com/apache/incubator-mxnet/issues/11359")
# @unittest.skip("Flaky test: https://github.com/apache/incubator-mxnet/issues/11359")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove rather than comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Although again, I need a review of my current approach to see if its the right one. If I get validation from that, I can then prepare the PR for proper review...
Many of nits I'll fix before removing [WIP]

@kalyc
Copy link
Contributor

kalyc commented Nov 14, 2018

@perdasilva thanks for your contributions. We are looking forward to merging your PR. Could you please address comments above to proceed further?

@perdasilva perdasilva force-pushed the imagerecorditer_flaky_test_fix_11359 branch 2 times, most recently from 728586b to 6984ba6 Compare November 15, 2018 12:03
@perdasilva perdasilva requested a review from szha as a code owner November 15, 2018 12:03
@perdasilva perdasilva force-pushed the imagerecorditer_flaky_test_fix_11359 branch 2 times, most recently from ac885eb to 5dcea9d Compare November 15, 2018 15:11
@stu1130
Copy link
Contributor

stu1130 commented Nov 21, 2018

@perdasilva is it ready for review?

@perdasilva perdasilva force-pushed the imagerecorditer_flaky_test_fix_11359 branch from 5dcea9d to aa3d167 Compare November 22, 2018 15:33
@perdasilva
Copy link
Contributor Author

@stu1130 I will polish it tomorrow and submit it for review. I'm sorry for the delay.

@perdasilva perdasilva force-pushed the imagerecorditer_flaky_test_fix_11359 branch from aa3d167 to c92ccba Compare November 23, 2018 06:12
@perdasilva perdasilva changed the title [WIP] test_ImageRecordIter_seed_augmentation flaky test fix test_ImageRecordIter_seed_augmentation flaky test fix Nov 23, 2018
@perdasilva
Copy link
Contributor Author

@stu1130 ready for review

Copy link
Contributor

@lebeg lebeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good

def assert_dataiter_equals(dataiter1, dataiter2):
for batch1, batch2 in zip_longest(dataiter1, dataiter2):
# ensure iterators are of same length
assert(batch1 and batch2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this assert doing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a function for asserting that two data iterators are producing the same output. The data iterator iterates over batches (lists of arrays). So, basically it runs through the iterators to get the batches, then iterates through the batches and ensures that they are equal.

for batch1, batch2 in zip_longest(dataiter1, dataiter2):

# try to ensure iterators are of same length
assert(batch1 and batch2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this ensuring same length?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that (as far as I could tell), there was no easy way of getting the number of batches in an iterator. It doesn't have a size or length method. So, I'm using the zip_longest (https://docs.python.org/3.0/library/itertools.html#itertools.zip_longest) to iterate through the iterators together. If they are don't have the same length, a fill value (None by default) will be returned in place of an item (or batch in this case). Therefore, if one of batch1 or batch2 is none, the assertion will be triggered.


# check whether seed_aug changes the iterator behavior
dataiter = mx.io.ImageRecordIter(
dataiter1 = mx.io.ImageRecordIter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is getting tested here that is not being tested in the first test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was already here when I got here =) I think it was meant as a sanity check to ensure that the seed aug parameter is not influencing the iterator (other than in the case where the iterator has augmentations applied to its images).

if sys.version_info >= (3,0):
from itertools import zip_longest
else:
from itertools import izip_longest as zip_longest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion for handling different python versions

try:
    from itertools import izip_longest as zip_longest
except:
    from itertools import zip_longest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated it with your suggestion. Much nicer.

@perdasilva perdasilva force-pushed the imagerecorditer_flaky_test_fix_11359 branch from c92ccba to 451291b Compare November 29, 2018 08:25
@perdasilva
Copy link
Contributor Author

@anirudhacharya all good now?

@perdasilva perdasilva force-pushed the imagerecorditer_flaky_test_fix_11359 branch 3 times, most recently from 39117c5 to 88d2d6d Compare December 7, 2018 06:43
@anirudhacharya
Copy link
Member

LGTM.
@anirudh2290 @apeforest for review/merge

@nswamy nswamy added the Test label Dec 8, 2018
@apeforest
Copy link
Contributor

apeforest commented Dec 10, 2018

@ChaiBapchya @stu1130 Could you please review this change?

Copy link
Contributor

@stu1130 stu1130 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Have you run the test_ImageRecordIter_seed_augmentation at least 10000 times to see if it's still flaky?

@perdasilva
Copy link
Contributor Author

@stu1130 - not 10000x. I will do that tomorrow and confirm ^^

@perdasilva
Copy link
Contributor Author

@stu1130 - ok each test takes around 10 seconds - might take a tad bit longer than expected...

@perdasilva
Copy link
Contributor Author

@stu1130 - I can confirm, all good and flake free, please merge at your convenience

Copy link
Contributor

@ChaiBapchya ChaiBapchya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@stu1130
Copy link
Contributor

stu1130 commented Dec 13, 2018

@anirudh2290 could you take a look at this and merge it if it looks good

@perdasilva
Copy link
Contributor Author

@anirudh2290 any chance you could merge this? =)

@Roshrini
Copy link
Member

Roshrini commented Jan 2, 2019

@marcoabreu @sandeep-krishnamurthy Can someone merge this PR?
@mxnet-label-bot Update [pr-awaiting-merge]

@marcoabreu marcoabreu added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed Test pr-awaiting-review PR is waiting for code review labels Jan 2, 2019
@perdasilva perdasilva force-pushed the imagerecorditer_flaky_test_fix_11359 branch from 88d2d6d to 1861e85 Compare January 9, 2019 07:11
@stu1130
Copy link
Contributor

stu1130 commented Jan 16, 2019

@sandeep-krishnamurthy @nswamy could someone merge this PR?

@marcoabreu marcoabreu merged commit d07187b into apache:master Jan 18, 2019
@perdasilva perdasilva deleted the imagerecorditer_flaky_test_fix_11359 branch January 18, 2019 18:47
jessr92 pushed a commit to jessr92/incubator-mxnet that referenced this pull request Jan 27, 2019
* Moves seed_aug parameter to ImageRecParserParam and re-seeds RNG before each augmentation to guarantee reproducibilit

* Update image record iterator tests to check the whole iterator not only first image
stephenrawls pushed a commit to stephenrawls/incubator-mxnet that referenced this pull request Feb 16, 2019
* Moves seed_aug parameter to ImageRecParserParam and re-seeds RNG before each augmentation to guarantee reproducibilit

* Update image record iterator tests to check the whole iterator not only first image
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* Moves seed_aug parameter to ImageRecParserParam and re-seeds RNG before each augmentation to guarantee reproducibilit

* Update image record iterator tests to check the whole iterator not only first image
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.