Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky sequential/test-async-wrap-getasyncid #18190

Closed
joyeecheung opened this issue Jan 16, 2018 · 21 comments
Closed

Investigate flaky sequential/test-async-wrap-getasyncid #18190

joyeecheung opened this issue Jan 16, 2018 · 21 comments
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. test Issues and PRs related to the tests. windows Issues and PRs related to the Windows platform.

Comments

@joyeecheung
Copy link
Member

joyeecheung commented Jan 16, 2018

  • Version: master
  • Platform: windows, linux, arm(?)
  • Subsystem: async_wrap

Opening a separate issue for the flake reported in #13020 (comment) , I have not seem it for a while but tonight it has come back in different unrelated PRs.

Sample:

https://ci.nodejs.org/job/node-test-commit-linux/15590/nodes=alpine37-container-x64/console
https://ci.nodejs.org/job/node-test-binary-windows/14242/COMPILED_BY=vs2017,RUNNER=win10,RUN_SUBSET=3/console

not ok 1953 sequential/test-async-wrap-getasyncid
  ---
  duration_ms: 0.356
  severity: fail
  stack: |-
    Mismatched <anonymous> function calls. Expected exactly 1, actual 2.
        at Object.exports.mustCall (/home/iojs/build/workspace/node-test-commit-linux/nodes/alpine37-container-x64/test/common/index.js:497:10)
        at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-linux/nodes/alpine37-container-x64/test/sequential/test-async-wrap-getasyncid.js:19:35)
        at Module._compile (module.js:660:30)
        at Object.Module._extensions..js (module.js:671:10)
        at Module.load (module.js:577:32)
        at tryModuleLoad (module.js:517:12)
        at Function.Module._load (module.js:509:3)
        at Function.Module.runMain (module.js:701:10)
        at startup (bootstrap_node.js:194:16)
  ...
@joyeecheung joyeecheung added test Issues and PRs related to the tests. async_wrap flaky-test Issues and PRs related to the tests with unstable failures on the CI. labels Jan 16, 2018
@tniessen
Copy link
Member

Also here (Alpine) and here (Windows 10).

@BridgeAR
Copy link
Member

Closing as duplicate of #13020

@joyeecheung
Copy link
Member Author

joyeecheung commented Jan 18, 2018

@BridgeAR This issue appears after the fix of #13020 landed, see #13020 (comment) and #13020 (comment)

@joyeecheung joyeecheung reopened this Jan 18, 2018
@joyeecheung
Copy link
Member Author

@BridgeAR I'll close the original issue. Have not seen a CRASHED in this test in these two months so we probably don't need to open a separate issue for #13020 (comment)

@joyeecheung joyeecheung added the windows Issues and PRs related to the Windows platform. label Jan 18, 2018
@joyeecheung
Copy link
Member Author

This is similar to #16210, both beforeExit being emitted twice

@apapirovski
Copy link
Member

@joyeecheung I have a strong feeling this might be related to eca73a2 or 791975d — this only started failing after those commits landed and at a quick glance, the test does use node_file.cc in a number of tests.

@joyeecheung
Copy link
Member Author

joyeecheung commented Jan 18, 2018

@apapirovski The last CI run of eca73a2 was https://ci.nodejs.org/job/node-test-pull-request/12540/, which did not contain this flake. The CI run before 791975d landed was https://ci.nodejs.org/job/node-test-pull-request/12562/ (for another PR ) and there was already this flake...still investigating before those CI runs expired

@joyeecheung
Copy link
Member Author

I think the flake first appeared in https://ci.nodejs.org/job/node-test-pull-request/12562/ for #18186 (has not yet landed), and it was rebasing onto 316b5ef

@joyeecheung
Copy link
Member Author

joyeecheung commented Jan 18, 2018

Wait...I think I've seen one of the tests in 316b5ef failed on my mac this week (by looking at my out/Release/node shell history, I ran out/Release/node /Users/joyee/projects/node/test/addons-napi/test_typedarray/test.js because it failed). I was never able to reproduce it after make test-addons-napi-clean, so I thought it's probably something related to stale addon tests

@joyeecheung
Copy link
Member Author

joyeecheung commented Jan 18, 2018

@apapirovski
Copy link
Member

Ok, I know this is pushing it... but is there any way that https://ci.nodejs.org/job/node-test-pull-request/12562/ was wrong about what it was rebased on? It's the only outlier that doesn't point to the problem being with #17914 — the timestamps even align.

The other commit literally has nothing to do with the code being tested... :(

@joyeecheung
Copy link
Member Author

The git-rebase log of https://ci.nodejs.org/job/node-test-pull-request/12562/

+ git rev-parse HEAD
85a4230b32d9efb33733a623b7348b090bf1268d
+ git rev-parse 316b5efd6b02379a490ce62c31a13122729ff1dd
316b5efd6b02379a490ce62c31a13122729ff1dd
+ git rebase 316b5efd6b02379a490ce62c31a13122729ff1dd
First, rewinding head to replay your work on top of it...
Applying: util: implement util.getErrorName()
Auto packing the repository in background for optimum performance.
See "git help gc" for manual housekeeping.
error: The last gc run reported the following. Please correct the root cause
and remove .git/gc.log.
Automatic cleanup will not be performed until the file is removed.

error: pack-objects died of signal 15
error: failed to run repack

[git-rebase] $ /bin/sh -xe /tmp/jenkins3416429728392910523.sh
+ git checkout -B jenkins-node-test-commit-windows-fanned-14926
Switched to a new branch 'jenkins-node-test-commit-windows-fanned-14926'
+ grep -q ^github.com /home/iojs/.ssh/known_hosts
+ git push [email protected]:janeasystems/node_binary_tmp.git +jenkins-node-test-commit-windows-fanned-14926
To [email protected]:janeasystems/node_binary_tmp.git
 * [new branch]      jenkins-node-test-commit-windows-fanned-14926 -> jenkins-node-test-commit-windows-fanned-14926

Also there wasn't this flake in the CI run of #18037

@joyeecheung
Copy link
Member Author

We should do some stress tests on 316b5ef and 791975d

@apapirovski
Copy link
Member

Ok... a shot in the dark: what if we have two different bugs here. Because the main issue occurring now is on alpine but that one issue above is on Windows. So perhaps #17914 introduced a more frequent failure on alpine but we also have some other failure that happens very infrequently.

@joyeecheung
Copy link
Member Author

There was also #13020 (comment) which has the same error stack trace and it's on arm, although it has not appeared for a long time so I am not sure if that's related or not

@apapirovski
Copy link
Member

apapirovski commented Jan 18, 2018

I suppose we could also consider #18144 as the culprit, even though its CI didn't have any issues. *shrug*

@joyeecheung joyeecheung mentioned this issue Jan 18, 2018
4 tasks
@joyeecheung
Copy link
Member Author

No idea how to fire sequential stress tests on sha's...opened #18238 cause it's easier to run CIs

@joyeecheung
Copy link
Member Author

It showed up again...this was rebased onto 8803b69

https://ci.nodejs.org/job/node-test-commit-linux/15734/nodes=ubuntu1710-x64/console

not ok 1932 sequential/test-async-wrap-getasyncid
  ---
  duration_ms: 0.340
  severity: fail
  stack: |-
    Mismatched <anonymous> function calls. Expected exactly 1, actual 2.
        at Object.exports.mustCall (/home/iojs/build/workspace/node-test-commit-linux/nodes/ubuntu1710-x64/test/common/index.js:497:10)
        at Object.<anonymous> (/home/iojs/build/workspace/node-test-commit-linux/nodes/ubuntu1710-x64/test/sequential/test-async-wrap-getasyncid.js:20:35)
        at Module._compile (module.js:660:30)
        at Object.Module._extensions..js (module.js:671:10)
        at Module.load (module.js:577:32)
        at tryModuleLoad (module.js:517:12)
        at Function.Module._load (module.js:509:3)
        at Function.Module.runMain (module.js:701:10)
        at startup (bootstrap_node.js:194:16)
  ...

@joyeecheung joyeecheung reopened this Jan 22, 2018
@apapirovski
Copy link
Member

FWIW I've done some preliminary debugging, the loop is being kept alive because there's a uv_close being called on some handle within the main event loop. Should be a good starting point if anyone wants to dig in.

@apapirovski
Copy link
Member

apapirovski commented Jan 22, 2018

Edit: Ignore me for now... I can't even tell if this is actually cares_wrap or not...

@BridgeAR BridgeAR reopened this Jan 26, 2018
addaleax pushed a commit to addaleax/node that referenced this issue Feb 27, 2018
Since the `DestroyAsyncIdsCallback` in Node.js can be scheduled as
a result of GC, that means that it can accidentally keep the event
loop open when it shouldn't.

Replace `SetImmediate` with the newly introduced `SetUnrefImmediate`
and in addition introduce RunBeforeExit callbacks, of which
`DestroyAsyncIdsCallback` is now the first. These callbacks will run
before the `beforeExit` event is emitted (which will now only be
emitted if the event loop is still not active).

PR-URL: nodejs#18241
Fixes: nodejs#18190
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Andreas Madsen <[email protected]>
MayaLekova pushed a commit to MayaLekova/node that referenced this issue May 8, 2018
Since the `DestroyAsyncIdsCallback` in Node.js can be scheduled as
a result of GC, that means that it can accidentally keep the event
loop open when it shouldn't.

Replace `SetImmediate` with the newly introduced `SetUnrefImmediate`
and in addition introduce RunBeforeExit callbacks, of which
`DestroyAsyncIdsCallback` is now the first. These callbacks will run
before the `beforeExit` event is emitted (which will now only be
emitted if the event loop is still not active).

PR-URL: nodejs#18241
Fixes: nodejs#18190
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: James M Snell <[email protected]>
Reviewed-By: Andreas Madsen <[email protected]>
@apapirovski
Copy link
Member

If anyone wants to help out, please have a look at the now closed #18307 and suggest better ways of accomplishing this. Or if you believe the original PR was good enough, let me know.

apapirovski added a commit that referenced this issue Jun 10, 2018
Instead of relying on garbage collection to close the handle,
manage its state more explicitly.

PR-URL: #21093
Fixes: #18190
Refs: #18307
Reviewed-By: Jeremiah Senkpiel <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
apapirovski added a commit that referenced this issue Jun 10, 2018
PR-URL: #21093
Fixes: #18190
Refs: #18307
Reviewed-By: Jeremiah Senkpiel <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
targos pushed a commit that referenced this issue Jun 10, 2018
Instead of relying on garbage collection to close the timer handle,
manage its state more explicitly.

PR-URL: #21093
Fixes: #18190
Refs: #18307
Reviewed-By: Jeremiah Senkpiel <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
targos pushed a commit that referenced this issue Jun 10, 2018
Instead of relying on garbage collection to close the handle,
manage its state more explicitly.

PR-URL: #21093
Fixes: #18190
Refs: #18307
Reviewed-By: Jeremiah Senkpiel <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
targos pushed a commit that referenced this issue Jun 10, 2018
PR-URL: #21093
Fixes: #18190
Refs: #18307
Reviewed-By: Jeremiah Senkpiel <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
targos pushed a commit that referenced this issue Jun 13, 2018
Instead of relying on garbage collection to close the timer handle,
manage its state more explicitly.

PR-URL: #21093
Fixes: #18190
Refs: #18307
Reviewed-By: Jeremiah Senkpiel <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
targos pushed a commit that referenced this issue Jun 13, 2018
Instead of relying on garbage collection to close the handle,
manage its state more explicitly.

PR-URL: #21093
Fixes: #18190
Refs: #18307
Reviewed-By: Jeremiah Senkpiel <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
targos pushed a commit that referenced this issue Jun 13, 2018
PR-URL: #21093
Fixes: #18190
Refs: #18307
Reviewed-By: Jeremiah Senkpiel <[email protected]>
Reviewed-By: Anna Henningsen <[email protected]>
Reviewed-By: Ben Noordhuis <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Colin Ihrig <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. test Issues and PRs related to the tests. windows Issues and PRs related to the Windows platform.
Projects
None yet
4 participants