Skip to content
This repository has been archived by the owner on Oct 7, 2020. It is now read-only.

[Converge] test: make cluster tests more time tolerant #38

Closed
jasnell opened this issue May 22, 2015 · 15 comments
Closed

[Converge] test: make cluster tests more time tolerant #38

jasnell opened this issue May 22, 2015 · 15 comments

Comments

@jasnell
Copy link
Member

jasnell commented May 22, 2015

See: nodejs/node-v0.x-archive@f3f4e28

/cc @mhdawson

Original commit message:

simple tests test-cluster-master-error.js, test-cluster-master-kill.js
fails in AIX with assertion failure indicating that the workers were
alive even after the master terminated. A 200ms leeway is provided for
the workers to actually terminate, but the isAlive check returns
true in both the cases.

In AIX, the workers were actually terminating, but they took more time
- as much as 800ms (normal) to 1000ms (in rare cases).

Based on a C test we ran, it is found that the exit routines in AIX
is a bit more longer than that in Linux. There are a number of cleanup
activities performed in exit() system call, and depending on when the
signal handlers are shutdown in that sequence, the process will be
deemed as dead or alive, from another process's perspective.

process.kill(pid) is used in the test case to check the liveliness of
the worker, and when the kill() call is issued, even if the target
process is in it's exit sequences, if the signal handlers are not shut
down, it will respond to external signals, causing those calls to pass.

This fix extends the additional timeout for all platforms
@Fishrock123
Copy link
Contributor

I think we now have better ways of specifying platform-specific timeouts. cc @silverwind

@silverwind
Copy link
Contributor

Yes, there's two ways to achieve different timeouts per platform. First for individual timeout values inside tests with common.platformTimeout:

https://github.com/nodejs/io.js/blob/39dde3222e4733fc1b59c45e392d9ff1a88ae4cc/test/common.js#L183-L191

And for the overall test timeout (30s):

https://github.com/nodejs/io.js/blob/39dde3222e4733fc1b59c45e392d9ff1a88ae4cc/tools/test.py#L731-L734

@jasnell
Copy link
Member Author

jasnell commented May 27, 2015

@mhdawson ... can you take a look to see if the io.js changes mentioned by @silverwind address the original issue. If they do, then we can likely close this.

@Fishrock123
Copy link
Contributor

@jasnell @mhdawson I checked and they don't seem to, however it would be better to use platform-specific timeouts that we have for these kinds of things than just raising the timeout.

@Fishrock123
Copy link
Contributor

Hmmm, this seems to be a special-case for AIX's exit() though, so I'm not sure this can be applied as a common thing?

Thoughts? We could just take the patch, but it also means the test will take over a second on all machines. :/

@mhdawson
Copy link
Member

mhdawson commented Jul 7, 2015

I agree that we don't necessarily just want to up the general timeout for AIX.

If I remember correctly I think we had started with a special case in the code for AIX (either for this PR or some others related to timeouts), but in the review it was suggested we just apply to all platforms to avoid the extra platform checks.

@Fishrock123
Copy link
Contributor

@mhdawson your commit log above does have some detailed info on it, yes.

In io.js we've preferred to checking the platform it seems.

@rvagg
Copy link
Member

rvagg commented Aug 24, 2015

@mhdawson can I leave this for you to decide whether it needs to go in to nodejs/node prior to 4.0? I don't think it's that important cause it's only a test case but I'll leave it to you.

@jasnell
Copy link
Member Author

jasnell commented Aug 24, 2015

just fyi, @mhdawson is out on vacation for this week. I agree with your
assessment tho, so long as the current CI is passing, this shouldn't be a
priority for 4.0

On Mon, Aug 24, 2015 at 12:21 AM, Rod Vagg [email protected] wrote:

@mhdawson https://github.com/mhdawson can I leave this for you to
decide whether it needs to go in to nodejs/node prior to 4.0? I don't think
it's that important cause it's only a test case but I'll leave it to you.


Reply to this email directly or view it on GitHub
#38 (comment)
.

@mhdawson
Copy link
Member

mhdawson commented Sep 1, 2015

I'd like to be able to get it into the LTS stream, but provided I'll be able to do that later on I don't think its a blocker for 4.0. Based on the the discussion it sounds like I should just create a PR which makes the timeout longer only for AIX, right ?

@Fishrock123
Copy link
Contributor

@mhdawson Yes. I'd probably suggest adding AIX to common.platformTimeout() and then using that everywhere. (Unless this is super special case and AIX doesn't need it anywhere else, and nothing else needs it here.)

@mhdawson
Copy link
Member

mhdawson commented Sep 3, 2015

@Fishrock123 We have extended timeouts but not to the extent that I'd want to make all of the timeouts longer for AIX lat this point. I'd prefer to just as an AIX check to this test. If we end up having to change a significant number then I'd plan for us to change common.platformTimeout() as you suggest. Does this sound reasonable to you ?

@Fishrock123
Copy link
Contributor

@mhdawson Yep, that sounds fine. :)

@mhdawson
Copy link
Member

@Fishrock123, @jasnell PR created on master: nodejs/node#2891 could one of you review for me.

@mhdawson
Copy link
Member

Landed on master closing this issue nodejs/node@2853f98

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants