-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aix72-ppc64 is broken #2872
Comments
According to https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix72-ppc64/buildTimeTrend all of the AIX 7.2 hosts are affected. It appears that the Jenkins agent is being disconnected and reconnected, but the build is still continuing (there are processes still running even though Jenkins says the hosts are not busy). The aix-cleanup job has been run (manually by @Trott but also again on the timer) which cleared up the leftover processes but subsequent AIX builds still failed. I've tried running |
I'm not aware of anything changing on the machines recently that could be a suspect. There was a Jenkins update last week but we've had successful builds since that happened. Some Jenkins plugins were updated yesterday but again we were seeing failures before and after that happened. I've also checked we're on the latest Jenkins agent jar. |
Would it be inappropriate to remove AIX from CI until this problem is resolved? |
🤔 So https://ci.nodejs.org/job/node-test-commit-aix/40033/nodes=aix72-ppc64/console failed:
which appears to correspond to this from the Jenkins log on the CI server
|
1645022502096 - 1645022382095 = 120001 This appears to correspond to what I see in https://ci.nodejs.org/systemInfo -- that the following system properties have been set
These are being set in |
I haven't found anything in this repository suggesting if we deliberately set the ping values. They're not set on ci-release. I'm going to see what happens if we allow the ping to default to default values. |
Still pings out with defaults (240s, 1645032626467-1645032386352 = 240115) 😞:
|
I'm going to try temporarily disabling the ping timeout ( |
Thanks for investigating this @richardlau! |
More builds have gone through (slowly) and succeeded. Going to step away for a bit as its mid evening here. FTR the |
So far we've seen no further disconnects and of the 17 builds since the ping timeout was disabled 16 passed and the one failure was a test failing. Planned next steps are to reintroduce the timeout but at a larger value (say an hour or half an hour) to see if the pings complete (but take a long time) or just aren't completing at all. Also ask OSUOSL if there's anything that's changed recently that might explain why the failures we were seeing suddenly became consistent (every build) rather than intermittent. |
I've set the ping timeout to 3600 (i.e. 1 hour) and restarted the agent on test-osuosl-aix72-ppc64_be-3. I'll run a couple of test builds to see if we get disconnects. If we do not I may reduce down to half an hour. |
@richardlau thanks for all of your hard work on this one. |
7 passing builds with ping timeout of 3600. Trying 1800 (half an hour). |
No ping timeouts at 1800 so far. |
No ping timeouts at 900 so far. |
AFAIK we haven't been seeing a reoccurrence of the issue on AIX in a while (or at least not often enough to be noticeable). We have, however, seen the ping timeout issue come up on the Pi's for Maybe we should just pick a large but reasonable value for the timeouts and run with that. |
Theorycrafting time! On arm this is scaled by 3: and then for tests in pummel or benchmark again by 6: So worse case is 36 (2*3*6) minutes. So let's set the Jenkins ping timeout greater than that. I'm inclined to go for an hour (3600 sec). |
I've updated the timeout to 3600sec (1 hour). Let's see how it goes 🤞. |
@richardlau Thanks so much for working on this. Are you aware of any related failures (or other problems) since the last change? If not, can this be closed at this point? |
I am not aware of anything since the change. |
The last successful build was more than a day ago. Since then, every build has been failing.
The text was updated successfully, but these errors were encountered: