Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aix72-ppc64 is broken #2872

Closed
tniessen opened this issue Feb 16, 2022 · 22 comments
Closed

aix72-ppc64 is broken #2872

tniessen opened this issue Feb 16, 2022 · 22 comments

Comments

@tniessen
Copy link
Member

The last successful build was more than a day ago. Since then, every build has been failing.

graph

@richardlau
Copy link
Member

richardlau commented Feb 16, 2022

According to https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix72-ppc64/buildTimeTrend all of the AIX 7.2 hosts are affected.

It appears that the Jenkins agent is being disconnected and reconnected, but the build is still continuing (there are processes still running even though Jenkins says the hosts are not busy). The aix-cleanup job has been run (manually by @Trott but also again on the timer) which cleared up the leftover processes but subsequent AIX builds still failed.

I've tried running v16.x-staging build through (https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix72-ppc64/40029/) but that has also failed. This would suggest though that the failures are unrelated to any recent Node.js core commits landing.

@richardlau
Copy link
Member

I'm not aware of anything changing on the machines recently that could be a suspect. There was a Jenkins update last week but we've had successful builds since that happened. Some Jenkins plugins were updated yesterday but again we were seeing failures before and after that happened. I've also checked we're on the latest Jenkins agent jar.

@Trott
Copy link
Member

Trott commented Feb 16, 2022

Would it be inappropriate to remove AIX from CI until this problem is resolved?

@richardlau
Copy link
Member

🤔

So https://ci.nodejs.org/job/node-test-commit-aix/40033/nodes=aix72-ppc64/console failed:

14:41:41 ../deps/openssFATAL: command execution failed
14:41:41 java.nio.channels.ClosedChannelException
14:41:42 	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:238)
14:41:42 	at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:221)
14:41:42 	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:825)
14:41:42 	at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:288)
14:41:42 	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:179)
14:41:42 	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:281)
14:41:42 	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:501)
14:41:42 	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:246)
14:41:42 	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:198)
14:41:42 	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:211)
14:41:42 	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:793)
14:41:42 	at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:172)
14:41:42 	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:340)
14:41:42 	at hudson.remoting.Channel.close(Channel.java:1501)
14:41:42 	at hudson.remoting.Channel.close(Channel.java:1454)
14:41:42 	at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:894)
14:41:42 	at hudson.slaves.SlaveComputer.access$100(SlaveComputer.java:108)
14:41:42 	at hudson.slaves.SlaveComputer$2.run(SlaveComputer.java:774)
14:41:42 	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
14:41:42 	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
14:41:42 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
14:41:42 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
14:41:42 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
14:41:42 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
14:41:42 	at java.lang.Thread.run(Thread.java:748)
14:41:42 Caused: java.io.IOException: Backing channel 'JNLP4-connect connection from p8-aix2-nodejs04.osuosl.org/140.211.9.101:57172' is disconnected.
14:41:42 	at hudson.remoting.RemoteInvocationHandler.channelOrFail(RemoteInvocationHandler.java:216)
14:41:42 	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:286)
14:41:42 	at com.sun.proxy.$Proxy72.isAlive(Unknown Source)
14:41:42 	at hudson.Launcher$RemoteLauncher$ProcImpl.isAlive(Launcher.java:1213)
14:41:42 	at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:1205)
14:41:42 	at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:194)
14:41:42 	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:144)
14:41:42 	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:91)
14:41:42 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
14:41:42 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:806)
14:41:42 	at hudson.model.Build$BuildExecution.build(Build.java:198)
14:41:42 	at hudson.model.Build$BuildExecution.doRun(Build.java:163)
14:41:42 	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:514)
14:41:42 	at hudson.model.Run.execute(Run.java:1888)
14:41:42 	at hudson.matrix.MatrixRun.run(MatrixRun.java:153)
14:41:42 	at hudson.model.ResourceController.execute(ResourceController.java:99)
14:41:42 	at hudson.model.Executor.run(Executor.java:432)
14:41:42 FATAL: Unable to delete script file /tmp/jenkins1268877559770919254.sh
14:41:42 java.nio.channels.ClosedChannelException
14:41:42 	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer.onReadClosed(ChannelApplicationLayer.java:238)
14:41:42 	at org.jenkinsci.remoting.protocol.ApplicationLayer.onRecvClosed(ApplicationLayer.java:221)
14:41:42 	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecvClosed(ProtocolStack.java:825)
14:41:42 	at org.jenkinsci.remoting.protocol.FilterLayer.onRecvClosed(FilterLayer.java:288)
14:41:42 	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecvClosed(SSLEngineFilterLayer.java:179)
14:41:42 	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.switchToNoSecure(SSLEngineFilterLayer.java:281)
14:41:42 	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processWrite(SSLEngineFilterLayer.java:501)
14:41:42 	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processQueuedWrites(SSLEngineFilterLayer.java:246)
14:41:42 	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doSend(SSLEngineFilterLayer.java:198)
14:41:42 	at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.doCloseSend(SSLEngineFilterLayer.java:211)
14:41:42 	at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.doCloseSend(ProtocolStack.java:793)
14:41:42 	at org.jenkinsci.remoting.protocol.ApplicationLayer.doCloseWrite(ApplicationLayer.java:172)
14:41:42 	at org.jenkinsci.remoting.protocol.impl.ChannelApplicationLayer$ByteBufferCommandTransport.closeWrite(ChannelApplicationLayer.java:340)
14:41:42 	at hudson.remoting.Channel.close(Channel.java:1501)
14:41:42 	at hudson.remoting.Channel.close(Channel.java:1454)
14:41:42 	at hudson.slaves.SlaveComputer.closeChannel(SlaveComputer.java:894)
14:41:42 	at hudson.slaves.SlaveComputer.access$100(SlaveComputer.java:108)
14:41:42 	at hudson.slaves.SlaveComputer$2.run(SlaveComputer.java:774)
14:41:42 	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
14:41:42 	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
14:41:42 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
14:41:42 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
14:41:42 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
14:41:42 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
14:41:42 	at java.lang.Thread.run(Thread.java:748)
14:41:42 Caused: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@283bc0e6:JNLP4-connect connection from p8-aix2-nodejs04.osuosl.org/140.211.9.101:57172": Remote call on JNLP4-connect connection from p8-aix2-nodejs04.osuosl.org/140.211.9.101:57172 failed. The channel is closing down or has closed down
14:41:42 	at hudson.remoting.Channel.call(Channel.java:994)
14:41:42 	at hudson.FilePath.act(FilePath.java:1165)
14:41:42 	at hudson.FilePath.act(FilePath.java:1154)
14:41:42 	at hudson.FilePath.delete(FilePath.java:1681)
14:41:42 	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:162)
14:41:42 	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:91)
14:41:42 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
14:41:42 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:806)
14:41:42 	at hudson.model.Build$BuildExecution.build(Build.java:198)
14:41:42 	at hudson.model.Build$BuildExecution.doRun(Build.java:163)
14:41:42 	at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:514)
14:41:42 	at hudson.model.Run.execute(Run.java:1888)
14:41:42 	at hudson.matrix.MatrixRun.run(MatrixRun.java:153)
14:41:42 	at hudson.model.ResourceController.execute(ResourceController.java:99)
14:41:42 	at hudson.model.Executor.run(Executor.java:432)
14:41:42 Build step 'Execute shell' marked build as failure

which appears to correspond to this from the Jenkins log on the CI server

2022-02-16 14:41:42.096+0000 [id=1390]  INFO    hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel JNLP4-connect connection from p8-aix2-nodejs04.osuosl.org/140.211.9.101:57172.
java.util.concurrent.TimeoutException: Ping started at 1645022382095 hasn't completed by 1645022502096
        at hudson.remoting.PingThread.ping(PingThread.java:134)
        at hudson.remoting.PingThread.run(PingThread.java:90)
2022-02-16 14:41:42.097+0000 [id=29556] INFO    j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting [#3092] for test-osuosl-aix72-ppc64_be-4 terminated: java.nio.channels.ClosedChannelException

@richardlau
Copy link
Member

java.util.concurrent.TimeoutException: Ping started at 1645022382095 hasn't completed by 1645022502096

1645022502096 - 1645022382095 = 120001

This appears to correspond to what I see in https://ci.nodejs.org/systemInfo -- that the following system properties have been set

Name Value
hudson.slaves.ChannelPinger.pingIntervalSeconds 120
hudson.slaves.ChannelPinger.pingTimeoutSeconds 120

These are being set in /etc/default/jenkins. I do not know why these are smaller than the supposed defaults https://www.jenkins.io/doc/book/managing/system-properties/
image

@richardlau
Copy link
Member

I haven't found anything in this repository suggesting if we deliberately set the ping values. They're not set on ci-release.

I'm going to see what happens if we allow the ping to default to default values.

@richardlau
Copy link
Member

Still pings out with defaults (240s, 1645032626467-1645032386352 = 240115) 😞:

2022-02-16 17:30:26.469+0000 [id=1520]  INFO    hudson.slaves.ChannelPinger$1#onDead: Ping failed. Terminating the channel JNLP4-connect connection from p8-aix2-nodejs03.osuosl.org/140.211.9.131:53493.
java.util.concurrent.TimeoutException: Ping started at 1645032386352 hasn't completed by 1645032626467
        at hudson.remoting.PingThread.ping(PingThread.java:134)
        at hudson.remoting.PingThread.run(PingThread.java:90)
2022-02-16 17:30:26.483+0000 [id=2126]  INFO    j.s.DefaultJnlpSlaveReceiver#channelClosed: Computer.threadPoolForRemoting [#417] for test-osuosl-aix72-ppc64_be-3 terminated: java.nio.channels.ClosedChannelException

@richardlau
Copy link
Member

I'm going to try temporarily disabling the ping timeout (Jenkins.instance.injector.getInstance(hudson.slaves.ChannelPinger.class).@pingTimeoutSeconds = -1):
https://www.jenkins.io/doc/book/system-administration/monitoring/

@richardlau
Copy link
Member

With the ping timeout disabled I've had two builds complete (one from master, one v16.x-staging). Both builds were slow (1 hour 43 mins and 1 hour 30 mins, which may be another symptom of whatever the root cause of the pings failing to complete within the timeouts when they were set) but at least they completed.
image

@tniessen
Copy link
Member Author

Thanks for investigating this @richardlau!

@richardlau
Copy link
Member

More builds have gone through (slowly) and succeeded. Going to step away for a bit as its mid evening here.

FTR the Jenkins.instance.injector.getInstance(hudson.slaves.ChannelPinger.class).@pingTimeoutSeconds = -1 from
https://www.jenkins.io/doc/book/system-administration/monitoring/#PingThread-Disablingpingthread is entered into the Jenkins script console (https://ci.nodejs.org/script, must be a Jenkins admin to access) and the agent must be restarted (either manually or via the ansible/playbooks/jenkins/worker/restart-agent.yml playbook or ansible.nodejs.org) for the change to take effect. Omitting the = -1 part (i.e. Jenkins.instance.injector.getInstance(hudson.slaves.ChannelPinger.class).@pingTimeoutSeconds) will print the current value.

@richardlau
Copy link
Member

So far we've seen no further disconnects and of the 17 builds since the ping timeout was disabled 16 passed and the one failure was a test failing.

Planned next steps are to reintroduce the timeout but at a larger value (say an hour or half an hour) to see if the pings complete (but take a long time) or just aren't completing at all. Also ask OSUOSL if there's anything that's changed recently that might explain why the failures we were seeing suddenly became consistent (every build) rather than intermittent.

@richardlau
Copy link
Member

I've set the ping timeout to 3600 (i.e. 1 hour) and restarted the agent on test-osuosl-aix72-ppc64_be-3. I'll run a couple of test builds to see if we get disconnects. If we do not I may reduce down to half an hour.

@mhdawson
Copy link
Member

@richardlau thanks for all of your hard work on this one.

@richardlau
Copy link
Member

7 passing builds with ping timeout of 3600. Trying 1800 (half an hour).

@richardlau
Copy link
Member

No ping timeouts at 1800 so far.
Trying 900 (15 mins).

@richardlau
Copy link
Member

No ping timeouts at 900 so far.
Trying 300 (5 mins).

@richardlau
Copy link
Member

AFAIK we haven't been seeing a reoccurrence of the issue on AIX in a while (or at least not often enough to be noticeable). We have, however, seen the ping timeout issue come up on the Pi's for
a PR: nodejs/node#42409 (comment)

Maybe we should just pick a large but reasonable value for the timeouts and run with that.

@richardlau
Copy link
Member

richardlau commented Apr 7, 2022

Theorycrafting time!
For the Node.js test runner, the default timeout for a test is two minutes:
https://github.com/nodejs/node/blob/6374f6b29ce8e98ba54f240ff68dbe622c497d61/tools/test.py#L1338-L1339

On arm this is scaled by 3:
https://github.com/nodejs/node/blob/6374f6b29ce8e98ba54f240ff68dbe622c497d61/tools/test.py#L898-L902

and then for tests in pummel or benchmark again by 6:
https://github.com/nodejs/node/blob/6374f6b29ce8e98ba54f240ff68dbe622c497d61/tools/test.py#L945-L949

So worse case is 36 (2*3*6) minutes. So let's set the Jenkins ping timeout greater than that. I'm inclined to go for an hour (3600 sec).

@richardlau
Copy link
Member

I've updated the timeout to 3600sec (1 hour). Let's see how it goes 🤞.

@tniessen
Copy link
Member Author

@richardlau Thanks so much for working on this. Are you aware of any related failures (or other problems) since the last change? If not, can this be closed at this point?

@richardlau
Copy link
Member

I am not aware of anything since the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants