Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node-test-commit-arm is broken #3044

Closed
tniessen opened this issue Oct 2, 2022 · 18 comments
Closed

node-test-commit-arm is broken #3044

tniessen opened this issue Oct 2, 2022 · 18 comments

Comments

@tniessen
Copy link
Member

tniessen commented Oct 2, 2022

It seems that node-test-commit-arm jobs never completes. This affects all PR CIs:

build time trend showing recent builds taking days

Unfortunately, the "Auto start CI" GitHub action is unaware of the issue and just silently removes the request-ci label while queuing even more jobs.

@mhdawson
Copy link
Member

mhdawson commented Oct 2, 2022

I have brought one of the arm machines back online and its seems to be processing jobs.

@richardlau
Copy link
Member

richardlau commented Oct 2, 2022

FWIW it looks like the bottleneck was ubuntu2004-armv7l. In addition to the machine that was offline, the container on the other machine appears to have gone down on 15 Sept:

root@test-equinix-ubuntu2004-docker-arm64-3:~# systemctl status jenkins-test-equinix-ubuntu2004_container-armv7l-2.service
● jenkins-test-equinix-ubuntu2004_container-armv7l-2.service - Jenkins Slave in Docker for test-equinix-ubuntu2004_container-armv7l-2
     Loaded: loaded (/lib/systemd/system/jenkins-test-equinix-ubuntu2004_container-armv7l-2.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2022-07-26 12:27:04 UTC; 2 months 7 days ago
   Main PID: 3048135 (docker)
      Tasks: 113 (limit: 616176)
     Memory: 38.3M
     CGroup: /system.slice/jenkins-test-equinix-ubuntu2004_container-armv7l-2.service
             └─3048135 /usr/bin/docker run --init --rm -v /home/iojs/test-equinix-ubuntu2004_container-armv7l-2/:/home/iojs -v /home/iojs/.ccache/:/home/io>

Sep 15 00:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]: Sep 15, 2022 12:46:45 AM hudson.slaves.ChannelPinger$1 onDead
Sep 15 00:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]: INFO: Ping failed. Terminating the channel JNLP4-connect connection to ci.nodejs.or>
Sep 15 00:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]: java.util.concurrent.TimeoutException: Ping started at 1663199205593 hasn't complet>
Sep 15 00:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]:         at hudson.remoting.PingThread.ping(PingThread.java:134)
Sep 15 00:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]:         at hudson.remoting.PingThread.run(PingThread.java:90)
Sep 15 01:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]: Sep 15, 2022 1:46:45 AM hudson.slaves.ChannelPinger$1 onDead
Sep 15 01:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]: INFO: Ping failed. Terminating the channel JNLP4-connect connection to ci.nodejs.or>
Sep 15 01:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]: java.util.concurrent.TimeoutException: Ping started at 1663202805597 hasn't complet>
Sep 15 01:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]:         at hudson.remoting.PingThread.ping(PingThread.java:134)
Sep 15 01:46:45 test-equinix-ubuntu2004-docker-arm64-3 docker[3048135]:         at hudson.remoting.PingThread.run(PingThread.java:90)

I've restarted the container, so we now have the two back online.

@richardlau
Copy link
Member

I have brought one of the arm machines back online and its seems to be processing jobs.

I'm guessing this was another occurrence of #2894?

@tniessen
Copy link
Member Author

tniessen commented Oct 2, 2022

Thank you @mhdawson and @richardlau!

@sxa
Copy link
Member

sxa commented Oct 3, 2022

the container on the other machine appears to have gone down on 15 Sept:

Hmmm that's interesting if it was just the container that failed.

@richardlau
Copy link
Member

Looks like all ubuntu2004-armv7l containers are offline:
image

@richardlau
Copy link
Member

root@test-equinix-ubuntu2004-docker-arm64-3:~# systemctl status jenkins-test-equinix-ubuntu2004_container-armv7l-2.service
● jenkins-test-equinix-ubuntu2004_container-armv7l-2.service - Jenkins Slave in Docker for test-equinix-ubuntu2004_container-armv7l-2
     Loaded: loaded (/lib/systemd/system/jenkins-test-equinix-ubuntu2004_container-armv7l-2.service; enabled; vendor preset: enabled)
     Active: active (running) since Sun 2022-10-02 11:55:32 UTC; 1 day 20h ago
   Main PID: 2538938 (docker)
      Tasks: 104 (limit: 616176)
     Memory: 28.9M
     CGroup: /system.slice/jenkins-test-equinix-ubuntu2004_container-armv7l-2.service
             └─2538938 /usr/bin/docker run --init --rm -v /home/iojs/test-equinix-ubuntu2004_container-armv7l-2/:/home/iojs -v /home/iojs/.ccache/:/home/io>

Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: Oct 03, 2022 9:40:34 AM hudson.slaves.ChannelPinger$1 onDead
Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: INFO: Ping failed. Terminating the channel JNLP4-connect connection to ci.nodejs.org/107.170.240.62:41913.
Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: java.util.concurrent.TimeoutException: Ping started at 1664786434284 hasn't completed by 1664790034284
Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.ping(PingThread.java:132)
Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.run(PingThread.java:88)
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: Oct 03, 2022 10:40:34 AM hudson.slaves.ChannelPinger$1 onDead
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: INFO: Ping failed. Terminating the channel JNLP4-connect connection to ci.nodejs.org/107.170.240.62:41913.
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: java.util.concurrent.TimeoutException: Ping started at 1664790034286 hasn't completed by 1664793634287
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.ping(PingThread.java:132)
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.run(PingThread.java:88)

@richardlau
Copy link
Member

root@test-equinix-ubuntu2004-docker-arm64-3:~# journalctl -u jenkins-test-equinix-ubuntu2004_container-armv7l-2.service
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: WARNING: [JNLP4-connect connection to ci.nodejs.org/107.170.240.62:41913]
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: java.lang.NullPointerException
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.util.DirectByteBufferPool.acquire(DirectByteBuffe>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.IOHub.acquire(IOHub.java:164)
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.ProtocolStack.acquire(ProtocolStack.java>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SS>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngi>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.j>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:13>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetw>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetwo>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.jav>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:126)
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at java.lang.Thread.run(Thread.java:748)
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: Oct 03, 2022 7:40:10 AM org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: SEVERE: [JNLP4-connect connection to ci.nodejs.org/107.170.240.62:41913] Reader thr>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: java.lang.NullPointerException
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.util.DirectByteBufferPool.acquire(DirectByteBuffe>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.IOHub.acquire(IOHub.java:164)
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.ProtocolStack.acquire(ProtocolStack.java>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.processRead(SS>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.impl.SSLEngineFilterLayer.onRecv(SSLEngi>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.ProtocolStack$Ptr.onRecv(ProtocolStack.j>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.NetworkLayer.onRead(NetworkLayer.java:13>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer.access$2200(BIONetw>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at org.jenkinsci.remoting.protocol.impl.BIONetworkLayer$Reader.run(BIONetwo>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.jav>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.ja>
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:126)
Oct 03 07:40:10 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at java.lang.Thread.run(Thread.java:748)
Oct 03 08:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: Oct 03, 2022 8:40:34 AM hudson.slaves.ChannelPinger$1 onDead
Oct 03 08:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: INFO: Ping failed. Terminating the channel JNLP4-connect connection to ci.nodejs.or>
Oct 03 08:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: java.util.concurrent.TimeoutException: Ping started at 1664782834280 hasn't complet>
Oct 03 08:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.ping(PingThread.java:132)
Oct 03 08:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.run(PingThread.java:88)
Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: Oct 03, 2022 9:40:34 AM hudson.slaves.ChannelPinger$1 onDead
Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: INFO: Ping failed. Terminating the channel JNLP4-connect connection to ci.nodejs.or>
Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: java.util.concurrent.TimeoutException: Ping started at 1664786434284 hasn't complet>
Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.ping(PingThread.java:132)
Oct 03 09:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.run(PingThread.java:88)
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: Oct 03, 2022 10:40:34 AM hudson.slaves.ChannelPinger$1 onDead
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: INFO: Ping failed. Terminating the channel JNLP4-connect connection to ci.nodejs.or>
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]: java.util.concurrent.TimeoutException: Ping started at 1664790034286 hasn't complet>
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.ping(PingThread.java:132)
Oct 03 10:40:34 test-equinix-ubuntu2004-docker-arm64-3 docker[2538938]:         at hudson.remoting.PingThread.run(PingThread.java:88)
lines 7614

@richardlau
Copy link
Member

I've restarted the ubuntu2004-armv7l container on test-equinix-ubuntu2004-docker-arm64-3.

@sxa
Copy link
Member

sxa commented Oct 4, 2022

Hmmm - so this is ONLY affecting the armv7l container on the host?

@richardlau
Copy link
Member

AFAICT at the moment just the ubuntu2004 armv7l containers. The debian10 armv7l containers were still working.

@richardlau
Copy link
Member

It looks like the NullPointerException has happened again and I'm not sure if the agent on that container is now broken again (https://ci.nodejs.org/job/node-test-commit-arm/nodes=ubuntu2004-armv7l/44015/console looks stuck).

I've restarted the container on the other Altra -- will see if the problem reoccurs while watching (virtually) the rest of this morning's NodeConf EU talks. Might trying updating the containers this afternoon to Java 11/17 (currently on 8).

@tniessen
Copy link
Member Author

tniessen commented Oct 5, 2022

I'm seeing this immediately after resuming CI:

fatal: Couldn't find remote ref refs/heads/jenkins-node-test-commit-arm-fanned-feaab78d411c319cbbe153db7d04e9b46b2b48eb-binary-pi1p/cc-armv7

(Example build)

@richardlau
Copy link
Member

@tniessen jenkins-node-test-commit-arm-fanned-feaab78d411c319cbbe153db7d04e9b46b2b48eb-binary-pi1p/cc-armv7 was deleted by https://ci.nodejs.org/job/git-delete-branches/58878/ because the https://ci.nodejs.org/job/node-test-commit-arm-fanned/27642/console job from https://ci.nodejs.org/job/node-test-pull-request/46975/ succeeded.

@tniessen
Copy link
Member Author

tniessen commented Oct 5, 2022

@richardlau I am not sure I understand what I did wrong; this is happening in two PRs and I am pretty sure I only resumed/restarted failing builds. I guess I'll start new builds and see if that works.

@richardlau
Copy link
Member

@tniessen Arm builds were backlogged heavily over the weekend due to the machines being offline and then catching up the already queued jobs. nodejs/node#44849 appears to have at least two builds which ended up racing each other:

The arm build from https://ci.nodejs.org/job/node-test-pull-request/46975/ succeeded and caused the branch in the binary repo to be deleted while https://ci.nodejs.org/job/node-test-pull-request/46992/ was still in progress and causing the arm build there to fail. You've been resuming builds from https://ci.nodejs.org/job/node-test-pull-request/46992/ so the resumed builds think they need to resume the failed arm build but the branch no longer exists.

I could easily see this happening in other PRs during the backlog if builds had started for the same commit while earlier builds for that commit were still waiting for the offline arm machines to come back online.

@tniessen
Copy link
Member Author

tniessen commented Oct 5, 2022

Ah, thanks for the clarification @richardlau! That makes a lot of sense :)

@richardlau
Copy link
Member

I've restarted the container on the other Altra -- will see if the problem reoccurs while watching (virtually) the rest of this morning's NodeConf EU talks. Might trying updating the containers this afternoon to Java 11/17 (currently on 8).

Haven't seen the problem again. FWIW I've updated the containers to run on Java 17 now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants