Avoid continual thread replacement in `JarCacheSupport` #717

basil · 2024-01-08T22:11:24Z

For a while now, users have been complaining about jobs failing with an (agent-side) stack trace like e.g.

java.lang.OutOfMemoryError: unable to create new native thread
	at java.lang.Thread.start0(Native Method)
	at java.lang.Thread.start(Thread.java:717)
	at hudson.remoting.AtmostOneThreadExecutor.execute(AtmostOneThreadExecutor.java:104)
	at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
	at org.jenkinsci.remoting.util.ExecutorServiceUtils.submitAsync(ExecutorServiceUtils.java:58)
	at hudson.remoting.JarCacheSupport.resolve(JarCacheSupport.java:77)
	at hudson.remoting.ResourceImageInJar._resolveJarURL(ResourceImageInJar.java:93)
	at hudson.remoting.ResourceImageInJar.resolve(ResourceImageInJar.java:45)
	at hudson.remoting.RemoteClassLoader.loadRemoteClass(RemoteClassLoader.java:302)
	at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:280)
	at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:232)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:264)
	at com.sun.proxy.$Proxy9.<clinit>(Unknown Source)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:739)
	at hudson.remoting.RemoteInvocationHandler.wrap(RemoteInvocationHandler.java:168)
	at hudson.remoting.Channel.export(Channel.java:812)
	at hudson.remoting.Channel.export(Channel.java:775)
	at org.jenkinsci.plugins.gitclient.LegacyCompatibleGitAPIImpl.writeReplace(LegacyCompatibleGitAPIImpl.java:204)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1244)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1136)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
	at hudson.remoting.UserRequest._serialize(UserRequest.java:263)
	at hudson.remoting.UserRequest.serialize(UserRequest.java:272)
	at hudson.remoting.UserRequest.perform(UserRequest.java:222)
	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
	at hudson.remoting.Request$2.run(Request.java:376)
	at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121)
	at java.lang.Thread.run(Thread.java:748)

Examples of this include https://issues.jenkins.io/browse/JENKINS-65873?focusedId=429290&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-429290 and #709 (comment).

Previous work on this problem has involved drastically decreasing the number of threads required during normal operation in #559 as well as backporting https://bugs.openjdk.org/browse/JDK-8268773 to JDK 11. When this failure occurs, the job fails with an error on the controller, which generates complaints from users. These complaints continue to the present.

The error java.lang.OutOfMemoryError: unable to create new native thread is emanating from an EAGAIN returned by pthread_create(3). The cause of this failure is not well understood. https://bugs.openjdk.org/browse/JDK-8268773 retries up to three times when such a failure is encountered, but this is just a workaround as well. More generally, it can be said that the Linux threads implementation is fragile when creating/destroying large numbers of threads in quick succession. While this should ideally be fixed, it behooves us to avoid exposure to this problem by avoiding unnecessary thread churn wherever possible.

Remoting creates threads in a handful of places:

hudson.remoting.Engine#executor, which has no upper bound but caches threads for up to 60 seconds. In practice this is enough to avoid large bursts of thread creation.
hudson.remoting.RemoteInvocationHandler.Unexporter#svc, which uses AtmostOneThreadExecutor. In practice GC is an infrequent enough operation that there have been no issues reported about this call site. Also, if an error occurred here, it would not cause the job to fail immediately, but would rather just cause the GC operation to fail (thus leaking memory).
hudson.remoting.JarCacheSupport#downloader, which also uses AtmostOneThreadExecutor, and which frequently appears in stack traces of complaints because it causes the entire job to fail.

The last of these is the focus of this PR, as it runs dozens of times during routine operation (particularly when using JGit or other plugins that do heavyweight remote class loading). By design it only creates a new thread once the preceding one has terminated, but in practice the accounting is a bit sloppy and there is a small window of time during which the old thread is still shutting down but a new one may be started. Whether or not this is the cause of user complaints is unclear, but one thing that is clear is that we are repeatedly calling pthread_create(3), in some cases dozens of times, far more than in any other place in Remoting, and which exposes us to the fragility in Linux mentioned above, especially because these threads are created and destroyed so closely together in time.

To avoid exposure to this fragility, this PR caches the thread for up to a minute after creation, reusing it if it is present. This avoids the constant churn where threads are destroyed and then created again almost immediately (which exposes us to the fragility described above). Rather, one thread will be created and reused for all class loading operations (which should take well under a minute) and then destroyed once it is no longer needed. If it is needed later (for example, when running a different job on the agent which requires a different set of classes), it will be spun up again. When the system is in a stable state, with no new classes being loaded for over a minute, then the behavior is the same as before.

I considered introducing a new permanent thread for each connection, but this seemed risky on the controller side for controllers with hundreds or thousands of agents. In contrast, having the thread expire after 1 minute is closer to existing behavior, reducing risk while still avoiding exposure to the fragility described above. Also, it more closely matches the thread caching behavior that already exists in Engine and which has not caused any problems for us.

Testing done

Before this PR, I measured the number of threads created after connecting the agent and after running a job with JGit. About 50 threads were created in total (not all at the same time) with about 34 being created by JarCacheSupport, often in rapid succession.

After this PR, I connected the agent and verified a single JarCacheSupport thread was created. I kept checking jstack and saw the thread hung around for about 60 seconds, after which it terminated (as expected) and disappeared from jstack. Then I ran a job that used JGit, which needs to load more classes, and saw that a single JarCacheSupport thread was created again, that the job succeeded, and that the thread hung around for about 65 seconds, after which it terminated (as expected) and disappeared from jstack.

This demonstrates that we are not shy about creating threads when we need to, but also that we are no longer cavalier about creating threads gratuitously in short succession. We went from creating about 50 threads to only creating 35 threads, two of which were JarCacheSupport (as described above), 3 of which were GC threads, and the rest of which were Engine threads (which already have a cache, as described above, and so the upper bound here is the parallelism bound for the application).

Closes #709

timja

Thanks!

MarkEWaite

Thanks. Looks good to me based on reading of ThreadPoolExecutor javadoc

Avoid continual thread replacement in JarCacheSupport

d57aeb6

basil added the bug For changelog: Fixes a bug. label Jan 8, 2024

basil mentioned this pull request Jan 8, 2024

Use ConcurrentHashMap#computeIfAbsent in JarCacheSupport #713

Merged

timja approved these changes Jan 8, 2024

View reviewed changes

MarkEWaite approved these changes Jan 9, 2024

View reviewed changes

timja mentioned this pull request Jan 9, 2024

JENKINS-65873: reduce threads to download jars and fix promise value … #709

Closed

basil merged commit b15dcf7 into jenkinsci:master Jan 9, 2024
13 checks passed

basil deleted the eagain branch January 9, 2024 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid continual thread replacement in `JarCacheSupport` #717

Avoid continual thread replacement in `JarCacheSupport` #717

basil commented Jan 8, 2024 •

edited

timja left a comment

MarkEWaite left a comment

Avoid continual thread replacement in JarCacheSupport #717

Avoid continual thread replacement in JarCacheSupport #717

Conversation

basil commented Jan 8, 2024 • edited

Testing done

timja left a comment

Choose a reason for hiding this comment

MarkEWaite left a comment

Choose a reason for hiding this comment

Avoid continual thread replacement in `JarCacheSupport` #717

Avoid continual thread replacement in `JarCacheSupport` #717

basil commented Jan 8, 2024 •

edited