Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid continual thread replacement in JarCacheSupport #717

Merged
merged 1 commit into from
Jan 9, 2024

Conversation

basil
Copy link
Member

@basil basil commented Jan 8, 2024

For a while now, users have been complaining about jobs failing with an (agent-side) stack trace like e.g.

java.lang.OutOfMemoryError: unable to create new native thread
	at java.lang.Thread.start0(Native Method)
	at java.lang.Thread.start(Thread.java:717)
	at hudson.remoting.AtmostOneThreadExecutor.execute(AtmostOneThreadExecutor.java:104)
	at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
	at org.jenkinsci.remoting.util.ExecutorServiceUtils.submitAsync(ExecutorServiceUtils.java:58)
	at hudson.remoting.JarCacheSupport.resolve(JarCacheSupport.java:77)
	at hudson.remoting.ResourceImageInJar._resolveJarURL(ResourceImageInJar.java:93)
	at hudson.remoting.ResourceImageInJar.resolve(ResourceImageInJar.java:45)
	at hudson.remoting.RemoteClassLoader.loadRemoteClass(RemoteClassLoader.java:302)
	at hudson.remoting.RemoteClassLoader.loadWithMultiClassLoader(RemoteClassLoader.java:280)
	at hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:232)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:264)
	at com.sun.proxy.$Proxy9.<clinit>(Unknown Source)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at java.lang.reflect.Proxy.newProxyInstance(Proxy.java:739)
	at hudson.remoting.RemoteInvocationHandler.wrap(RemoteInvocationHandler.java:168)
	at hudson.remoting.Channel.export(Channel.java:812)
	at hudson.remoting.Channel.export(Channel.java:775)
	at org.jenkinsci.plugins.gitclient.LegacyCompatibleGitAPIImpl.writeReplace(LegacyCompatibleGitAPIImpl.java:204)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1244)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1136)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
	at hudson.remoting.UserRequest._serialize(UserRequest.java:263)
	at hudson.remoting.UserRequest.serialize(UserRequest.java:272)
	at hudson.remoting.UserRequest.perform(UserRequest.java:222)
	at hudson.remoting.UserRequest.perform(UserRequest.java:54)
	at hudson.remoting.Request$2.run(Request.java:376)
	at hudson.remoting.InterceptingExecutorService.lambda$wrap$0(InterceptingExecutorService.java:78)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at hudson.remoting.Engine$1.lambda$newThread$0(Engine.java:121)
	at java.lang.Thread.run(Thread.java:748)

Examples of this include https://issues.jenkins.io/browse/JENKINS-65873?focusedId=429290&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-429290 and #709 (comment).

Previous work on this problem has involved drastically decreasing the number of threads required during normal operation in #559 as well as backporting https://bugs.openjdk.org/browse/JDK-8268773 to JDK 11. When this failure occurs, the job fails with an error on the controller, which generates complaints from users. These complaints continue to the present.

The error java.lang.OutOfMemoryError: unable to create new native thread is emanating from an EAGAIN returned by pthread_create(3). The cause of this failure is not well understood. https://bugs.openjdk.org/browse/JDK-8268773 retries up to three times when such a failure is encountered, but this is just a workaround as well. More generally, it can be said that the Linux threads implementation is fragile when creating/destroying large numbers of threads in quick succession. While this should ideally be fixed, it behooves us to avoid exposure to this problem by avoiding unnecessary thread churn wherever possible.

Remoting creates threads in a handful of places:

  • hudson.remoting.Engine#executor, which has no upper bound but caches threads for up to 60 seconds. In practice this is enough to avoid large bursts of thread creation.
  • hudson.remoting.RemoteInvocationHandler.Unexporter#svc, which uses AtmostOneThreadExecutor. In practice GC is an infrequent enough operation that there have been no issues reported about this call site. Also, if an error occurred here, it would not cause the job to fail immediately, but would rather just cause the GC operation to fail (thus leaking memory).
  • hudson.remoting.JarCacheSupport#downloader, which also uses AtmostOneThreadExecutor, and which frequently appears in stack traces of complaints because it causes the entire job to fail.

The last of these is the focus of this PR, as it runs dozens of times during routine operation (particularly when using JGit or other plugins that do heavyweight remote class loading). By design it only creates a new thread once the preceding one has terminated, but in practice the accounting is a bit sloppy and there is a small window of time during which the old thread is still shutting down but a new one may be started. Whether or not this is the cause of user complaints is unclear, but one thing that is clear is that we are repeatedly calling pthread_create(3), in some cases dozens of times, far more than in any other place in Remoting, and which exposes us to the fragility in Linux mentioned above, especially because these threads are created and destroyed so closely together in time.

To avoid exposure to this fragility, this PR caches the thread for up to a minute after creation, reusing it if it is present. This avoids the constant churn where threads are destroyed and then created again almost immediately (which exposes us to the fragility described above). Rather, one thread will be created and reused for all class loading operations (which should take well under a minute) and then destroyed once it is no longer needed. If it is needed later (for example, when running a different job on the agent which requires a different set of classes), it will be spun up again. When the system is in a stable state, with no new classes being loaded for over a minute, then the behavior is the same as before.

I considered introducing a new permanent thread for each connection, but this seemed risky on the controller side for controllers with hundreds or thousands of agents. In contrast, having the thread expire after 1 minute is closer to existing behavior, reducing risk while still avoiding exposure to the fragility described above. Also, it more closely matches the thread caching behavior that already exists in Engine and which has not caused any problems for us.

Testing done

Before this PR, I measured the number of threads created after connecting the agent and after running a job with JGit. About 50 threads were created in total (not all at the same time) with about 34 being created by JarCacheSupport, often in rapid succession.

After this PR, I connected the agent and verified a single JarCacheSupport thread was created. I kept checking jstack and saw the thread hung around for about 60 seconds, after which it terminated (as expected) and disappeared from jstack. Then I ran a job that used JGit, which needs to load more classes, and saw that a single JarCacheSupport thread was created again, that the job succeeded, and that the thread hung around for about 65 seconds, after which it terminated (as expected) and disappeared from jstack.

This demonstrates that we are not shy about creating threads when we need to, but also that we are no longer cavalier about creating threads gratuitously in short succession. We went from creating about 50 threads to only creating 35 threads, two of which were JarCacheSupport (as described above), 3 of which were GC threads, and the rest of which were Engine threads (which already have a cache, as described above, and so the upper bound here is the parallelism bound for the application).

Closes #709

@basil basil added the bug For changelog: Fixes a bug. label Jan 8, 2024
Copy link
Member

@timja timja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@MarkEWaite MarkEWaite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Looks good to me based on reading of ThreadPoolExecutor javadoc

@basil basil merged commit b15dcf7 into jenkinsci:master Jan 9, 2024
13 checks passed
@basil basil deleted the eagain branch January 9, 2024 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For changelog: Fixes a bug.
Projects
None yet
3 participants