New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
putIfAbsent() regression? #506
putIfAbsent() regression? #506
Comments
Can you write a simple jmh benchmark to debug against? I want to make sure there is no confusion of how it’s being used. Offhand I’m surprised as well. |
Will take a look at that |
Taking a fresh look and nothing jumps out at me. The looping around The lock may be held for a long time if there is some long-running user computation occuring, either via a compute / load or writer. If that is not returning promptly then it blocks further updates to that entry. A mistake can be if an atomic operation on the entry performs a write back into the map, which can cause livelocks in some JDK releases. If not that, then maybe adding timeout safeguards. Searching the other stack traces might hint to the cause. Of course if there is a regression and you can reproduce it, I'll jump right onto it. Does downgrading to 2.7 help? |
All line numbers are 2.8.8. Trying to understand how it can contend on the node at line 2030 (we are also looking at whether 2.7 truly helps with this). Is there a reason if multiple threads hit line 1987 and prior is null that putIfAbsent() on line 2005 returns a non-null value and it cannot also optimize and check for onlyIfAbsent at line 2009 instead of having to synchronize at line 2030 (or am I not reading the code properly?) |
yeah, that does seem to have been overlooked. I think you're right that the optimization could be added there too. |
Per our own analysis, it seems like we are calling putIfAbsent() more in our release that also took on 2.8.8. Perhaps we are seeing this, i.e. contention on prior != null when multiple threads are trying to call putIfAbsent for the same key at once which leads to the synchronization on prior. |
Let me know how you want to proceed to confirm this. Either downgrade to 2.7 to see if its an issue (as we suspect) or upgrade with a patch. If you are amenable to it, I'd prefer to use a pre-release build to confirm with in case we have to continue digging. I can get a patch in tonight (PST). What do you want to do? |
Thanks for the fast response. We are currently working on creating the jmh benchmark as well as trying to reproduce the issues consistently on our side. Once we can reproduce it consistently, it would be great if we can move forward with a patch build. |
I pushed the patch to This is available on jitpack (via commit hash) or Sonatype's snapshot repository (as |
Thanks, giving it a spin. |
I forgot to copy into the |
Just to share here as well but it seems like 2.8.8 is faster already (there is no regression): https://github.com/dbaghdasarya/jmh-benchmark |
Oh, good. Thanks for doing that. Does the 2.9.1 snapshot help in your case? |
We have been testing it at this point, so far it seems to be helping. Thanks! |
great! By the way, your benchmark has a shared |
Please let me know when you have a conclusion of whether the patch is enough or if you are still experiencing performance problems. I’ll cut a release for you once it’s clear that we’ve resolved the issues. |
I am going to hold off on a release due to this optimization being for a very narrow race, that your benchmark didn't find a regression, and that you observed it was likely due to usage changes. Since it doesn't sound critical anymore, I'll wait until enough other changes pile up, unless otherwise pressed. |
sorry, we're still going through some internal testing but we are releasing with the snapshot build for now, will circle back once we have confirmation that it helps |
@panghy @ben-manes
|
This should only occur when the write rate greatly exceeds how quickly the cache can evict. There is a write buffer to capture work to replay against the eviction policy, so that usually a caller is only penalized by the cost to perform the hash table update. When the writer buffer is full (can grow up to ~128 * cpus), then the cache blocks on the eviction lock to assist in draining. This deprioritizes calling threads and lets the one holding the lock get more cpu time, which drains a batch of work. Typically that results in higher throughput and we are protecting against runaway growth. Are you doing something very expensive that keeps the cache eviction from running or have otherwise starved the thread? By default the FJP#commonPool() is used as the executor. You might try a same-thread ( |
Hm, when I looked at the stack trace that was captured, it seemed like all threads were blocked waiting for that lock and no progress was being made at all (all threads were parked). That seemed odd and did not point to contention issues but a lock release issue. That or we are doing something else wrong that can cause the lock to be held and not released some-how. |
That is odd. Are you on an old Linux kernel? There is a nasty missing wake up bug that Linus tried to hide, which effectively broke the world. |
Doesn't seem like it, we are on 5.4.0 with Java 14. Just a bit concerning that it started happening with the 2.9.1-SNAPSHOT build. |
Also still trying to look into whether the latest patch is helping us, seeing:
while other threads are blocked at:
The putIfAbsent call still seems to be able to trigger contention. |
Still trying to piece together the possibilities as to why this could happen (expiration could be it but this is a cache that has a 1 day expire after write and a size limit of 1M).
We could switch to a LoadingCache to make things a bit simpler for this particular code path but it's odd how we can contend on |
L2070 is afterRead, so I am assuming you are cutting the stack trace heavily. Can you show the rest of the calls? It is appending the read to a ring buffer and should tryLock to schedule the drain. |
Hm, those are the only frames we have from jstack. We are on the |
This is confusing. I don’t see why it’s causing so much trouble. Are you sure there are no other places that lock is being held? We do so in a few places, such as within a compute. Can you try with a executor(Runnable::run)? That’s for the prior stacktrace where it seemed like the FJP was stalled while holding the lock. |
For the first question, it does seem like it's all blocking on computeIfAbsent:
|
For the other issue where the FJP was stalled, we now have a better idea of the issue, we really did swarm the system with requests that ultimately saturated the system. |
Did you mean “does not”? I only see the putIfAbsent calls there. |
Sorry, I meant to say that it does seem like it's blocking on putIfAbsent calls. |
Strange. As you see in that code, it’s doing very tiny manipulations. The worst case is a retry loop, but that should go to a fast path exit. The isAlive check should only fail if the read entry was removed from the hash table (retired or dead transitions occur afterwards). So it should not loop indefinitely. If expired, it should resurrect the entry with a new value. Nothing seems obviously wrong. can you try a computeIfAbsent? This is a head scratcher. |
My best guess is that there is a thread you are not seeing which holds onto the lock. That might be the FJP trying to evict and not progressing. The direct executor would alleviate that somewhat. Otherwise this most closely resembles the lost wake up bug which is a kernel problem and not user code. I agree you should be beyond that by now. |
Will try computeIfAbsent(), thanks for the help! |
Update:
It's a bit weird that computeIfAbsent fixes the issues. Is there something else that we can add to further debug this? Metrics...etc. |
That at least is very good news. I don't know either why it would make such a big difference.
I am not sure how to reproduce to debug it or what metrics would help... |
We recently upgraded from 2.7.0 to 2.8.8 and noticed that our putIfAbsent code-paths are contending more heavily than before (at least that's our observation):
We are aware of the commit 614fe60 that should have improved putIfAbsent but we are somehow seeing the opposite. This is a bounded (1M) synchronous cache with a 1-day expiration.
The text was updated successfully, but these errors were encountered: