Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

putIfAbsent() regression? #506

Comments

@panghy
Copy link

panghy commented Feb 22, 2021

We recently upgraded from 2.7.0 to 2.8.8 and noticed that our putIfAbsent code-paths are contending more heavily than before (at least that's our observation):

   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2030)
	- waiting to lock <0x00007f00ba087828> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1965)

We are aware of the commit 614fe60 that should have improved putIfAbsent but we are somehow seeing the opposite. This is a bounded (1M) synchronous cache with a 1-day expiration.

@ben-manes
Copy link
Owner

Can you write a simple jmh benchmark to debug against? I want to make sure there is no confusion of how it’s being used. Offhand I’m surprised as well.

@panghy
Copy link
Author

panghy commented Feb 22, 2021

Will take a look at that

@ben-manes
Copy link
Owner

Taking a fresh look and nothing jumps out at me.

The looping around isAlive() check under the lock should be benign because all calls to transition the entry (retired, dead) occur just before the map compute finishes to remove the entry. That means it may have a stale lookup and spin temporarily, but the map should be updated promptly so that the loop doesn't find an entry.

The lock may be held for a long time if there is some long-running user computation occuring, either via a compute / load or writer. If that is not returning promptly then it blocks further updates to that entry. A mistake can be if an atomic operation on the entry performs a write back into the map, which can cause livelocks in some JDK releases. If not that, then maybe adding timeout safeguards. Searching the other stack traces might hint to the cause.

Of course if there is a regression and you can reproduce it, I'll jump right onto it. Does downgrading to 2.7 help?

@panghy
Copy link
Author

panghy commented Feb 22, 2021

All line numbers are 2.8.8.

Trying to understand how it can contend on the node at line 2030 (we are also looking at whether 2.7 truly helps with this). Is there a reason if multiple threads hit line 1987 and prior is null that putIfAbsent() on line 2005 returns a non-null value and it cannot also optimize and check for onlyIfAbsent at line 2009 instead of having to synchronize at line 2030 (or am I not reading the code properly?)

@ben-manes
Copy link
Owner

yeah, that does seem to have been overlooked. I think you're right that the optimization could be added there too.

@panghy
Copy link
Author

panghy commented Feb 22, 2021

Per our own analysis, it seems like we are calling putIfAbsent() more in our release that also took on 2.8.8. Perhaps we are seeing this, i.e. contention on prior != null when multiple threads are trying to call putIfAbsent for the same key at once which leads to the synchronization on prior.

@ben-manes
Copy link
Owner

Let me know how you want to proceed to confirm this. Either downgrade to 2.7 to see if its an issue (as we suspect) or upgrade with a patch. If you are amenable to it, I'd prefer to use a pre-release build to confirm with in case we have to continue digging. I can get a patch in tonight (PST). What do you want to do?

@kentwang929
Copy link

Thanks for the fast response. We are currently working on creating the jmh benchmark as well as trying to reproduce the issues consistently on our side. Once we can reproduce it consistently, it would be great if we can move forward with a patch build.

@ben-manes
Copy link
Owner

I pushed the patch to v2.dev. I think it will be hard to test or benchmark, but a nice catch. I didn't see anything cleaner than copying the 10 lines of code.

This is available on jitpack (via commit hash) or Sonatype's snapshot repository (as 2.9.1-SNAPSHOT).

@panghy
Copy link
Author

panghy commented Feb 23, 2021

Thanks, giving it a spin.

@ben-manes
Copy link
Owner

I forgot to copy into the notifyWriter block as well, if by chance you are using a CacheWriter. If so, you might need to take latest commit.

@panghy
Copy link
Author

panghy commented Feb 24, 2021

Just to share here as well but it seems like 2.8.8 is faster already (there is no regression): https://github.com/dbaghdasarya/jmh-benchmark

@ben-manes
Copy link
Owner

Oh, good. Thanks for doing that. Does the 2.9.1 snapshot help in your case?

@panghy
Copy link
Author

panghy commented Feb 24, 2021

We have been testing it at this point, so far it seems to be helping. Thanks!

@ben-manes
Copy link
Owner

ben-manes commented Feb 24, 2021

great!

By the way, your benchmark has a shared Random used by threaded tests. This uses a lock internally, so that will cause synchronous access. You can use ThreadLocalRandom or create unique data sets by thread scoping (usage example). It may be that the perf difference is noise since JIT'ing, background tasks, etc. can cause skew. I look for large differences and do not put much weight on small changes.

@ben-manes
Copy link
Owner

Please let me know when you have a conclusion of whether the patch is enough or if you are still experiencing performance problems. I’ll cut a release for you once it’s clear that we’ve resolved the issues.

@ben-manes
Copy link
Owner

I am going to hold off on a release due to this optimization being for a very narrow race, that your benchmark didn't find a regression, and that you observed it was likely due to usage changes. Since it doesn't sound critical anymore, I'll wait until enough other changes pile up, unless otherwise pressed.

@panghy
Copy link
Author

panghy commented Mar 8, 2021

sorry, we're still going through some internal testing but we are releasing with the snapshot build for now, will circle back once we have confirmation that it helps

@haosong
Copy link

haosong commented Mar 10, 2021

@panghy @ben-manes
We're currently using the version of a89e7a6 and seeing similar issues again: around 200 threads all having parking to wait for <0x00007fba19558e90>:

java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@14/Native Method)
 - parking to wait for  <0x00007fba19558e90> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
 at java.util.concurrent.locks.LockSupport.park(java.base@14/LockSupport.java:211)
 at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@14/AbstractQueuedSynchronizer.java:714)
 at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@14/AbstractQueuedSynchronizer.java:937)
 at java.util.concurrent.locks.ReentrantLock$Sync.lock(java.base@14/ReentrantLock.java:153)
 at java.util.concurrent.locks.ReentrantLock.lock(java.base@14/ReentrantLock.java:322)
 at com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(BoundedLocalCache.java:1451)
 at com.github.benmanes.caffeine.cache.BoundedLocalCache.afterWrite(BoundedLocalCache.java:1372)
 at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2451)
 at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2367)
 at com.github.benmanes.caffeine.cache.LocalAsyncCache.get(LocalAsyncCache.java:91)
 at com.github.benmanes.caffeine.cache.LocalAsyncCache.get(LocalAsyncCache.java:82)
 at com.github.benmanes.caffeine.cache.LocalAsyncLoadingCache.get(LocalAsyncLoadingCache.java:79)```

@ben-manes
Copy link
Owner

This should only occur when the write rate greatly exceeds how quickly the cache can evict. There is a write buffer to capture work to replay against the eviction policy, so that usually a caller is only penalized by the cost to perform the hash table update. When the writer buffer is full (can grow up to ~128 * cpus), then the cache blocks on the eviction lock to assist in draining. This deprioritizes calling threads and lets the one holding the lock get more cpu time, which drains a batch of work. Typically that results in higher throughput and we are protecting against runaway growth.

Are you doing something very expensive that keeps the cache eviction from running or have otherwise starved the thread? By default the FJP#commonPool() is used as the executor. You might try a same-thread (Runnable::run) if that pool is overloaded. You can play with Stresser if you want to see how this all behaves.

@panghy
Copy link
Author

panghy commented Mar 15, 2021

Hm, when I looked at the stack trace that was captured, it seemed like all threads were blocked waiting for that lock and no progress was being made at all (all threads were parked). That seemed odd and did not point to contention issues but a lock release issue. That or we are doing something else wrong that can cause the lock to be held and not released some-how.

@ben-manes
Copy link
Owner

That is odd. Are you on an old Linux kernel? There is a nasty missing wake up bug that Linus tried to hide, which effectively broke the world.

#203 (comment)

@panghy
Copy link
Author

panghy commented Mar 15, 2021

Doesn't seem like it, we are on 5.4.0 with Java 14. Just a bit concerning that it started happening with the 2.9.1-SNAPSHOT build.

@panghy
Copy link
Author

panghy commented Mar 15, 2021

Also still trying to look into whether the latest patch is helping us, seeing:

   java.lang.Thread.State: RUNNABLE
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- locked <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)

while other threads are blocked at:

   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5660107488> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)

The putIfAbsent call still seems to be able to trigger contention.

@panghy
Copy link
Author

panghy commented Mar 15, 2021

Still trying to piece together the possibilities as to why this could happen (expiration could be it but this is a cache that has a 1 day expire after write and a size limit of 1M).

  1. getIfPresent() is called for a key
  2. returns if a value is present for the key
  3. putIfAbsent() is called against the key with the value

We could switch to a LoadingCache to make things a bit simpler for this particular code path but it's odd how we can contend on prior.

@ben-manes
Copy link
Owner

L2070 is afterRead, so I am assuming you are cutting the stack trace heavily. Can you show the rest of the calls? It is appending the read to a ring buffer and should tryLock to schedule the drain.

@panghy
Copy link
Author

panghy commented Mar 15, 2021

Hm, those are the only frames we have from jstack. We are on the a89e7a6164 of caffeine (2.9.1-SNAPSHOT). If afterRead is called, shouldn't it return and never synchronize on prior?

@ben-manes
Copy link
Owner

This is confusing. I don’t see why it’s causing so much trouble. Are you sure there are no other places that lock is being held? We do so in a few places, such as within a compute.

Can you try with a executor(Runnable::run)? That’s for the prior stacktrace where it seemed like the FJP was stalled while holding the lock.

@panghy
Copy link
Author

panghy commented Mar 15, 2021

For the first question, it does seem like it's all blocking on computeIfAbsent:

	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: RUNNABLE
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- locked <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)
--
--
   java.lang.Thread.State: BLOCKED (on object monitor)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.put(BoundedLocalCache.java:2070)
	- waiting to lock <0x00007f5e12290dc0> (a com.github.benmanes.caffeine.cache.PSWMS)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.putIfAbsent(BoundedLocalCache.java:1994)

@panghy
Copy link
Author

panghy commented Mar 15, 2021

For the other issue where the FJP was stalled, we now have a better idea of the issue, we really did swarm the system with requests that ultimately saturated the system.

@ben-manes
Copy link
Owner

For the first question, it does seem like it's all blocking on computeIfAbsent:

Did you mean “does not”? I only see the putIfAbsent calls there.

@panghy
Copy link
Author

panghy commented Mar 15, 2021

Sorry, I meant to say that it does seem like it's blocking on putIfAbsent calls.

@ben-manes
Copy link
Owner

ben-manes commented Mar 15, 2021

Strange. As you see in that code, it’s doing very tiny manipulations. The worst case is a retry loop, but that should go to a fast path exit.

The isAlive check should only fail if the read entry was removed from the hash table (retired or dead transitions occur afterwards). So it should not loop indefinitely. If expired, it should resurrect the entry with a new value. Nothing seems obviously wrong.

can you try a computeIfAbsent? This is a head scratcher.

@ben-manes
Copy link
Owner

My best guess is that there is a thread you are not seeing which holds onto the lock. That might be the FJP trying to evict and not progressing. The direct executor would alleviate that somewhat.

Otherwise this most closely resembles the lost wake up bug which is a kernel problem and not user code. I agree you should be beyond that by now.

@panghy
Copy link
Author

panghy commented Mar 15, 2021

Will try computeIfAbsent(), thanks for the help!

@kentwang929
Copy link

Update:
The rollout of the new changes seems to address the thread-lock issue.
2 things that we have done

  • switched from putIfAbsent to computeIfAbsent
  • switch to using MoreExecutors.directExecutor() in the cache

It's a bit weird that computeIfAbsent fixes the issues. Is there something else that we can add to further debug this? Metrics...etc.

@ben-manes
Copy link
Owner

That at least is very good news. I don't know either why it would make such a big difference.

computeIfAbsent performs a Map.get and returns eagerly if present and valid, or else falls back to a Map.compute. That locks within the hash table and performs under that lambda. Typically that is preferred because then the value is only loaded when absent to avoid cache stampedes where the value is computed multiple times as threads race.

putIfAbsent tries to avoid the map compute and lambda as an optimization, which improved throughputs when I tried both variants. In your case it should do that Map.get and exit early if valid, or else fallback to a per-entry lock. This does suffer from a retry loop on a stale read, but the next iteration shouldn't need to retry; as alive -> retired (removed from map) -> dead (removed from lru). I am still unsure why your experience hit the entry lock so frequently rather than the fast path read.

I am not sure how to reproduce to debug it or what metrics would help...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment