Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

asyncLoad occasionally called with Object due to race between refreshAfterWrite and expireAfterAccess #715

Closed
wiggzz opened this issue May 24, 2022 · 14 comments

Comments

@wiggzz
Copy link

wiggzz commented May 24, 2022

We're seeing some ClassCastExceptions (in our code) after our cache loader asyncLoad seems to be invoked with an Object instead of the expected key type K. We are using the cache configured with refreshAfterWrite and expireAfterAccess and I suspect this may be happening because the expireAfterAccess will cause the cache Node to eventually be evicted and then die, which will set the Node key to DEAD_STRONG_KEY (new Object()). At the same time, the node is being refreshed, and so the asyncLoad method is called with the dead key.

I'll take a look if I can reproduce in a unit test, but wanted to open this issue to see if you had seen this before or if we are using it wrong.

Our cache is roughly configured like this

Caffeine.newBuilder()
        .refreshAfterWrite(10, TimeUnit.MINUTES)
        .expireAfterAccess(15, TimeUnit.MINUTES)
        .maximumSize(325_000)
        .recordStats(/* custom StatsCounter, not doing anything really special */ )
        .scheduler(forScheduledExecutorService(/* scheduled executor for the application */))
        .buildAsync(/* our loader, which does coalescing similar to the `CoalescingBulkLoader` */);
@ben-manes
Copy link
Owner

ben-manes commented May 24, 2022

Thanks! If you can put a unit test together that would be really great. It hasn't been reported before, but your description makes sense.

@ben-manes
Copy link
Owner

The logic to trigger the refresh reads the key but doesn't check afterwards that Node.isAlive() is true. The bug makes sense, testing could be hard due to races.

if (((now - writeTime) > refreshAfterWriteNanos()) && (keyReference != null)
&& ((key = node.getKey()) != null) && ((oldValue = node.getValue()) != null)
&& ((writeTime & 1L) == 0L) && !(refreshes = refreshes()).containsKey(keyReference)
&& node.casWriteTime(writeTime, refreshWriteTime)) {

@wiggzz
Copy link
Author

wiggzz commented May 24, 2022

The logic to trigger the refresh reads the key but doesn't check afterwards that Node.isAlive() is true. The bug makes sense, testing could be hard due to races.

if (((now - writeTime) > refreshAfterWriteNanos()) && (keyReference != null)
&& ((key = node.getKey()) != null) && ((oldValue = node.getValue()) != null)
&& ((writeTime & 1L) == 0L) && !(refreshes = refreshes()).containsKey(keyReference)
&& node.casWriteTime(writeTime, refreshWriteTime)) {

Yep that is exactly what I was thinking. And also agree testing is tricky. Need to get the node to be evicted while the read is happening. Could possibly hook into the StatsCounter to tick the ticker and then call cleanup all during a read?

@wiggzz
Copy link
Author

wiggzz commented May 24, 2022

Actually, this theory can't be 100% correct because the value is set to null in the same place that the key is set to the DEAD_STRONG_KEY. So the refresh logic shouldn't process the dead node just because the value should be null... 🤔

@wiggzz
Copy link
Author

wiggzz commented May 24, 2022

Ah - but it could be RETIRED_STRONG_KEY. Thinking out loud here.

@ben-manes
Copy link
Owner

right. Since those are two different fields so one could be set before the other is and cause it to pass the visibility check. Those nulls are mostly for weak/soft reference collection and not liveliness.

ben-manes added a commit that referenced this issue May 25, 2022
When a read triggers a refresh, it captures the key and value into local
variables and performs validation checks. This should have included
verifying that the entry is still alive (vs retired, dead). Otherwise,
the key may be an internal sentinel and the refresh will fail. This has
no harmful effects due to being logged and swallowed.
@ben-manes
Copy link
Owner

I was able to write a unit test using your idea of stats as a hook. I also made the sentinel keys proper classes so that it will be easier to debug in the future. Would you be able to canary 3.1.1-SNAPSHOT? If not, I can release and we can verify as a follow-up.

Thankfully this should be a benign bug since the ClassCastExceptions is logged, swallowed, and the entry left discarded by its removal. Other than log noise it shouldn't be causing any problems.

ben-manes added a commit that referenced this issue May 25, 2022
When a read triggers a refresh, it captures the key and value into local
variables and performs validation checks. This should have included
verifying that the entry is still alive (vs retired, dead). Otherwise,
the key may be an internal sentinel and the refresh will fail. This has
no harmful effects due to being logged and swallowed.
@wiggzz
Copy link
Author

wiggzz commented May 25, 2022

Interesting. In our case the error is not being fully swallowed by caffeine because we are coalescing the refreshes and normal loads. So once an invalid key is requested to be refreshed, it is added to a batch, then every other key that is requested with it also fails because of the class cast exception because the batch is polluted.

So in our case it is more than noise, although it's probably not having much of a customer impact since it is fairly infrequent.

I will see if we can canary this. If not, I will definitely follow up if it still happens after we upgrade.

@ben-manes
Copy link
Owner

I will see if we can canary this. If not, I will definitely follow up if it still happens after we upgrade.

Any updates? I'm all set on my side

@wiggzz
Copy link
Author

wiggzz commented May 26, 2022

I will see if we can canary this. If not, I will definitely follow up if it still happens after we upgrade.

Any updates? I'm all set on my side

I wasn't able to get to this today and will be on leave for a week, so feel free to cut the release and I will followup when I get a chance.

@ben-manes
Copy link
Owner

thanks, will do!

@ben-manes
Copy link
Owner

Released 3.1.1

@wiggzz
Copy link
Author

wiggzz commented Jun 17, 2022

@ben-manes Hey sorry for the delayed follow up on this. We finally got this out into production and it's fixed the issue. Thanks for the quick resolution!

@ben-manes
Copy link
Owner

awesome 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants