-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Removal listener not called #859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
When I add a println to your test's removal listener then I see it being called, but the test fails claiming no calls were made.
|
An issue with the test is that it does not hold a strong reference to the cache. The scheduled cleanup task holds a weak reference to it, so that if it outlives the cache's use then the task can no-op and the cache garbage collected. It doesn't make much sense to have a cache that the application discarded but pending expiration keeps it around, a proper shut down or long-lived instance would be more reasonable application logic. So that seems to be halting the accumulation and explain part of it, at least. |
Thank you for your investigation, and you are right. The reproducer has the problem, that the cache can be garbage collected. I fixed that issue and now the test is green. |
When I rewrite your test to capture all of the run state and use a As scheduling is documented as best-effort, from that perspective this is acceptable as there will be benign orchestration races which do not result in invalid state or operations. However if you are relying in strict timing events for business logic and otherwise leave the cache idle, then the divergences of goals could cause you problems. In that case it might make sense to use a stricter timing subsystem rather than rely on the cache's fuzziness. There may be some tweaks that we can do a bit better, but I am not sure yet if that would fully resolve this scenario or simply make it less likely. In a cache every read is a write to update shared state, such as reorder items on an LRU list. This is why traditional caches do not scale well because every access requires obtaining an exclusive lock to update the policy metadata. The approach that Caffeine uses is to update the This orchestration means there can be races. Here the maintenance work is running, flushed the write buffer, and is busy expiring all of the entries that the policy is aware of. A concurrent write adds the data entry and write event, updates a flag, but does not schedule a new maintenance task as one is running. When the maintenance finishes, there is nothing to schedule but new pending work when it releases the lock. If you use the default Does that make sense? Do you have any suggestions for how this could be improved, etc? Optimistic reschedulingcaffeine/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java Lines 1629 to 1644 in 7548084
Test![]() import java.time.Duration;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.CountDownLatch;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;
import org.junit.jupiter.api.RepeatedTest;
import com.github.benmanes.caffeine.cache.Cache;
import com.github.benmanes.caffeine.cache.Caffeine;
import com.github.benmanes.caffeine.cache.Scheduler;
public class CaffeineReproducer {
private static final Object VALUE = new Object();
private static final int NUMBER_OF_RUNS = 100_000;
private static final int NUMBER_OF_KEYS = 10;
private final ExecutorService cacheExecutor = Executors.newFixedThreadPool(200);
private final ExecutorService testExecutor = Executors.newCachedThreadPool();
private final ConcurrentHashMap<Integer, TestRun> runs = new ConcurrentHashMap<>();
@RepeatedTest(1)
void testRunRemovalListener() throws InterruptedException {
System.out.printf("%n%nSTART%n%n");
for (int i = 1; i <= NUMBER_OF_RUNS; i++) {
runs.put(i, runTest(i));
}
Thread.currentThread().setName(getClass().getSimpleName());
for (int i = 1; i <= NUMBER_OF_RUNS; i++) {
System.out.printf("#%s - WAIT%n", i);
var run = runs.get(i);
boolean finished = run.latch.await(5, TimeUnit.SECONDS);
if (!finished) {
System.out.printf("debug for inspection");
run.cache.cleanUp();
}
}
System.out.printf("%n%nEND%n%n");
}
private TestRun runTest(final int run) {
var list = Collections.synchronizedList(new ArrayList<>());
var latch = new CountDownLatch(NUMBER_OF_KEYS);
Cache<Object, Object> cache = Caffeine.newBuilder()
.removalListener((key, value, cause) -> {
//System.out.printf("#%s - %s: %s=%s%n", run, cause, key, value);
latch.countDown();
list.add(value);
})
.expireAfterWrite(Duration.ofMillis(5))
.scheduler(Scheduler.systemScheduler())
.executor(cacheExecutor)
.build();
for (int i = 0; i < NUMBER_OF_KEYS; i++) {
var key = i;
testExecutor.execute(() -> cache.put(key, VALUE));
}
return new TestRun(cache, list, latch);
}
class TestRun {
Cache<Object, Object> cache;
List<Object> notifications;
CountDownLatch latch;
TestRun(Cache<Object, Object> cache, List<Object> notifications, CountDownLatch latch) {
this.notifications = notifications;
this.cache = cache;
this.latch = latch;
}
}
} |
I think that the optimistic rescheduling could use the provided if (drainStatusOpaque() == REQUIRED) {
if (executor == ForkJoinPool.commonPool()) {
scheduleDrainBuffers();
return;
}
var pacer = pacer();
if ((pacer == null) || (pacer.future != null)) {
return;
}
synchronized (pacer) {
if (pacer.future == null) {
pacer.schedule(executor, drainBuffersTask,
expirationTicker().read(), Pacer.TOLERANCE);
}
}
} |
Good that you found the possible cause of the problem and also provided a proposal for a fix. Your explanation and the updated test makes total sense. The Your finding also shows me that my workaround does not work. It just changes the cleanup timing a little: I thought a little about what you wrote according to business logic in the callbacks. I totally agree that there can not be any assumptions about the timing when a callback will be called. But it should be at least be called. This is not only important for business logic, but also in other cases, like cleanup resources. I also agree that the most common case is that there is ongoing interaction with the cache, but it can always happen that, for whatever reason, the interaction stops. In this case, nothing should hang around in the cache anymore when a scheduler is configured. I really appreciate your time on this and your detailed explanations. |
Your workaround is fine because it resolves it continuously. My fix would do the same, except by inspecting the internals it can schedule only when necessary instead of (blindly) periodically.
Right and the listener is always called, just that various tradeoffs results in there being no strict guarantee about when. That can cause an unexpected delay like in your findings, but not a missed invocation.
I agree. Hopefully, I'll get a chance to work on the unit tests this weekend and release a fix with our proposed changes. Thanks for the bug report, reproducer, and patience! |
A benign race can occur when a maintenance cycle is running, writers insert new entries, and all of the entries that the policy is aware of have expired so they are removed. In this case no clean up was scheduled by the expiration policy, the maintenance status is left as "required", and entries in the write buffer are waiting to be applied. If the cache is idle then those entries will expire and the removal listener will not be notified promptly. As a scheduler is configured to communicate that a prompt notification is desired, the cache should more aggressively use it to ensure this case is handled.
Released in 3.1.3 |
Will this change not be implemented for pre v3 (java 8) Caffeine? |
Correct, it’s not planned for 2.x. This is a quality of life improvement that is non-critical and the old behavior was allowed by the api contract. Java 8 reached EOL for public security fixes so those users are already accepting a very limited amount of support and have a high risk tolerance. |
Thanks for the fast reply! I apologize for reviving this old thread. |
My recollection was that calling cleanUp() manually would flush the pending work and resolve the problem. |
I think the cleanup function is being called but the listener is just not being notified. I will do some testing and see if I can get some proper reproducible results |
First, thank you for this really great library!
Now to my issue:
I have the problem that the removal listener is sometimes not executed in low throughput situations.
I use the most recent version
3.1.2
. As a workaround, I created a scheduled task that executescleanUp()
every second instead of setting a scheduler.Here is a reproducer:
The text was updated successfully, but these errors were encountered: