Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend the memo feature with optional parameters for expiring entries #1424

Merged
merged 14 commits into from Jun 12, 2020

Conversation

DavyLandman
Copy link
Member

This PR does 2 things:

  • Memo entries now expire after 1 hour of not being used
  • you can configure the memo tag with function specific expire strategies.
import util::Memo; // new module that you have to import to get these parameters

@memo=expireAfter(minutes=10) // expires entries after they haven't been used for 10 minutes
int calc(int y) = y * 2;

@memo=maximumSize(4000) // after the table is more than 4000 entries, we start clearing it to make room (in a random order)
int calc2(int y) = y * 3;

@memo={maximumSize(4000), expireAfter(minutes=10)} // can be combined
int calc3(int y) = y * 4;

The only other reason why memo caches are cleared is due to running out of memory.

Future improvements:

  • Implement a more LRU like structure so that the maximumSize can also take into account usage of entries
  • Currently retiring entries happens every 5 seconds, especially with the maximumSize, this can run quite a bit behind, but I have no real idea how to schedule this without causing quite a call overhead.

Since I know at least @joukestoel and @rodinaarssen have talked about this at some point, will you take a look before I merge into master?

use:

```
@memo{(
   "expireSeconds": 20
)}

// or

@memo{(
   "expireMinutes": 5
)}

// these can be combined with

@memo{(
   "expireMinutes": 5,
   "maxEntries": 10000
)}
```

By default it is 1 hour expiration without a max entries
```
@memo{
    expire(
        seconds = 10,
        minutes = 20, // only one of the time based once should be set
        hours = 2,
        entries = 4000
    )
}
```
Copy link
Member

@jurgenvinju jurgenvinju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool and handy. it should help with memory efficiency.

Copy link
Member

@joukestoel joukestoel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good to me!
Nice added bonus is that Prelude now gets a standard way to call sleep so I don't have to include separate code for that in my own Rascal projects anymore 😬

Why we discussed memoization in the past was for the need for an explicit way to programmatically clear them. I needed this to run some benchmarking of my Rascal code.
I don't see anything in this code that addresses this issue but (like discussed yesterday) I presume the previously added code to clear the caches is still there.

@DavyLandman
Copy link
Member Author

Nice added bonus is that Prelude now gets a standard way to call sleep so I don't have to include separate code for that in my own Rascal projects anymore 😬

yeah, I'm not happy about that, but at least it's not a library function ;)

Copy link
Member

@rodinaarssen rodinaarssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@DavyLandman
Copy link
Member Author

DavyLandman commented Jun 5, 2020

Thanks for the checks.

I realized how we can use Caffeine for this, it makes the cleanup function a bit simpler, but it does make it slower (it's a great library but it also does a lot that we don't need). What do you think? Less of our own code for a bit more call overhead?

@joukestoel
Copy link
Member

Hmmm, I would prefer less call overhead in this case since this is code which gets hit very often. So my vote would be to leave it as is at the moment. The more runtime speed we can get, the better imho :)

@jurgenvinju
Copy link
Member

jurgenvinju commented Jun 6, 2020 via email

* A monotonic ticker that fast to frequently access and is updated roughly once every second
* A monotonic ticker that fast to frequently access and is updated roughly once every second<br/>
* <br/>
* For more reasons why we want a fast time field, see <a href="http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html">this discussion</a>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What an incredible thorough and nice blog post! TIL :)

@ben-manes
Copy link

Caffeine has an optional scheduler setting that cleans up automatically based on the next expiration time. This is paced to run at most every second to avoid excessive runs due to being off by nanos / millis. The sorted order is kept by using a timing wheel, which is an amortized O(1) priority queue.

For eviction we wrap the value with list pointers to use the O(1) lru trick. Then max size is a poll of the list head and quick reorderings. However making it concurrent is where all the complexity lies.

The allocation of the LookupKey per call can be removed by using a thread local. Unfortunately that doesn’t play well with gc’ing ClassLoaders so as a general library we can’t make assumptions about our user’s usage.

I’m not sure if you can borrow much of the code, but I hope that offers optimization ideas.

@DavyLandman
Copy link
Member Author

Hi @ben-manes,

Always a pleasure to know you'll read along if we mention Caffeine. As you might know, we've replaced most of the caches in our code-base with caffeine ones, even a few instances of String::intern() were replaced by Caffeine 🙇 .

Using Caffeine

I truly want to consider Caffeine instead of our own wrapper around ConcurrentMap and a Cleanup thread, so let me give some more context in case it wasn't clear from the discussion & javadoc:

  • We need to have value equality on keys, while also storing them in SoftReferences, I was working on a version of LookupKey that was just put into a SoftReference before storing into Caffeine and then overloading that SoftReference::hashCode and SoftReference::equals. (I understand why you designed Caffeine to do reference equality when you get Soft keys). This would also mean I still need to take care of invalidate-ing entries in the cache in some background runner (or every put).
  • Sized based eviction will be used way less common than LRU.
  • While the feature in rascal is called memo in reality we do not really care about memoization (as in, atomic putIfAbsent).

Regarding:

but it does make it slower (it's a great library but it also does a lot that we don't need). What do you think? Less of our own code for a bit more call overhead?

I was reading the code, and it seems there is quite some book keeping in BoundedLocalCache around every put, that was the "overhead" that I was discussing. It also calls expirationTicker().read(); for every put, and since time stamp can be an expensive operation, would it be better if we supply our own coarse ticker (I was actually thinking of contributing a coarse variant of the nanoTime to Caffeine if you are interested)?

I was trying to balance this put versus just using the ConcurrentMap that you also end up using. But please elaborate if I'm overlooking something.

Optimization Suggestions

  • ThreadLocal version of LookupKey sounds like a nice plan, do you know this in reality out performs the short lived objects that are relatively cheap for the GC to collect? (and indeed, there is the overhead of clearing references)
  • Our current eviction implementation is a bit rought, there is an optimization for the time based one, but once every while, we do a full iteration over the map. For the future we should definitely try to apply the TinyLFU approach.

@ben-manes
Copy link

Thanks @DavyLandman,

You should definitely use the right tool for the job. Sometimes that's caffeine, but certainly not always. So please only take what I say as an experience report, not an attempt to influence whether to use Caffeine or not.

We need to have value equality on keys, while also storing them in SoftReferences

Yeah, I forget the details of why in Guava we decided to only support identity equality for weak/soft references. I believe it was because MapMaker had an equivalence option that was rarely used, and only used wrong inside of Google's code base. When using identity equality then soft keys is the same as weak keys, since no strong reference makes the entry unretrievable.

I wouldn't be against adding (ben-manes/caffeine#344) but I am unclear on the tradeoffs, and the builder names have to be obvious & succinct (which seems harder than the code changes!).

I was reading the code, and it seems there is quite some book keeping in BoundedLocalCache around every put, that was the "overhead" that I was discussing.

Yes, though minor calls like if-conditions that don't apply so much of it is nanoseconds. Past benchmarks indicated that entry lock contention is the primary overhead, which leads us to have similar throughput as the underlying ConcurrentHashMap. A lock-free hash table of course removes that, but has its own gotchas.

Some of the code complexity is to avoid calling a put when possible since that always locks, so a get is used if the entry is present. A putIfAbsent will eagerly return if possible in the latest release, with prior having forgotten that since it is on computeIfAbsent also and that's the general usage.

It also calls expirationTicker().read(); for every put, and since time stamp can be an expensive operation, would it be better if we supply our own coarse ticker (I was actually thinking of contributing a coarse variant of the nanoTime to Caffeine if you are interested)?

Right, so this is why it's nice to have Ticker be pluggable. I don't think offering a coarse version directly is super valuable as fairly trivial code and could lead to unnecessary usage when developers blindly apply optimizations. However it is reasonable in your code if identified as a bottleneck. The cache will work fine with a custom one like yours and I tried to avoid redundant calls to the ticker to minimize the lookup overhead.

ThreadLocal version of LookupKey sounds like a nice plan, do you know this in reality out performs the short lived objects that are relatively cheap for the GC to collect?

At one point I had observed escape analysis kicking in to avoid the allocation, but it is too flaky to rely on. The idea originated in the linked discussions (ben-manes/caffeine#294) where an amp-agent seemed to run into allocation issues due to its high frequency and requires to be very low overhead. They were able to use my idea of thread local, but also point out why I couldn't apply that trick safely so it has remained backlogged.

A high rate of short allocation will impact the GC, so being garbage free on read is desirable for a cache. Since they use bump pointer allocation, it is fast until threads contend trying to get new chunks and trigger pacing to allow the GC to catch up. Allocations can be observed as a bottleneck in very hot paths where escape analysis doesn't work, e.g. if crossing a thread boundy via a concurrent queue. In this case I don't consider it too concerning, so just take note of possible optimizations if you do observe it later on.

(and indeed, there is the overhead of clearing references)

You probably want to call Refeference.clear() explicitly when discarding the entry, to avoid the cost of making the GC figure it out.

Our current eviction implementation is a bit rough..

Overall I am quite happy with how things turned out in Caffeine, given the constraints of a general purpose library. A targeted design like yours can leverage having more informed decisions and make nice optimizations that a general case cannot. Of course remember to benchmark, since that's almost always forgotten as many times those nice gains unfortunately don't materialize in practice.

@DavyLandman
Copy link
Member Author

Based on our discussion, I've simplified the code quite a bit, I a single class now for lookup, and either store it inside a SoftReference when putting, or just raw for lookup. Simplified the code quite a bit.

@DavyLandman
Copy link
Member Author

Yes, though minor calls like if-conditions that don't apply so much of it is nanoseconds. Past benchmarks indicated that entry lock contention is the primary overhead, which leads us to have similar throughput as the underlying ConcurrentHashMap. A lock-free hash table of course removes that, but has its own gotchas.

Even the postWrite task architecture stuff?

A high rate of short allocation will impact the GC, so being garbage free on read is desirable for a cache. Since they use bump pointer allocation, it is fast until threads contend trying to get new chunks and trigger pacing to allow the GC to catch up. Allocations can be observed as a bottleneck in very hot paths where escape analysis doesn't work, e.g. if crossing a thread boundy via a concurrent queue. In this case I don't consider it too concerning, so just take note of possible optimizations if you do observe it later on.

I'm putting the ThreadLocal in for now, will start some profiling & measurements in the future.

Of course remember to benchmark, since that's almost always forgotten as many times those nice gains unfortunately don't materialize in practice.

When time permits: always :)

@DavyLandman DavyLandman merged commit 6dc2ced into master Jun 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants