Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom key equality for weakKeys (equals-based lookup) #344

Closed
Maaartinus opened this issue Sep 6, 2019 · 26 comments
Closed

Custom key equality for weakKeys (equals-based lookup) #344

Maaartinus opened this issue Sep 6, 2019 · 26 comments

Comments

@Maaartinus
Copy link

I'm not sure at all about this feature, but it's something I surely wanted it a lot some long time ago. Opening this issue to discuss it as requested.

In case of weakKeys(), identity-based lookup makes more often sense than using equals, but sometimes, equals is better - imagine, you cache some pure function and the keys get re-created: When using equals, you may be lucky and get a hit, with identity, it's always a bad luck. All workarounds mean more work and more overhead.

As Ben wrote, since there's already a wrapper for a weak key, replacing it by a different wrapper (which uses equals rather than identity) should add no overhead.

Allowing equals-based lookup makes softKeys()usable as the only reason against them no longer applies. OTOH soft references don't work very well and it's unclear if such a cache is worth it.

I think due to codegen it could increase the JAR size for all the variations, unless there is an elegant alternative that I'm missing?

I'm afraid, you'd need another version of WeakKeyReference and LookupKeyReference and a few new NodeFactorys and/or to pay some tiny runtime-overhead.

I'm not against the feature, though not sure what the cache builder setting would be, e.g. weakKeysByEquality()?

Maybe forceUsingEqualsForKeys? But that's even longer.

I know offering the full-baked keyEquivalence feature isn't compatible

This is unfortunately true and it makes the idea less appealing than in case of Guava, which could do it easily using their custom map.

@ben-manes
Copy link
Owner

I think the most natural argument for weakKeys by equality is interning, e.g. Guava's Interners.weak(). This can be a useful optimization to reduce memory waste caused by duplication of immutable values, e.g. strings. This can be useful during deserialization or an ORM's object hydration, when a lot of bloat is expected so one can intelligently add special case interning.

I'm hard pressed to think of other cases where I might have wanted weakKey equality in my own code. The identity is perfect when scoping the lifecycle to a single consumer, such as a per-session cache of a shared resource. Any other usage I can think of, especially when considering soft references, is better handled by an explicit size bounding.

I think naming of this feature in the builder is my biggest concern.
(@kluever and team might have good opinions here)

@ben-manes
Copy link
Owner

Thinking about this anew, and I think we can do this cleanly without addition codegen bloat, poor builder names, and minimal complexity.

In MapMaker of yore, there were settings keyEquivalence and valueEquivalence which were pluggable instances of Guava's Equivalence. In that case the hash table was forked to make the equality checks directly. To support weak references with equals would be weakKeys + keyEquivalence(Equivalence.equals()). That sounds pretty good for an api to me.

In our case we have a LookupKeyReference that wraps the key for identity comparison. The generated NodeFactory knows its type, so all lookups use nodeFactory.newLookupKey(key) and either return the key directly or a wrapped object. We could add keyEquivalence as a parameter to delegate to, allowing it to provide the proper lookup object. A custom equivalence would surely need to be wrapped to delegate to (e.g. for byte[] keys), but our own equals would never need to be wrapped. This way the general case would be identical and efficient.

The builder would default to its current behavior, but allow overriding by a keyEquivalence setting. This would then be a natural extension, clearer, and compatible. We could offer a similar valueEquivalence which is likely trivial, but we'd need to review the usages a bit (e.g. replace and compute). At worst if left out it wouldn't feel like much of a gap and rarely needed.

It could also make a stronger argument for offering #294 as an optimization to avoid the lookup key wrapper. It might be something to bake into our Equivalence type or another builder setting, depending on how we flushed out the apis.

What do you think?

@ben-manes
Copy link
Owner

ben-manes commented Jun 10, 2020

hmm, I suppose this wouldn't work nicely for custom key equivalence and is why I originally dismissed it. It might still work for equals and reference, though.

The identity case has to be wrapped for both lookup and storing. Therefore WeakKeyReference and LookupKeyReference both extend InternalReference. Since equality must be symmetric, the wrappers must use the same strategy. This would lead us to the following cases,

  1. Strong keys
    1. equality => no wrapper
    2. identity => identity wrapper
    3. custom => equivalence wrapper
  2. Weak keys
    1. equality => weak reference + equality wrapper
    2. identity => weak reference + identity wrapper
    3. custom => weak reference + equivalence wrapper

The weak reference has to wrap the key for GC'ing, so a wrapper is always required. We'd prefer not storing an extra Equivalence field on every key, unless we determine object alignment means it is free. So we'd take that penalty only on a custom Equivalence strategy and drop it in the common case.

I think that would still work and remains simpler by not needing to codegen different Node classes, but rather make newLookupKey and newReferenceKey perform a switch around the keyEquivalence argument.

@DavyLandman
Copy link

Looks good, I only wanted to point out that soft keys will add another set of rows.

@ben-manes
Copy link
Owner

Right, thanks. We would probably ignore soft keys + identity and use its weak key instead, since it has identical application semantics while also being more GC hygienic.

We can also probably trim the generated class heirarchy by combining weak/soft variants by using Reference field types and passing a factory to construct the right type (e.g. see FD#setValue).

@DavyLandman
Copy link

Right, thanks. We would probably ignore soft keys + identity and use its weak key instead, since it has identical application semantics while also being more GC hygienic.

Sorry, I don't understand, why can soft reference be replaced by weak reference for identity? Or do you mean during lookup?

@ben-manes
Copy link
Owner

A weak reference is collected when there are no strong references remaining. This can be collected immediately.

A soft key is collected when there are no strong references and gc pressure requires freeing memory. It can be resurrected to a strong reference.

For identity there is no way to have the same instance to look up with unless you have a strong key. If you lack a strong reference, it cannot be retrieved again. This makes it externally identical to a weakKey setting.

Potentially you might have a soft reference of the key elsewhere to resurrect from, thereby being able to access in the cache. That redundancy isn’t realistic in practice, so I don’t think it’s a violation to support.

@ben-manes ben-manes changed the title Consider providing weakKeys with equals-based lookup Custom key equality for weakKeys (equals-based lookup) Feb 21, 2021
@ben-manes
Copy link
Owner

I am having a hard time justifying this feature. The technical aspects are clear, but the use-case is not. That leads back to Guava's rational for why it was removed, as in Google's large code-base there was not a valid use-case. I'll summarize what understand so far,

Weak interner

This is the use-case that Guava observed throughout Google's code base that caused the feature to remain as package-private (see Interners.weak()). The intent here is to discard the original value for the equivalent shared one (canonical), thereby reducing the memory footprint and allowing for fast identity checks. If this feature was implemented then it could be implemented as,

Function<E, E> newWeakInterner() {
  Cache<E, Boolean> cache = Caffeine.newBuilder()
      .keyEquivilance(Equivalence.equals())
      .weakKeys()
      .build();
  return e -> {
    for (;;) {
      var entry = cache.policy().getEntryIfPresentQuietly(e);
      if (entry) {
        return entry.getKey();
      }
      var value = cache.asMap().putIfPresent(e, Boolean.TRUE);
      if (value == null) {
        return e;
      }
    }
  };
}

Custom equality

The primary example for this is that Java's arrays use identity-based equality, e.g. (new int[] {1}).equals(new int[] {1}) is false. The more idiomatic approach is to use a Java Collection, e.g. a List, or one of its primitive variations like those from fastutil. The same would be if replacing the equality contract of a custom type, e.g. the infamous java.net.URL. In both cases the user can wrap the key with their own type, which is what the cache would otherwise do on their behalf.

The problem

So while interning is a desirable case to support, it allows for a lot of misuse. Guava's larger scope allowed it to hide this feature and solve the singular problem in a nice interface, whose implementation complexity is hidden from the user. Since many might not write an interner correctly, it does seem better for them to have made that into a first-class concept.

As @kevinb9n rational summarizes the user's intent and misunderstanding well. They would expect that using any equivalent key would find the entry and show a strong desire to retain it. Once all references to an equivalent key are reclaimed then it would become prudent to remove. And that is exactly where the confusion lies, the weak reference is only applicable to the instance used by the mapping. The fact that the mapping is popular or equivalent instances are abound has no relevance to the garbage collector. Once the instance itself has no strong references then it is eligible for collection, which results in the cache discarding the value earlier than the user intended.

Therefore, to get the desired behavior one should use both an interner and a cache. These then serve different purposes and combining that responsibility would be confusing logic. The interner explicitly reads as replacing the element with a canonical instance that should be used, while the cache explicitly provides a policy for how long the data should be held in-memory. One can easily understand the intent: cache.get(interner.intern(key), k -> /* expensive load */).

What to do?

If there are no other cases then we should be very wary of exposing a footgun. We could avoid this by following Guava's example by providing our own Interner type backed by Caffeine instead of MapMaker. That is not completely unreasonable as a cache-like use-case. However this library isn't meant to advocate for not using Guava - its still wonderful - so in practice using their Interner seems perfectly fine. The reasons not to don't seem to be about performance, but minimizing dependencies for those who want to avoid the disk footprint. That's a practical ask, but quite niche.

Hence I have not been able to justify implementing this feature and would lean towards closing over duplicating Guava's Interner.

@DavyLandman
Copy link

I agree that WeakReference require careful handling.

In the systems I've worked on, and where I wanted a WeakReference key, it was often something like interning (aka maximal sharing), or quicker clearing of a cache. So with that in mind, three observations:

  1. Yes, as soon as no reference is strong anymore, the entry would be invalidated. But that is fine and intended if you go for WeakReferences. I agree that for some cases, a SoftReference and a reasonable expiration configuration would be just as good, but in high memory pressure systems (or maybe, systems where the values of calculation evolve quickly), a WeakReference can help avoid retaining objects that nobody needs anymore. I understand that this all depends on the kind of application, in a server-side java app, WeakReferences will be cleared much sooner/frequent than in a desktop/cli app.
  2. I think the example discussed by @kevinb9n does complicate matters by the boxed values, as they are a primary case where users rarely are responsible for allocating (and worrying about references).
  3. a third scenario is that If the key retains a large collection of memory, I want clear it as soon as nobody else is using it, I don't want a cache to keep that memory alive. But I also can't be sure that I'm the only reference that will be used to query it (so I need equals, not identity).

Hope this helps a bit.

@ben-manes
Copy link
Owner

Thanks @DavyLandman. Do you think that those cases would fit the interner + weak keyed cache approach? If so, then it should make the intent and behavior much clearer. The interner is responsible for returning a canonical key and the cache is responsible for evicting that key's heavy-weight value once it is no longer strongly referenced.

If the argument can be reduced down to a preference of avoiding Guava or a micro optimization, then I think this feature doesn't carry its conceptual weight.

@DavyLandman
Copy link

I've had two scenarios where I wanted weak keys with equals contract:

  • interning
  • heavy key that maps to a different value

I think from an esthetic point of view, it makes sense to allow equality on WeakReferences.

@ben-manes
Copy link
Owner

Shouldn’t the heavy key be interned first or else runs into the misunderstanding that Kevin highlighted?

@DavyLandman
Copy link

Not necessarily, especially if the key can come from multiple places. So you have some fleeting access (that reconstruct the key) and some longer access, that keep the key around via some form of interning.

@ben-manes
Copy link
Owner

Sure, but after reconstruction there is a canonical key that can be used and held by the cache. The longer accessed that interned help keep it alive, while the short ones only need it transiently for the lookup. It seems to me that using an interner to get the cache key before the lookup has similar results.

@DavyLandman
Copy link

DavyLandman commented Apr 4, 2022

Ah, true, that's completely possible, at the cost of an extra tree lookup you could chain them after each other.

Do you want to propose special case for interning logic?

@ben-manes
Copy link
Owner

Yes, exactly. Guava instead kept this feature internal and offers an Interner interface with Interners.newWeakInterner() for this feature. Then one performs cache.get(interner.intern(key)) which clearly communicates the intent.

If that satisfies the requirements then for now I don't think we need to offer our own Interner type and can close this issue for now.

@DavyLandman
Copy link

DavyLandman commented Apr 4, 2022

I don't want to take-over @Maaartinus issue.

But I do want to note that it's a pity we need guava as extra dependency. Most of our intern caches we've done using caffeine.

@ben-manes
Copy link
Owner

If we did this then maybe we could make it into a subtle feature. Even though I do think a proper interface like Guava's is very desirable, it also seems a bit beyond the scope here and could lead to feature creep asks. We could do something subtle like Function<E, E> interner = Caffeine.newWeakInterner() while still directing users to Guava as a default, or telling them to use a method reference to that api. Otherwise adding our own Interner interface isn't end of the world, just a bit ad hoc of an addition.

@DavyLandman
Copy link

My bad, I was assuming most of the wiring for a proper "weak" interner was already present in your internal classes.

@ben-manes
Copy link
Owner

It isn't far from it, but some work is needed. A full abstraction for pluggable key / value Equivalence types would have required more generalization. It wasn't done that way initially to avoid a few extra fields and wrappers that would impose. If only a weak interner with private hooks are needed then it might be a very small change.

ben-manes added a commit that referenced this issue Apr 10, 2022
A weak keyed cache uses identity equivalence, as this is the only sensible
approach when caching a value. If Objects.equals was used then the user
would not be inclined to hold a reference to the canonical key and expect
any equivalent key to cause the mapping to be retained. As that is an
impossible expectation, the cache would seemingly discard prematurely.
Therefore we disallow this case to hint that users should be more thoughtful
about their desired behavior.

An interner is a Set-based cache where the key is the only item of interest.
This allows usages to resolve to a canonical instance, discard duplicates
as determined by Object.equals, and thereby reduce memory usage. A weak
interner allows the garbage collector to discard the canonical instance
when all usages have been reclaimed. This special case of a weak keyed
cache with Object.equals behavior is now supported, but hidden through
an interface to make the behavior explicit and avoid misuses.

For cases that want to cache by a canonical weak key to a value, the
two type should be used in conjunction. Instead of incorrectly trying
to combine into a single cache, use intern the key and use that for
the cache lookup.
ben-manes added a commit that referenced this issue Apr 10, 2022
A weak keyed cache uses identity equivalence, as this is the only sensible
approach when caching a value. If Objects.equals was used then the user
would not be inclined to hold a reference to the canonical key and expect
any equivalent key to cause the mapping to be retained. As that is an
impossible expectation, the cache would seemingly discard prematurely.
Therefore we disallow this case to hint that users should be more thoughtful
about their desired behavior.

An interner is a Set-based cache where the key is the only item of interest.
This allows usages to resolve to a canonical instance, discard duplicates
as determined by Object.equals, and thereby reduce memory usage. A weak
interner allows the garbage collector to discard the canonical instance
when all usages have been reclaimed. This special case of a weak keyed
cache with Object.equals behavior is now supported, but hidden through
an interface to make the behavior explicit and avoid misuses.

For cases that want to cache by a canonical weak key to a value, the
two type should be used in conjunction. Instead of incorrectly trying
to combine into a single cache, use intern the key and use that for
the cache lookup.
ben-manes added a commit that referenced this issue Apr 10, 2022
A weak keyed cache uses identity equivalence, as this is the only sensible
approach when caching a value. If Objects.equals was used then the user
would not be inclined to hold a reference to the canonical key and expect
any equivalent key to cause the mapping to be retained. As that is an
impossible expectation, the cache would seemingly discard prematurely.
Therefore we disallow this case to hint that users should be more thoughtful
about their desired behavior.

An interner is a Set-based cache where the key is the only item of interest.
This allows usages to resolve to a canonical instance, discard duplicates
as determined by Object.equals, and thereby reduce memory usage. A weak
interner allows the garbage collector to discard the canonical instance
when all usages have been reclaimed. This special case of a weak keyed
cache with Object.equals behavior is now supported, but hidden through
an interface to make the behavior explicit and avoid misuses.

For cases that want to cache by a canonical weak key to a value, the
two type should be used in conjunction. Instead of incorrectly trying
to combine into a single cache, use intern the key and use that for
the cache lookup.
ben-manes added a commit that referenced this issue Apr 10, 2022
A weak keyed cache uses identity equivalence, as this is the only sensible
approach when caching a value. If Objects.equals was used then the user
would not be inclined to hold a reference to the canonical key and expect
any equivalent key to cause the mapping to be retained. As that is an
impossible expectation, the cache would seemingly discard prematurely.
Therefore we disallow this case to hint that users should be more thoughtful
about their desired behavior.

An interner is a Set-based cache where the key is the only item of interest.
This allows usages to resolve to a canonical instance, discard duplicates
as determined by Object.equals, and thereby reduce memory usage. A weak
interner allows the garbage collector to discard the canonical instance
when all usages have been reclaimed. This special case of a weak keyed
cache with Object.equals behavior is now supported, but hidden through
an interface to make the behavior explicit and avoid misuses.

For cases that want to cache by a canonical weak key to a value, the
two type should be used in conjunction. Instead of incorrectly trying
to combine into a single cache, use intern the key and use that for
the cache lookup.
@ben-manes
Copy link
Owner

I brought over Guava's Interner type to document the behavior. You can use it like,

Interner<E> interner = Interner.newWeakInterner();
E canonical = interner.intern(sample);

The strong implementation wraps a ConcurrentHashMap<E, E>, equivalent to computeIfAbsent(key, identity()).

The weak implementation uses a weak keyed Cache<E, Boolean> with a specialized entry type for equality and that doesn't store the value (as unused).

Neither implementation offer other bounding options as that doesn't make sense for this use-case. For caching where bounding of the values is useful, as discussed above one can combine a Cache<K, V> with Interner<K> to maintain a cache of the canonical mappings.

ben-manes added a commit that referenced this issue Apr 10, 2022
A weak keyed cache uses identity equivalence, as this is the only sensible
approach when caching a value. If Objects.equals was used then the user
would not be inclined to hold a reference to the canonical key and expect
any equivalent key to cause the mapping to be retained. As that is an
impossible expectation, the cache would seemingly discard prematurely.
Therefore we disallow this case to hint that users should be more thoughtful
about their desired behavior.

An interner is a Set-based cache where the key is the only item of interest.
This allows usages to resolve to a canonical instance, discard duplicates
as determined by Object.equals, and thereby reduce memory usage. A weak
interner allows the garbage collector to discard the canonical instance
when all usages have been reclaimed. This special case of a weak keyed
cache with Object.equals behavior is now supported, but hidden through
an interface to make the behavior explicit and avoid misuses.

For cases that want to cache by a canonical weak key to a value, the
two type should be used in conjunction. Instead of incorrectly trying
to combine into a single cache, use intern the key and use that for
the cache lookup.
ben-manes added a commit that referenced this issue Apr 10, 2022
A weak keyed cache uses identity equivalence, as this is the only sensible
approach when caching a value. If Objects.equals was used then the user
would not be inclined to hold a reference to the canonical key and expect
any equivalent key to cause the mapping to be retained. As that is an
impossible expectation, the cache would seemingly discard prematurely.
Therefore we disallow this case to hint that users should be more thoughtful
about their desired behavior.

An interner is a Set-based cache where the key is the only item of interest.
This allows usages to resolve to a canonical instance, discard duplicates
as determined by Object.equals, and thereby reduce memory usage. A weak
interner allows the garbage collector to discard the canonical instance
when all usages have been reclaimed. This special case of a weak keyed
cache with Object.equals behavior is now supported, but hidden through
an interface to make the behavior explicit and avoid misuses.

For cases that want to cache by a canonical weak key to a value, the
two type should be used in conjunction. Instead of incorrectly trying
to combine into a single cache, use intern the key and use that for
the cache lookup.
ben-manes added a commit that referenced this issue Apr 10, 2022
A weak keyed cache uses identity equivalence, as this is the only sensible
approach when caching a value. If Objects.equals was used then the user
would not be inclined to hold a reference to the canonical key and expect
any equivalent key to cause the mapping to be retained. As that is an
impossible expectation, the cache would seemingly discard prematurely.
Therefore we disallow this case to hint that users should be more thoughtful
about their desired behavior.

An interner is a Set-based cache where the key is the only item of interest.
This allows usages to resolve to a canonical instance, discard duplicates
as determined by Object.equals, and thereby reduce memory usage. A weak
interner allows the garbage collector to discard the canonical instance
when all usages have been reclaimed. This special case of a weak keyed
cache with Object.equals behavior is now supported, but hidden through
an interface to make the behavior explicit and avoid misuses.

For cases that want to cache by a canonical weak key to a value, the
two type should be used in conjunction. Instead of incorrectly trying
to combine into a single cache, use intern the key and use that for
the cache lookup.
@DavyLandman
Copy link

Sounds good, thanks for taking the effort 👍🏼

@ben-manes
Copy link
Owner

@tydhot I added a wiki page for this feature. If you would kindly translate it, that would be appreciated!
(Since its unreleased I temporarily added a version hint)

@tydhot
Copy link
Collaborator

tydhot commented Apr 11, 2022

@ben-manes no problem!:)

@tydhot
Copy link
Collaborator

tydhot commented Apr 23, 2022

https://github.com/ben-manes/caffeine/wiki/Interner-zh-CN done:) @ben-manes

@ben-manes
Copy link
Owner

Released in 3.1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants