Scaling up without state transfer

The goal is to be able to add nodes to the cluster and make them own new entries, without also owning any of the old entries.

Simplifying assumptions:

Only one node is being added at a time.
Single owner
No transactions
When scaling down data is just lost

The basic idea is that all keys are written on the newest member. The location of all inserted keys is kept in a replicated cache (the anchor cache) and used for further reads/updates/removals.

It’s important to note that clients don’t have access to the anchor cache, so clients access servers in a round-robin fashion.

Implementation

The implementation will live in a module named anchored-keys.

Since we can’t deterministically map keys to segments, we don’t need a distributed cache. Instead we can use an invalidation cache and the ModuleLifecycle implementation can replace the invalidation interceptor with a custom interceptor.

The interceptor will first look up affected keys in the anchor cache. If the key doesn’t exist in the anchor cache, reads can assume the value is missing. Writes must insert the current writer’s location in the anchor cache, then forward to the current writer for the actual write.

When a key is removed, the interceptor must also remove the key from the anchor cache. When a node leaves, the interceptor must remove all the keys mapped to that node from the anchor cache.

Configuration

Configuration means a custom element anchored-keys with a single attribute enabled.

<invalidation-cache name="name">
    <anchored-keys enabled="true"/>
</invalidation-cache>

Performance considerations

Latency / RPCs

The client doesn’t know the owner, so any read or write has a (N-1)/N probability of requiring a unicast RPC from the processing server to the owner.

In addition to this, the first write to a key also requires a replicated cache write, which means a multicast RPC plus a (N-1)/N probability of another unicast RPC.

If FailoverRequestBalancingStrategy knew whether the next request was a read or a write, we could make it always send write requests to the last server. However, that would only work if updates and removals are minimal, otherwise the last server could be overloaded.

Memory overhead

The anchor cache contains copies of all the keys and their locations, plus the overhead of the cache itself.

The overhead is lowest for off-heap storage: 21 bytes in the entry itself plus 8 bytes in the table, assuming no eviction or expiration. The location is another 20 bytes, assuming we keep the serialized owner’s address.

Note: We could reduce the location size to ⇐ 8 bytes by encoding the location as an integer.

Addresses are interned, so an address already uses only 4 bytes with OBJECT storage and -XX:+UseCompressedOops. But the overhead of the ConcurrentHashMap-based on-heap cache is much bigger, at least 32 bytes from CHM.Node, 24 bytes from ImmortalCacheEntry, and 4 bytes in the table.

Note: because replicated writes are sent to the primary owner, which forwards them to all the other members, the keys cannot be de-duplicated.

State transfer

There will be no state transfer for the main cache, but the anchor cache still needs to transfer all the keys and the location information.

Assuming that the values are much bigger compared to the keys and to the cache overhead per entry, the anchor cache’s state transfer should also be much faster compared to the state transfer for the main cache.

The initial state transfer should not block a joiner from starting, because the joiner can ask an existing node for the location.

Alternative implementations

Key generator

If the keys could be generated by Infinispan, e.g. in a server task, we could generate the keys to map to a segment owned by the current writer and we wouldn’t need to keep track of key locations.

Cluster loader

Instead of keeping track of the key locations in a replicated cache, an anchored cache could send a broadcast request to all the members.

Read hits would be slowed down a bit, because they would send out additional get requests, but the difference would be small, because they would only have to wait for the first non-null response. Read misses, on the other hand, would be much slower, because the originator would have to wait for responses from all the members before telling the client that the key does not exist in the cache.

Writes would be faster, because they would only need a unicast RPC to the current writer, without the replicated cache write.

The RPCs might be reduced by maintaining on each node a set of bloom filters, one for each member.

Single cache

Store the data and the location information in a single REPLICATED_SYNC cache. There is a single segment, and a custom ConsistentHashFactory makes the newest member the primary owner of that segment.

The distribution/triangle interceptor would be replaced with a custom interceptor that replaces the value in backup write commands with a reference to the primary owner. For reads it checks if the value in the context is a reference to another node, and makes a remote get request to that node.

The state provider would also be replaced to send the location information to the joiners instead of the actual values.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scaling-without-state-transfer.asciidoc

scaling-without-state-transfer.asciidoc

Scaling up without state transfer

Implementation

Configuration

Performance considerations

Latency / RPCs

Memory overhead

State transfer

Alternative implementations

Key generator

Cluster loader

Single cache

Files

scaling-without-state-transfer.asciidoc

Latest commit

History

scaling-without-state-transfer.asciidoc

File metadata and controls

Scaling up without state transfer

Implementation

Configuration

Performance considerations

Latency / RPCs

Memory overhead

State transfer

Alternative implementations

Key generator

Cluster loader

Single cache