Index Affinity proposal

This is a memory dump of the late night discussion between Emmanuel and Sanne at the Berlin clustering meeting of 2015.

The goal of this proposal is to reduce the number of RPC generated by indexing an entry or a set of entries: since each of these RPCs are synchronous across multiple servers, the latency cost of a single indexing operation is high. Also, because of the requirements of the index structure, these operations can’t run in parallel. This proposal will also show how to increase parallelism.

Some terms

Indexing Backend

It’s the component which receives an index changeset and actually uses an IndexWriter instance from Lucene to write changes to the index. Only one IndexWriter instance can write to a specific index at the same time: the index needs to be protected by a global (cluster wide?) lock.

Not to be confused by the process of converting the data entry into the index changeset.

Index Directory

Represents the mapping from Lucene’s IndexWriter output into entries which are written into a dedicated Infinispan Cache. It doesn’t have to be a dedicated Cache, but that’s usually desirable so that it can be tuned separately from the caches containing the application data.

The name comes from Lucene’s API org.apache.lucene.store.Directory.

Segment

There is some ambiguity around the term segment. The Infinispan core team uses it to refer to a range of keys on the hash space which is stored together "on the same segment" or "by the same owner". In context of Lucene, one needs to keep in mind that the physical storage of the index is structured in "segment" files. These are different things .. but the point of this proposal is to relate and jointly store these two things so be prepared for some ambiguity!

Current situation

The index is global for a given entity type. For maximum efficiency, an index must have a single writer (called master). By reusing a single writer for multiple sequential operations, the distributed lock which is needed by any writer to be acquired doesn’t need to be re-acquired for each single operation. Also Apache Lucene is more efficient if you can reuse the same IndexWriter rather than re-opening a new one for each operation.

Assuming an index of n Lucene segments, the index is stored into a distributed or replicated Cache, composed of the following entries:

a global commit id entry header
a global commit id entry content
n Lucene segment file headers
n * m Lucene segment chunks (chunked as Lucene segments can be big)

When a new entry is indexed, it requires to:

create a new Lucene segment (this involves m + 1 puts)
update the global commit id (2 puts)

And it gets worse: approximately every n writes, all segments get deleted and re-compacted in a more efficient format. (n = 10 by default: making it higher improves write speed but slows down reads) That’s called a merge operation in Lucene terminology: that adds a significant amount of RPCs.

The entries being written as part of the segments are immutable in practice: the keys will never be reused, unless for reading that same data.

Deletions and Compaction

It’s worth keeping in mind that delete operations are happening as well during a merge: these are less interesting to this design proposal though as delete operations happen asynchronously from the primary process. Let’s remember though that deletes require distributed readlocks to be assigned, so there is a storm of background RPCs triggered by a merge, and when there are too many such operations enqueued the delete process degenerates into a synchronouse process as well. Still, the process is built on async rpcs, so these should not choke the primary Infinispan RPC handling threadpool.

A segment which is composed of a single chunk (a small enough segment) won’t need any distributed readlock.

Batched writes and Transactions

The Infinispan Query and Hibernate Search engines already compact any amount of writes from a same context (batch or committed transaction) into a single changeset command. So for the most part of this design, we’ll be able to treat any use case with multiple keys being written as it was one key to be written.

The notable exception to this is sharding: when a sharding strategy is enabled the Hibernate Search engine actually splits the work queue into a queue per shard. But again, the multiple updates which are needed to flush all changes on a specific shard (index) can be thought of being a single write operation on the index.

Sharding the index

The Hibernate Search engine can apply various smart or dumb sharding strategies to split the indexing workload across multiple masters, but how those master nodes are elected is mostly a user problem, as the engine doesn’t deal with the complexities of a distributed system.

Infinispan Query builds on these extension points to provide an "automatic master" election service when it enabled the org.infinispan.query.indexmanager.InfinispanIndexManager, but it focuses on safety rather than optimisation and will always pick the cluster coordinator. The reason this was created like that, is that Infinispan doesn’t provide a reliable centralized lock service (one is in the works, but this paragraph is meant to explain the current situation).

This implies that sharding the index could have various interesting effects as normally described in the Hibernate Search guide, but no matter how it splits work all the load from various shards is performed on the same cluster node.

Possible tuning options (today)

Make n larger to reduce frequency of merges. Play with options merge_factor and max_merge_docs from the lucene tuning properties
Make m smaller to reduce the number of chunks written for each segment
Make m smaller so that a larger ratio of segments can benefit of the "single chunk" status
Get the chunksize, the JGroups fragmentation and the network sizes aligned to avoid fragmentation on top of our chunking
plug in your own IndexManager implementation which could use a reliable (or external) lock service

Summary of current indexing process

For each indexed entry, several writes happen on the index Directory, so generating many put operations on the backing Cache. As with any write on Infinispan, each of these put operations gets delegated to the primary owner of each key being written to, according to the hashing function.

Since there are many keys being written, and these have no form of affinity, it’s very likely that many nodes will be contacted to coordinate these writes, and each and every of these puts will also trigger a set of synchronized RPCs to replicate each change from the primary owner to each of its backup owners.

Note	Storing the index in DIST vs REPL Note that the best performing architecture for running queries implies using a REPLICATED Cache to store the indexes, so in this context writing to the backup owners means writing to all other nodes in the cluster with synchronous RPCs.

Proposition

The key idea:

for a given Infinispan data segment S, all its keys will be indexed in a Lucene index shard named I~s~
all entries composing index I~s~ are colocated into the same physical node owning data segment S

In addition we aim at storing each index into a REPL Cache to maximise performance at query time, but this is no strong requirement so one could use DIST if the index size is too large to be fully replicated.

Multi master

The first consequence of this design is that - for each data segment S, there will be an index manager I~s~.

That implies there will be a separate index writer for each of I~s~, and since Infinispan supposedly spreads out each of these shards across all physical compute resources fairly (equally or according to configured weight) the consequence is that this spreads out indexing load fairly as well.

✓ Parallel index writing

10,000 shards

This is the main drawback of this proposal, and some of the following paragraphs aim at reducing/compensating for this limitation.

The problem is that when applying sharding to a Lucene index, it’s not a good idea to have too many of them when it comes to query operations. We have experience with dozens of shards, not thousands.

Warning

Verify impact of many shards on Query performance at scale

This needs to be tested, but before someone jumps on it: the current design of Hibernate Search implies the number of indexes isn’t truly scalable, so that should be fixed first to give this a fair trial:

❏ Don’t initialize writing backends eagerly
❏ See if we should share Threads/Executors across backends (if still needed after previous point)

Caution

The amount of shards does not affect just the functional performance of queries, but also the quality of results on full-text queries.

Note

Number of segments by default in Infinispan

The current default number of segments in Infinispan is in the two digits range. That suggests this problem might not be too worrying, but we want to make sure it works even for (significantly) higher number of segments. Also the current default seems questionable.

What we don’t aim at

We’re not aiming at strong guarantees of affinity. There is an optimal case in which each write is resolved fully locally, and there is a suboptimal case in which we need to resolve the writes by delegating to a remote master node via RPCs, which then generates multiple RPCs to perform the storage (Essentially as in the current design).

Both of these operating modes result in a correctly stored index, so the system can freely transition from one mode to the other, effectively ignoring race conditions with cluster state transfer and assignement of segments to physical nodes. The goal is of course to maximise the likelyhood of each indexing operation to be performed in the optimal mode, especially when in steady state.

Backend behaviour in Optimal mode

In this scenario, the indexing backend I~s~ is located in the same node which is generating the index events by intercepting changes on S. Obviously this allows to skip serialization of the changeset and no RPCs are needed at all.

To make sure that the indexing changes are generated on the same node on which indexing backend I~s~ is located, it would be enough to have these generated on the primary owner for S, let’s call it O~s~. In the Infinispan model, this same primary owner O~s~ happens to be the perfect place to intercept the changes as it also is the node managing locking and consistency.

❏ Verify that indexing operations are created on the O~s~ for each S (it might currently happen on the originator?)

Caution

Dan mentioned that in some cases two nodes might be writing the same entry on the same key - so not violating any data consistency requirement - but while each of these nodes thinks that the other one is the primary owner. We’d be missing indexing events in such a case.

Backend behaviour in Suboptimal mode

Essentially this is the current implementation.

In this case, the work to be sent to the backend could be on any node. But only one node is the master indexer for any given workset, so the likelyhood to have a "lucky" event of needing to forward the indexing changeset to the local node is inversely proportional to the size of the cluster.

Index Directory in Optimal mode

In this mode the indexing backend I~s~ will configure its Directory instance to generate tagged keys. These tags will force the hashing function to store the entries in S.

When operating in optimal mode, the node running the indexing backend I~s~ is also O~s~: therefore any of the many writes which the Directory will generate, all are local.

There still is a source of RPCs: the primary owner needs to replicate each of its writes to the backup owners (which means everyone else when using REPL). But as we’ll see in a separate design page (TBD) that actually we can rework some details to get the same reliability guarantees from ASYNC_REPL than what we would get from REPL.

If that intuition turns out correct (to be explained and validated), that implies we reduce the many blocking indexing RPCs down to zero. If not, it means this proposal just halves the amount of RPCs.. let’s continue this design document assuming we’ll need REPL to be synchronous, there still is a very strong benefit: all lock acquisition of entries is local too.

Note

What when the locality assumptions fail?

The locality assumptions we’re basing on are inherently fragile, as they could cease to hold at any point in time and there is no locking strategy to prevent that. The point is that we really don’t care: if such a race happens, the operation will degenerate in the suboptimal execution. For "in flight" writes, we rely on Infinispan core to handle it via the NBST design.

Delete operations in optimal Directory

Remember the swarm of async RPCs generated by each delete statement? They all benefit from locality, especially the readlocks.

Index Directory in Suboptimal mode

See the current design: each index changeset generates m + 3 put operations, encompassing many different nodes.

Combine Suboptimal and Optimal modes into a "pick your pain" tunable space

So let’s be realistic/pessimistic and assume that we can’t efficiently run queries on many shards, when the number of shards needs to rise to match a high number of segments. Let’s also assume that the optimal number of segments (virtual nodes) used by Infinispan is probably higher than the current default of 60.

In such an evil world, one would need to choose a low number of segments to optimise query performance, but a large number of segments to optimise for proper data distribution across nodes, or even just to assign a segment to each server in case there are many.

This conflict of interests could be addressed by introducing a scale factor: an integral N which could be set by the user so that

segments = `N` * index shards

Let’s assume you pick N = 2. That would mean that 50% of your writes will be indexed in optimal mode, while the other 50% would be indexed in suboptimal mode.

It’s easy to see that the benefit degrades quickly as soon as we change N from anything other than 2, still even just this option doubles write performance speed compared to the alternative (assuming changing the number of shards is not an option).

This is not ideal, but the good news is that there’s not much to implement to get this feature since all of the above described components need to be able to gracefully degrade into suboptimal mode to handle cluster topology changes.

Important

The benefits of this approach would be considerably higher if we could revisit the hashing function in such a way to maximise co-location for contiguous segments.

❏ Explore alternative hash distribution functions to maximise colocations for contiguous segments

How we run queries

Hibernate Search already supports sharding and is able to perform a Query on multiple indexes, presenting them as a single virtual index. So what’s missing is to plug such a sharding implementation which maps an index changeset for S to I~s~.

To implement this mapping, the sharding implementation will need a reference to the Infinispan hash function.

Note	Question Technically the Infinispan hash function will have just processed the same key. Could the mapping `k` → `S` not be saved in the context and reused by the sharding policy?

Replicating the index on all nodes makes each node able to execute a query with local data. Not all of the index needs to remain in memory.

Lucene is very efficient at keeping a low portion of the index in memory and rely on disk access for the rest. To take advantage of this, we’d need the JGroups NIO.2 patches to efficiently make a hybrid index which is based on Infinispan replication but stored on some externally memory mapped space. Details to be discussed in a separate context, but the relevant point to this proposal is that we can essentially store an unbounded sized index in REPL mode.

Compared to the ClusteredQuery design in Infinispan Query, this approach doesn’t need to serialize Query instances from Lucene, nor serialize intermediate state or run complex distributed sorting protocols. Also, a ClusteredQuery needs to be broadcasted to all nodes - which all get to consume some resources - while in this model the Query can be performed locally.

Other benefits / open questions

Master election

In this model, there is no master election problem. There’s a straight 1:1 function: for any s, O~s~ also is I~s~.

Caution

How does this play out with transient multi-owner states? Or no-owner states?

Conclusion

We should replicate using async to replicas, this will speed things up significantly with only a small guarantee loss. Implementing this proposal offers a ratio cursor that can be moved by the user. The degenerated case of ratio = number of Infinispan segments means we have a single index. This is the current implementation: low locality rate, no parallelism on writers, great query performance. On the other spectrum, ratio = 1 means best case write but likely slower queries.

The ratio implementation is roughly done. What is missing is the ability to pass to Hibernate Search the sharding strategy to do key to index affinity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index-affinity-proposal.asciidoc

Index-affinity-proposal.asciidoc

Index Affinity proposal

Some terms

Indexing Backend

Index Directory

Segment

Current situation

Deletions and Compaction

Batched writes and Transactions

Sharding the index

Possible tuning options (today)

Summary of current indexing process

Proposition

Multi master

10,000 shards

What we don’t aim at

Backend behaviour in Optimal mode

Backend behaviour in Suboptimal mode

Index Directory in Optimal mode

Delete operations in optimal Directory

Index Directory in Suboptimal mode

Combine Suboptimal and Optimal modes into a "pick your pain" tunable space

How we run queries

Other benefits / open questions

Master election

Conclusion

Files

Index-affinity-proposal.asciidoc

Latest commit

History

Index-affinity-proposal.asciidoc

File metadata and controls

Index Affinity proposal

Some terms

Indexing Backend

Index Directory

Segment

Current situation

Deletions and Compaction

Batched writes and Transactions

Sharding the index

Possible tuning options (today)

Summary of current indexing process

Proposition

Multi master

10,000 shards

What we don’t aim at

Backend behaviour in Optimal mode

Backend behaviour in Suboptimal mode

Index Directory in Optimal mode

Delete operations in optimal Directory

Index Directory in Suboptimal mode

Combine Suboptimal and Optimal modes into a "pick your pain" tunable space

How we run queries

Other benefits / open questions

Master election

Conclusion