Skip to content
This repository has been archived by the owner on May 27, 2020. It is now read-only.

Cassandra secondary lucene index rebuild is very slowly #416

Open
Dante2204 opened this issue Nov 28, 2019 · 0 comments
Open

Cassandra secondary lucene index rebuild is very slowly #416

Dante2204 opened this issue Nov 28, 2019 · 0 comments

Comments

@Dante2204
Copy link

Hi all,

I am using
Apache-cassandra version: 3.11.2
cassandra-lucene-index-plugin jar version: 3.11.2

In my cassandra cluster, I have a column-family having a lucene based secondary index:
Create Custom INDEX secondary_idx on keyspace.tablename () USING 'com.stratio.cassandra.lucene.Index' with OPTIONS = ('schema' : '{ fields: { id: { type : "uuid"}, valid_from: { type : "date", pattern : "yyyy-MM-dd"}, valid_till : { type : "date", pattern : "yyyy-MM-dd"}, name : { type: "string"}} }');

I have approximately 100 million rows in this column-family (the folder size of this column family on is ~180 GB per node and the nodetool status tells a load of ~130 GB per node). This data is expected to grow much more in future.

The issues I am having are:

  1. If I am trying to create a new secondary index on this loaded cluster, it is taking >7hrs to get created. I tried adding some custom properties while creating the index like: 'indexing_queues_size': '400', 'ram_buffer_mb': '513', 'partitioner': '{type : "token", partitions: 4}', 'index_threading': '8', but these did not bring about any considerable difference. Is there anything else I can optimize to reduce the time taken to build a new secondary index on a loaded cluster using lucene?

  2. For backup-restore operations, I am backing up the /lucene folder in my column-family and while restoring, I am pasting it back from a different location as is. For restoration, I start the cassandra application on this host and perform: nodetool repair -pr -seq -local -- keyspace. I am observing that if I want to rebuild my secondary index using nodetool rebuild_index -- keyspace tablename secondary_idx (so that it gets synced up with any new data), it is taking a long time (~12 hrs with significant variations) every time I trigger it even when no new data has been ingested into the cluster. I tried using SASI as secondary index. It is taking very less time to build up as compared to lucene index, but the rebuild index is taking same time on that index as well. Is there a better way to perform a rebuild_index in lucene index which does not take so much time provided I have restored the lucene files?

Thanks in advance

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant