Stratio's Cassandra Lucene Index

Overview
Indexing
Searching
Geographical elements
Complex data types
Query builder
Spark and Hadoop
JMX interface
Performance tips

Overview

Stratio’s Cassandra Lucene Index, derived from Stratio Cassandra, is a plugin for Apache Cassandra that extends its index functionality to provide near real time search such as ElasticSearch or Solr, including full text search capabilities and free multivariable, geospatial and bitemporal search. It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. Stratio’s Cassandra indexes are one of the core modules on which Stratio’s BigData platform is based.

Index relevance searches allow you to retrieve the n more relevant results satisfying a search. The coordinator node sends the search to each node in the cluster, each node returns its n best results and then the coordinator combines these partial results and gives you the n best of them, avoiding full scan. You can also base the sorting in a combination of fields.

Any cell in the tables can be indexed, including those in the primary key as well as collections. Wide rows are also supported. You can scan token/key ranges, apply additional CQL3 clauses and page on the filtered results.

Index filtered searches are a powerful help when analyzing the data stored in Cassandra with MapReduce frameworks as Apache Hadoop or, even better, Apache Spark. Adding Lucene filters in the jobs input can dramatically reduce the amount of data to be processed, avoiding full scan.

This project is not intended to replace Apache Cassandra denormalized tables, inverted indexes, and/or secondary indexes. It is just a tool to perform some kind of queries which are really hard to be addressed using Apache Cassandra out of the box features, filling the gap between real-time and analytics.

Features

Lucene search technology integration into Cassandra provides:

Stratio’s Cassandra Lucene Index and its integration with Lucene search technology provides:

Full text search (language-aware analysis, wildcard, fuzzy, regexp)
Boolean search (and, or, not)
Sorting by relevance, column value, and distance
Geospatial indexing (points, lines, polygons and their multiparts)
Geospatial transformations (bounding box, buffer, centroid, convex hull, union, difference, intersection)
Geospatial operations (intersects, contains, is within)
Bitemporal search (valid and transaction time durations)
CQL complex types (list, set, map, tuple and UDT)
CQL user defined functions (UDF)
CQL paging, even with sorted searches
Columns with TTL
Third-party CQL-based drivers compatibility
Spark and Hadoop compatibility

Not yet supported:

Thrift API
Legacy compact storage option
Indexing counter columns
Indexing static columns
Other partitioners than Murmur3
Per partition limit

Architecture

Indexing is achieved through a Lucene based implementation of Apache Cassandra secondary indexes. Cassandra's secondary indexes are local indexes, meaning that each node of the cluster indexes it's own data. As usual in Cassandra, each node can act as search coordinator. The coordinator node sends the searches to all the involved nodes, and then it post-processes the returned rows to return the required ones. This post-processing is particularly important in sorted searches.

Regarding to the Cassandra-Lucene mapping, each node has a single Lucene index per indexed table, and each logic CQL row is mapped to a Lucene document. This documents are composed by the user-defined fields, the primary key and the partitioner's token. Indexing is done in a synchronous fashion at the storage layer, so each row upsert implies a document upsert. This adds an extra cost for write operations, which is the price of the provided search features. As long as indexing is done below the distribution layer, replication has been already achieved when the rows come to the index.

Requirements

Cassandra (identified by the three first numbers of the plugin version)
Java >= 1.8 (OpenJDK and Sun have been tested)
Maven >= 3.0

Installation

Stratio’s Cassandra Lucene Index is distributed as a plugin for Apache Cassandra. Thus, you just need to build a JAR containing the plugin and add it to the Cassandra’s classpath:

Clone the project: git clone http://github.com/Stratio/cassandra-lucene-index
Change to the downloaded directory: cd cassandra-lucene-index
Checkout a plugin version suitable for your Apache Cassandra version: git checkout A.B.C.X
Build the plugin with Maven: mvn clean package
Copy the generated JAR to the lib folder of your compatible Cassandra installation: cp plugin/target/cassandra-lucene-index-plugin-*.jar <CASSANDRA_HOME>/lib/
Start/restart Cassandra as usual.

Specific Cassandra Lucene index versions are targeted to specific Apache Cassandra versions. So, cassandra-lucene-index A.B.C.X is aimed to be used with Apache Cassandra A.B.C, e.g. cassandra-lucene-index:3.0.7.1 for cassandra:3.0.7. Please note that production-ready releases are version tags (e.g. 3.0.6.3), don't use branch-X nor master branches in production.

Alternatively, patching can also be done with this Maven profile, specifying the path of your Cassandra installation, this task also deletes previous plugin's JAR versions in CASSANDRA_HOME/lib/ directory:

mvn clean package -Ppatch -Dcassandra_home=<CASSANDRA_HOME>

If you don’t have an installed version of Cassandra, there is also an alternative profile to let Maven download and patch the proper version of Apache Cassandra:

mvn clean package -Pdownload_and_patch -Dcassandra_home=<CASSANDRA_HOME>

Now you can run Cassandra and do some tests using the Cassandra Query Language:

<CASSANDRA_HOME>/bin/cassandra -f
<CASSANDRA_HOME>/bin/cqlsh

The Lucene’s index files will be stored in the same directories where the Cassandra’s will be. The default data directory is /var/lib/cassandra/data, and each index is placed next to the SSTables of its indexed column family.

Remember that if you use geo shape search you need to include the JTS jar.

For more details about Apache Cassandra please see its documentation.

Upgrade

If you want to upgrade your cassandra cluster to a newer version you must follow the Datastax official upgrade instructions.

The rule for the Lucene secondary indexes is to delete them with older version, upgrade cassandra and lucene index jar and create them again with running newer version.

If you have huge amount of data in your cluster this could be an expensive task. We have tested it and here you have a compatibility matrix that states between which versions it is not needed to delete the index:

From\ To	3.0.3.0	3.0.3.1	3.0.4.0	3.0.4.1	3.0.5.0	3.0.5.1	3.0.5.2	3.0.6.0	3.0.6.1	3.0.6.2	3.0.7.0	3.0.7.1	3.0.7.2	3.0.8.0	3.0.8.1	3.0.8.2	3.0.8.3	3.0.9.0	3.0.9.1	3.0.9.2	3.0.10.0	3.0.10.1	3.0.10.2	3.0.10.3	3.0.10.4	3.0.11.0	3.0.12.0	3.0.13.0	3.0.14.0
2.x	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.3.0	--	YES	YES	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.3.1	--	--	YES	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.4.0	--	--	--	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.4.1	--	--	--	--	YES	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.5.0	--	--	--	--	--	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.5.1	--	--	--	--	--	--	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.5.2	--	--	--	--	--	--	--	YES	YES	YES	YES	YES	YES	YES			NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.6.0	--	--	--	--	--	--	--	--	YES	YES	YES	YES	YES	YES			NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.6.1	--	--	--	--	--	--	--	--	--	YES	YES	YES	YES	YES			NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.6.2	--	--	--	--	--	--	--	--	--	--	YES	YES	YES	YES			NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.7.0	--	--	--	--	--	--	--	--	--	--	--	YES	YES	YES			NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.7.1	--	--	--	--	--	--	--	--	--	--	--	--	YES	YES			NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.7.2	--	--	--	--	--	--	--	--	--	--	--	--	--	YES			NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.8.0	--	--	--	--	--	--	--	--	--	--	--	--	--	--			NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.8.1	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	YES	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.8.2	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO	NO
3.0.8.3	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	YES	YES	YES	YES	YES	NO	NO	NO	NO	NO	NO	NO
3.0.9.0	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	YES	YES	YES	YES	NO	NO	NO	NO	NO	NO	NO
3.0.9.1	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	YES	YES	YES	NO	NO	NO	NO	NO	NO	NO
3.0.9.2	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	YES	YES	NO	NO	NO	NO	NO	NO	NO
3.0.10.0	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	YES	NO	NO	NO	NO	NO	NO	NO
3.0.10.1	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	NO	NO	NO	NO	NO	NO	NO
3.0.10.2	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	YES	YES	YES	YES	YES	NO
3.0.10.3	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	YES	YES	YES	YES	NO
3.0.10.4	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	YES	YES	YES	NO
3.0.11.0	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	YES	YES	NO
3.0.12.0	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	YES	NO
3.0.13.0	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	--	NO

(1): Compatible only if you are not using geospatial mappers.

(2): Compatible only if you are not using snowball analyzers.

Alternative syntaxes

There are two alternative syntaxes for managing indexes. Prior to Cassandra 3.0, indexes had to be linked to a dummy column due to CQL syntax limitations:

CREATE TABLE test(pk int PRIMARY KEY, rc text);
ALTER TABLE test ADD lucene text; -- Dummy column

CREATE CUSTOM INDEX idx ON test(lucene) -- Index is linked to the dummy column
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {'schema': '{fields: {rc: {type: "text"}}}'};

This column wasn't intended to store anything, it was just a trick to embed Lucene syntax into CQL syntax, so custom search predicates could be directed to this dummy column:

SELECT * FROM test WHERE lucene = '{...}';

As a collateral benefit, this column was used to return the score assigned by the Lucene query to each of the rows.

However, Cassandra 3.0 introduced a secondary index API redesign including explicit syntactical support for custom per-row indexes using their own query language. This new syntax didn't require the dummy column anymore:

CREATE TABLE test(pk int PRIMARY KEY, rc text);

CREATE CUSTOM INDEX idx ON test() -- Index is directly linked to the table, without dummy column
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {'schema': '{fields: {rc: {type: "text"}}}'};

Instead, we can address custom search expressions directly to the index using the new 'expr' operator:

SELECT * FROM test WHERE expr(idx, '{...}');

As you can see, this new syntax is far clearer than the previous one. However, the old syntax is still supported for compatibility reasons, given that several client applications do not support the new syntax yet. The most remarkable case is DataStax's connector for Apache Spark, which doesn't allow 'expr' queries and fails managing tables with new-style indexes even if the Spark operation doesn't use the index at all. So, unfortunately, you must continue using the old dummy column approach if you are going to use the Spark connector or any other incompatible software.

Additionally, another possible reason for using the old syntax is that it uses the fake column to show the scores assigned by the Lucene's scoring formula to each one of the matched rows. This score is internally used for sorting and selecting the matched rows according to some user-defined search criteria. Although it is more intended for internal use, showing this value could be useful in some specific cases.

Last but not least, it is important to note that you can address searches with the new syntax to indexes created with the old fake column approach:

CREATE TABLE test(pk int PRIMARY KEY, rc text);
ALTER TABLE test ADD lucene text; -- Dummy column

CREATE CUSTOM INDEX idx ON test(lucene) -- Index is linked to the dummy column
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {'schema': '{fields: {rc: {type: "text"}}}'};

SELECT * FROM test WHERE expr(idx,'{...}');

This offers a good balance between the advantages of both syntaxes.

Cassandra only allows one per-row index per table, whereas there is no limit for the number of per-column indexes that a table can have. So, an additional benefit of creating indexes over dummy columns is that you can have multiple Lucene indexes per table, as long as they are considered per-column indexes.

All the examples in this document use the new syntax, but all of them can be written in the old way.

Example

We will create the following table to store tweets:

CREATE KEYSPACE demo
WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1};
USE demo;
CREATE TABLE tweets (
   id INT PRIMARY KEY,
   user TEXT,
   body TEXT,
   time TIMESTAMP,
   latitude FLOAT,
   longitude FLOAT
);

Now you can create a custom Lucene index on it with the following statement:

CREATE CUSTOM INDEX tweets_index ON tweets ()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         id: {type: "integer"},
         user: {type: "string"},
         body: {type: "text", analyzer: "english"},
         time: {type: "date", pattern: "yyyy/MM/dd"},
         place: {type: "geo_point", latitude: "latitude", longitude: "longitude"}
      }
   }'
};

This will index all the columns in the table with the specified types, and it will be refreshed once per second. Alternatively, you can explicitly refresh all the index shards with an empty search with consistency ALL:

CONSISTENCY ALL
SELECT * FROM tweets WHERE expr(tweets_index, '{refresh:true}');
CONSISTENCY QUORUM

Now, to search for tweets within a certain date range:

SELECT * FROM tweets WHERE expr(tweets_index, '{
   filter: {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"}
}');

The same search can be performed forcing an explicit refresh of the involved index shards:

SELECT * FROM tweets WHERE expr(tweets_index, '{
   filter: {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
   refresh: true
}') limit 100;

Now, to search the top 100 more relevant tweets where body field contains the phrase “big data gives organizations” within the aforementioned date range:

SELECT * FROM tweets WHERE expr(tweets_index, '{
   filter: {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
   query: {type: "phrase", field: "body", value: "big data gives organizations", slop: 1}
}') LIMIT 100;

To refine the search to get only the tweets written by users whose names start with "a":

SELECT * FROM tweets WHERE expr(tweets_index, '{
   filter: [
      {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
      {type: "prefix", field: "user", value: "a"}
   ],
   query: {type: "phrase", field: "body", value: "big data gives organizations", slop: 1}
}') LIMIT 100;

To get the 100 more recent filtered results you can use the sort option:

SELECT * FROM tweets WHERE expr(tweets_index, '{
   filter: [
      {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
      {type: "prefix", field: "user", value: "a"}
   ],
   query: {type: "phrase", field: "body", value: "big data gives organizations", slop: 1},
   sort: {field: "time", reverse: true}
}') limit 100;

The previous search can be restricted to tweets created close to a geographical position:

SELECT * FROM tweets WHERE expr(tweets_index, '{
   filter: [
      {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
      {type: "prefix", field: "user", value: "a"},
      {type: "geo_distance", field: "place", latitude: 40.3930, longitude: -3.7328, max_distance: "1km"}
   ],
   query: {type: "phrase", field: "body", value: "big data gives organizations", slop: 1},
   sort: {field: "time", reverse: true}
}') limit 100;

It is also possible to sort the results by distance to a geographical position:

SELECT * FROM tweets WHERE expr(tweets_index, '{
   filter: [
      {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
      {type: "prefix", field: "user", value: "a"},
      {type: "geo_distance", field: "place", latitude: 40.3930, longitude: -3.7328, max_distance: "1km"}
   ],
   query: {type: "phrase", field: "body", value: "big data gives organizations", slop: 1},
   sort: [
      {field: "time", reverse: true},
      {field: "place", type: "geo_distance", latitude: 40.3930, longitude: -3.7328}
   ]
}') limit 100;

Last but not least, you can route any search to a certain token range or partition, in such a way that only a subset of the cluster nodes will be hit, saving precious resources:

SELECT * FROM tweets WHERE expr(tweets_index, '{
   filter: [
      {type: "range", field: "time", lower: "2014/04/25", upper: "2014/05/01"},
      {type: "prefix", field: "user", value: "a"},
      {type: "geo_distance", field: "place", latitude: 40.3930, longitude: -3.7328, max_distance: "1km"}
   ],
   query: {type: "phrase", field: "body", value: "big data gives organizations", slop: 1},
   sort: [
      {field: "time", reverse: true},
      {field: "place", type: "geo_distance", latitude: 40.3930, longitude: -3.7328}
   ]
}') AND TOKEN(id) >= TOKEN(0) AND TOKEN(id) < TOKEN(10000000) limit 100;

Indexing

Lucene indexes are an extension of the Cassandra secondary indexes. As such, they are created through CQL CREATE CUSTOM INDEX statement, specifying the full qualified class name and a list of configuration options that are specified in this section.

Syntax:

CREATE CUSTOM INDEX (IF NOT EXISTS)? <index_name>
                                  ON <table_name> ()
                               USING 'com.stratio.cassandra.lucene.Index'
                        WITH OPTIONS = <options>

where <options> is a JSON object:

<options>:= {
   'schema': '<schema_definition>'
   (, 'refresh_seconds': '<int_value>')?
   (, 'ram_buffer_mb': '<int_value>')?
   (, 'max_merge_mb': '<int_value>')?
   (, 'max_cached_mb': '<int_value>')?
   (, 'indexing_threads': '<int_value>')?
   (, 'indexing_queues_size': '<int_value>')?
   (, 'directory_path': '<string_value>')?
   (, 'excluded_data_centers': '<string_value>')?
   (, 'partitioner': '<partitioner_definition>')?
};

All options take a value enclosed in single quotes:

refresh_seconds: number of seconds before auto-refreshing the index reader. It is the max time taken for writes to be searchable without forcing an index refresh. Defaults to '60'.
ram_buffer_mb: size of the write buffer. Its content will be committed to disk when full. Defaults to '64'.
max_merge_mb: defaults to '5'.
max_cached_mb: defaults to '30'.
indexing_threads: number of asynchronous indexing threads. ’0’ means synchronous indexing. Defaults to number of processors available to the JVM.
indexing_queues_size: max number of queued documents per asynchronous indexing thread. Defaults to ’50’.
directory_path: The path of the directory where the Lucene index will be stored.
excluded_data_centers: The comma-separated list of the data centers to be excluded. The index will be created on this data centers but all the write operations will be silently ignored.
partitioner: The optional index partitioner. Index partitioning is useful to speed up some searches to the detriment of others, depending on the implementation. It is also useful to overcome the Lucene's hard limit of 2147483519 documents per index.
sparse: If true, the update to the index is omitted unless the CQL statement includes a column that could affect the index. By default it is false, and any insert or update will trigger an index modification, even if the cql statement does not contain any relevant column for the index. The cost of this optimization is an extra comparison performed each time a row must be indexed. This flag helps in reducing lucene calls when the row is updated partially, and the columns that affect the index are updated less frequently then the rest of the row.
schema: see below

<schema_definition>:= {
   fields: { <mapper_definition> (, <mapper_definition>)* }
   (, analyzers: { <analyzer_definition> (, <analyzer_definition>)* })?
   (, default_analyzer: "<analyzer_name>")?
}

Where default_analyzer defaults to ‘org.apache.lucene.analysis.standard.StandardAnalyzer’.

<analyzer_definition>:= <analyzer_name>: {
   type: "<analyzer_type>" (, <option>: "<value>")*
}

<mapper_definition>:= <mapper_name>: {
   type: "<mapper_type>" (, <option>: "<value>")*
}

There are three configuration levels related to the directory where indexes should be written to. You can configure your static partitioner with the literal paths. Also, you can configure the global 'directory_path'.

SCLI uses the custom static partitioner paths, the directory_path and cassandra configured in that order

When cassandra is configured with more than one data_file_directories, during flushing, it locks those directories. Lucene indexes does it as well (file locking), So, if cassandra is configured with multiple data_file_directories, indexes prevent you to use a child directory to store lucene indexes.

Partitioners

Lucene indexes can be partitioned on a per-node basis. This means that the local index in each node can be split in multiple smaller fragments. Index partitioning is useful to speed up some searches to the detriment of others, depending on the implementation. It is also useful to overcome the Lucene's hard limit of 2147483519 documents per local index, which becomes a per-partition limit.

Partitioning is disabled by default, and it can be activated specifying a partitioner implementation in the index creation statement.

Please note that the index creation statement specifies the values of several Lucene memory-related attributes, such as max_merge_mb or ram_buffer_mb. These attributes are applied to each local Lucene index or partition, so the amount of memory should be multiplied by the number of partitions.

None partitioner

A partitioner with no action, equivalent to not defining a partitioner. This is the default implementation.

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'schema': '{...}',
   'partitioner': '{type: "none"}',
};

Token partitioner

A partitioner based on the partition key token. Partitioning on token guarantees a good load balancing between partitions while speeding up partition-directed searches to the detriment of token range searches performance. It allows to efficiently run partition directed queries in nodes indexing more than 2147483519 rows. However, token range searches in nodes with more than 2147483519 rows will fail. The number of partitions per node should be specified.

CREATE TABLE tweets (
   user TEXT,
   month INT,
   date TIMESTAMP,
   id INT,
   body TEXT,
   PRIMARY KEY ((user, month), date, id)
);

CREATE CUSTOM INDEX idx ON tweets()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'schema': '{...}',
   'partitioner': '{type: "token", partitions: 4}'
};

SELECT * FROM tweets WHERE expr(idx, '{...}') AND user = 'jsmith' AND month = 5; -- Fetches 1 node, 1 partition

SELECT * FROM tweets WHERE expr(idx, '{...}') AND user = 'jsmith' ALLOW FILTERING; -- Fetches all nodes, all partitions

SELECT * FROM tweets WHERE expr(idx, '{...}')'; -- Fetches all nodes, all partitions

Column partitioner

A partitioner based on a column of the partition key. Rows will be stored in an index partition determined by the hash of the specified partition key column. Both partition-directed and token range searches containing an CQL equality filter over the selected partition key column will be routed to a single partition, increasing performance. However, token range searches without filters over the partitioning column will be routed to all the partitions, with a slightly lower performance.

Load balancing depends on the cardinality and distribution of the values of the partitioning column. Both high cardinalities and uniform distributions will provide better load balancing between partitions.

CREATE TABLE tweets (
   user TEXT,
   month INT,
   date TIMESTAMP,
   id INT,
   body TEXT,
   PRIMARY KEY ((user, month), date, id)
);

CREATE CUSTOM INDEX idx ON tweets()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'schema': '{...}',
   'partitioner': '{type: "column", partitions: 4, column:"user"}',
};

SELECT * FROM tweets WHERE expr(idx, '{...}') AND user = 'jsmith' AND month = 5; -- Fetches 1 node, 1 partition

SELECT * FROM tweets WHERE expr(idx, '{...}') AND user = 'jsmith' ALLOW FILTERING; -- Fetches all nodes, 1 partition

SELECT * FROM tweets WHERE expr(idx, '{...}')'; -- Fetches all nodes, all partitions

Virtual node partitioner

A virtual node based partitioner. Rows will be stored in an index partition determined by the hash of the virtual node token range number. Partition-directed and specific virtual node token range searches will be routed to a single partition, increasing performance. However, unbounded token range searches will be routed to all the partitions, with a slightly lower performance.

Load balancing depends on virtual node token ranges distribution. The more virtual nodes, the better distribution (more similarity in number of tokens that falls inside any virtual node) between virtual nodes, the better load balancing.

CREATE TABLE tweets (
   user TEXT,
   month INT,
   date TIMESTAMP,
   id INT,
   body TEXT,
   PRIMARY KEY ((user, month), date, id)
);

CREATE CUSTOM INDEX idx ON tweets()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'schema': '{...}',
   'partitioner': '{type: "vnode", vnodes_per_partition: 4}'
};

SELECT * FROM tweets WHERE expr(idx, '{...}') AND user = 'jsmith' AND month = 5; -- Fetches 1 node, 1 partition

SELECT * FROM tweets WHERE expr(idx, '{...}') AND user = 'jsmith' ALLOW FILTERING; -- Fetches all nodes, all partitions

SELECT * FROM tweets WHERE expr(idx, '{...}')'
    AND token(user, month) >= -2918332558536081408 AND token(user, month) < -2882303761517117440; -- Fetches 1 node, 1 partition

    being [-2918332558536081408, -2882303761517117440) one virtual node token range assignment

SELECT * FROM tweets WHERE expr(idx, '{...}')'; -- Fetches all nodes, all partitions

Analyzers

Analyzer definition options depend on the analyzer type. Details and default values are listed in the table below.

Analyzer type	Option	Value type	Default value
classpath	class	string	null
snowball	language	string	null
snowball	stopwords	string	null

The analyzers defined in this section can by referenced by mappers. Additionally, there are prebuilt analyzers for:

Analyzer name	Analyzer full package name
standard	org.apache.lucene.analysis.standard.StandardAnalyzer
keyword	org.apache.lucene.analysis.core.KeywordAnalyzer
stop	org.apache.lucene.analysis.core.StopAnalyzer
whitespace	org.apache.lucene.analysis.core.WhitespaceAnalyzer
simple	org.apache.lucene.analysis.core.SimpleAnalyzer
classic	org.apache.lucene.analysis.standard.ClassicAnalyzer
arabic	org.apache.lucene.analysis.ar.ArabicAnalyzer
armenian	org.apache.lucene.analysis.hy.ArmenianAnalyzer
basque	org.apache.lucene.analysis.eu.BasqueAnalyzer
brazilian	org.apache.lucene.analysis.br.BrazilianAnalyzer
bulgarian	org.apache.lucene.analysis.bg.BulgarianAnalyzer
catalan	org.apache.lucene.analysis.ca.CatalanAnalyzer
cjk	org.apache.lucene.analysis.cjk.CJKAnalyzer
czech	org.apache.lucene.analysis.cz.CzechAnalyzer
dutch	org.apache.lucene.analysis.nl.DutchAnalyzer
danish	org.apache.lucene.analysis.da.DanishAnalyzer
english	org.apache.lucene.analysis.en.EnglishAnalyzer
finnish	org.apache.lucene.analysis.fi.FinnishAnalyzer
french	org.apache.lucene.analysis.fr.FrenchAnalyzer
galician	org.apache.lucene.analysis.gl.GalicianAnalyzer
german	org.apache.lucene.analysis.de.GermanAnalyzer
greek	org.apache.lucene.analysis.el.GreekAnalyzer
hindi	org.apache.lucene.analysis.hi.HindiAnalyzer
hungarian	org.apache.lucene.analysis.hu.HungarianAnalyzer
indonesian	org.apache.lucene.analysis.id.IndonesianAnalyzer
irish	org.apache.lucene.analysis.ga.IrishAnalyzer
italian	org.apache.lucene.analysis.it.ItalianAnalyzer
latvian	org.apache.lucene.analysis.lv.LatvianAnalyzer
norwegian	org.apache.lucene.analysis.no.NorwegianAnalyzer
persian	org.apache.lucene.analysis.fa.PersianAnalyzer
portuguese	org.apache.lucene.analysis.pt.PortugueseAnalyzer
romanian	org.apache.lucene.analysis.ro.RomanianAnalyzer
russian	org.apache.lucene.analysis.ru.RussianAnalyzer
sorani	org.apache.lucene.analysis.ckb.SoraniAnalyzer
spanish	org.apache.lucene.analysis.es.SpanishAnalyzer
swedish	org.apache.lucene.analysis.sv.SwedishAnalyzer
turkish	org.apache.lucene.analysis.tr.TurkishAnalyzer
thai	org.apache.lucene.analysis.th.ThaiAnalyzer

Classpath analyzer

Analyzer which instances a Lucene's analyzer present in classpath.

Example:

CREATE CUSTOM INDEX census_index on census()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      analyzers: {
         an_analyzer: {
            type: "classpath",
            class: "org.apache.lucene.analysis.en.EnglishAnalyzer"
         }
      }
   }'
};

Snowball analyzer

Analyzer using a http://snowball.tartarus.org/ snowball filter SnowballFilter

Example:

CREATE CUSTOM INDEX census_index on census()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      analyzers: {
         an_analyzer: {
            type: "snowball",
            language: "English",
            stopwords: "a,an,the,this,that"
         }
      }
   }'
};

Supported languages: English, French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Swedish, Norwegian, Danish, Russian, Finnish, Hungarian and Turkish.

Mappers

Field mapping definition options specify how the CQL rows will be mapped to Lucene documents. Several mappers can be applied to the same CQL column/s. Details and default values are listed in the table below.

Mapper type	Option	Value type	Default value	Mandatory
bigdec	validated	boolean	false	No
	column	string	mapper_name of the schema	No
	integer_digits	integer	32	No
	decimal_digits	integer	32	No
bigint	validated	boolean	false	No
	column	string	mapper_name of the schema	No
	digits	integer	32	No
bitemporal	validated	boolean	false	No
	vt_from	string		Yes
	vt_to	string		Yes
	tt_from	string		Yes
	tt_to	string		Yes
	pattern	string	yyyy/MM/dd HH:mm:ss.SSS Z	No
	now_value	object	Long.MAX_VALUE	No
blob	validated	boolean	false	No
blob	column	string	mapper_name of the schema	No
boolean	validated	boolean	false	No
boolean	column	string	mapper_name of the schema	No
date	validated	boolean	false	No
	column	string	mapper_name of the schema	No
	pattern	string	yyyy/MM/dd HH:mm:ss.SSS Z	No
date_range	validated	boolean	false	No
	from	string		Yes
	to	string		Yes
	pattern	string	yyyy/MM/dd HH:mm:ss.SSS Z	No
double	validated	boolean	false	No
	column	string	mapper_name of the schema	No
	boost	integer	0.1f	No
float	validated	boolean	false	No
	column	string	mapper_name of the schema	No
	boost	integer	0.1f	No
geo_point	validated	boolean	false	No
	latitude	string		Yes
	longitude	string		Yes
	max_levels	integer	11	No
geo_shape	validated	boolean	false	No
	column	string	mapper_name of the schema	No
	max_levels	integer	5	No
	transformations	array		No
inet	validated	boolean	false	No
inet	column	string	mapper_name of the schema	No
integer	validated	boolean	false	No
	column	string	mapper_name of the schema	No
	boost	integer	0.1f	No
long	validated	boolean	false	No
	column	string	mapper_name of the schema	No
	boost	integer	0.1f	No
string	validated	boolean	false	No
	column	string	mapper_name of the schema	No
	case_sensitive	boolean	true	No
text	validated	boolean	false	No
	column	string	mapper_name of the schema	No
	analyzer	string	default_analyzer of the schema	No
uuid	validated	boolean	false	No
uuid	column	string	mapper_name of the schema	No

All mappers have a validated option indicating if the mapped column values must be validated at CQL level before performing the distributed write operation. If this option is set then the coordinator node will throw an error on writes containing values that can't be mapped, causing the failure of all the write operation and notifying the client about the failure cause. If validation is not set, which is the default setting, writes to C* will never fail due to the index. Instead, each failing column value will be silently discarded, and the error message will be just logged in the implied nodes. This option is useful to avoid writes containing values that can't be searched afterwards, and can also be used as a generic data validation layer. Note that mappers affecting several columns at a time, such as date_range,``geo_point`` and bitemporal, need to have all the involved columns to perform validation, so no partial columns update will be allowed when validation is active.

Cassandra allows only one custom per-row index per table, and it does not allow any modify operation on indexes. So, to modify an index it needs to be deleted first and created again. Alternatively, if you are using the classic dummy-column syntax, the index will be considered per-column, so you would be able to create a second index with the new schema, wait until the new index is completely built, and then delete the old index.

Big decimal mapper

Maps arbitrary precision signed decimal values.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the big decimal to be indexed.
integer_digits (default = 32): the max number of decimal digits for the integer part.
decimal_digits (mandatory): the max number of decimal digits for the decimal part.

Supported CQL types:

ascii, bigint, decimal, double, float, int, smallint, text, tinyint, varchar, varint

Example:

CREATE CUSTOM INDEX census_index on census()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         bigdecimal: {
            type: "bigdec",
            integer_digits: 2,
            decimal_digits: 2,
            validated: true,
            column: "column_name"
         }
      }
   }'
};

Big integer mapper

Maps arbitrary precision signed integer values.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the big integer to be indexed.
digits (default = 32): the max number of decimal digits.

Supported CQL types:

ascii, bigint, int, smallint, text, tinyint, varchar, varint

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         biginteger: {
            type: "bigint",
            digits: 10,
            validated: true,
            column: "column_name"
         }
      }
   }'
};

Bitemporal mapper

Maps four columns containing the four dates defining a bitemporal fact. The mapped columns shouldn't be collections.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
vt_from (mandatory): the name of the column storing the beginning of the valid date range.
vt_to (mandatory): the name of the column storing the end of the valid date range.
tt_from (mandatory): the name of the column storing the beginning of the transaction date range.
tt_to (mandatory): the name of the column storing the end of the transaction date range.
now_value (default = Long.MAX_VALUE): a date representing now.
pattern (default = yyyy/MM/dd HH:mm:ss.SSS Z): the date pattern for parsing Cassandra not-date columns and creating Lucene fields. Note that it can be used to index dates with reduced precision.

Supported CQL types:

ascii, bigint, date, int, text, timestamp, timeuuid, uuid, varchar, varint

Example:

CREATE CUSTOM INDEX census_index on census()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         bitemporal: {
            type: "bitemporal",
            vt_from: "vt_from",
            vt_to: "vt_to",
            tt_from: "tt_from",
            tt_to: "tt_to",
            validated: true,
            pattern: "yyyy/MM/dd HH:mm:ss.SSS",
            now_value: "3000/01/01 00:00:00.000"
         }
      }
   }'
};

Blob mapper

Maps a blob value.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing blob to be indexed.

Supported CQL types:

ascii, blob, text, varchar

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         blob: {
            type: "bytes",
            column: "column_name"
         }
      }
   }'
};

Boolean mapper

Maps a boolean value.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing boolean value to be indexed.

Supported CQL types:

ascii, boolean , text, varchar

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         bool: {
            type: "boolean",
            validated: true,
            column: "column_name"
         }
      }
   }'
};

Date mapper

Maps dates using a either a pattern, an UNIX timestamp or a time UUID.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the date to be indexed.
pattern (default = yyyy/MM/dd HH:mm:ss.SSS Z): the date pattern for parsing Cassandra not-date columns and creating Lucene fields. Note that it can be used to index dates with reduced precision.

Supported CQL types:

ascii, bigint, date, int, text, timestamp, timeuuid, uuid, varchar, varint

Example: Index the column creation with a precision of minutes using the date format pattern yyyy/MM/dd HH:mm:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         creation: {
            type: "date",
            pattern: "yyyy/MM/dd HH:mm"
         }
      }
   }'
};

Date range mapper

Maps a time duration/period defined by a start date and a stop date. The mapped columns shouldn't be collections.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
from (mandatory): the name of the column storing the start date of the time duration to be indexed.
to (mandatory): the name of the column storing the stop date of the time duration to be indexed.
pattern (default = yyyy/MM/dd HH:mm:ss.SSS Z): the date pattern for parsing Cassandra not-date columns and creating Lucene fields. Note that it can be used to index dates with reduced precision.

Supported CQL types:

ascii, bigint, date, int, text, timestamp, timeuuid, uuid, varchar, varint

Example 1: Index the column time period defined by the columns start and stop, using the default date pattern:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         duration: {
            type: "date_range",
            from: "start",
            to: "stop"
         }
      }
   }'
};

Example 2: Index the column time period defined by the columns start and stop, validating values, and using a precision of minutes:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         duration: {
            type: "date_range",
            validated: true,
            from: "start",
            to: "stop",
            pattern: "yyyy/MM/dd HH:mm"
         }
      }
   }'
};

Double mapper

Maps a 64-bit decimal number.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the double to be indexed.
boost (default = 0.1f): the Lucene's index-time boosting factor.

Supported CQL types:

ascii, bigint, decimal, double, float, int, smallint, text, tinyint, varchar, varint

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         double: {
            type: "double",
            boost: 2.0,
            validated: true,
            column: "column_name"
         }
      }
   }'
};

Float mapper

Maps a 32-bit decimal number.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the float to be indexed.
boost (default = 0.1f): the Lucene's index-time boosting factor.

Supported CQL types:

ascii, bigint, decimal, double, float, int, smallint, text, tinyint, varchar, varint

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         float: {
            type: "float",
            boost: 2.0,
            validated: true,
            column: "column_name"
         }
      }
   }'
};

Geo point mapper

Maps a geospatial location (point) defined by two columns containing a latitude and a longitude. Indexing is based on a composite spatial strategy that stores points in a doc values field and also indexes them into a geohash recursive prefix tree with a certain precision level. The low-accuracy prefix tree is used to quickly find results, maybe producing some false positives, and the doc values field is used to discard these false positives. The mapped columns shouldn't be collections.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
latitude (mandatory): the name of the column storing the latitude of the point to be indexed.
longitude (mandatory): the name of the column storing the longitude of the point to be indexed.
max_levels (default = 11): the maximum number of levels in the underlying geohash search tree. False positives will be discarded using stored doc values, so this doesn't mean precision lost. Higher values will produce few false positives to be post-filtered, at the expense of creating more terms in the search index.

Supported CQL types:

ascii, bigint, decimal, double, float, int, smallint, text, varchar, varint

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         geo_point: {
            type: "geo_point",
            validated: true,
            latitude: "lat",
            longitude: "long",
            max_levels: 15
         }
      }
   }'
};

Geo shape mapper

Maps a geographical shape stored in a text column with Well Known Text (WKT) format. The supported WKT shapes are point, linestring, polygon, multipoint, multilinestring and multipolygon.

It is possible to specify a sequence of geometrical transformations to be applied to the shape before indexing it. It could be used for indexing only the centroid of the shape, or a buffer around it, etc.

Indexing is based on a composite spatial strategy that stores shapes in a doc values field and also indexes them into a geohash recursive prefix tree with a certain precision level. The low-accuracy prefix tree is used to quickly find results, maybe producing some false positives, and the doc values field is used to discard these false positives.

This mapper depends on Java Topology Suite (JTS). This library can't be distributed together with this project due to license compatibility problems, but you can add it by putting jts-core-1.14.0.jar into your Cassandra installation lib directory.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the shape to be indexed in WKT format.
max_levels (default = 5): the maximum number of levels in the underlying geohash search tree. False positives will be discarded using stored doc values, so this doesn't mean precision lost. Higher values will produce few false positives to be post-filtered, at the expense of creating more terms in the search index.
transformations (optional): sequence of geometrical transformations to be applied to each shape before indexing it.

Supported CQL types:

ascii, text, varchar

Example 1:

CREATE TABLE IF NOT EXISTS test (
   id int,
   shape text,
   lucene text,
   PRIMARY KEY (id)
);

INSERT INTO test(id, shape) VALUES (1, 'POINT(-0.13 51.50)');
INSERT INTO test(id, shape) VALUES (2, 'LINESTRING(-0.25 51.52, -0.08 51.39, -0.02 51.42)');
INSERT INTO test(id, shape) VALUES (3, 'POLYGON((-0.07 51.63, 0.03 51.54, 0.05 51.65, -0.07 51.63))');
INSERT INTO test(id, shape) VALUES (4, 'MULTIPOINT(-0.65 52.60, -1.00 51.76, -0.65 52.60)');
INSERT INTO test(id, shape) VALUES (5, 'MULTILINESTRING((-0.43 51.56, -0.33 51.35, -0.13 51.35),
                                                        (-0.25 51.56, -0.14 51.48))');
INSERT INTO test(id, shape) VALUES (6, 'MULTIPOLYGON(((-0.51 51.58, -0.18 51.14, 0.49 51.73, -0.51 51.58),
                                                      (-0.25 51.54, -0.12 51.32, 0.16 51.59, -0.25 51.54)))');

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         shape: {
            type: "geo_shape",
            max_levels: 15
         }
      }
   }'
};

Example 2: Index only the centroid of the WKT shape contained in the indexed column:

CREATE TABLE IF NOT EXISTS cities (
   name text,
   shape text,
   lucene text,
   PRIMARY KEY (name)
);

INSERT INTO cities(name, shape) VALUES ('birmingham', 'POLYGON((-2.25 52.63, -2.26 52.49, -2.13 52.36, -1.80 52.34, -1.57 52.54, -1.89 52.67, -2.25 52.63))');
INSERT INTO cities(name, shape) VALUES ('london', 'POLYGON((-0.55 51.50, -0.13 51.19, 0.21 51.35, 0.30 51.62, -0.02 51.75, -0.34 51.69, -0.55 51.50))');

CREATE CUSTOM INDEX cities_index on cities()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         shape: {
            type: "geo_shape",
            max_levels: 15,
            transformations: [{type: "centroid"}]
         }
      }
   }'
};

Example 3: Index a buffer 50 kilometres around the area of a city:

CREATE TABLE IF NOT EXISTS cities (
   name text,
   shape text,
   lucene text,
   PRIMARY KEY (name)
);

INSERT INTO cities(name, shape) VALUES ('birmingham', 'POLYGON((-2.25 52.63, -2.26 52.49, -2.13 52.36, -1.80 52.34, -1.57 52.54, -1.89 52.67, -2.25 52.63))');
INSERT INTO cities(name, shape) VALUES ('london', 'POLYGON((-0.55 51.50, -0.13 51.19, 0.21 51.35, 0.30 51.62, -0.02 51.75, -0.34 51.69, -0.55 51.50))');

CREATE CUSTOM INDEX cities_index on cities()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         shape: {
            type: "geo_shape",
            max_levels: 15,
            transformations: [{type: "buffer", min_distance: "50km"}]
         }
      }
   }'
};

Example 4: Index a buffer 50 kilometres around the borders of a country:

CREATE TABLE IF NOT EXISTS borders (
   country text,
   shape text,
   PRIMARY KEY (country)
);

INSERT INTO borders(country, shape) VALUES ('france', 'LINESTRING(-1.8037198483943 43.463094234466, -1.3642667233943 43.331258296966 ... )');
INSERT INTO borders(country, shape) VALUES ('portugal', 'LINESTRING(-8.8789151608943 41.925008296966, -8.2636807858943 42.100789546966 ... )');

CREATE CUSTOM INDEX borders_index on borders()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         shape: {
            type: "geo_shape",
            max_levels: 15,
            transformations: [{type: "buffer", max_distance: "50km"}]
         }
      }
   }'
};

Example 5: Index the convex hull of the WKT shape contained in the indexed column:

CREATE TABLE IF NOT EXISTS blocks (
   id bigint PRIMARY KEY,
   shape text
);

INSERT INTO blocks(name, shape) VALUES (341, 'MULTIPOLYGON(((-86.693279 32.390691, -86.693185 32.391494, -86.691590 32.391362, -86.691621 32.391095 ... )))');

CREATE CUSTOM INDEX blocks_index on cities()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         shape: {
            type: "geo_shape",
            max_levels: 15,
            transformations: [{type: "convex_hull"}]
         }
      }
   }'
};

Example 6: Index the bounding box of the WKT shape contained in the indexed column:

CREATE TABLE IF NOT EXISTS blocks (
   id bigint PRIMARY KEY,
   shape text
);

INSERT INTO blocks(name, shape) VALUES (341, 'MULTIPOLYGON(((-86.693279 32.390691, -86.693185 32.391494, -86.691590 32.391362 ... )))');

CREATE CUSTOM INDEX blocks_index on cities()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         shape: {
            type: "geo_shape",
            max_levels: 15,
            transformations: [{type: "bbox"}]
         }
      }
   }'
};

Inet mapper

Maps an IP address. Either IPv4 and IPv6 are supported.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the IP address to be indexed.

Supported CQL types:

ascii, inet, text, varchar

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         inet: {
            type: "inet",
            validated: true,
            column: "column_name"
         }
      }
   }'
};

Integer mapper

Maps a 32-bit integer number.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the integer to be indexed.
boost (default = 0.1f): the Lucene's index-time boosting factor.

Supported CQL types:

ascii, bigint, date, decimal, double, float, int, smallint, text, timestamp, tinyint, varchar, varint

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         integer: {
            type: "integer",
            validated: true,
            column: "column_name",
            boost: 2.0
         }
      }
   }'
};

Long mapper

Maps a 64-bit integer number.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the double to be indexed.
boost (default = 0.1f): the Lucene's index-time boosting factor.

Supported CQL types:

ascii, bigint, date, decimal, double, float, int, smallint, text, timestamp, tinyint, varchar, varint

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         long: {
            type: "long",
            validated: true,
            column: "column_name",
            boost: 2.0
         }
      }
   }'
};

String mapper

Maps a not-analyzed text value.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the IP address to be indexed.
case_sensitive (default = true): if the text will be indexed preserving its casing.

Supported CQL types:

ascii, bigint, boolean, decimal, double, float, inet, int, smallint, text, timeuuid, tinyint, uuid, varchar, varint

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         string: {
            type: "string",
            validated: true,
            column: "column_name",
            case_sensitive: false
         }
      }
   }'
};

Text mapper

Maps a language-aware text value analyzed according to the specified analyzer.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the IP address to be indexed.
analyzer (default = default_analyzer): the name of the text analyzer to be used. Additionally to references to those analyzers defined in the analyzers section of the schema, there are prebuilt analyzers for Arabic, Bulgarian, Brazilian, Catalan, Sorani, Czech, Danish, German, Greek, English, Spanish, Basque, Persian, Finnish, French, Irish, Galician, Hindi, Hungarian, Armenian, Indonesian, Italian, Latvian, Dutch, Norwegian, Portuguese, Romanian, Russian, Swedish, Thai and Turkish.

Supported CQL types:

ascii, bigint, boolean, decimal, double, float, inet, int, smallint, text, timeuuid, tinyint, uuid, varchar, varint

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      analyzers: {
         my_custom_analyzer: {
            type: "snowball",
            language: "Spanish",
            stopwords: "el,la,lo,los,las,a,ante,bajo,cabe,con,contra"
         }
      },
      fields: {
         spanish_text: {
            type: "text",
            validated: true,
            column: "message_body",
            analyzer: "my_custom_analyzer"
         },
         english_text: {
            type: "text",
            column: "message_body",
            analyzer: "English"
         }
      }
   }'
};

UUID mapper

Maps an UUID value.

Parameters:

validated (default = false): if mapping errors should make CQL writes fail, instead of just logging the error.
column (default = name of the mapper): the name of the column storing the IP address to be indexed.

Supported CQL types:

ascii, text, timeuuid, uuid, varchar

Example:

CREATE CUSTOM INDEX test_idx ON test()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '1',
   'schema': '{
      fields: {
         id: {
            type: "uuid",
            validated: true,
            column: "column_name"
         }
      }
   }'
};

Example

This code below and the one for creating the corresponding keyspace and table is available in a CQL script that can be sourced from the Cassandra shell: test-users-create.cql.

CREATE CUSTOM INDEX IF NOT EXISTS users_index
ON test.users ()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
   'refresh_seconds': '60',
   'ram_buffer_mb': '64',
   'max_merge_mb': '5',
   'max_cached_mb': '30',
   'excluded_data_centers': 'dc2,dc3',
   'partitioner': '{type: "token", partitions: 4}',
   'schema': '{
      analyzers: {
         my_custom_analyzer: {
            type: "snowball",
            language: "Spanish",
            stopwords: "el,la,lo,los,las,a,ante,bajo,cabe,con,contra"
         }
      },
      default_analyzer: "english",
      fields: {
         name: {type: "string"},
         gender: {type: "string", validated: true},
         animal: {type: "string"},
         age: {type: "integer"},
         food: {type: "string"},
         number: {type: "integer"},
         bool: {type: "boolean"},
         date: {type: "date", validated: true, pattern: "yyyy/MM/dd"},
         duration: {type: "date_range", from: "start_date", to: "stop_date"},
         place: {type: "geo_point", latitude: "latitude", longitude: "longitude"},
         mapz: {type: "string"},
         setz: {type: "string"},
         listz: {type: "string"},
         phrase: {type: "text", analyzer: "my_custom_analyzer"}
      }
   }'
};

Searching

Lucene indexes are queried using a custom JSON syntax defining the kind of search to be done.

Syntax:

SELECT ( <fields> | * ) FROM <table_name> WHERE expr(<index_name>, '{
   (filter: ( <filter> )* )?
   (, query: ( <query>  )* )?
   (, sort: ( <sort>   )* )?
   (, refresh: ( true | false ) )?
}');

where <filter> and <query> are a JSON object:

<filter>:= {type: <type> (, <option>: ( <value> | <value_list> ) )* }
<query>:= {type: <type> (, <option>: ( <value> | <value_list> ) )* }

and <sort> is another JSON object:

<sort>:= <simple_sort_field> | <geo_distance_sort_field>
<simple_sort_field>:= {
   field: <field>
   (, type: "simple" )?
   (, reverse: <reverse> )?
}
<geo_distance_sort_field>:= {
   type: "geo_distance",
   field: <field>,
   latitude: <Double>,
   longitude: <Double>
   (, reverse: <reverse> )?
}

When searching by filter, without any query or sort defined, then the results are returned in the Cassandra’s natural order, which is defined by the partitioner and the column name comparator. When searching by query, results are returned sorted by descending relevance. Sort option is used to specify the order in which the indexed rows will be traversed. When simple_sort_field sorting is used, the query scoring is delayed.

Geo_distance_sort_field is use to sort Rows by min distance to point indicating the GeoPointMapper to use by mapper field

Relevance queries must touch all the nodes in the ring in order to find the globally best results, so you should prefer filters over queries when no relevance nor sorting are needed.

The refresh boolean option indicates if the search must commit pending writes and refresh the Lucene IndexSearcher before being performed. This way a search with refresh set to true will view the most recent changes done to the index, independently of the index auto-refresh time. Please note that it is a costly operation, so you should not use it unless it is strictly necessary. The default value is false. You can explicitly refresh all the index shards with an empty search with consistency ALL, and the return to your desired consistency level:

CONSISTENCY ALL
SELECT * FROM <table> WHERE expr(<index_name>, '{refresh:true}');
CONSISTENCY QUORUM

This way the subsequent searches will view all the writes done before this operation, without needing to wait for the index auto refresh. It is useful to perform this operation before searching after a bulk data load.

Types of search and their options are summarized in the table below. Details for each of them are available in individual sections and the examples can be downloaded as a CQL script: extended-search-examples.cql.

In addition to the options described in the table, all search types have a “boost” option that acts as a weight on the resulting score.

Search type	Option	Value type	Default value	Mandatory
All
Bitemporal	field	string		Yes
	vt_from	string/long	0L	No
	vt_to	string/long	Long.MAX_VALUE	No
	tt_from	string/long	0L	No
	tt_to	string/long	Long.MAX_VALUE	No
Boolean	must	search		No
	should	search		No
	not	search		No
Contains	field	string		Yes
	values	array		Yes
	doc_values	boolean	false	No
Date range	field	string		Yes
	from	string/long	0	No
	to	string/long	Long.MAX_VALUE	No
	operation	string	intersects	No
Fuzzy	field	string		Yes
	value	string		Yes
	max_edits	integer	2	No
	prefix_length	integer	0	No
	max_expansions	integer	50	No
	transpositions	boolean	true	No
Geo bounding box	field	string		Yes
	min_latitude	double		Yes
	max_latitude	double		Yes
	min_longitude	double		Yes
	max_longitude	double		Yes
Geo distance	field	string		Yes
	latitude	double		Yes
	longitude	double		Yes
	max_distance	string		Yes
	min_distance	string		No
Geo shape	field	string		Yes
	shape	string (WKT)		Yes
	operation	string	is_within	No
Match	field	string		Yes
	value	any		Yes
	doc_values	boolean	false	No
None
Phrase	field	string		Yes
	value	string		Yes
	slop	integer	0	No
Prefix	field	string		Yes
Prefix	value	string		Yes
Range	field	string		Yes
	lower	any		No
	upper	any		No
	include_lower	boolean	false	No
	include_upper	boolean	false	No
	doc_values	boolean	false	No
Regexp	field	string		Yes
Regexp	value	string		Yes
Wildcard	field	string		Yes
Wildcard	value	string		Yes

All search

Search for all the indexed rows.

Syntax:

SELECT ( <fields> | * ) FROM <table> WHERE expr(<index_name>, '{
   (filter | query): {type: "all"}
}');

Example: search for all the indexed rows:

SELECT * FROM users WHERE expr(users_index, '
   {filter: {type: "all"}
}');

Values	Unit
mm, millimetres	millimetre
cm, centimetres	centimetre
dm, decimetres	decimetre
m, metres	metre
dam, decametres	decametre
hm, hectometres	hectometre
km, kilometres	kilometre
ft, foots	foot
yd, yards	yard
in, inches	inch
mi, miles	mile
M, NM, mil, nautical_miles	nautical mile

Name	Type	Notes
NumDeletedDocs	Attribute	Total number of deleted documents in the index.
NumDocs	Attribute	Total number of documents in the index.
Commit	Operation	Commits all the pending index changes to disk.
Refresh	Operation	Reopens all the readers and searchers to provide a recent view of the index.
forceMerge	Operation	Optimizes the index forcing merge segments leaving the specified number of segments. It also includes a boolean parameter to block until all merging completes.
forceMergeDeletes	Operation	Optimizes the index forcing merge segments containing deletions, leaving the specified number of segments. It also includes a boolean parameter to block until all merging completes.

Files

documentation.rst

Latest commit

History

documentation.rst

File metadata and controls

Stratio's Cassandra Lucene Index

Overview

Features

Architecture

Requirements

Installation

Upgrade

Alternative syntaxes

Example

Indexing

Partitioners

None partitioner

Token partitioner

Column partitioner

Virtual node partitioner

Analyzers

Classpath analyzer

Snowball analyzer

Example:

Mappers

Big decimal mapper

Big integer mapper

Bitemporal mapper

Blob mapper

Boolean mapper

Date mapper

Date range mapper

Double mapper

Float mapper

Geo point mapper

Geo shape mapper

Inet mapper

Integer mapper

Long mapper

String mapper

Text mapper

UUID mapper

Example

Searching

All search

Bitemporal search

Boolean search

Contains search

Date range search

Fuzzy search

Geo bbox search

Geo distance search

Geo shape search

Match search

None search

Phrase search

Prefix search

Range search

Regexp search

Wildcard search

Geographical elements

Distance

Transformations

Bounding box transformation

Buffer transformation

Centroid transformation

Convex hull transformation

Shapes

WKT shape

Bounding box shape

Buffer shape

Centroid shape

Convex hull shape

Difference shape

Intersection shape

Union shape

Complex data types

Tuples

User Defined Types