Skip to content

Latest commit

 

History

History
184 lines (150 loc) · 7.61 KB

technology.md

File metadata and controls

184 lines (150 loc) · 7.61 KB
title layout sectionid
Technology
simple_page
technology

Apache Druid (incubating) is an open source distributed data store. Druid’s core design combines ideas from OLAP/analytic databases, timeseries databases, and search systems to create a unified system for a broad range of use cases. Druid merges key characteristics of each of the 3 systems into its ingestion layer, storage format, querying layer, and core architecture.

Key features of Druid include:

Column-oriented storage

Druid stores and compresses each column individually, and only needs to read the ones needed for a particular query, which supports fast scans, rankings, and groupBys.

Native search indexes

Druid creates inverted indexes for string values for fast search and filter.

Streaming and batch ingest

Out-of-the-box connectors for Apache Kafka, HDFS, AWS S3, stream processors, and more.

Flexible schemas

Druid gracefully handles evolving schemas and nested data.

Time-optimized partitioning

Druid intelligently partitions data based on time and time-based queries are significantly faster than traditional databases.

SQL support

In addition to its native JSON based language, Druid speaks SQL over either HTTP or JDBC.

Horizontal scalability

Druid has been used in production to ingest millions of events/sec, retain years of data, and provide sub-second queries.

Easy operation

Scale up or down by just adding or removing servers, and Druid automatically rebalances. Fault-tolerant architecture routes around server failures.

Integration

Druid is complementary to many open source data technologies in the Apache Software Foundation including Apache Kafka, Apache Hadoop, Apache Flink, and more.

Druid typically sits between a storage or processing layer and the end user, and acts as a query layer to serve analytic workloads.

Ingestion

Druid supports both streaming and batch ingestion. Druid connects to a source of raw data, typically a message bus such as Apache Kafka (for streaming data loads), or a distributed filesystem such as HDFS (for batch data loads).

Druid converts raw data stored in a source to a more read-optimized format (called a Druid “segment”) in a process calling “indexing”.

For more information, please visit our docs page.

Storage

Like many analytic data stores, Druid stores data in columns. Depending on the type of column (string, number, etc), different compression and encoding methods are applied. Druid also builds different types of indexes based on the column type.

Similar to search systems, Druid builds inverted indexes for string columns for fast search and filter. Similar to timeseries databases, Druid intelligently partitions data by time to enable fast time-oriented queries.

Unlike many traditional systems, Druid can optionally pre-aggregate data as it is ingested. This pre-aggregation step is known as rollup, and can lead to dramatic storage savings.

For more information, please visit our docs page.

Querying

Druid supports querying data through JSON-over-HTTP and SQL. In addition to standard SQL operators, Druid supports unique operators that leverage its suite of approximate algorithms to provide rapid counting, ranking, and quantiles.

For more information, please visit our docs page.

Architecture

Druid can be thought of as a disassembled database. Each core process in Druid (ingestion, querying, and coordination) can be separately or jointly deployed on commodity hardware.

Druid explicitly names every main process to allow the operator to fine tune each process based on the use case and workload. For example, an operator can dedicate more resources to Druid’s ingestion process while giving less resources to Druid’s query process if the workload requires it.

Druid processes can independently fail without impacting the operations of other processes.

For more information, please visit our docs page.

Operations

Druid is designed to power applications that need to be up 24 hours a day, 7 days a week. As such, Druid possesses several features to ensure uptime and no data loss.

Data replication

All data in Druid is replicated a configurable number of times so single server failures have no impact on queries.

Independent processes

Druid explicitly names all of its main processes and each process can be fine tuned based on use case. Processes can independently fail without impacting other processes. For example, if the ingestion process fails, no new data is loaded in the system, but existing data remains queryable.

Automatic data backup

Druid automatically backs up all indexed data to a filesystem such as HDFS. You can lose your entire Druid cluster and quickly restore it from this backed up data.

Rolling updates

You can update a Druid cluster with no downtime and no impact to end users through rolling updates. All Druid releases are backwards compatible with the previous version.

For more information, please visit our docs page.