Skip to content

Releases: apache/druid

Druid 0.23.0

23 Jun 03:31
Compare
Choose a tag to compare

Apache Druid 0.23.0 contains over 450 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 81 contributors. See the complete set of changes for additional details.

# New Features

# Query engine

# Grouping on arrays without exploding the arrays

You can now group on a multi-value dimension as an array. For a datasource named "test":

{"timestamp": "2011-01-12T00:00:00.000Z", "tags": ["t1","t2","t3"]}  #row1
{"timestamp": "2011-01-13T00:00:00.000Z", "tags": ["t3","t4","t5"]}  #row2
{"timestamp": "2011-01-14T00:00:00.000Z", "tags": ["t5","t6","t7"]}  #row3
{"timestamp": "2011-01-14T00:00:00.000Z", "tags": []}                #row4

The following query:

{
  "queryType": "groupBy",
  "dataSource": "test",
  "intervals": [
    "1970-01-01T00:00:00.000Z/3000-01-01T00:00:00.000Z"
  ],
  "granularity": {
    "type": "all"
  },
  "virtualColumns" : [ {
    "type" : "expression",
    "name" : "v0",
    "expression" : "mv_to_array(\"tags\")",
    "outputType" : "ARRAY<STRING>"
  } ],
  "dimensions": [
    {
      "type": "default",
      "dimension": "v0",
      "outputName": "tags"
      "outputType":"ARRAY<STRING>"
    }
  ],
  "aggregations": [
    {
      "type": "count",
      "name": "count"
    }
  ]
}

Returns the following:

[
 {
    "timestamp": "1970-01-01T00:00:00.000Z",
    "event": {
      "count": 1,
      "tags": "[]"
    }
  },
  {
    "timestamp": "1970-01-01T00:00:00.000Z",
    "event": {
      "count": 1,
      "tags": "["t1","t2","t3"]"
    }
  },
  {
    "timestamp": "1970-01-01T00:00:00.000Z",
    "event": {
      "count": 1,
      "tags": "[t3","t4","t5"]"
    }
  },
  {
    "timestamp": "1970-01-01T00:00:00.000Z",
    "event": {
      "count": 2,
      "tags": "["t5","t6","t7"]"
    }
  }
]

(#12078)
(#12253)

# Specify a column other than __time column for row comparison in first/last aggregators

You can pass time column in *first/*last aggregators by using LATEST_BY / EARLIEST_BY SQL functions. This provides support for cases where the time is stored as a part of a column different than "__time". You can also specify another logical time column.
(#11949)
(#12145)

# Improvements to querying user experience

This release includes several improvements for querying:

  • Added the SQL query ID to response header for failed SQL query to aid in locating the error messages (#11756)
  • Added input type validation for DataSketches HLL (#12131)
  • Improved JDBC logging (#11676)
  • Added SQL functions MV_FILTER_ONLY and MV_FILTER_NONE to filter rows of multi-value string dimensions to include only the supplied list of values or none of them respectively (#11650)
  • Added ARRAY_CONCAT_AGG to aggregate array inputs together into a single array (#12226)
  • Added the ability to authorize the usage of query context parameters (#12396)
  • Improved query IDs to make it easier to link queries and sub-queries for end-to-end query visibility (#11809)
  • Added a safe divide function to protect against division by 0 (#11904)
  • You can now add a query context to internally generated SegmentMetadata query (#11429)
  • Added support for Druid complex types to the native expression processing system to make all Druid data usable within expressions (#11853, #12016)
  • You can control the size of the on-heap segment-level dictionary via druid.query.groupBy.maxSelectorDictionarySize when grouping on string or array-valued expressions that do not have pre-existing dictionaries.
  • You have better protection against filter explosion during CNF conversion (#12314) (#12324)
  • You can get the complete native query on explaining the SQL query by setting useNativeQueryExplain to true in query context (#11908)
  • You can have broker ignore real time nodes or specific historical tiers. (#11766) (#11732)

# Streaming Ingestion

# Kafka input format for parsing headers and key

We've introduced a Kafka input format so you can ingest header data in addition to the message contents. For example:

  • the event key field
  • event headers
  • the Kafka event timestamp
  • the Kafka event value that stores the payload.

(#11630)

# Kinesis ingestion - Improvements

We have made following improvements in kinesis ingestion

  • Re-sharding can affect and slow down ingestion as many intermediate empty shards are created. These shards get assigned to tasks causing imbalance in load assignment. You can set skipIgnorableShards to true in kinesis ingestion tuning config to ignore such shards. (#12235)
  • Currently, kinesis ingestion uses DescribeStream to fetch the list of shards. This call is deprecated and slower. In this release, you can switch to a newer API listShards by setting useListShards to true in kinesis ingestion tuning config. (#12161)

# Native Batch Ingestion

# Multi-dimension range partitioning

Multi-dimension range partitioning allows users to partition their data on the ranges of any number of dimensions. It develops further on the concepts behind "single-dim" partitioning and is now arguably the most preferable secondary partitioning, both for query performance and storage efficiency.
(#11848)
(#11973)

# Improved replace data behavior

In previous versions of Druid, if ingested data with dropExisting flag to replace data, Druid would retain the existing data for a time chunk if there was no new data to replace it. Now, if you set dropExisting to true in your ioSpec and ingest data for a time range that includes a time chunk with no data, Druid uses a tombstone to overshadow the existing data in the empty time chunk.
(#12137)

This release includes several improvements for native batch ingestion:

  • Druid now emits a new metric when a batch task finishes waiting for segment availability. (#11090)
  • Added segmentAvailabilityWaitTimeMs, the duration in milliseconds that a task waited for its segments to be handed off to Historical nodes, to IngestionStatsAndErrorsTaskReportData (#11090)
  • Added functionality to preserve existing metrics during ingestion (#12185)
  • Parallel native batch task can now provide task reports for the sequential and single phase mode (e.g., used with dynamic partitioning) as well as single phase mode subtasks (#11688)
  • Added support for RowStats in druid/indexer/v1/task/{task_id}/reports API for multi-phase parallel indexing task (#12280)
  • Fixed the OOM failures in the dimension distribution phase of parallel indexing (#12331)
  • Added support to handle null dimension values while creating partition boundaries (#11973)

# Improvements to ingestion in general

This release includes several improvements for ingestion in general:

  • Removed the template modifier from IncrementalIndex<AggregatorType> because it is no longer required
  • You can now use JsonPath functions in JsonPath expressions during ingestion (#11722)
  • Druid no longer creates a materialized list of segment files and elimited looping over the files to reduce OOM issues (#11903)
  • Added an intermediate-persist IndexSpec to the main "merge" method in IndexMerger (#11940)
  • Granularity.granularitiesFinerThan now returns ALL if you pass in ALL (#12003)
  • Added a configuation parameter for appending tasks to allow them to use a SHARED lock (#12041)
  • SchemaRegistryBasedAvroBytesDecoder now throws a ParseException instead of RE when it fails to retrieve a schema (#12080)
  • Added includeAllDimensions to dimensionsSpec to put all explicit dimensions first in InputRow and subsequently any other dimensions found in input data (#12276)
  • Added the ability to store null columns in segments (#12279)

# Compaction

This release includes several improvements for compaction:

  • Automatic compaction now supports complex dimensions (#11924)
  • Automatic compaction now supports overlapping segment in...
Read more

druid-0.22.1

11 Dec 09:24
Compare
Choose a tag to compare

Apache Druid 0.22.1 is a bug fix release that fixes some security issues. See the complete set of changes for additional details.

# Bug fixes

#12051 Update log4j to 2.15.0 to address CVE-2021-44228
#11787 JsonConfigurator no longer logs sensitive properties
#11786 Update axios to 0.21.4 to address CVE-2021-3749
#11844 Update netty4 to 4.1.68 to address CVE-2021-37136 and CVE-2021-37137

# Credits

Thanks to everyone who contributed to this release!

@abhishekagarwal87
@andreacyc
@clintropolis
@gianm
@jihoonson
@kfaraz
@xvrl

druid-0.22.0

22 Sep 22:24
Compare
Choose a tag to compare

Apache Druid 0.22.0 contains over 400 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 73 contributors. See the complete set of changes for additional details.

# New features

# Query engine

# Support for multiple distinct aggregators in same query

Druid now can support multiple DISTINCT 'exact' counts using the grouping aggregator typically used with grouping sets. Note that this only applies to exact counts - when druid.sql.planner.useApproximateCountDistinct is false, and can be enabled by setting druid.sql.planner.useGroupingSetForExactDistinct to true.

#11014

# SQL ARRAY_AGG and STRING_AGG aggregator functions

The ARRAY_AGG aggregation function has been added, to allow accumulating values or distinct values of a column into a single array result. This release also adds STRING_AGG, which is similar to ARRAY_AGG, except it joins the array values into a single string with a supplied 'delimiter' and it ignores null values. Both of these functions accept a maximum size parameter to control maximum result size, and will fail if this value is exceeded. See SQL documentation for additional details.

#11157
#11241

# Bitwise math function expressions and aggregators

Several new SQL functions functions for performing 'bitwise' math (along with corresponding native expressions), including BITWISE_AND, BITWISE_OR, BITWISE_XOR and so on. Additionally, aggregation functions BIT_AND, BIT_OR, and BIT_XOR have been added to accumulate values in a column with the corresponding bitwise function. For complete details see SQL documentation.

#10605
#10823
#11280

# Human readable number format functions

Three new SQL and native expression number format functions have been added in Druid 0.22.0, HUMAN_READABLE_BINARY_BYTE_FORMAT, HUMAN_READABLE_DECIMAL_BYTE_FORMAT, and HUMAN_READABLE_DECIMAL_FORMAT, which allow transforming results into a more friendly consumption format for query results. For more information see SQL documentation.

#10584
#10635

# Expression aggregator

Druid 0.22.0 adds a new 'native' JSON query expression aggregator function, that lets you use Druid native expressions to perform "fold" (alternatively known as "reduce") operations to accumulate some value on any number of input columns. This adds significant flexibility to what can be done in a Druid aggregator, similar in a lot of ways to what was possible with the Javascript aggregator, but in a much safer, sandboxed manner.

Expressions now being able to perform a "fold" on input columns also really rounds out the abilities of native expressions in addition to the previously possible "map" (expression virtual columns), "filter" (expression filters) and post-transform (expression post-aggregators) functions.

Since this uses expressions, performance is not yet optimal, and it is not directly documented yet, but it is the underlying technology behind the SQL ARRAY_AGG, STRING_AGG, and bitwise aggregator functions also added in this release.

#11104

# SQL query routing improvements

Druid 0.22 adds some new facilities to provide extension writers with enhanced control over how queries are routed between Druid routers and brokers. The first adds a new manual broker selection strategy to the Druid router, which allows a query to manually specify which Druid brokers a query should be sent to based on a query context parameter brokerService to any broker pool defined in druid.router.tierToBrokerMap (this corresponds to the 'service name' of the broker set, druid.service).

The second new feature allows the Druid router to parse and examine SQL queries so that broker selection strategies can also function for SQL queries. This can be enabled by setting druid.router.sql.enable to true. This does not affect JDBC queries, which use a different mechanism to facilitate "sticky" connections to a single broker.

#11566
#11495

# Avatica protobuf JDBC Support

Druid now supports using Avatica Protobuf JDBC connections, such as for use with the Avatica Golang Driver, and has a separate endpoint from the JSON JDBC uri.

String url = "jdbc:avatica:remote:url=http://localhost:8082/druid/v2/sql/avatica-protobuf/;serialization=protobuf";

#10543

# Improved query error logging

Query exceptions have been changed from WARN level to ERROR level to include additional information in the logs to help troubleshoot query failures. Additionally, a new query context flag, enableQueryDebugging has been added that will include stack traces in these query error logs, to provide even more information without the need to enable logs at the DEBUG level.

#11519

# Streaming Ingestion

# Task autoscaling for Kafka and Kinesis streaming ingestion

Druid 0.22.0 now offers experimental support for dynamic Kafka and Kinesis task scaling. The included strategies are driven by periodic measurement of stream lag (which is based on message count for Kafka, and difference of age between the message iterator and the oldest message for Kinesis), and will adjust the number of tasks based on the amount of 'lag' and several configuration parameters. See Kafka and Kinesis documentation for complete information.

#10524
#10985

# Avro and Protobuf streaming InputFormat and Confluent Schema Registry Support

Druid streaming ingestion now has support for Avro and Protobuf in the updated InputFormat specification format, which replaces the deprecated firehose/parser specification used by legacy Druid streaming formats. Alongside this, comes support for obtaining schemas for these formats from Confluent Schema Registry. See data formats documentation for further information.

#11040
#11018
#10314
#10839

# Kafka ingestion support for specifying group.id

Druid Kafka streaming ingestion now optionally supports specifying group.id on the connections Druid tasks make to the Kafka brokers. This is useful for accessing clusters which require this be set as part of authorization, and can be specified in the consumerProperties section of the Kafka supervisor spec. See Kafka ingestion documentation for more details.

#11147

# Native Batch Ingestion

# Support for using deep storage for intermediary shuffle data

Druid native 'perfect rollup' 2-phase ingestion tasks now support using deep storage as a shuffle location, as an alternative to local disks on middle-managers or indexers. To use this feature, set druid.processing.intermediaryData.storage.type to deepstore, which uses the configured deep storage type.

Note - With "deepstore" type, data is stored in shuffle-data directory under the configured deep storage path, auto clean up for this directory is not supported yet. One can setup cloud storage lifecycle rules for auto clean up of data at shuffle-data prefix location.

#11507

# Improved native batch ingestion task memory usage

Druid native batch ingestion has received a new configuration option, druid.indexer.task.batchProcessingMode which introduces two new operating modes that should allow batch ingestion to operate with a smaller and more predictable heap memory usage footprint. The CLOSED_SEGMENTS_SINKS mode is the most aggressive, and should have the smallest memory footprint, and works by eliminating in memory tracking and mmap of intermediary segments produced during segment creation, but isn't super well tested at this point so considered experimental...

Read more

druid-0.21.1

10 Jun 23:14
Compare
Choose a tag to compare

Apache Druid 0.21.1 is a bug fix release that fixes a few regressions with the 0.21 release. The first is an issue with the published Docker image, which causes containers to fail to start due to volume permission issues, described in #11166 as fixed in #11167. This release also fixes an issue caused by a bug in the upgraded Jetty version which was released in 0.21, described in #11206 and fixed in #11207. Finally, a web console regression related to field validation has been added in #11228.

# Bug fixes

#11167 fix docker volume permissions
#11207 Upgrade jetty version
#11228 Web console: Fix required field treatment
#11299 Fix permission problems in docker

# Credits

Thanks to everyone who contributed to this release!

@a2l007
@clintropolis
@FrankChen021
@maytasm
@vogievetsky

druid-0.21.0

28 Apr 00:26
Compare
Choose a tag to compare

Apache Druid 0.21.0 contains around 120 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 36 contributors. Refer to the complete list of changes and everything tagged to the milestone for further details.

# New features

# Operation

# Service discovery and leader election based on Kubernetes

The new Kubernetes extension supports service discovery and leader election based on Kubernetes. This extension works in conjunction with the HTTP-based server view (druid.serverview.type=http) and task management (druid.indexer.runner.type=httpRemote) to allow you to run a Druid cluster with zero ZooKeeper dependencies. This extension is still experimental. See Kubernetes extension for more details.

#10544
#9507
#10537

# New dynamic coordinator configuration to limit the number of segments when finding a candidate segment for segment balancing

You can set the percentOfSegmentsToConsiderPerMove to limit the number of segments considered when picking a candidate segment to move. The candidates are searched up to maxSegmentsToMove * 2 times. This new configuration prevents Druid from iterating through all available segments to speed up the segment balancing process, especially if you have lots of available segments in your cluster. See Coordinator dynamic configuration for more details.

#10284

# status and selfDiscovered endpoints for Indexers

The Indexer now supports status and selfDiscovered endpoints. See Processor information APIs for details.

#10679

# Querying

# New grouping aggregator function

You can use the new grouping aggregator SQL function with GROUPING SETS or CUBE to indicate which grouping dimensions are included in the current grouping set. See Aggregation functions for more details.

#10518

# Improved missing argument handling in expressions and functions

Expression processing now can be vectorized when inputs are missing. For example a non-existent column. When an argument is missing in an expression, Druid can now infer the proper type of result based on non-null arguments. For instance, for longColumn + nonExistentColumn, nonExistentColumn is treated as (long) 0 instead of (double) 0.0. Finally, in default null handling mode, math functions can produce output properly by treating missing arguments as zeros.

#10499

# Allow zero period for TIMESTAMPADD

TIMESTAMPADD function now allows zero period. This functionality is required for some BI tools such as Tableau.

#10550

# Ingestion

# Native parallel ingestion no longer requires explicit intervals

Parallel task no longer requires you to set explicit intervals in granularitySpec. If intervals are missing, the parallel task executes an extra step for input sampling which collects the intervals to index.

#10592
#10647

# Old Kafka version support

Druid now supports Apache Kafka older than 0.11. To read from an old version of Kafka, set the isolation.level to read_uncommitted in consumerProperties. Only 0.10.2.1 have been tested up until this release. See Kafka supervisor configurations for details.

#10551

Multi-phase segment merge for native batch ingestion

A new tuningConfig, maxColumnsToMerge, controls how many segments can be merged at the same time in the task. This configuration can be useful to avoid high memory pressure during the merge. See tuningConfig for native batch ingestion for more details.

#10689

# Native re-ingestion is less memory intensive

Parallel tasks now sort segments by ID before assigning them to subtasks. This sorting minimizes the number of time chunks for each subtask to handle. As a result, each subtask is expected to use less memory, especially when a single Parallel task is issued to re-ingest segments covering a long time period.

#10646

# Web console

# Updated and improved web console styles

The new web console styles make better use of the Druid brand colors and standardize paddings and margins throughout. The icon and background colors are now derived from the Druid logo.

image

#10515

# Partitioning information is available in the web console

The web console now shows datasource partitioning information on the new Segment granularity and Partitioning columns.

Segment granularity column in the Datasources tab

97240667-1b9cb280-17ac-11eb-9c55-e312c24cd8fc

Partitioning column in the Segments tab

97240597-ebedaa80-17ab-11eb-976f-a0d49d6d1a40

#10533

# The column order in the Schema table matches the dimensionsSpec

The Schema table now reflects the dimension ordering in the dimensionsSpec.

image

#10588

# Metrics

# Coordinator duty runtime metrics

The coordinator performs several 'duty' tasks. For example segment balancing, loading new segments, etc. Now there are two new metrics to help you analyze how fast the Coordinator is executing these duties.

  • coordinator/time: the time for an individual duty to execute
  • coordinator/global/time: the time for the whole duties runnable to execute

#10603

# Query timeout metric

A new metric provides the number of timed out queries. Previously timed out queries were treated as interrupted and included in the query/interrupted/count (see Changed HTTP status codes for query errors for more details).

query/timeout/count: the number of timed out queries during the emission period

#10567

# Shuffle metrics for batch ingestion

Two new metrics provide shuffle statistics for MiddleManagers and Indexers. These metrics have the supervisorTaskId as their dimension.

  • ingest/shuffle/bytes: number of bytes shuffled per emission period
  • ingest/shuffle/requests: number of shuffle requests per emission period

To enable the shuffle metrics, add org.apache.druid.indexing.worker.shuffle.ShuffleMonitor in druid.monitoring.monitors. See Shuffle metrics for more details.

#10359

# New clock-drift safe metrics monitor scheduler

The default metrics monitor scheduler is implemented based on ScheduledThreadPoolExecutor which is prone to unbounded clock drift. A new monitor scheduler, ClockDriftSafeMonitorScheduler, overcomes this limitation. To use the new scheduler, set druid.monitoring.schedulerClassName to org.apache.druid.java.util.metrics.ClockDriftSafeMonitorScheduler in the runtime.properties file.

#10448
#10732

# Others

# New extension for a password p...

Read more

druid-0.20.2

29 Mar 19:00
Compare
Choose a tag to compare

Apache Druid 0.20.2 introduces new configurations to address CVE-2021-26919: Authenticated users can execute arbitrary code from malicious MySQL database systems. Users are recommended to enable new configurations in the below to mitigate vulnerable JDBC connection properties. These configurations will be applied to all JDBC connections for ingestion and lookups, but not for metadata store. See security configurations for more details.

  • druid.access.jdbc.enforceAllowedProperties: When true, Druid applies druid.access.jdbc.allowedProperties to JDBC connections starting with jdbc:postgresql: or jdbc:mysql:. When false, Druid allows any kind of JDBC connections without JDBC property validation. This config is set to false by default to not break rolling upgrade. This config is deprecated now and can be removed in a future release. The allow list will be always enforced in that case.
  • druid.access.jdbc.allowedProperties: Defines a list of allowed JDBC properties. Druid always enforces the list for all JDBC connections starting with jdbc:postgresql: or jdbc:mysql: if druid.access.jdbc.enforceAllowedProperties is set to true. This option is tested against MySQL connector 5.1.48 and PostgreSQL connector 42.2.14. Other connector versions might not work.
  • druid.access.jdbc.allowUnknownJdbcUrlFormat: When false, Druid only accepts JDBC connections starting with jdbc:postgresql: or jdbc:mysql:. When true, Druid allows JDBC connections to any kind of database, but only enforces druid.access.jdbc.allowedProperties for PostgreSQL and MySQL.

druid-0.20.1

29 Jan 17:34
Compare
Choose a tag to compare

Apache Druid 0.20.1 is a bug fix release that addresses CVE-2021-25646: Authenticated users can override system configurations in their requests which allows them to execute arbitrary code.

# Known issues

# Incorrect Druid version in docker-compose.yml

The Druid version is specified as 0.20.0 in the docker-compose.yml file. We recommend to update the version to 0.20.1 before you run a Druid cluster using docker compose.

druid-0.20.0

17 Oct 01:08
Compare
Choose a tag to compare

Apache Druid 0.20.0 contains around 160 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 36 contributors. Refer to the complete list of changes and everything tagged to the milestone for further details.

# New Features

# Ingestion

# Combining InputSource

A new combining InputSource has been added, allowing the user to combine multiple input sources during ingestion. Please see https://druid.apache.org/docs/0.20.0/ingestion/native-batch.html#combining-input-source for more details.

#10387

# Automatically determine numShards for parallel ingestion hash partitioning

When hash partitioning is used in parallel batch ingestion, it is no longer necessary to specify numShards in the partition spec. Druid can now automatically determine a number of shards by scanning the data in a new ingestion phase that determines the cardinalities of the partitioning key.

#10419

# Subtask file count limits for parallel batch ingestion

The size-based splitHintSpec now supports a new maxNumFiles parameter, which limits how many files can be assigned to individual subtasks in parallel batch ingestion.

The segment-based splitHintSpec used for reingesting data from existing Druid segments also has a new maxNumSegments parameter which functions similarly.

Please see https://druid.apache.org/docs/0.20.0/ingestion/native-batch.html#split-hint-spec for more details.

#10243

# Task slot usage metrics

New task slot usage metrics have been added. Please see the entries for the taskSlot metrics at https://druid.apache.org/docs/0.20.0/operations/metrics.html#indexing-service for more details.

#10379

# Compaction

# Support for all partitioning schemes for auto-compaction

A partitioning spec can now be defined for auto-compaction, allowing users to repartition their data at compaction time. Please see the documentation for the new partitionsSpec property in the compaction tuningConfig for more details:

https://druid.apache.org/docs/0.20.0/configuration/index.html#compaction-tuningconfig

#10307

# Auto-compaction status API

A new coordinator API which shows the status of auto-compaction for a datasource has been added. The new API shows whether auto-compaction is enabled for a datasource, and a summary of how far compaction has progressed.

The web console has also been updated to show this information:

https://user-images.githubusercontent.com/177816/94326243-9d07e780-ff57-11ea-9f80-256fa08580f0.png

Please see https://druid.apache.org/docs/latest/operations/api-reference.html#compaction-status for details on the new API, and https://druid.apache.org/docs/latest/operations/metrics.html#coordination for information on new related compaction metrics.

#10371
#10438

# Querying

# Query segment pruning with hash partitioning

Druid now supports query-time segment pruning (excluding certain segments as read candidates for a query) for hash partitioned segments. This optimization applies when all of the partitionDimensions specified in the hash partition spec during ingestion time are present in the filter set of a query, and the filters in the query filter on discrete values of the partitionDimensions (e.g., selector filters). Segment pruning with hash partitioning is not supported with non-discrete filters such as bound filters.

For existing users with existing segments, you will need to reingest those segments to take advantage of this new feature, as the segment pruning requires a partitionFunction to be stored together with the segments, which does not exist in segments created by older versions of Druid. It is not necessary to specify the partitionFunction explicitly, as the default is the same partition function that was used in prior versions of Druid.

Note that segments created with a default partitionDimensions value (partition by all dimensions + the time column) cannot be pruned in this manner, the segments need to be created with an explicit partitionDimensions.

#9810
#10288

# Vectorization

To enable vectorization features, please set the druid.query.default.context.vectorizeVirtualColumns property to true or set the vectorize property in the query context. Please see https://druid.apache.org/docs/0.20.0/querying/query-context.html#vectorization-parameters for more information.

# Vectorization support for expression virtual columns

Expression virtual columns now have vectorization support (depending on the expressions being used), which an results in a 3-5x performance improvement in some cases.

Please see https://druid.apache.org/docs/0.20.0/misc/math-expr.html#vectorization-support for details on the specific expressions that support vectorization.

#10388
#10401
#10432

# More vectorization support for aggregators

Vectorization support has been added for several aggregation types: numeric min/max aggregators, variance aggregators, ANY aggregators, and aggregators from the druid-histogram extension.

#10260 - numeric min/max
#10304 - histogram
#10338 - ANY
#10390 - variance

We've observed about a 1.3x to 1.8x performance improvement in some cases with vectorization enabled for the min, max, and ANY aggregator, and about 1.04x to 1.07x wuth the histogram aggregator.

# offset parameter for GroupBy and Scan queries

It is now possible set an offset parameter for GroupBy and Scan queries, which tells Druid to skip a number of rows when returning results. Please see https://druid.apache.org/docs/0.20.0/querying/limitspec.html and https://druid.apache.org/docs/0.20.0/querying/scan-query.html for details.

#10235
#10233

# OFFSET clause for SQL queries

Druid SQL queries now support an OFFSET clause. Please see https://druid.apache.org/docs/0.20.0/querying/sql.html#offset for details.

#10279

# Substring search operators

Druid has added new substring search operators in its expression language and for SQL queries.

Please see documentation for CONTAINS_STRING and ICONTAINS_STRING string functions for Druid SQL (https://druid.apache.org/docs/0.20.0/querying/sql.html#string-functions) and documentation for contains_string and icontains_string for the Druid expression language (https://druid.apache.org/docs/0.20.0/misc/math-expr.html#string-functions).

We've observed about a 2.5x performance improvement in some cases by using these functions instead of STRPOS.

#10350

# UNION ALL operator for SQL queries

Druid SQL queries now support the UNION ALL operator, which fuses the results of multiple queries together. Please see https://druid.apache.org/docs/0.20.0/querying/sql.html#union-all for details on what query shapes are supported by this operator.

#10324

# Cluster-wide default query context settings

It is now possible to set cluster-wide default query context properties by adding a configuration of the form druid.query.override.default.context.*, with * replaced by the property name.

#10208

# Other features

# Improved retention rules UI

The retention rules UI in the web console has been improved. It now provides suggestions and basic validation in the period dropdown, shows the cluster default rules, and makes editing the default rules more accessible.

#10226

# Redis cache extension enhancements

The Redis cache extension now supports Redis Cluster, selecting which database is used, connecting to password-protected servers, and period-style configurations for the `exp...

Read more

druid-0.19.0

21 Jul 10:33
Compare
Choose a tag to compare

Apache Druid 0.19.0 contains around 200 new features, bug fixes, performance enhancements, documentation improvements, and additional test coverage from 51 contributors. Refer to the complete list of changes and everything tagged to the milestone for further details.

# New Features

# GroupBy and Timeseries vectorized query engines enabled by default

Vectorized query engines for GroupBy and Timeseries queries were introduced in Druid 0.16, as an opt in feature. Since then we have extensively tested these engines and feel that the time has come for these improvements to find a wider audience. Note that not all of the query engine is vectorized at this time, but this change makes it so that any query which is eligible to be vectorized will do so. This feature may still be disabled if you encounter any problems by setting druid.query.vectorize to false.

#10065

# Druid native batch support for Apache Avro Object Container Files

New in Druid 0.19.0, native batch indexing now supports Apache Avro Object Container Format encoded files, allowing batch ingestion of Avro data without needing an external Hadoop cluster. Check out the docs for more details

#9671

# Updated Druid native batch support for SQL databases

An 'SqlInputSource' has been added in Druid 0.19.0 to work with the new native batch ingestion specifications first introduced in Druid 0.17, deprecating the SqlFirehose. Like the 'SqlFirehose' it currently supports MySQL and PostgreSQL, using the driver from those extensions. This is a relatively low level ingestion task, and the operator must take care to manually ensure that the correct data is ingested, either by specially crafting queries to ensure no duplicate data is ingested for appends, or ensuring that the entire set of data is queried to be replaced when overwriting. See the docs for more operational details.

#9449

# Apache Ranger based authorization

A new extension in Druid 0.19.0 adds an Authorizer which implements access control for Druid, backed by Apache Ranger. Please see [the extension documentation]((https://druid.apache.org/docs/0.19.0/development/extensions-core/druid-ranger-security.html) and Authentication and Authorization for more information on the basic facilities this extension provides.

#9579

# Alibaba Object Storage Service support

A new 'contrib' extension has been added for Alibaba Cloud Object Storage Service (OSS) to provide both deep storage and usage as a batch ingestion input source. Since this is a 'contrib' extension, it will not be packaged by default in the binary distribution, please see community extensions for more details on how to use in your cluster.

#9898

# Ingestion worker autoscaling for Google Compute Engine

Another 'contrib' extension new in 0.19.0 has been added to support ingestion worker autoscaling, which allows a Druid Overlord to provision or terminate worker instances (MiddleManagers or Indexers) whenever there are pending tasks or idle workers, for Google Compute Engine. Unlike the Amazon Web Services ingestion autoscaling extension, which provisions and terminates instances directly without using an Auto Scaling Group, the GCE autoscaler uses Managed Instance Groups to more closely align with how operators are likely to provision their clusters in GCE. Like other 'contrib' extensions, it will not be packaged by default in the binary distribution, please see community extensions for more details on how to use in your cluster.

#8987

# REGEXP_LIKE

A new REGEXP_LIKE function has been added to Druid SQL and native expressions, which behaves similar to LIKE, except using regular expressions for the pattern.

#9893

# Web console lookup management improvements

Druid 0.19 also web console also includes some useful improvements to the lookup table management interface. Creating and editing lookups is now done with a form to accept user input, rather than a raw text editor to enter the JSON spec.

Screen Shot 2020-04-02 at 1 14 38 AM

Additionally, clicking the magnifying glass icon next to a lookup will now allow displaying the first 5000 values of that lookup.

Screen Shot 2020-03-20 at 3 09 24 PM

#9549
#9587

# New Coordinator per datasource 'loadstatus' API

A coordinator API can make it easier to determine if the latest published segments are available for querying. This is similar to the existing coordinator 'loadstatus' API, but is datasource specific, may specify an interval, and can optionally live refresh the metadata store snapshot to get the latest up to date information. Note that operators should still exercise caution when using this API to query large numbers of segments, especially if forcing a metadata refresh, as it can potentially be a 'heavy' call on large clusters.

#9965

# Native batch append support for range and hash partitioning

Part bug fix, part new feature, Druid native batch (once again) supports appending new data to existing time chunks when those time chunks were partitioned with 'hash' or 'range' partitioning algorithms. Note that currently the appended segments only support 'dynamic' partitioning, and when rolling back to older versions that these appended segments will not be recognized by Druid after the downgrade. In order to roll back to a previous version, these appended segments should be compacted with the rest of the time chunk in order to have a homogenous partitioning scheme.

#10033

# Bug fixes

Druid 0.19.0 contains 65 bug fixes, you can see the complete list here.

# Fix for batch ingested 'dynamic' partitioned segments not becoming queryable atomically

Druid 0.19.0 fixes an important query correctness issue, where 'dynamic' partitioned segments produced by a batch ingestion task were not tracking the overall number of partitions. This had the implication that when these segments came online, they did not do so as a complete set, but rather as individual segments, meaning that there would be periods of swapping where results could be queried from an incomplete partition set within a time chunk.

#10025

# Fix to allow 'hash' and 'range' partitioned segments with empty buckets to now be queryable

Prior to 0.19.0, Druid had a bug when using hash or ranged partitioning where if data skew was such that any of the buckets were 'empty' after ingesting, the partitions would never be recognized as 'complete' and so never become queryable. Druid 0.19.0 fixes this issue by adjusting the schema of the partitioning spec. These changes to the json format should be backwards compatible, however rolling back to a previous version will again make these segments no longer queryable.

#10012

# Incorrect balancer behavior

A bug in Druid versions prior to 0.19.0 allowed for (incorrect) coordinator operation in the event druid.server.maxSize was not set. This bug would allow segments to load, and effectively randomly balance them in the cluster (regardless of what balancer strategy was actually configured) if all historicals did not have this value set. This bug has been fixed, but as a result druid.server.maxSize must be set to the sum of the segment cache location sizes for historicals, or else they will not load segments.

#10070

# Upgrading to Druid 0.19.0

Please be aware of the f...

Read more

druid-0.18.1

14 May 17:32
Compare
Choose a tag to compare

Apache Druid 0.18.1 is a bug fix release that fixes Streaming ingestion failure with Avro, ingestion performance issue, upgrade issue with HLLSketch, and so on. The complete list of bug fixes can be found at https://github.com/apache/druid/pulls?q=is%3Apr+milestone%3A0.18.1+label%3ABug+is%3Aclosed.

# Bug fixes

  • #9823 rollbacks the new Kinesis lag metrics as it can stall the Kinesis supervisor indefinitely with a large number of shards.
  • #9734 fixes the Streaming ingestion failure issue when you use a data format other than CSV or JSON.
  • #9812 fixes filtering on boolean values during transformation.
  • #9723 fixes slow ingestion performance due to frequent flushes on local file system.
  • #9751 reverts the version of datasketches-java from 1.2.0 to 1.1.0 to workaround upgrade failure with HLLSketch.
  • #9698 fixes a bug in inline subquery with multi-valued dimension.
  • #9761 fixes a bug in CloseableIterator which potentially leads to resource leaks in Data loader.

# Known issues

Incorrect result of nested groupBy query on Join of subqueries

A nested groupBy query can result in an incorrect result when it is on top of a Join of subqueries and the inner and the outer groupBys have different filters. See #9866 for more details.

# Credits

Thanks to everyone who contributed to this release!

@clintropolis
@gianm
@jihoonson
@maytasm
@suneet-s
@viongpanzi
@whutjs