Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quantile (99%) aggregation with druid and hive #472

Open
datametrics opened this issue Jul 22, 2018 · 0 comments
Open

quantile (99%) aggregation with druid and hive #472

datametrics opened this issue Jul 22, 2018 · 0 comments

Comments

@datametrics
Copy link

Hi all together,

i've got a question concerning data ingestion from inside hive.
I use a query similar to this one for indexing a result view in Druid.

use LONGFIELD_DWH;
DROP TABLE IF EXISTS TestCube;
EXPLAIN CREATE TABLE TestCube
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES (
"druid.segment.granularity" = "MONTH",
"druid.query.granularity" = "DAY")
AS
SELECT
cast(DimDate.Name as timestamp with local time zone) as __time,
cast(DimInstrument.Name as string) InsName,
FactTable.TestMeasure * p.Quantity as TestMeasure,
cast(DimInstrument.InstrumentType as string) InstrumentType,
cast(FactTable.IdScenario as string) Scenario,
cast(p.Portfolio as string) Portfolio,
p.Quantity as Quantity
FROM FactTable
JOIN DimDate on (FactTable.IdDate = DimDate.Id)
JOIN DimInstrument on (FactTable.IdInstrument = DimInstrument.Id)
JOIN DimScenario on (FactTable.IdScenario = DimScenario.Id)
JOIN (select pos.Name, pos.IdInstrument, pos.IdPortfolio,pf.Name as Portfolio,
pos.Quantity from FactPosition pos
join DimInstrument ins on ins.Id = pos.IdInstrument
join DimPortfolio pf on pf.Id = pos.IdPortfolio) p
on DimInstrument.Id = p.IdInstrument;

That works without any issues. The datasource gets published in druid and is visible
in superset after querying datasource metadata. A segment is created for every day
in the facttable.

As part of my analysis i would like to build a postaggregated quantile measure like
described on the project documentation page:

At first i create a new metric in superset "HISTOGRAM" of type "approxHistogramFold" with this JSON:

{
"type" : "approxHistogramFold",
"name" : "HISTOGRAM",
"fieldName" : "sum__testmeasure",
"resolution" : 500,
"numBuckets" : 500,
"lowerLimit" : 0.0
}

Next, i create another metric in superset "Response_99" (type = "postagg")that should display the
fifth-worst (i.e. the 99% quantile) item from the list of 500 ordered testmeasure items
from the approxhistogram measure (JSON):

{
"type": "quantile",
"name": "HISTOGRAM",
"fieldName": "HISTOGRAM",
"probability": "0.99"
}

That measure combination runs but the output is "na". As far as I understand the documentation the field
of the first aggregator for "testmeasure" has to be ingested with a specific metric-spec / ingestion
aggregator similar to this:

http://druid.io/docs/latest/development/extensions-core/datasketches-aggregators.html

Does anyone know how to incorporate this into the ingestion query issued in HIVE?
If not, does anyone see how to mimic such quantile measures maybe without "approxHistogramFold" and
probably in a simple javascript postaggregation?

Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant