You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the kafka architecture scaling ingesters is easier and is expected to happen more often. Series change owners when ingersters are added. For 10 minutes (value of -ingester.active-series-metrics-idle-timeout) both the new ingester and some old ingesters will count a series as active. This means that for this time a tenant would appear to have increased the number of active series when in reality they haven't.
Proposal
Use active series when calculating active series stats. This means the ingester will reevaluate what it considers active series every time the ring changes. This will show as changes in the values of the following metrics
there was some discussion in slack with @pstibrany and @pr00se. I'm leaving here to gist of it. There were two concerns
the proposed implementation exceeds the concerns of the owned series service; The alternative was to let the active series service reach back out to the userTSDB and check if a series is owned each time it does a purge
that creates extra indirection and overhead on every series purge
a concern with the performance of having to do the active series check on the owned series update - doing too much in band
I ran an experiment triggering node scaling 3 times: the results are
head compactions, and new users don’t trigger the active series calculation so the overhead most of the time is 0
so for the past 24h there were no triggers of that logic because we didn’t change limits or add ingesters
details: the reason is that the active series code is triggered only when a series is in the head but is no longer owned. After a compaction all of the series in the recomputeOwnedSeries loop are in the head, but also all of them are owned - so no need to update the active series
changing shuffle-shards or adding more ingesters does trigger it (i manually triggered this three time by adding ingesters):
absolute CPU and latency aren’t affected - especially write latency
total duration of the recalculation is higher - 500-700ms vs 300-500ms; see the screenshot below for the CPU profile of zone-a from the first recalculation
Early compaction
Early compaction works on a premise that if “active series is way lower than in-memory-series” than compacting head up to 20min ago will actually remove some series from memory (those that are not active anymore).
However if active series remove “disowned” series too, above assumption will not longer be true. Early compaction may trigger because AS is low, but compacting head up to 20min ago may not remove any series.
That’s not necessarily bad, we should just make sure that we don’t attempt to run early compaction too often in such case.
after the rollout of zone-a early compaction happened twice more in zone-a than in the other zones; but still overall zone-a is at lower levels of early compaction than before
Context
With the kafka architecture scaling ingesters is easier and is expected to happen more often. Series change owners when ingersters are added. For 10 minutes (value of
-ingester.active-series-metrics-idle-timeout
) both the new ingester and some old ingesters will count a series as active. This means that for this time a tenant would appear to have increased the number of active series when in reality they haven't.Proposal
Use active series when calculating active series stats. This means the ingester will reevaluate what it considers active series every time the ring changes. This will show as changes in the values of the following metrics
cortex_ingester_active_series
cortex_ingester_active_series_custom_tracker
cortex_ingester_active_native_histogram_series
cortex_ingester_active_native_histogram_series_custom_tracker
cortex_ingester_active_native_histogram_series_custom_tracker
cortex_ingester_active_native_histogram_buckets_custom_tracker
/api/v1/cardinality/active_series
endpointProposed implementation
Head.ForEachSecondaryHash()
in mimir-prometheus to return the ref of the series (storage.SeriesRef) as well as the secondary hashActiveSeries.Delete(r storage.SeriesRef)
which behaves very similar toActiveSeries.purge()
ActiveSeries.Delete
for every series which isn't owned anymore; the call would happen somewhere arounduserTSDB.recomputeOwnedSeries
The text was updated successfully, but these errors were encountered: