ingester: include only owned series in active series stats #7976

dimitarvdimitrov · 2024-04-25T17:51:07Z

Context

With the kafka architecture scaling ingesters is easier and is expected to happen more often. Series change owners when ingersters are added. For 10 minutes (value of -ingester.active-series-metrics-idle-timeout) both the new ingester and some old ingesters will count a series as active. This means that for this time a tenant would appear to have increased the number of active series when in reality they haven't.

Proposal

Use active series when calculating active series stats. This means the ingester will reevaluate what it considers active series every time the ring changes. This will show as changes in the values of the following metrics

cortex_ingester_active_series
cortex_ingester_active_series_custom_tracker
cortex_ingester_active_native_histogram_series
cortex_ingester_active_native_histogram_series_custom_tracker
cortex_ingester_active_native_histogram_series_custom_tracker
cortex_ingester_active_native_histogram_buckets_custom_tracker
response on the /api/v1/cardinality/active_series endpoint

Proposed implementation

change Head.ForEachSecondaryHash() in mimir-prometheus to return the ref of the series (storage.SeriesRef) as well as the secondary hash
add ActiveSeries.Delete(r storage.SeriesRef) which behaves very similar to ActiveSeries.purge()
invoke ActiveSeries.Delete for every series which isn't owned anymore; the call would happen somewhere around userTSDB.recomputeOwnedSeries

The text was updated successfully, but these errors were encountered:

dimitarvdimitrov · 2024-05-02T18:25:39Z

there was some discussion in slack with @pstibrany and @pr00se. I'm leaving here to gist of it. There were two concerns

the proposed implementation exceeds the concerns of the owned series service; The alternative was to let the active series service reach back out to the userTSDB and check if a series is owned each time it does a purge

that creates extra indirection and overhead on every series purge

a concern with the performance of having to do the active series check on the owned series update - doing too much in band
- I ran an experiment triggering node scaling 3 times: the results are
  - head compactions, and new users don’t trigger the active series calculation so the overhead most of the time is 0
    - so for the past 24h there were no triggers of that logic because we didn’t change limits or add ingesters
    - details: the reason is that the active series code is triggered only when a series is in the head but is no longer owned. After a compaction all of the series in the recomputeOwnedSeries loop are in the head, but also all of them are owned - so no need to update the active series
  - changing shuffle-shards or adding more ingesters does trigger it (i manually triggered this three time by adding ingesters):
    - absolute CPU and latency aren’t affected - especially write latency
    - total duration of the recalculation is higher - 500-700ms vs 300-500ms; see the screenshot below for the CPU profile of zone-a from the first recalculation

(Flame)Graphs

dimitarvdimitrov · 2024-05-03T14:32:09Z

from @pstibrany

Early compaction
Early compaction works on a premise that if “active series is way lower than in-memory-series” than compacting head up to 20min ago will actually remove some series from memory (those that are not active anymore).

However if active series remove “disowned” series too, above assumption will not longer be true. Early compaction may trigger because AS is low, but compacting head up to 20min ago may not remove any series.

That’s not necessarily bad, we should just make sure that we don’t attempt to run early compaction too often in such case.

after the rollout of zone-a early compaction happened twice more in zone-a than in the other zones; but still overall zone-a is at lower levels of early compaction than before

dimitarvdimitrov added the component/ingester label Apr 25, 2024

dimitarvdimitrov self-assigned this Apr 25, 2024

dimitarvdimitrov mentioned this issue May 3, 2024

head: provide series ID in ForEachSecondaryHash grafana/mimir-prometheus#624

Merged

dimitarvdimitrov added the sigyn naming for the kafka-based Mimir architecture while it's still WIP label May 6, 2024

This was referenced May 6, 2024

Update mimir-prometheus #8064

Merged

ingester: reduce active series when series change owner #8084

Merged

dimitarvdimitrov closed this as completed in #8084 May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingester: include only owned series in active series stats #7976

ingester: include only owned series in active series stats #7976

dimitarvdimitrov commented Apr 25, 2024

dimitarvdimitrov commented May 2, 2024

dimitarvdimitrov commented May 3, 2024

ingester: include only owned series in active series stats #7976

ingester: include only owned series in active series stats #7976

Comments

dimitarvdimitrov commented Apr 25, 2024

Context

Proposal

Proposed implementation

dimitarvdimitrov commented May 2, 2024

dimitarvdimitrov commented May 3, 2024