Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingester: include only owned series in active series stats #7976

Closed
dimitarvdimitrov opened this issue Apr 25, 2024 · 2 comments · Fixed by #8084
Closed

ingester: include only owned series in active series stats #7976

dimitarvdimitrov opened this issue Apr 25, 2024 · 2 comments · Fixed by #8084
Assignees
Labels
component/ingester sigyn naming for the kafka-based Mimir architecture while it's still WIP

Comments

@dimitarvdimitrov
Copy link
Contributor

Context

With the kafka architecture scaling ingesters is easier and is expected to happen more often. Series change owners when ingersters are added. For 10 minutes (value of -ingester.active-series-metrics-idle-timeout) both the new ingester and some old ingesters will count a series as active. This means that for this time a tenant would appear to have increased the number of active series when in reality they haven't.

Proposal

Use active series when calculating active series stats. This means the ingester will reevaluate what it considers active series every time the ring changes. This will show as changes in the values of the following metrics

  • cortex_ingester_active_series
  • cortex_ingester_active_series_custom_tracker
  • cortex_ingester_active_native_histogram_series
  • cortex_ingester_active_native_histogram_series_custom_tracker
  • cortex_ingester_active_native_histogram_series_custom_tracker
  • cortex_ingester_active_native_histogram_buckets_custom_tracker
  • response on the /api/v1/cardinality/active_series endpoint

Proposed implementation

@dimitarvdimitrov
Copy link
Contributor Author

there was some discussion in slack with @pstibrany and @pr00se. I'm leaving here to gist of it. There were two concerns

  1. the proposed implementation exceeds the concerns of the owned series service; The alternative was to let the active series service reach back out to the userTSDB and check if a series is owned each time it does a purge
  • that creates extra indirection and overhead on every series purge
  1. a concern with the performance of having to do the active series check on the owned series update - doing too much in band

    • I ran an experiment triggering node scaling 3 times: the results are
      • head compactions, and new users don’t trigger the active series calculation so the overhead most of the time is 0
        • so for the past 24h there were no triggers of that logic because we didn’t change limits or add ingesters
        • details: the reason is that the active series code is triggered only when a series is in the head but is no longer owned. After a compaction all of the series in the recomputeOwnedSeries loop are in the head, but also all of them are owned - so no need to update the active series
      • changing shuffle-shards or adding more ingesters does trigger it (i manually triggered this three time by adding ingesters):
        • absolute CPU and latency aren’t affected - especially write latency
        • total duration of the recalculation is higher - 500-700ms vs 300-500ms; see the screenshot below for the CPU profile of zone-a from the first recalculation
(Flame)Graphs

Screenshot 2024-05-02 at 19 43 27 Screenshot 2024-05-02 at 20 03 18 Screenshot 2024-05-02 at 20 04 01

@dimitarvdimitrov
Copy link
Contributor Author

from @pstibrany

Early compaction
Early compaction works on a premise that if “active series is way lower than in-memory-series” than compacting head up to 20min ago will actually remove some series from memory (those that are not active anymore).

However if active series remove “disowned” series too, above assumption will not longer be true. Early compaction may trigger because AS is low, but compacting head up to 20min ago may not remove any series.

That’s not necessarily bad, we should just make sure that we don’t attempt to run early compaction too often in such case.

after the rollout of zone-a early compaction happened twice more in zone-a than in the other zones; but still overall zone-a is at lower levels of early compaction than before
Screenshot 2024-05-03 at 16 25 08 (1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/ingester sigyn naming for the kafka-based Mimir architecture while it's still WIP
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant