Add ability for prometheus & thanos sidecar to flush on graceful shutdown #6540

Nashluffy · 2024-04-20T14:14:58Z

Component(s)

Prometheus

What is missing? Please describe.

We run several short-lived (sometimes only 1 hour in age) clusters. When using the thanos sidecar approach, downscaling a prometheus replica (either permanently or removing shards) will result in data loss of all chunks in the head.

There are several issues that have all roughly touched on this issue.
#4967
prometheus/prometheus#12261
thanos-io/thanos#1849

It would be great to have native support for flushing and uploading what’s in the head in prometheus-operator (likely requiring changes to other components as well).

Unfortunately there's no TSDB API for "flushing" the head, but you can create a snapshot of TSDB, then move all new blocks in that snapshot into the top-level data dir.

The thanos sidecar can then perform it's own "flushing" in the form of uploading blocks one last time.

prometheus-operator feels like the most natural place to orchestrate this, but open to discussion!

kind: Prometheus
spec:
  thanos:
    flushOnShutdown: true

Describe alternatives you've considered.

I'm currently achieving this in a separate container that uses a preStop hook to

call the snapshot endpoint of prometheus
move the new blocks from that snapshot dir into the top-level data dir
run thanos tools bucket upload-blocks.

The snapshot isn't a lot of storage as existing blocks in the snapshot are symlinks to the actual block.

We previously used a Thanos receiver setup which avoided this problem altogether, but it was wildly more expensive and quite a lot of overhead to operate.

Environment Information.

Environment

Kubernetes Version: 1.27
Prometheus-Operator Version: 0.73

The text was updated successfully, but these errors were encountered:

nicolastakashi · 2024-04-20T19:54:46Z

Hey @Nashluffy thanks for this new issue. 😄
Yes, the described steps would work and sounds pretty nice.
I'd like to have something less hacky by not relying on lifecycle hooks.
I just opened this new issue on the Thanos Project, let's see what people think about it.
thanos-io/thanos#7295

Nashluffy · 2024-04-22T07:41:38Z

Thanks! I'll keep the prometheus-operator discussion here

Just another point: I think a call to the flush endpoint should be part of the Prometheus finalizer as well, not just when scaling down shards. This would capture my use-case, as we don't use shards.

ArthurSens · 2024-04-25T18:59:04Z

Seems aligned with on of the ideas we had for Graceful shutdown. (See https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/proposals/202310-shard-autoscaling.md#snapshot--upload-on-shutdown)

I think we could extend the proposed API to also provide this alternative as a shutdown option. Of course, that requires me to continue and finish my PR 😅

Nashluffy added kind/feature needs-triage Issues that haven't been triaged yet labels Apr 20, 2024

nicolastakashi mentioned this issue Apr 20, 2024

Thanos Sidecar - Flush Endpoint thanos-io/thanos#7295

Open

nicolastakashi assigned nicolastakashi and unassigned nicolastakashi Apr 20, 2024

nicolastakashi added area/sharding and removed needs-triage Issues that haven't been triaged yet labels Apr 20, 2024

This was referenced May 14, 2024

sidecar: Add /api/v1/flush endpoint thanos-io/thanos#7358

Closed

sidecar: Add /api/v1/flush endpoint thanos-io/thanos#7359

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability for prometheus & thanos sidecar to flush on graceful shutdown #6540

Add ability for prometheus & thanos sidecar to flush on graceful shutdown #6540

Nashluffy commented Apr 20, 2024 •

edited

nicolastakashi commented Apr 20, 2024

Nashluffy commented Apr 22, 2024

ArthurSens commented Apr 25, 2024

Add ability for prometheus & thanos sidecar to flush on graceful shutdown #6540

Add ability for prometheus & thanos sidecar to flush on graceful shutdown #6540

Comments

Nashluffy commented Apr 20, 2024 • edited

Component(s)

What is missing? Please describe.

Describe alternatives you've considered.

Environment Information.

Environment

nicolastakashi commented Apr 20, 2024

Nashluffy commented Apr 22, 2024

ArthurSens commented Apr 25, 2024

Nashluffy commented Apr 20, 2024 •

edited