Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminated state results in unhealthy ingesters #5673

Open
mkieweg opened this issue Nov 23, 2023 · 0 comments
Open

Terminated state results in unhealthy ingesters #5673

mkieweg opened this issue Nov 23, 2023 · 0 comments

Comments

@mkieweg
Copy link

mkieweg commented Nov 23, 2023

Describe the bug
When shutting down ingesters they get into Terminated state. This state is considered unexpected by memberlist resulting in the heartbeat to fail and the instance to be tainted as unhealthy. This requires manual intervention and thus effectively breaks autoscaling.

To Reproduce
Steps to reproduce the behavior:

  1. Start Cortex v1.15.3 using Helm chart v2.1.0
  2. Use HPA to scale down Cortex ingesters

Expected behavior
Ingesters should scale down and remove themselves from the ring without errors

Environment:

  • Infrastructure: EKS
  • Deployment tool: Helm chart v2.1.0

Additional Context

Logs

{"caller":"logging.go:76","level":"debug","msg":"GET //ingester/shutdown (301) 73.436µs","traceID":"1bf635dc8c6c3d4e","ts":"2023-11-21T16:41:39.79265651Z"}
{"caller":"lifecycler.go:498","level":"info","msg":"lifecycler loop() exited gracefully","ring":"ingester","ts":"2023-11-21T16:41:39.8043733Z"}
{"caller":"lifecycler.go:811","level":"info","msg":"changing instance state from","new_state":"LEAVING","old_state":"ACTIVE","ring":"ingester","ts":"2023-11-21T16:41:39.804427334Z"}
{"caller":"ingester.go:2586","level":"info","msg":"starting to flush and ship TSDB blocks","ts":"2023-11-21T16:41:39.804546549Z"}
{"caller":"compact.go:519","duration":"234.25592ms","level":"info","maxt":1700582400000,"mint":1700581137870,"msg":"write block","org_id":"fake","ts":"2023-11-21T16:41:40.038875302Z","ulid":"01HFSC4H6WJD5XV7H90F0P6D4V"}
{"block":"01HEQEWTXD8ZKSDDDE9071TP70","caller":"db.go:1550","level":"info","msg":"Deleting obsolete block","org_id":"fake","ts":"2023-11-21T16:41:40.042351899Z"}
{"block":"01HEQY3KTDAJA0TJHPRCZ0MBQN","caller":"db.go:1550","level":"info","msg":"Deleting obsolete block","org_id":"fake","ts":"2023-11-21T16:41:40.046284574Z"}
{"block":"01HEQHJBED85MPHZTEXSAS9SYD","caller":"db.go:1550","level":"info","msg":"Deleting obsolete block","org_id":"fake","ts":"2023-11-21T16:41:40.049584673Z"}
{"block":"01HEQEWW1KRNEBKS9K4Y42RVMT","caller":"db.go:1550","level":"info","msg":"Deleting obsolete block","org_id":"fake","ts":"2023-11-21T16:41:40.052457795Z"}
{"caller":"truncateMemory","duration":"52.163691ms","level":"info","msg":"Head GC completed","org_id":"fake","ts":"2023-11-21T16:41:40.104683711Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=127.0.0.6:54087","ts":"2023-11-21T16:41:40.10990833Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Failed ping: cortex-store-gateway-1-1a5d9a43 (timeout reached)","ts":"2023-11-21T16:41:40.892540819Z"}
{"caller":"grpc_logging.go:46","duration":"76.461µs","level":"debug","method":"/grpc.health.v1.Health/Check","msg":"gRPC (success)","ts":"2023-11-21T16:41:40.927996371Z"}
{"caller":"compact.go:519","duration":"1.423570173s","level":"info","maxt":1700584899375,"mint":1700582400000,"msg":"write block","org_id":"fake","ts":"2023-11-21T16:41:41.528432979Z","ulid":"01HFSC4HG89P99GJVSEBSTFP1K"}
{"caller":"truncateMemory","duration":"202.667137ms","level":"info","msg":"Head GC completed","org_id":"fake","ts":"2023-11-21T16:41:41.732417054Z"}
{"caller":"checkpoint.go:100","from_segment":578,"level":"info","mint":1700584899375,"msg":"Creating checkpoint","org_id":"fake","to_segment":579,"ts":"2023-11-21T16:41:41.732951452Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=127.0.0.6:58933","ts":"2023-11-21T16:41:41.979575777Z"}
{"caller":"head.go:1240","duration":"1.523683363s","first":578,"last":579,"level":"info","msg":"WAL checkpoint complete","org_id":"fake","ts":"2023-11-21T16:41:43.256181134Z"}
{"caller":"ingester.go:2368","compactReason":"forced","level":"debug","msg":"TSDB blocks compaction completed successfully","ts":"2023-11-21T16:41:43.256293661Z","user":"fake"}
{"caller":"shipper.go:334","id":"01HFSC4H6WJD5XV7H90F0P6D4V","level":"info","msg":"upload new block","org_id":"fake","ts":"2023-11-21T16:41:43.301936682Z"}
{"bucket":"tracing: cortex-cortex-stg-us-west-2","caller":"objstore.go:288","dst":"01HFSC4H6WJD5XV7H90F0P6D4V/chunks/000001","from":"/data/tsdb/fake/thanos/upload/01HFSC4H6WJD5XV7H90F0P6D4V/chunks/000001","level":"debug","msg":"uploaded file","org_id":"fake","ts":"2023-11-21T16:41:43.333067008Z"}
{"bucket":"tracing: cortex-cortex-stg-us-west-2","caller":"objstore.go:288","dst":"01HFSC4H6WJD5XV7H90F0P6D4V/index","from":"/data/tsdb/fake/thanos/upload/01HFSC4H6WJD5XV7H90F0P6D4V/index","level":"debug","msg":"uploaded file","org_id":"fake","ts":"2023-11-21T16:41:43.427698215Z"}
{"caller":"shipper.go:334","id":"01HFSC4HG89P99GJVSEBSTFP1K","level":"info","msg":"upload new block","org_id":"fake","ts":"2023-11-21T16:41:43.500269397Z"}
{"bucket":"tracing: cortex-cortex-stg-us-west-2","caller":"objstore.go:288","dst":"01HFSC4HG89P99GJVSEBSTFP1K/chunks/000001","from":"/data/tsdb/fake/thanos/upload/01HFSC4HG89P99GJVSEBSTFP1K/chunks/000001","level":"debug","msg":"uploaded file","org_id":"fake","ts":"2023-11-21T16:41:43.660061181Z"}
{"bucket":"tracing: cortex-cortex-stg-us-west-2","caller":"objstore.go:288","dst":"01HFSC4HG89P99GJVSEBSTFP1K/index","from":"/data/tsdb/fake/thanos/upload/01HFSC4HG89P99GJVSEBSTFP1K/index","level":"debug","msg":"uploaded file","org_id":"fake","ts":"2023-11-21T16:41:43.856623646Z"}
{"caller":"memberlist_logger.go:74","level":"warn","msg":"Was able to connect to cortex-store-gateway-1-1a5d9a43 but other probes failed, network may be misconfigured","ts":"2023-11-21T16:41:43.890882572Z"}
{"caller":"ingester.go:2279","level":"debug","msg":"shipper successfully synchronized TSDB blocks with storage","ts":"2023-11-21T16:41:43.984722874Z","uploaded":2,"user":"fake"}
{"caller":"ingester.go:2595","level":"info","msg":"finished flushing and shipping TSDB blocks","ts":"2023-11-21T16:41:43.984859001Z"}
{"caller":"lifecycler.go:871","final_sleep":"30s","level":"info","msg":"lifecycler entering final sleep before shutdown","ts":"2023-11-21T16:41:43.985246801Z"}
{"caller":"signals.go:55","level":"info","msg":"=== received SIGINT/SIGTERM ===\n*** exiting","ts":"2023-11-21T16:41:44.816310571Z"}
{"caller":"module_service.go:96","level":"info","module":"ingester-service","msg":"module stopped","ts":"2023-11-21T16:41:44.816429019Z"}
{"caller":"module_service.go:86","level":"debug","module":"server","msg":"stopping","ts":"2023-11-21T16:41:44.816563052Z"}
{"caller":"module_service.go:109","level":"debug","module":"runtime-config","msg":"module waiting for","ts":"2023-11-21T16:41:44.816598457Z","waiting_for":"ingester-service"}
{"caller":"module_service.go:86","level":"debug","module":"runtime-config","msg":"stopping","ts":"2023-11-21T16:41:44.816632226Z"}
{"caller":"module_service.go:96","level":"info","module":"runtime-config","msg":"module stopped","ts":"2023-11-21T16:41:44.816643075Z"}
{"caller":"module_service.go:109","level":"debug","module":"memberlist-kv","msg":"module waiting for","ts":"2023-11-21T16:41:44.816657622Z","waiting_for":"ingester-service"}
{"caller":"module_service.go:86","level":"debug","module":"memberlist-kv","msg":"stopping","ts":"2023-11-21T16:41:44.816672603Z"}
{"caller":"memberlist_client.go:612","level":"info","msg":"leaving memberlist cluster","ts":"2023-11-21T16:41:44.816698917Z"}
{"caller":"module_service.go:96","level":"info","module":"memberlist-kv","msg":"module stopped","ts":"2023-11-21T16:41:45.841625286Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Failed ping: cortex-distributor-7d7d5b59b8-9t7ks-7824768a (timeout reached)","ts":"2023-11-21T16:41:45.89149416Z"}
{"caller":"memberlist_logger.go:74","level":"info","msg":"Suspect cortex-distributor-7d7d5b59b8-9t7ks-7824768a has failed, no acks received","ts":"2023-11-21T16:41:48.891631962Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:41:49.804559785Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:41:54.804679488Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:41:59.805094041Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:42:04.805275687Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:42:09.805392347Z"}
{"caller":"lifecycler.go:877","level":"debug","msg":"unregistering instance from ring","ring":"ingester","ts":"2023-11-21T16:42:13.986349184Z"}
{"caller":"ingester.go:772","err":"failed to unregister from the KV store, ring: ingester: unexpected state: Terminated","level":"warn","msg":"failed to stop ingester lifecycler","ts":"2023-11-21T16:42:13.986629129Z"}
{"caller":"logging.go:76","level":"debug","msg":"GET /ingester/shutdown (204) 34.185574054s","traceID":"1a7457db41a31f14","ts":"2023-11-21T16:42:13.989845983Z"}
{"caller":"server_service.go:50","level":"info","msg":"server stopped","ts":"2023-11-21T16:42:14.148840428Z"}
{"caller":"module_service.go:96","level":"info","module":"server","msg":"module stopped","ts":"2023-11-21T16:42:14.148922944Z"}
{"caller":"cortex.go:423","level":"info","msg":"Cortex stopped","ts":"2023-11-21T16:42:14.148952283Z"}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant