You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When shutting down ingesters they get into Terminated state. This state is considered unexpected by memberlist resulting in the heartbeat to fail and the instance to be tainted as unhealthy. This requires manual intervention and thus effectively breaks autoscaling.
To Reproduce
Steps to reproduce the behavior:
Start Cortex v1.15.3 using Helm chart v2.1.0
Use HPA to scale down Cortex ingesters
Expected behavior
Ingesters should scale down and remove themselves from the ring without errors
Environment:
Infrastructure: EKS
Deployment tool: Helm chart v2.1.0
Additional Context
Logs
{"caller":"logging.go:76","level":"debug","msg":"GET //ingester/shutdown (301) 73.436µs","traceID":"1bf635dc8c6c3d4e","ts":"2023-11-21T16:41:39.79265651Z"}
{"caller":"lifecycler.go:498","level":"info","msg":"lifecycler loop() exited gracefully","ring":"ingester","ts":"2023-11-21T16:41:39.8043733Z"}
{"caller":"lifecycler.go:811","level":"info","msg":"changing instance state from","new_state":"LEAVING","old_state":"ACTIVE","ring":"ingester","ts":"2023-11-21T16:41:39.804427334Z"}
{"caller":"ingester.go:2586","level":"info","msg":"starting to flush and ship TSDB blocks","ts":"2023-11-21T16:41:39.804546549Z"}
{"caller":"compact.go:519","duration":"234.25592ms","level":"info","maxt":1700582400000,"mint":1700581137870,"msg":"write block","org_id":"fake","ts":"2023-11-21T16:41:40.038875302Z","ulid":"01HFSC4H6WJD5XV7H90F0P6D4V"}
{"block":"01HEQEWTXD8ZKSDDDE9071TP70","caller":"db.go:1550","level":"info","msg":"Deleting obsolete block","org_id":"fake","ts":"2023-11-21T16:41:40.042351899Z"}
{"block":"01HEQY3KTDAJA0TJHPRCZ0MBQN","caller":"db.go:1550","level":"info","msg":"Deleting obsolete block","org_id":"fake","ts":"2023-11-21T16:41:40.046284574Z"}
{"block":"01HEQHJBED85MPHZTEXSAS9SYD","caller":"db.go:1550","level":"info","msg":"Deleting obsolete block","org_id":"fake","ts":"2023-11-21T16:41:40.049584673Z"}
{"block":"01HEQEWW1KRNEBKS9K4Y42RVMT","caller":"db.go:1550","level":"info","msg":"Deleting obsolete block","org_id":"fake","ts":"2023-11-21T16:41:40.052457795Z"}
{"caller":"truncateMemory","duration":"52.163691ms","level":"info","msg":"Head GC completed","org_id":"fake","ts":"2023-11-21T16:41:40.104683711Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=127.0.0.6:54087","ts":"2023-11-21T16:41:40.10990833Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Failed ping: cortex-store-gateway-1-1a5d9a43 (timeout reached)","ts":"2023-11-21T16:41:40.892540819Z"}
{"caller":"grpc_logging.go:46","duration":"76.461µs","level":"debug","method":"/grpc.health.v1.Health/Check","msg":"gRPC (success)","ts":"2023-11-21T16:41:40.927996371Z"}
{"caller":"compact.go:519","duration":"1.423570173s","level":"info","maxt":1700584899375,"mint":1700582400000,"msg":"write block","org_id":"fake","ts":"2023-11-21T16:41:41.528432979Z","ulid":"01HFSC4HG89P99GJVSEBSTFP1K"}
{"caller":"truncateMemory","duration":"202.667137ms","level":"info","msg":"Head GC completed","org_id":"fake","ts":"2023-11-21T16:41:41.732417054Z"}
{"caller":"checkpoint.go:100","from_segment":578,"level":"info","mint":1700584899375,"msg":"Creating checkpoint","org_id":"fake","to_segment":579,"ts":"2023-11-21T16:41:41.732951452Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Stream connection from=127.0.0.6:58933","ts":"2023-11-21T16:41:41.979575777Z"}
{"caller":"head.go:1240","duration":"1.523683363s","first":578,"last":579,"level":"info","msg":"WAL checkpoint complete","org_id":"fake","ts":"2023-11-21T16:41:43.256181134Z"}
{"caller":"ingester.go:2368","compactReason":"forced","level":"debug","msg":"TSDB blocks compaction completed successfully","ts":"2023-11-21T16:41:43.256293661Z","user":"fake"}
{"caller":"shipper.go:334","id":"01HFSC4H6WJD5XV7H90F0P6D4V","level":"info","msg":"upload new block","org_id":"fake","ts":"2023-11-21T16:41:43.301936682Z"}
{"bucket":"tracing: cortex-cortex-stg-us-west-2","caller":"objstore.go:288","dst":"01HFSC4H6WJD5XV7H90F0P6D4V/chunks/000001","from":"/data/tsdb/fake/thanos/upload/01HFSC4H6WJD5XV7H90F0P6D4V/chunks/000001","level":"debug","msg":"uploaded file","org_id":"fake","ts":"2023-11-21T16:41:43.333067008Z"}
{"bucket":"tracing: cortex-cortex-stg-us-west-2","caller":"objstore.go:288","dst":"01HFSC4H6WJD5XV7H90F0P6D4V/index","from":"/data/tsdb/fake/thanos/upload/01HFSC4H6WJD5XV7H90F0P6D4V/index","level":"debug","msg":"uploaded file","org_id":"fake","ts":"2023-11-21T16:41:43.427698215Z"}
{"caller":"shipper.go:334","id":"01HFSC4HG89P99GJVSEBSTFP1K","level":"info","msg":"upload new block","org_id":"fake","ts":"2023-11-21T16:41:43.500269397Z"}
{"bucket":"tracing: cortex-cortex-stg-us-west-2","caller":"objstore.go:288","dst":"01HFSC4HG89P99GJVSEBSTFP1K/chunks/000001","from":"/data/tsdb/fake/thanos/upload/01HFSC4HG89P99GJVSEBSTFP1K/chunks/000001","level":"debug","msg":"uploaded file","org_id":"fake","ts":"2023-11-21T16:41:43.660061181Z"}
{"bucket":"tracing: cortex-cortex-stg-us-west-2","caller":"objstore.go:288","dst":"01HFSC4HG89P99GJVSEBSTFP1K/index","from":"/data/tsdb/fake/thanos/upload/01HFSC4HG89P99GJVSEBSTFP1K/index","level":"debug","msg":"uploaded file","org_id":"fake","ts":"2023-11-21T16:41:43.856623646Z"}
{"caller":"memberlist_logger.go:74","level":"warn","msg":"Was able to connect to cortex-store-gateway-1-1a5d9a43 but other probes failed, network may be misconfigured","ts":"2023-11-21T16:41:43.890882572Z"}
{"caller":"ingester.go:2279","level":"debug","msg":"shipper successfully synchronized TSDB blocks with storage","ts":"2023-11-21T16:41:43.984722874Z","uploaded":2,"user":"fake"}
{"caller":"ingester.go:2595","level":"info","msg":"finished flushing and shipping TSDB blocks","ts":"2023-11-21T16:41:43.984859001Z"}
{"caller":"lifecycler.go:871","final_sleep":"30s","level":"info","msg":"lifecycler entering final sleep before shutdown","ts":"2023-11-21T16:41:43.985246801Z"}
{"caller":"signals.go:55","level":"info","msg":"=== received SIGINT/SIGTERM ===\n*** exiting","ts":"2023-11-21T16:41:44.816310571Z"}
{"caller":"module_service.go:96","level":"info","module":"ingester-service","msg":"module stopped","ts":"2023-11-21T16:41:44.816429019Z"}
{"caller":"module_service.go:86","level":"debug","module":"server","msg":"stopping","ts":"2023-11-21T16:41:44.816563052Z"}
{"caller":"module_service.go:109","level":"debug","module":"runtime-config","msg":"module waiting for","ts":"2023-11-21T16:41:44.816598457Z","waiting_for":"ingester-service"}
{"caller":"module_service.go:86","level":"debug","module":"runtime-config","msg":"stopping","ts":"2023-11-21T16:41:44.816632226Z"}
{"caller":"module_service.go:96","level":"info","module":"runtime-config","msg":"module stopped","ts":"2023-11-21T16:41:44.816643075Z"}
{"caller":"module_service.go:109","level":"debug","module":"memberlist-kv","msg":"module waiting for","ts":"2023-11-21T16:41:44.816657622Z","waiting_for":"ingester-service"}
{"caller":"module_service.go:86","level":"debug","module":"memberlist-kv","msg":"stopping","ts":"2023-11-21T16:41:44.816672603Z"}
{"caller":"memberlist_client.go:612","level":"info","msg":"leaving memberlist cluster","ts":"2023-11-21T16:41:44.816698917Z"}
{"caller":"module_service.go:96","level":"info","module":"memberlist-kv","msg":"module stopped","ts":"2023-11-21T16:41:45.841625286Z"}
{"caller":"memberlist_logger.go:74","level":"debug","msg":"Failed ping: cortex-distributor-7d7d5b59b8-9t7ks-7824768a (timeout reached)","ts":"2023-11-21T16:41:45.89149416Z"}
{"caller":"memberlist_logger.go:74","level":"info","msg":"Suspect cortex-distributor-7d7d5b59b8-9t7ks-7824768a has failed, no acks received","ts":"2023-11-21T16:41:48.891631962Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:41:49.804559785Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:41:54.804679488Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:41:59.805094041Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:42:04.805275687Z"}
{"caller":"lifecycler.go:538","err":"unexpected state: Terminated","level":"error","msg":"failed to write to the KV store, sleeping","ring":"ingester","ts":"2023-11-21T16:42:09.805392347Z"}
{"caller":"lifecycler.go:877","level":"debug","msg":"unregistering instance from ring","ring":"ingester","ts":"2023-11-21T16:42:13.986349184Z"}
{"caller":"ingester.go:772","err":"failed to unregister from the KV store, ring: ingester: unexpected state: Terminated","level":"warn","msg":"failed to stop ingester lifecycler","ts":"2023-11-21T16:42:13.986629129Z"}
{"caller":"logging.go:76","level":"debug","msg":"GET /ingester/shutdown (204) 34.185574054s","traceID":"1a7457db41a31f14","ts":"2023-11-21T16:42:13.989845983Z"}
{"caller":"server_service.go:50","level":"info","msg":"server stopped","ts":"2023-11-21T16:42:14.148840428Z"}
{"caller":"module_service.go:96","level":"info","module":"server","msg":"module stopped","ts":"2023-11-21T16:42:14.148922944Z"}
{"caller":"cortex.go:423","level":"info","msg":"Cortex stopped","ts":"2023-11-21T16:42:14.148952283Z"}
The text was updated successfully, but these errors were encountered:
Describe the bug
When shutting down ingesters they get into
Terminated
state. This state is consideredunexpected
by memberlist resulting in the heartbeat to fail and the instance to be tainted asunhealthy
. This requires manual intervention and thus effectively breaks autoscaling.To Reproduce
Steps to reproduce the behavior:
Expected behavior
Ingesters should scale down and remove themselves from the ring without errors
Environment:
Additional Context
Logs
The text was updated successfully, but these errors were encountered: