Unrecoverable JetStream cluster #3906

matkam · 2023-02-24T04:58:44Z

Defect

First reported on Slack: https://natsio.slack.com/archives/C069GSYFP/p1676329368394969

Make sure that these boxes are checked before submitting your issue -- thank you!

Included nats-server -DV output
Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)

/ # nats-server -DV
[84] 2023/02/24 04:17:32.753080 [INF] Starting nats-server
[84] 2023/02/24 04:17:32.753117 [INF]   Version:  2.9.14
[84] 2023/02/24 04:17:32.753119 [INF]   Git:      [74ae59a]
[84] 2023/02/24 04:17:32.753120 [DBG]   Go build: go1.19.5
[84] 2023/02/24 04:17:32.753122 [INF]   Name:     NCL3ETWJKRUBPYCG5IT2NQC7DKKVQUZFJDZANXLDOBJNIHQ56OKPGMHN
[84] 2023/02/24 04:17:32.753128 [INF]   ID:       NCL3ETWJKRUBPYCG5IT2NQC7DKKVQUZFJDZANXLDOBJNIHQ56OKPGMHN
[84] 2023/02/24 04:17:32.753141 [DBG] Created system account: "$SYS"
[84] 2023/02/24 04:17:32.754627 [FTL] Error listening on port: 0.0.0.0:4222, "listen tcp 0.0.0.0:4222: bind: address already in use"

Versions of `nats-server` and affected client libraries used:

NATS server: v2.9.11 and up. tested up to v2.9.14. NATS v2.9.10 is the last "good" release.
Helm chart: 0.19.9

OS/Container environment:

Kubernetes, deployed with helm. values.yaml configuration:

nats:
  image:
    tag: 2.9.14-alpine

  serverNamePrefix: "${region}-"

  healthcheck:
    startup:
      initialDelaySeconds: 180

  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet

  resources:
    requests:
      cpu: 3500m
      memory: 7250Mi

  limits:
    maxPings: 4
    pingInterval: 1m

  logging:
    debug: false
    trace: false

  jetstream:
    enabled: true
    memStorage:
      enabled: true
    fileStorage:
      enabled: false

cluster:
  enabled: true
  name: ${name}
  replicas: 3

gateway:
  enabled: true
  name: ${name}
  ...

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
                - nats
        topologyKey: "kubernetes.io/hostname"

In our environment, we have a NATS super cluster, with 3 NATS servers per data center. We have multiple streams, each configured with 3 replicas and 1 minute MaxAge/TTL.

Steps or code to reproduce the issue:

We've been able to consistently reproduce the unrecoverable state by using some test code.

Apply this tester deployment in one of the K8s clusters where NATS is running with JetStream enabled.
Scale up the deployment to ~5 replicas.
Wait a few minutes (less than 5) while checking to see if nats stream ls is still working.
If Jetstream is still responsive, force kill one of the NATS servers: kubectl delete pod nats-x --grace-period=0 --force

Expected result:

JetStream continues to work, with a few errors about a single node being down.

Actual result:

JetStream is completely down, and cannot be recovered without destroying and recreating the entire super cluster.

NATS servers constantly printing these warnings:

[WRN] Consumer assignment for '$G > testStream:14 > 2AAcUVbn' not cleaned up, retrying

Screenshot from our NATS Grafana dashboard:

The text was updated successfully, but these errors were encountered:

wallyqs · 2023-02-24T07:22:04Z

@matkam what are the resources for the nodes in the test? is it maximum 2 GB memory for the nats-server? Are you seeing any MemoryPressure events in the kubectl get events in your case? What I see is that there is an increasing number of slow consumers, if these are between the routes (rid) I think that might be the reason why there is no full mesh and thus JS is not available.

matkam · 2023-02-24T17:29:47Z

@wallyqs Each NATS pod has almost the entire 4 vCPU, 8GB memory VM to itself. I've updated the helm values with our resource requests and affinity settings. There are no MemoryPressure events.

That makes sense that there is no full mesh, but what has changed since v2.9.10 that is preventing the mesh from recovering?

derekcollison · 2023-02-24T20:43:49Z

We will get it fixed before 2.9.15 is released. Thanks for the info and the test case, much appreciated.

esemeniuc · 2023-03-04T01:33:44Z

Is this fixed now that 2.9.15 is released?

wallyqs · 2023-03-04T04:53:40Z

@esemeniuc most of it was addressed here so that it handles this scenario much better (50 memory streams each with 150 ephemeral push consumers and no flow control): #3922 but there are still some follow ups for when running into this condition in k8s like the startup probe causing extra restarts in some scenarios during the recovery. A workaround would be extend the startup period like below in k8s, but we'll likely revisit soon the heuristics around the startup probe to avoid this:

nats:
 healthcheck:
   startup:
     enabled: true
     failureThreshold: 300

matkam added the 🐞 bug label Feb 24, 2023

matkam closed this as completed Mar 23, 2023

bruth removed the 🐞 bug label Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unrecoverable JetStream cluster #3906

Unrecoverable JetStream cluster #3906

matkam commented Feb 24, 2023 •

edited

wallyqs commented Feb 24, 2023

matkam commented Feb 24, 2023 •

edited

derekcollison commented Feb 24, 2023

esemeniuc commented Mar 4, 2023

wallyqs commented Mar 4, 2023

Unrecoverable JetStream cluster #3906

Unrecoverable JetStream cluster #3906

Comments

matkam commented Feb 24, 2023 • edited

Defect

Versions of nats-server and affected client libraries used:

OS/Container environment:

Steps or code to reproduce the issue:

Expected result:

Actual result:

wallyqs commented Feb 24, 2023

matkam commented Feb 24, 2023 • edited

derekcollison commented Feb 24, 2023

esemeniuc commented Mar 4, 2023

wallyqs commented Mar 4, 2023

matkam commented Feb 24, 2023 •

edited

Versions of `nats-server` and affected client libraries used:

matkam commented Feb 24, 2023 •

edited