Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unrecoverable JetStream cluster #3906

Closed
2 tasks done
matkam opened this issue Feb 24, 2023 · 5 comments
Closed
2 tasks done

Unrecoverable JetStream cluster #3906

matkam opened this issue Feb 24, 2023 · 5 comments

Comments

@matkam
Copy link

matkam commented Feb 24, 2023

Defect

First reported on Slack: https://natsio.slack.com/archives/C069GSYFP/p1676329368394969

Make sure that these boxes are checked before submitting your issue -- thank you!

/ # nats-server -DV
[84] 2023/02/24 04:17:32.753080 [INF] Starting nats-server
[84] 2023/02/24 04:17:32.753117 [INF]   Version:  2.9.14
[84] 2023/02/24 04:17:32.753119 [INF]   Git:      [74ae59a]
[84] 2023/02/24 04:17:32.753120 [DBG]   Go build: go1.19.5
[84] 2023/02/24 04:17:32.753122 [INF]   Name:     NCL3ETWJKRUBPYCG5IT2NQC7DKKVQUZFJDZANXLDOBJNIHQ56OKPGMHN
[84] 2023/02/24 04:17:32.753128 [INF]   ID:       NCL3ETWJKRUBPYCG5IT2NQC7DKKVQUZFJDZANXLDOBJNIHQ56OKPGMHN
[84] 2023/02/24 04:17:32.753141 [DBG] Created system account: "$SYS"
[84] 2023/02/24 04:17:32.754627 [FTL] Error listening on port: 0.0.0.0:4222, "listen tcp 0.0.0.0:4222: bind: address already in use"

Versions of nats-server and affected client libraries used:

NATS server: v2.9.11 and up. tested up to v2.9.14. NATS v2.9.10 is the last "good" release.
Helm chart: 0.19.9

OS/Container environment:

Kubernetes, deployed with helm. values.yaml configuration:

nats:
  image:
    tag: 2.9.14-alpine

  serverNamePrefix: "${region}-"

  healthcheck:
    startup:
      initialDelaySeconds: 180

  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet

  resources:
    requests:
      cpu: 3500m
      memory: 7250Mi

  limits:
    maxPings: 4
    pingInterval: 1m

  logging:
    debug: false
    trace: false

  jetstream:
    enabled: true
    memStorage:
      enabled: true
    fileStorage:
      enabled: false

cluster:
  enabled: true
  name: ${name}
  replicas: 3

gateway:
  enabled: true
  name: ${name}
  ...

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
                - nats
        topologyKey: "kubernetes.io/hostname"

In our environment, we have a NATS super cluster, with 3 NATS servers per data center. We have multiple streams, each configured with 3 replicas and 1 minute MaxAge/TTL.

Steps or code to reproduce the issue:

We've been able to consistently reproduce the unrecoverable state by using some test code.

  1. Apply this tester deployment in one of the K8s clusters where NATS is running with JetStream enabled.
  2. Scale up the deployment to ~5 replicas.
  3. Wait a few minutes (less than 5) while checking to see if nats stream ls is still working.
  4. If Jetstream is still responsive, force kill one of the NATS servers: kubectl delete pod nats-x --grace-period=0 --force

Expected result:

JetStream continues to work, with a few errors about a single node being down.

Actual result:

JetStream is completely down, and cannot be recovered without destroying and recreating the entire super cluster.

NATS servers constantly printing these warnings:

[WRN] Consumer assignment for '$G > testStream:14 > 2AAcUVbn' not cleaned up, retrying

Screenshot from our NATS Grafana dashboard:
image

@wallyqs
Copy link
Member

wallyqs commented Feb 24, 2023

@matkam what are the resources for the nodes in the test? is it maximum 2 GB memory for the nats-server? Are you seeing any MemoryPressure events in the kubectl get events in your case? What I see is that there is an increasing number of slow consumers, if these are between the routes (rid) I think that might be the reason why there is no full mesh and thus JS is not available.

@matkam
Copy link
Author

matkam commented Feb 24, 2023

@wallyqs Each NATS pod has almost the entire 4 vCPU, 8GB memory VM to itself. I've updated the helm values with our resource requests and affinity settings. There are no MemoryPressure events.

That makes sense that there is no full mesh, but what has changed since v2.9.10 that is preventing the mesh from recovering?

@derekcollison
Copy link
Member

We will get it fixed before 2.9.15 is released. Thanks for the info and the test case, much appreciated.

@esemeniuc
Copy link

Is this fixed now that 2.9.15 is released?

@wallyqs
Copy link
Member

wallyqs commented Mar 4, 2023

@esemeniuc most of it was addressed here so that it handles this scenario much better (50 memory streams each with 150 ephemeral push consumers and no flow control): #3922 but there are still some follow ups for when running into this condition in k8s like the startup probe causing extra restarts in some scenarios during the recovery. A workaround would be extend the startup period like below in k8s, but we'll likely revisit soon the heuristics around the startup probe to avoid this:

nats:
 healthcheck:
   startup:
     enabled: true
     failureThreshold: 300

@matkam matkam closed this as completed Mar 23, 2023
@bruth bruth removed the 🐞 bug label Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants