Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Operator cannot reliably bootstrap a cluster #811

Open
lpeter91 opened this issue May 13, 2024 · 2 comments
Open

[BUG] Operator cannot reliably bootstrap a cluster #811

lpeter91 opened this issue May 13, 2024 · 2 comments
Labels
bug Something isn't working untriaged Issues that have not yet been triaged

Comments

@lpeter91
Copy link

What is the bug?

The operator sometimes fails to correctly bootstrap/initialize a new cluster, instead it settles on a yellow state with shards stuck in unallocated and initializing statuses.

How can one reproduce the bug?

Note that this doesn't always happen, so you might have to try multiple times; however it happens for me more often than not:

Apply the minimal example below. It's basically the first example from the docs, with the now mandatory TLS added, and Dashboards removed. Wait until the bootstrapping finishes.

apiVersion: opensearch.opster.io/v1
kind: OpenSearchCluster
metadata:
  name: my-first-cluster
  namespace: default
spec:
  general:
    serviceName: my-first-cluster
    version: 2.13.0
  security:
    tls:
      transport:
        generate: true
        perNode: true
      http:
        generate: true
  nodePools:
    - component: nodes
      replicas: 3
      diskSize: "5Gi"
      nodeSelector:
      resources:
         requests:
            memory: "2Gi"
            cpu: "500m"
         limits:
            memory: "2Gi"
            cpu: "500m"
      roles:
        - "cluster_manager"
        - "data"

When the setup process finishes, the bootstrap pod is removed. Also around this time when to operator sometimes decides to log the event "Starting to rolling restart", and recreate the first node (pod). If this happens, sometimes the cluster ends up in a yellow state, that the operator does not resolve. If at this point I manually delete the cluster_manager pod (second node, usually), it will be recreated and the issue seems to resolve itself.

What is the expected behavior?

A cluster with a green state after setup. Preferably without unnecessary restarts.

What is your host/environment?

minikube v1.33.0 on Opensuse-Tumbleweed 20240511 w/ docker driver

I'm currently evaluating the operator locally. It might be part of the problem, as it forces me to run 3 nodes on a single machine. (However it does have sufficient resources to accommodate the nodes. The issue was also reproduced on a MacBook, albeit also with minikube.)

Do you have any additional context?

See the attached files. Some logs are probably missing since a pod was recreated.
kubectl_describe.txt
operator.log
node-2.log
node-1.log
node-0.log
allocation_explain.json
cat_shards.txt
cat_nodes.txt

@lpeter91 lpeter91 added bug Something isn't working untriaged Issues that have not yet been triaged labels May 13, 2024
@dtaivpp
Copy link
Contributor

dtaivpp commented May 14, 2024

Going to be honest here not having enough resources to host the cluster is probably where you are running into issues. OpenSearch gets really unstable when there is not enough memory. I've personally experienced this as well when running docker containers with OpenSearch.

These logs are concerning but it's hard to say the are unrelated to OOM type issues.

Node 0 Log:
[2024-05-13T17:35:44,103][WARN ][o.o.s.SecurityAnalyticsPlugin] [my-first-cluster-nodes-0] Failed to initialize LogType config index and builtin log types

Node 1 Log:

[2024-05-13T17:35:42,014][INFO ][o.o.i.i.MetadataService  ] [my-first-cluster-nodes-1] ISM config index not exist, so we cancel the metadata migration job.
[2024-05-13T17:35:43,296][ERROR][o.o.s.l.LogTypeService   ] [my-first-cluster-nodes-1] Custom LogType Bulk Index had failures:
 
[2024-05-13T17:35:43,296][ERROR][o.o.s.l.LogTypeService   ] [my-first-cluster-nodes-1] Custom LogType Bulk Index had failures:```

@lpeter91
Copy link
Author

Now I can confirm that this issue happens on an actul production Kubernetes cluster with plenty of resources too. The operator erroneously decides to do a rolling restart and fails to deliver, leaving the cluster in a yellow state. Seems like a concurrency issue as it doesn't always happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged Issues that have not yet been triaged
Projects
None yet
Development

No branches or pull requests

2 participants