Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Upgrade from OpenSearch 2.8.0 to 2.11.1 does not complete and cluster remains in a yellow state #779

Open
nilushancosta opened this issue Apr 5, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@nilushancosta
Copy link
Contributor

nilushancosta commented Apr 5, 2024

What is the bug?

I tested upgrading OpenSearch from 2.8.0 to 2.11.1 using the OpenSearch Operator. I deployed the operator and a 3 node OpenSearchCluster using Helm. Once the upgrade started, one of the OpenSearch pods got terminated and a new one came up to replace it (running the new 2.11.1 version). But the other two pods remained and the upgrade didn't continue.

Output of kubectl get OpenSearchCluster

NAME                 HEALTH   NODES   VERSION   PHASE     AGE
opensearch-cluster   yellow   3       2.8.0     RUNNING   44m

Part of the output of kubectl describe OpenSearch

Status:
  Available Nodes:  3
  Components Status:
    Component:  Upgrader
    Conditions:
      cluster is not green and drain nodes is enabled
    Description:  masters
    Status:       Upgrading
    Component:    Upgrader
    Conditions:
      cluster is not green and drain nodes is enabled
    Description:  masters
    Status:       Upgrading
  Health:         yellow
  Initialized:    true
  Phase:          RUNNING
  Version:        2.8.0
Events:
  Type    Reason    Age                From                     Message
  ----    ------    ----               ----                     -------
  Normal  Security  66m                containerset-controller  Starting to securityconfig update job
  Normal  Upgrade   27m (x2 over 27m)  containerset-controller  Starting upgrade of node pool 'masters'

operator-controller-manager container (in operator pod) keeps printing these logs
{"level":"info","ts":"2024-04-05T08:11:15.094Z","msg":"Reconciling OpenSearchCluster","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch-cluster","namespace":"test-ns"},"namespace":"test-ns","name":"opensearch-cluster","reconcileID":"921912f4-acff-497a-afc5-4bb73e51383e","cluster":{"name":"opensearch-cluster","namespace":"test-ns"}}
{"level":"info","ts":"2024-04-05T08:11:15.379Z","msg":"Generating certificates","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch-cluster","namespace":"test-ns"},"namespace":"test-ns","name":"opensearch-cluster","reconcileID":"921912f4-acff-497a-afc5-4bb73e51383e","interface":"transport"}
{"level":"info","ts":"2024-04-05T08:11:15.379Z","msg":"Generating certificates","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch-cluster","namespace":"test-ns"},"namespace":"test-ns","name":"opensearch-cluster","reconcileID":"921912f4-acff-497a-afc5-4bb73e51383e","interface":"http"}
{"level":"info","ts":"2024-04-05T08:11:15.380Z","msg":"Not passed any SecurityconfigSecret","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch-cluster","namespace":"test-ns"},"namespace":"test-ns","name":"opensearch-cluster","reconcileID":"921912f4-acff-497a-afc5-4bb73e51383e"}
{"level":"info","ts":"2024-04-05T08:11:15.385Z","msg":"ServiceMonitor crd not found, skipping deletion","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch-cluster","namespace":"test-ns"},"namespace":"test-ns","name":"opensearch-cluster","reconcileID":"921912f4-acff-497a-afc5-4bb73e51383e"}
{"level":"info","ts":"2024-04-05T08:11:15.684Z","msg":"Cluster is not ready for next pod to restart because cluster is not green and drain nodes is enabled","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch-cluster","namespace":"test-ns"},"namespace":"test-ns","name":"opensearch-cluster","reconcileID":"921912f4-acff-497a-afc5-4bb73e51383e","reconciler":"upgrade"}

I know that the compatibility chart does not list 2.11.1. Since Newer minor versions not listed there are generally also said to work, I wanted to test this in a test setup. Is this a known issue that will get fixed when 2.11.1 is officially supported?

How can one reproduce the bug?

  1. Using Helm, install the OpenSearch operator and then install an OpenSearch cluster with 2.8.0 as the version in opensearchCluster.general.version
  2. Change opensearchCluster.general.version from 2.8.0 to 2.11.1 and execute a helm upgrade
  3. One OpenSearch pod will get terminated and a new one will be created and come into a running state (with version 2.11.1)
  4. The upgrade does not proceed further after this

What is the expected behavior?

I also tested an upgrade from 2.7.0 to 2.8.0 on the same setup. Here the upgrade completed and the Health of the OpenSearchCluster resource became green. During the upgrade, pods got terminated one at a time and new pods with version 2.8.0 came up. Thereafter a rolling restart took place. So I expected the same behaviour

What is your host/environment?

I tried this on Rancher Desktop on MacOS. Kubernetes version: v1.25.11

Helm chart versions
opensearch-operator/opensearch-operator: 2.5.1
opensearch-operator/opensearch-cluster: 2.5.1

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

opensearch-operator/opensearch-operator was deployed with default values.

values file used with opensearch-operator/opensearch-cluster helm chart is as follows

opensearchCluster:
  enabled: true
#  bootstrap:
#    Configure settings for the bootstrap pod
  general:
    httpPort: "9200"
    version: 2.11.1
    serviceName: "opensearch"
    drainDataNodes: true
    setVMMaxMapCount: true
#    securityContext:
#       Specify container security context for OpenSearch pods
#    podSecurityContext:
#       Specify pod security context for OpenSearch pods
  dashboards:
    enable: false
    replicas: 1
    version: 2.3.0
#        securityContext:
#           Specify container security context for OSD pods
#        podSecurityContext:
#           Specify pod security context for OSD pods
    resources:
      requests:
        memory: "1Gi"
        cpu: "200m"
      limits:
        memory: "1Gi"
        cpu: "400m"
  nodePools:
    - component: masters
      diskSize: "2Gi"
      replicas: 3
      roles:
        - "master"
        - "data"
      resources:
        requests:
          memory: "2Gi"
          cpu: "500m"
        limits:
          memory: "2Gi"
          cpu: "500m"
  security:
    tls:
      transport:
        generate: true
      http:
        generate: true
@nilushancosta nilushancosta added bug Something isn't working untriaged Issues that have not yet been triaged labels Apr 5, 2024
@nilushancosta
Copy link
Contributor Author

I made a GET API call to _cluster/allocation/explain and as per the response there is a problem allocating a shard due to the other nodes having older versions. Could this be why the cluster is in a yello state?

{
    "index": ".opensearch-sap-log-types-config",
    "shard": 0,
    "primary": false,
    "current_state": "unassigned",
    "unassigned_info": {
        "reason": "REPLICA_ADDED",
        "at": "2024-04-05T08:08:30.191Z",
        "last_allocation_status": "no_attempt"
    },
    "can_allocate": "no",
    "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
    "node_allocation_decisions": [
        {
            "node_id": "4vw4uCqKTcCfWfY1Ml8H7A",
            "node_name": "opensearch-cluster-masters-2",
            "transport_address": "10.42.0.238:9300",
            "node_attributes": {
                "shard_indexing_pressure_enabled": "true"
            },
            "node_decision": "no",
            "deciders": [
                {
                    "decider": "node_version",
                    "decision": "NO",
                    "explanation": "cannot allocate replica shard to a node with version [2.8.0] since this is older than the primary version [2.11.1]"
                }
            ]
        },
        {
            "node_id": "NsAtfEdeQXiZz55clSVc7Q",
            "node_name": "opensearch-cluster-masters-0",
            "transport_address": "10.42.0.239:9300",
            "node_attributes": {
                "shard_indexing_pressure_enabled": "true"
            },
            "node_decision": "no",
            "deciders": [
                {
                    "decider": "same_shard",
                    "decision": "NO",
                    "explanation": "a copy of this shard is already allocated to this node [[.opensearch-sap-log-types-config][0], node[NsAtfEdeQXiZz55clSVc7Q], [P], s[STARTED], a[id=NfN2AKPhRiayU2Cei3pX3Q]]"
                }
            ]
        },
        {
            "node_id": "jVC68dgBR3C5oKixg4EnFg",
            "node_name": "opensearch-cluster-masters-1",
            "transport_address": "10.42.0.236:9300",
            "node_attributes": {
                "shard_indexing_pressure_enabled": "true"
            },
            "node_decision": "no",
            "deciders": [
                {
                    "decider": "node_version",
                    "decision": "NO",
                    "explanation": "cannot allocate replica shard to a node with version [2.8.0] since this is older than the primary version [2.11.1]"
                }
            ]
        }
    ]
}

@nilushancosta
Copy link
Contributor Author

nilushancosta commented Apr 8, 2024

I did some more debugging and these are my findings.

When I deploy an OpenSearchCluster with version 2.11.1, the following indices get created
Screenshot 2024-04-08 at 13 35 19

When I deploy one with version 2.8.0, the following indices get created
Screenshot 2024-04-08 at 13 44 57

.opensearch-sap-log-types-config index is the extra index in version 2.11.1 that requires 3 shards (1 primary and 2 replica)

I had a look at the operator codebase to see how .opensearch-observability index (which also requires 3 shards) is handled. If the cluster goes into a yellow state due to this index, the operator allows the upgrade to continue -

// continueRestartWithYellowHealth allows upgrades and rolling restarts to continue when the cluster is yellow

As per the index allocation failure explain API call, it is this .opensearch-sap-log-types-config that is having the allocation failure. But I could not find a similar skip rule for it.

I also found this issue (opensearch-project/opensearch-build#4285) which discusses issues with .opensearch-sap-log-types-config causing the cluster to go into a yellow state

@nilushancosta
Copy link
Contributor Author

WORKAROUND

.opensearch-sap-log-types-config and .plugins-ml-config are created by the Security Analytics and ML Commons bundled plugins. As a workaround, I was able to remove these two plugins by building a Docker image without them and deploying OpenSearch from it.
Once I did that, the upgrade from 2.8.0 to 2.11.1 completed

FROM opensearchproject/opensearch:2.11.1

RUN /usr/share/opensearch/bin/opensearch-plugin remove opensearch-security-analytics \
    && /usr/share/opensearch/bin/opensearch-plugin remove opensearch-ml

@prudhvigodithi
Copy link
Collaborator

[Triage]
Hey, this looks to me like an issue with plugin but not an issue with the operator. Please check opensearch-project/opensearch-build#4285 (comment).
Adding @swoehrl-mw @salyh

@prudhvigodithi prudhvigodithi removed the untriaged Issues that have not yet been triaged label May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants