[BUG] Upgrade from OpenSearch 2.8.0 to 2.11.1 does not complete and cluster remains in a yellow state #779

nilushancosta · 2024-04-05T08:46:24Z

What is the bug?

I tested upgrading OpenSearch from 2.8.0 to 2.11.1 using the OpenSearch Operator. I deployed the operator and a 3 node OpenSearchCluster using Helm. Once the upgrade started, one of the OpenSearch pods got terminated and a new one came up to replace it (running the new 2.11.1 version). But the other two pods remained and the upgrade didn't continue.

Output of kubectl get OpenSearchCluster

NAME                 HEALTH   NODES   VERSION   PHASE     AGE
opensearch-cluster   yellow   3       2.8.0     RUNNING   44m

Part of the output of kubectl describe OpenSearch

Status:
  Available Nodes:  3
  Components Status:
    Component:  Upgrader
    Conditions:
      cluster is not green and drain nodes is enabled
    Description:  masters
    Status:       Upgrading
    Component:    Upgrader
    Conditions:
      cluster is not green and drain nodes is enabled
    Description:  masters
    Status:       Upgrading
  Health:         yellow
  Initialized:    true
  Phase:          RUNNING
  Version:        2.8.0
Events:
  Type    Reason    Age                From                     Message
  ----    ------    ----               ----                     -------
  Normal  Security  66m                containerset-controller  Starting to securityconfig update job
  Normal  Upgrade   27m (x2 over 27m)  containerset-controller  Starting upgrade of node pool 'masters'

operator-controller-manager container (in operator pod) keeps printing these logs

{"level":"info","ts":"2024-04-05T08:11:15.094Z","msg":"Reconciling OpenSearchCluster","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch-cluster","namespace":"test-ns"},"namespace":"test-ns","name":"opensearch-cluster","reconcileID":"921912f4-acff-497a-afc5-4bb73e51383e","cluster":{"name":"opensearch-cluster","namespace":"test-ns"}}
{"level":"info","ts":"2024-04-05T08:11:15.379Z","msg":"Generating certificates","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch-cluster","namespace":"test-ns"},"namespace":"test-ns","name":"opensearch-cluster","reconcileID":"921912f4-acff-497a-afc5-4bb73e51383e","interface":"transport"}
{"level":"info","ts":"2024-04-05T08:11:15.379Z","msg":"Generating certificates","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch-cluster","namespace":"test-ns"},"namespace":"test-ns","name":"opensearch-cluster","reconcileID":"921912f4-acff-497a-afc5-4bb73e51383e","interface":"http"}
{"level":"info","ts":"2024-04-05T08:11:15.380Z","msg":"Not passed any SecurityconfigSecret","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch-cluster","namespace":"test-ns"},"namespace":"test-ns","name":"opensearch-cluster","reconcileID":"921912f4-acff-497a-afc5-4bb73e51383e"}
{"level":"info","ts":"2024-04-05T08:11:15.385Z","msg":"ServiceMonitor crd not found, skipping deletion","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch-cluster","namespace":"test-ns"},"namespace":"test-ns","name":"opensearch-cluster","reconcileID":"921912f4-acff-497a-afc5-4bb73e51383e"}
{"level":"info","ts":"2024-04-05T08:11:15.684Z","msg":"Cluster is not ready for next pod to restart because cluster is not green and drain nodes is enabled","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"opensearch-cluster","namespace":"test-ns"},"namespace":"test-ns","name":"opensearch-cluster","reconcileID":"921912f4-acff-497a-afc5-4bb73e51383e","reconciler":"upgrade"}

I know that the compatibility chart does not list 2.11.1. Since Newer minor versions not listed there are generally also said to work, I wanted to test this in a test setup. Is this a known issue that will get fixed when 2.11.1 is officially supported?

How can one reproduce the bug?

Using Helm, install the OpenSearch operator and then install an OpenSearch cluster with 2.8.0 as the version in opensearchCluster.general.version
Change opensearchCluster.general.version from 2.8.0 to 2.11.1 and execute a helm upgrade
One OpenSearch pod will get terminated and a new one will be created and come into a running state (with version 2.11.1)
The upgrade does not proceed further after this

What is the expected behavior?

I also tested an upgrade from 2.7.0 to 2.8.0 on the same setup. Here the upgrade completed and the Health of the OpenSearchCluster resource became green. During the upgrade, pods got terminated one at a time and new pods with version 2.8.0 came up. Thereafter a rolling restart took place. So I expected the same behaviour

What is your host/environment?

I tried this on Rancher Desktop on MacOS. Kubernetes version: v1.25.11

Helm chart versions
opensearch-operator/opensearch-operator: 2.5.1
opensearch-operator/opensearch-cluster: 2.5.1

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

opensearch-operator/opensearch-operator was deployed with default values.

values file used with opensearch-operator/opensearch-cluster helm chart is as follows

opensearchCluster:
  enabled: true
#  bootstrap:
#    Configure settings for the bootstrap pod
  general:
    httpPort: "9200"
    version: 2.11.1
    serviceName: "opensearch"
    drainDataNodes: true
    setVMMaxMapCount: true
#    securityContext:
#       Specify container security context for OpenSearch pods
#    podSecurityContext:
#       Specify pod security context for OpenSearch pods
  dashboards:
    enable: false
    replicas: 1
    version: 2.3.0
#        securityContext:
#           Specify container security context for OSD pods
#        podSecurityContext:
#           Specify pod security context for OSD pods
    resources:
      requests:
        memory: "1Gi"
        cpu: "200m"
      limits:
        memory: "1Gi"
        cpu: "400m"
  nodePools:
    - component: masters
      diskSize: "2Gi"
      replicas: 3
      roles:
        - "master"
        - "data"
      resources:
        requests:
          memory: "2Gi"
          cpu: "500m"
        limits:
          memory: "2Gi"
          cpu: "500m"
  security:
    tls:
      transport:
        generate: true
      http:
        generate: true

The text was updated successfully, but these errors were encountered:

nilushancosta · 2024-04-05T10:04:21Z

I made a GET API call to _cluster/allocation/explain and as per the response there is a problem allocating a shard due to the other nodes having older versions. Could this be why the cluster is in a yello state?

{
    "index": ".opensearch-sap-log-types-config",
    "shard": 0,
    "primary": false,
    "current_state": "unassigned",
    "unassigned_info": {
        "reason": "REPLICA_ADDED",
        "at": "2024-04-05T08:08:30.191Z",
        "last_allocation_status": "no_attempt"
    },
    "can_allocate": "no",
    "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
    "node_allocation_decisions": [
        {
            "node_id": "4vw4uCqKTcCfWfY1Ml8H7A",
            "node_name": "opensearch-cluster-masters-2",
            "transport_address": "10.42.0.238:9300",
            "node_attributes": {
                "shard_indexing_pressure_enabled": "true"
            },
            "node_decision": "no",
            "deciders": [
                {
                    "decider": "node_version",
                    "decision": "NO",
                    "explanation": "cannot allocate replica shard to a node with version [2.8.0] since this is older than the primary version [2.11.1]"
                }
            ]
        },
        {
            "node_id": "NsAtfEdeQXiZz55clSVc7Q",
            "node_name": "opensearch-cluster-masters-0",
            "transport_address": "10.42.0.239:9300",
            "node_attributes": {
                "shard_indexing_pressure_enabled": "true"
            },
            "node_decision": "no",
            "deciders": [
                {
                    "decider": "same_shard",
                    "decision": "NO",
                    "explanation": "a copy of this shard is already allocated to this node [[.opensearch-sap-log-types-config][0], node[NsAtfEdeQXiZz55clSVc7Q], [P], s[STARTED], a[id=NfN2AKPhRiayU2Cei3pX3Q]]"
                }
            ]
        },
        {
            "node_id": "jVC68dgBR3C5oKixg4EnFg",
            "node_name": "opensearch-cluster-masters-1",
            "transport_address": "10.42.0.236:9300",
            "node_attributes": {
                "shard_indexing_pressure_enabled": "true"
            },
            "node_decision": "no",
            "deciders": [
                {
                    "decider": "node_version",
                    "decision": "NO",
                    "explanation": "cannot allocate replica shard to a node with version [2.8.0] since this is older than the primary version [2.11.1]"
                }
            ]
        }
    ]
}

nilushancosta · 2024-04-08T08:27:22Z

I did some more debugging and these are my findings.

When I deploy an OpenSearchCluster with version 2.11.1, the following indices get created

When I deploy one with version 2.8.0, the following indices get created

.opensearch-sap-log-types-config index is the extra index in version 2.11.1 that requires 3 shards (1 primary and 2 replica)

I had a look at the operator codebase to see how .opensearch-observability index (which also requires 3 shards) is handled. If the cluster goes into a yellow state due to this index, the operator allows the upgrade to continue -

opensearch-k8s-operator/opensearch-operator/opensearch-gateway/services/os_data_service.go

Line 268 in a622fa0

    
           // continueRestartWithYellowHealth allows upgrades and rolling restarts to continue when the cluster is yellow

As per the index allocation failure explain API call, it is this .opensearch-sap-log-types-config that is having the allocation failure. But I could not find a similar skip rule for it.

I also found this issue (opensearch-project/opensearch-build#4285) which discusses issues with .opensearch-sap-log-types-config causing the cluster to go into a yellow state

nilushancosta · 2024-04-09T07:59:56Z

WORKAROUND

.opensearch-sap-log-types-config and .plugins-ml-config are created by the Security Analytics and ML Commons bundled plugins. As a workaround, I was able to remove these two plugins by building a Docker image without them and deploying OpenSearch from it.
Once I did that, the upgrade from 2.8.0 to 2.11.1 completed

FROM opensearchproject/opensearch:2.11.1

RUN /usr/share/opensearch/bin/opensearch-plugin remove opensearch-security-analytics \
    && /usr/share/opensearch/bin/opensearch-plugin remove opensearch-ml

prudhvigodithi · 2024-05-13T16:02:52Z

[Triage]
Hey, this looks to me like an issue with plugin but not an issue with the operator. Please check opensearch-project/opensearch-build#4285 (comment).
Adding @swoehrl-mw @salyh

nilushancosta added bug Something isn't working untriaged Issues that have not yet been triaged labels Apr 5, 2024

nilushancosta mentioned this issue Apr 8, 2024

[BUG] .opensearch-sap-log-types-config and .opensearch-sap-pre-packaged-rules-config should not exist opensearch-project/opensearch-build#4285

Open

kfox1111 mentioned this issue Apr 23, 2024

Release v2.6.0 #776

Closed

prudhvigodithi removed the untriaged Issues that have not yet been triaged label May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Upgrade from OpenSearch 2.8.0 to 2.11.1 does not complete and cluster remains in a yellow state #779

[BUG] Upgrade from OpenSearch 2.8.0 to 2.11.1 does not complete and cluster remains in a yellow state #779

nilushancosta commented Apr 5, 2024 •

edited

nilushancosta commented Apr 5, 2024

nilushancosta commented Apr 8, 2024 •

edited

nilushancosta commented Apr 9, 2024

prudhvigodithi commented May 13, 2024

[BUG] Upgrade from OpenSearch 2.8.0 to 2.11.1 does not complete and cluster remains in a yellow state #779

[BUG] Upgrade from OpenSearch 2.8.0 to 2.11.1 does not complete and cluster remains in a yellow state #779

Comments

nilushancosta commented Apr 5, 2024 • edited

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

Do you have any screenshots?

Do you have any additional context?

nilushancosta commented Apr 5, 2024

nilushancosta commented Apr 8, 2024 • edited

nilushancosta commented Apr 9, 2024

WORKAROUND

prudhvigodithi commented May 13, 2024

nilushancosta commented Apr 5, 2024 •

edited

nilushancosta commented Apr 8, 2024 •

edited