Network partition caused resources to be deleted from state without warning #1013

lblackstone · 2020-03-02T17:38:57Z

Problem description

A user reported that they ran an update which deleted resources from Pulumi's state, but left the actual resources on the Kubernetes cluster. This was apparently due to a network partition that occurred between the preview and update steps, so no prior warning was given.

Although the state was recoverable from the previous update, this was a bad user experience, and we should reconsider the way we handle state for unreachable clusters.

Related: #491 #881

Errors & Logs

See #881 (comment)

Updating (development):
     Type                                                              Name                                             Status      Info
     pulumi:pulumi:Stack                                               azure-kubernetes-cluster-development                         
     └─ osimis:AzureKubernetesCluster                                  osimis-lify-k8s                                              
 -      ├─ kubernetes:helm.sh:Chart                                    ingress-gloo                                     deleted     
 -      │  ├─ kubernetes:core:ServiceAccount                           ingress-gloo-qq119b7f/discovery                  deleted     1 warning
 -      │  ├─ kubernetes:core:ServiceAccount                           ingress-gloo-qq119b7f/gloo                       deleted     1 warning
 -      │  ├─ kubernetes:core:ConfigMap                                ingress-gloo-qq119b7f/ingress-envoy-config       deleted     1 warning
 -      │  ├─ kubernetes:core:ConfigMap                                ingress-gloo-qq119b7f/gloo-usage                 deleted     1 warning
 -      │  ├─ kubernetes:core:Service                                  ingress-gloo-qq119b7f/ingress-proxy              deleted     1 warning
 -      │  ├─ kubernetes:rbac.authorization.k8s.io:ClusterRoleBinding  gloo-role-binding-ingress-ingress-gloo-qq119b7f  deleted     1 warning
 -      │  ├─ kubernetes:apps:Deployment                               ingress-gloo-qq119b7f/discovery                  deleted     1 warning
 -      │  ├─ kubernetes:rbac.authorization.k8s.io:ClusterRole         gloo-role-ingress                                deleted     1 warning
 -      │  ├─ kubernetes:apps:Deployment                               ingress-gloo-qq119b7f/ingress                    deleted     1 warning
 -      │  ├─ kubernetes:core:Service                                  ingress-gloo-qq119b7f/gloo                       deleted     1 warning
 -      │  ├─ kubernetes:apps:Deployment                               ingress-gloo-qq119b7f/ingress-proxy              deleted     1 warning
 -      │  └─ kubernetes:apps:Deployment                               ingress-gloo-qq119b7f/gloo                       deleted     1 warning
 -      ├─ kubernetes:core:Namespace                                   ingress-gloo                                     deleted     1 warning
 -      └─ kubernetes:cert-manager.io:ClusterIssuer                    nginx-cluster-issuer                             deleted     1 warning
 
Diagnostics:
  kubernetes:apps:Deployment (ingress-gloo-qq119b7f/discovery):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:core:ConfigMap (ingress-gloo-qq119b7f/gloo-usage):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:core:Namespace (ingress-gloo):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:core:ServiceAccount (ingress-gloo-qq119b7f/gloo):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:core:ConfigMap (ingress-gloo-qq119b7f/ingress-envoy-config):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:cert-manager.io:ClusterIssuer (nginx-cluster-issuer):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:core:ServiceAccount (ingress-gloo-qq119b7f/discovery):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:rbac.authorization.k8s.io:ClusterRoleBinding (gloo-role-binding-ingress-ingress-gloo-qq119b7f):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:rbac.authorization.k8s.io:ClusterRole (gloo-role-ingress):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:apps:Deployment (ingress-gloo-qq119b7f/gloo):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:apps:Deployment (ingress-gloo-qq119b7f/ingress):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:core:Service (ingress-gloo-qq119b7f/gloo):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:apps:Deployment (ingress-gloo-qq119b7f/ingress-proxy):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:core:Service (ingress-gloo-qq119b7f/ingress-proxy):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
Outputs:
   <outputs redacted>
Resources:
    - 15 deleted
    69 unchanged

Reproducing the issue

Create a stack with resources deployed to Kubernetes
Remove a resource from the program and run pulumi up
After preview, but before confirming the update, make the cluster inaccessible (change kubeconfig or similar out of band)
Apply the update

Suggestions for a fix

The intent for the current behavior was to unblock users who had inadvertently deleted their Kubernetes cluster prior to cleaning up resources deployed to that cluster. If the cluster is unreachable prior to preview, a descriptive warning is shown before deleting the resources from the state.

Rather than defaulting to deleting resources from the state, it would be better to require a force-delete option for users who need to fix invalid state.

The text was updated successfully, but these errors were encountered:

lblackstone · 2020-03-02T17:39:48Z

@ringods

maxromanovsky · 2020-03-10T20:27:07Z

It just happened to me. IMHO it's a huge issue for a tool that should track the state of infrastructure, and explicitly declares:

Pulumi will complete any pending operations currently in progress and then exit and report the error or failure.

Crossing fingers and hoping for a fix!

WoLfulus · 2020-09-30T17:50:19Z

This is still happening. All my resources were deleted due to not being able to connect to the cluster.

I literally only did a pulumi refresh but since it spits a hell lot of warnings regarding kubernetes deprecations, I missed some cluster unreacheable warnings and applied the changes.

I reverted the stack state back, and did a refresh only on the KubernetesCluster resource, in which a new kubectl was fetched, but for some reason it doesn't use that kubeconfig. (note that I'm doing the status.apply() call just like the published examples that was supposed to cover this scenario)

After that everything is broken because we can't even run commands due to integrity checking pulumi has.

Any updates on this one? Or at least a workaround?

I tried doing a refresh on the Provider, didn't work either.

lblackstone · 2020-09-30T20:10:55Z

You should be able to revert to any previous checkpoint state with the following:
pulumi stack export --version=<previous-version-number> > out
followed by
pulumi stack import --file=out

That should get your stack back into a good state so you can resume updates as normal.

lblackstone self-assigned this Mar 2, 2020

lblackstone added this to the 0.33 milestone Mar 10, 2020

lblackstone modified the milestones: 0.33, 0.34 Mar 18, 2020

leezen removed this from the 0.34 milestone Apr 7, 2020

leezen added this to the current milestone Nov 2, 2020

leezen added the impact/reliability Something that feels unreliable or flaky label Nov 9, 2020

lblackstone mentioned this issue Nov 12, 2020

Error on delete if cluster is unreachable #1379

Merged

lblackstone closed this as completed in #1379 Nov 12, 2020

leezen modified the milestones: current, 0.47 Nov 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Network partition caused resources to be deleted from state without warning #1013

Network partition caused resources to be deleted from state without warning #1013

lblackstone commented Mar 2, 2020 •

edited

lblackstone commented Mar 2, 2020

maxromanovsky commented Mar 10, 2020

WoLfulus commented Sep 30, 2020 •

edited

lblackstone commented Sep 30, 2020

Network partition caused resources to be deleted from state without warning #1013

Network partition caused resources to be deleted from state without warning #1013

Comments

lblackstone commented Mar 2, 2020 • edited

Problem description

Errors & Logs

Reproducing the issue

Suggestions for a fix

lblackstone commented Mar 2, 2020

maxromanovsky commented Mar 10, 2020

WoLfulus commented Sep 30, 2020 • edited

lblackstone commented Sep 30, 2020

lblackstone commented Mar 2, 2020 •

edited

WoLfulus commented Sep 30, 2020 •

edited