Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network partition caused resources to be deleted from state without warning #1013

Closed
lblackstone opened this issue Mar 2, 2020 · 4 comments · Fixed by #1379
Closed

Network partition caused resources to be deleted from state without warning #1013

lblackstone opened this issue Mar 2, 2020 · 4 comments · Fixed by #1379
Assignees
Labels
impact/reliability Something that feels unreliable or flaky
Milestone

Comments

@lblackstone
Copy link
Member

lblackstone commented Mar 2, 2020

Problem description

A user reported that they ran an update which deleted resources from Pulumi's state, but left the actual resources on the Kubernetes cluster. This was apparently due to a network partition that occurred between the preview and update steps, so no prior warning was given.

Although the state was recoverable from the previous update, this was a bad user experience, and we should reconsider the way we handle state for unreachable clusters.

Related: #491 #881

Errors & Logs

See #881 (comment)

Updating (development):
     Type                                                              Name                                             Status      Info
     pulumi:pulumi:Stack                                               azure-kubernetes-cluster-development                         
     └─ osimis:AzureKubernetesCluster                                  osimis-lify-k8s                                              
 -      ├─ kubernetes:helm.sh:Chart                                    ingress-gloo                                     deleted     
 -      │  ├─ kubernetes:core:ServiceAccount                           ingress-gloo-qq119b7f/discovery                  deleted     1 warning
 -      │  ├─ kubernetes:core:ServiceAccount                           ingress-gloo-qq119b7f/gloo                       deleted     1 warning
 -      │  ├─ kubernetes:core:ConfigMap                                ingress-gloo-qq119b7f/ingress-envoy-config       deleted     1 warning
 -      │  ├─ kubernetes:core:ConfigMap                                ingress-gloo-qq119b7f/gloo-usage                 deleted     1 warning
 -      │  ├─ kubernetes:core:Service                                  ingress-gloo-qq119b7f/ingress-proxy              deleted     1 warning
 -      │  ├─ kubernetes:rbac.authorization.k8s.io:ClusterRoleBinding  gloo-role-binding-ingress-ingress-gloo-qq119b7f  deleted     1 warning
 -      │  ├─ kubernetes:apps:Deployment                               ingress-gloo-qq119b7f/discovery                  deleted     1 warning
 -      │  ├─ kubernetes:rbac.authorization.k8s.io:ClusterRole         gloo-role-ingress                                deleted     1 warning
 -      │  ├─ kubernetes:apps:Deployment                               ingress-gloo-qq119b7f/ingress                    deleted     1 warning
 -      │  ├─ kubernetes:core:Service                                  ingress-gloo-qq119b7f/gloo                       deleted     1 warning
 -      │  ├─ kubernetes:apps:Deployment                               ingress-gloo-qq119b7f/ingress-proxy              deleted     1 warning
 -      │  └─ kubernetes:apps:Deployment                               ingress-gloo-qq119b7f/gloo                       deleted     1 warning
 -      ├─ kubernetes:core:Namespace                                   ingress-gloo                                     deleted     1 warning
 -      └─ kubernetes:cert-manager.io:ClusterIssuer                    nginx-cluster-issuer                             deleted     1 warning
 
Diagnostics:
  kubernetes:apps:Deployment (ingress-gloo-qq119b7f/discovery):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:core:ConfigMap (ingress-gloo-qq119b7f/gloo-usage):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:core:Namespace (ingress-gloo):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:core:ServiceAccount (ingress-gloo-qq119b7f/gloo):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:core:ConfigMap (ingress-gloo-qq119b7f/ingress-envoy-config):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:cert-manager.io:ClusterIssuer (nginx-cluster-issuer):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:core:ServiceAccount (ingress-gloo-qq119b7f/discovery):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:rbac.authorization.k8s.io:ClusterRoleBinding (gloo-role-binding-ingress-ingress-gloo-qq119b7f):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:rbac.authorization.k8s.io:ClusterRole (gloo-role-ingress):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:apps:Deployment (ingress-gloo-qq119b7f/gloo):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:apps:Deployment (ingress-gloo-qq119b7f/ingress):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:core:Service (ingress-gloo-qq119b7f/gloo):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:apps:Deployment (ingress-gloo-qq119b7f/ingress-proxy):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
  kubernetes:core:Service (ingress-gloo-qq119b7f/ingress-proxy):
    warning: configured Kubernetes cluster is unreachable: unable to load schema information from the API server: Get https://<mycluster>.azmk8s.io:443/openapi/v2?timeout=32s: net/http: TLS handshake timeout
 
Outputs:
   <outputs redacted>
Resources:
    - 15 deleted
    69 unchanged

Reproducing the issue

  1. Create a stack with resources deployed to Kubernetes
  2. Remove a resource from the program and run pulumi up
  3. After preview, but before confirming the update, make the cluster inaccessible (change kubeconfig or similar out of band)
  4. Apply the update

Suggestions for a fix

The intent for the current behavior was to unblock users who had inadvertently deleted their Kubernetes cluster prior to cleaning up resources deployed to that cluster. If the cluster is unreachable prior to preview, a descriptive warning is shown before deleting the resources from the state.

Rather than defaulting to deleting resources from the state, it would be better to require a force-delete option for users who need to fix invalid state.

@lblackstone lblackstone self-assigned this Mar 2, 2020
@lblackstone
Copy link
Member Author

@ringods

@maxromanovsky
Copy link

It just happened to me. IMHO it's a huge issue for a tool that should track the state of infrastructure, and explicitly declares:

Pulumi will complete any pending operations currently in progress and then exit and report the error or failure.

Crossing fingers and hoping for a fix!

@lblackstone lblackstone added this to the 0.33 milestone Mar 10, 2020
@lblackstone lblackstone modified the milestones: 0.33, 0.34 Mar 18, 2020
@leezen leezen removed this from the 0.34 milestone Apr 7, 2020
@WoLfulus
Copy link

WoLfulus commented Sep 30, 2020

This is still happening. All my resources were deleted due to not being able to connect to the cluster.

I literally only did a pulumi refresh but since it spits a hell lot of warnings regarding kubernetes deprecations, I missed some cluster unreacheable warnings and applied the changes.

I reverted the stack state back, and did a refresh only on the KubernetesCluster resource, in which a new kubectl was fetched, but for some reason it doesn't use that kubeconfig. (note that I'm doing the status.apply() call just like the published examples that was supposed to cover this scenario)

After that everything is broken because we can't even run commands due to integrity checking pulumi has.

Any updates on this one? Or at least a workaround?

I tried doing a refresh on the Provider, didn't work either.

@lblackstone
Copy link
Member Author

You should be able to revert to any previous checkpoint state with the following:
pulumi stack export --version=<previous-version-number> > out
followed by
pulumi stack import --file=out

That should get your stack back into a good state so you can resume updates as normal.

@leezen leezen added this to the current milestone Nov 2, 2020
@leezen leezen added the impact/reliability Something that feels unreliable or flaky label Nov 9, 2020
@leezen leezen modified the milestones: current, 0.47 Nov 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact/reliability Something that feels unreliable or flaky
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants