Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes client rate-limiting #2942

Open
EronWright opened this issue Apr 11, 2024 · 2 comments
Open

Kubernetes client rate-limiting #2942

EronWright opened this issue Apr 11, 2024 · 2 comments
Labels
area/inner-dev-loop area/tools impact/performance Something is slower than expected kind/bug Some behavior is incorrect or out of spec kind/engineering Work that is not visible to an external user

Comments

@EronWright
Copy link
Contributor

EronWright commented Apr 11, 2024

What happened?

I was running the kubernetes provider in a debugger, and attaching to it using PULUMI_DEBUG_PROVIDERS. I used the same process for numerous deployments, and eventually the provider transitioned to a failure state, apparently due to client-side rate limiting. Once I restarted the provider process, the problem was fixed.

I decided to file an issue because, though my specific case is exotic, there might be a deeper scalability problem in the provider related to rate-limiting in the kube client.
See kubernetes/kubernetes#111880 for more background.

Diagnostics:
  kubernetes:apps/v1:Deployment (deployment):
    error: update of resource "urn:pulumi:dev::issue-xyz::kubernetes:apps/v1:Deployment::deployment" failed 
    because the Kubernetes API server reported that it failed to fully initialize or become live: 
    client rate limiter Wait returned an error: context canceled

  pulumi:pulumi:Stack (issue-xyz-dev):
    error: update failed

Here's the update made just prior to the first rate-limit error. I'd deliberately used an invalid image nginxfoo.

Diagnostics:
  kubernetes:apps/v1:Deployment (deployment):
    warning: Refreshed resource is in an unhealthy state:
    * Resource 'mydeployment' was created but failed to initialize
    * Minimum number of Pods to consider the application live was not attained
    * [Pod eron/mydeployment-65df56c569-dnqzh]: containers with unready status: [nginx]
    error: update of resource "urn:pulumi:dev::issue-2455::kubernetes:apps/v1:Deployment::deployment" failed because the Kubernetes API server reported that it failed to fully initialize or become live: Resource operation was cancelled for "mydeployment"

Example

name: issue-2942
runtime: yaml
description: A minimal Kubernetes Pulumi YAML program
config:
  pulumi:tags:
    value:
      pulumi:template: kubernetes-yaml
outputs:
  name: ${deployment.metadata.name}
resources:
  deployment:
    properties:
      metadata:
        name: mydeployment
      spec:
        replicas: 1
        selector:
          matchLabels: ${appLabels}
        template:
          metadata:
            labels: ${appLabels}
          spec:
            containers:
            - image: nginx
              name: nginx
              env:
              - name: DEMO_GREETING
                value: "16"
    type: kubernetes:apps/v1:Deployment
variables:
  appLabels:
    app: nginx

N/A

Output of pulumi about

CLI          
Version      3.108.1
Go Version   go1.22.0
Go Compiler  gc

Plugins
NAME        VERSION
kubernetes  unknown
yaml        unknown

Host     
OS       darwin
Version  14.4.1
Arch     arm64

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@EronWright EronWright added impact/performance Something is slower than expected kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Apr 11, 2024
@EronWright
Copy link
Contributor Author

EronWright commented Apr 11, 2024

Here's what happened in my case: the provider was sent a Cancel RPC, causing the provider's internal context to be canceled. In subsequent requests, the kube client logic is the first to hit upon the cancelled context.

Two possible follow-ups:

  1. double-check the qps settings
  2. teach the provider to reset the cancelation signal when it receives Configure RPC.

The low-level throttling code is here:

https://github.com/kubernetes/client-go/blob/46588f2726fa3e25b1704d6418190f424f95a990/rest/request.go#L986-L991

@blampe blampe added area/tools kind/engineering Work that is not visible to an external user area/inner-dev-loop and removed needs-triage Needs attention from the triage team labels Apr 12, 2024
@blampe
Copy link
Contributor

blampe commented Apr 12, 2024

Is there another alternative where we generously bump the QPS ceiling if running under debug? A quick workaround like that might be prudent if this is impacting the debug loop but not end-users.

Related #1748

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/inner-dev-loop area/tools impact/performance Something is slower than expected kind/bug Some behavior is incorrect or out of spec kind/engineering Work that is not visible to an external user
Projects
None yet
Development

No branches or pull requests

2 participants