Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client.Timeout exceeded (30s) on validation webhooks when updating Ingress objects #11255

Open
torvald opened this issue Apr 12, 2024 · 17 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@torvald
Copy link

torvald commented Apr 12, 2024

What happened:

We continue to hit the (max) timeout on our validation webhook when applying ingress manifests.

failed to call webhook: Post "https://ingress-nginx-controller-admission.ingress.svc:443/networking/v1/ingresses?timeout=30s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

It is consistently high, in the 20s mark, while general load or ingress applies in quick succession might push it to 30s where deploy pipelines start to fail.

image
The above image shows a graph of validation time, metric given by nginx itself, over 24 hours earlier this week.

This is me adding a label, to illustrate one simple update:

torvald@surdeig ~ $ time kubectl patch ing <ingress> --type='json' -p='[{"op": "add", "path": "/metadata/labels/testing", "value": "testing"}]'
ingress.networking.k8s.io/<ingress> patched

real	0m17.724s
user	0m0.396s
sys	0m0.057s

This is in a medium sized cluster,

  • ~130 nodes
  • 270 ingresses
  • 3 pods for nginx a 8GB RAM (request/limit) and 5 CPUs (request)
  • ~1000 rps at peak (see graph below)
  • 9.9 MB nginx config file (296k lines, 187 server_names, 4778 locations)

Request rate
image
Over the same time period as above.

Performance of pods
image
To comment on this, it looks and feels quite bearable. Spikes in CPUs are assumed to be nginx reloads and validations runs. Over the same time period as above.

90 days trends:
image
The image above show the number of ingresses over the last 90 days.

image
The image above shows the validation webhook duration over the last 90 days. This mostly support an organic growth of sorts, except the the quick changed marked in the picture above; this has been tracked down to 10 ingresses (serving the same host) that changed from 1 host to 3 so the collection of ~60 paths over 1 host became ~180 over 3 hosts.

See an example of such ingress post change

What you expected to happen:

I've seen people mention far better performance then 20-30s on their validation webhook in other issues around here, and that with larger clusters and larger nginx config files. So my expectations would be in the 1-5s mark.

This PR will probably help us in the cases where multiple ingresses at the same time gets applied - but one or a few single applies should probably not take 20s?

NGINX Ingress controller version

nginx/1.21.6, release v1.9.5
torvald@surdeig ~ $ kubectl exec -it nginx-ingress-controller-5d66477fb7-jttwl -- /nginx-ingress-controller --version  
Defaulted container "nginx-ingress-controller" out of: nginx-ingress-controller, opentelemetry (init), sysctl (init), geoip-database-download (init)
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.9.5
  Build:         f503c4bb5fa7d857ad29e94970eb550c2bc00b7c
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.21.6
-------------------------------------------------------------------------------

Kubernetes version (use kubectl version):

torvald@surdeig ~ $ kubectl version --short
Client Version: v1.25.0
Kustomize Version: v4.5.7
Server Version: v1.27.10-gke.1055000

Environment:

  • Cloud provider or hardware configuration: GCP, managed GKE; e2-custom-16-32768
  • OS (e.g. from /etc/os-release): Container-Optimized OS with containerd (cos_containerd)
  • Kernel (e.g. uname -a): 5.15.133+
  • How was the ingress-nginx-controller installed: It probably originated from a helm chart once, but everything has evolved in our own git repo since then. I'll attach the relevant files.
NAME                                                           READY   STATUS    RESTARTS   AGE     IP             NODE                                              NOMINATED NODE   READINESS GATES
pod/nginx-ingress-controller-5d66477fb7-8qtfs                  1/1     Running   0          20h     10.4.117.224   gke-k8s-prod-k8s-prod-standard-v8-611652ec-5lt5   <none>           <none>
pod/nginx-ingress-controller-5d66477fb7-jttwl                  1/1     Running   0          3h9m    10.4.71.52     gke-k8s-prod-k8s-prod-standard-v8-68b36906-wgdl   <none>           <none>
pod/nginx-ingress-controller-5d66477fb7-wlw6t                  1/1     Running   0          20h     10.4.1.143     gke-k8s-prod-k8s-prod-standard-v8-7c2e0d29-s7vz   <none>           <none>

NAME                                                               TYPE           CLUSTER-IP    EXTERNAL-IP           PORT(S)                      AGE      SELECTOR
service/ingress-nginx-controller-admission                         ClusterIP      10.6.12.178   <none>                443/TCP                      2y198d   app=nginx-ingress,component=controller
service/ingress-nginx-controller-collector-metrics                 ClusterIP      10.6.4.251    <none>                8888/TCP                     574d     app=nginx-ingress,component=controller
service/ingress-nginx-controller-metrics                           ClusterIP      10.6.6.244    <none>                10254/TCP                    2y198d   app=nginx-ingress,component=controller
service/nginx-ingress-controller                                   LoadBalancer   10.6.8.95     <redacted>          80:31151/TCP,443:30321/TCP   2y198d   app=nginx-ingress,component=controller

NAME                                            READY   UP-TO-DATE   AVAILABLE   AGE      CONTAINERS                      IMAGES                                                                                      SELECTOR
deployment.apps/nginx-ingress-controller        3/3     3            3           2y198d   nginx-ingress-controller        registry.k8s.io/ingress-nginx/controller:v1.9.5                                             app=nginx-ingress,component=controller

NAME                                                       DESIRED   CURRENT   READY   AGE      CONTAINERS                        IMAGES                                                                                                                         SELECTOR
replicaset.apps/nginx-ingress-controller-5d66477fb7        3         3         3       20h      nginx-ingress-controller          registry.k8s.io/ingress-nginx/controller:v1.9.5                                                                                app=nginx-ingress,component=controller,pod-template-hash=5d66477fb7
  • Current state of ingress object, if applicable:

See an example of an ingress, the same as mentioned above in the «What happened» section.

How to reproduce this issue:

I think this would be unfeasible, but I'm happy to assist with more details.

@torvald torvald added the kind/bug Categorizes issue or PR as related to a bug. label Apr 12, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 12, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@longwuyuan
Copy link
Contributor

Is it possible for you to get logs and events from those 20-30 secnds

@longwuyuan
Copy link
Contributor

/remove-kind bug
/triage needs-information

@k8s-ci-robot k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Apr 13, 2024
@torvald
Copy link
Author

torvald commented Apr 15, 2024

torvald@surdeig ~ $ date; time kubectl patch ing tech-radar-auth --type='json' -p='[{"op": "add", "path": "/metadata/labels/testing", "value": "testing"}]'; date;
Mon Apr 15 09:29:26 AM CEST 2024
ingress.networking.k8s.io/tech-radar-auth patched

real	0m19.327s
user	0m0.350s
sys	0m0.067s
Mon Apr 15 09:29:45 AM CEST 2024

Excluding lines containing:

  • "remote_addr" (pattern not to include pure request logs and client IP addresses)
  • "Endpoint"
  • "body is buffered"
  • "Error getting SSL certificate"
  • "Unexpected error validating SSL certificate"
  • "does not contain a Common Name"

I'm left with these for the time period as mention above + 15 sec (09:29:26 - 09:29:59):

2024-04-15 09:29:45.477	admission.go:149] processed ingress via admission controller {testedIngressLength:269 testedIngressTime:18.855s renderingIngressLength:269 renderingIngressTime:0.002s admissionTime:9.5MBs testedConfigurationSize:18.857}
2024-04-15 09:29:45.478	main.go:107] "successfully validated configuration, accepting" ingress="infra/tech-radar-auth"
2024-04-15 09:29:45.486	event.go:298] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"infra", Name:"tech-radar-auth", UID:"e635a88c-e72e-401f-8a9a-afa536b06c11", APIVersion:"networking.k8s.io/v1", ResourceVersion:"3624067364", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
2024-04-15 09:29:45.486	event.go:298] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"infra", Name:"tech-radar-auth", UID:"e635a88c-e72e-401f-8a9a-afa536b06c11", APIVersion:"networking.k8s.io/v1", ResourceVersion:"3624067364", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
2024-04-15 09:29:45.487	event.go:298] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"infra", Name:"tech-radar-auth", UID:"e635a88c-e72e-401f-8a9a-afa536b06c11", APIVersion:"networking.k8s.io/v1", ResourceVersion:"3624067364", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync

But for completeness sake, here is the full dump (redacted for business info, and all request logs are excluded): nginx-logs-ing-update-2024-04-15 09 33 02.414.txt

@longwuyuan
Copy link
Contributor

  • Does the node on which the controller pods are scheduled see any resource crunch
  • Have you configured limits and resources on the pods

There are lots of ssl errors and if you are saying that you submit lots of new ingress json to the api-server and the controller pod needs to reconcile & sync while there is resource crunch, then the stats could be the impact

@torvald
Copy link
Author

torvald commented Apr 16, 2024

  • Does the node on which the controller pods are scheduled see any resource crunch

No, we have fairly low utilization on all node. These are the three nodes from the same point (marked with the arrows) in time as the previous mentioned test. CPU to the left, memory to the right.

image
image
image

  • Have you configured limits and resources on the pods

Yes, full deployment manifest here.

        resources:
          requests:
            cpu: 5000m
            memory: 8000Mi
          limits:
            memory: 8000Mi

There are lots of ssl errors and if you are saying that you submit lots of new ingress json to the api-server and the controller pod needs to reconcile & sync while there is resource crunch, then the stats could be the impact

Events that make nginx sync data can be seen here, my test from earlier is marked, a period of low «sync volume».
image

Yes, there are a unfortunate amount of SSL errors. Zooming out, these are rather constant and I don't feel like they should affect the validation of my config to the point it take 20s. Following the same argument, there should be periods of low resource crunch where we should see validation duration of only a few seconds, but we don't.
image

I can try and eliminate the SSL errors before going further to get rid of this suspicion.

@longwuyuan
Copy link
Contributor

I want to give enough time to this so we can get the root cause. If you ping me on slack we could see how to do that.

@torvald
Copy link
Author

torvald commented Apr 17, 2024

We had a video chat and ran through some config and settings to validate the «normal stuff». We SSH into the node of the master nginx pod to look for clues in dmesg logs, especially related to connection tracking (conntrack). Nothing of significance, but we'll try and inspect the webhook with tcpdump get a better grip of where time is lost.

@torvald
Copy link
Author

torvald commented Apr 18, 2024

We have a dev environment with 60% of the same ingresses and we see high-ish numbers there as well, in the 6-8s mark.

I tried package dumping on our nginx pods there, and it seems like all time is spent inside nginx.

torvald@surdeig ~ $ date; time kubectl -n infra patch ing interno-public --type='json' -p='[{"op": "add", "path": "/metadata/labels/testing", "value": "testing"}]'; date

Wed Apr 17 03:45:05 PM CEST 2024
ingress.networking.k8s.io/interno-public patched

real	0m7.567s
user	0m0.413s
sys	0m0.047s

Wed Apr 17 03:45:13 PM CEST 2024

TCP and TLS handshake happens immidiatly, and then there is a 7s holdup waiting for a response from nginx on the admission webhook port 8443.

image

This tells me this is not a network problem.

@longwuyuan
Copy link
Contributor

Thanks.

This data is proof that when you say "network", you really mean network does not have port blocked or no ip/tcp blocking via packet-filter .

But as I hinted on the video call, the truly useful information for this is to to know if a test like telnet ingress-nginx-controller-admission 443 or telnet $controller-leader-podname 8443 or same test with *nix tool netcat (nc), shows connection working or jammed/wedged for 7+ seconds. This is a test of busy/choked network causing delay and not a test of connectivity blocked or not (for those few seconds)

@torvald
Copy link
Author

torvald commented Apr 18, 2024

Before going deeper down the network route, i came across a scent of what i think is the smoking gun: adding lots of paths seem to correlate well with the validation times we are seeing. More path items within one of our ingresses, produces longer validation times. If paths are spread out over multiple ingresses, it seems to be okay, but cramping many paths into one ingress seem to explode.

Our dev cluster has 180 ingresses configured. Many of them have multiple hosts and paths, so in the nginx.config, there is a a total of 1434 nginx locations directives:

$ kubectl -n ingress exec nginx-ingress-controller-7686cb664f-5lnnc -- cat /etc/nginx/nginx.conf | grep " location " | wc -l
1434

If i pick up on of our ingresses, not even a special one, just one of them, edit a label and apply it – i see the usual 7s wait in our dev cluster (20s for prod as reported earlier).

$ time kubectl apply -f /tmp/ingress.yaml
ingress.networking.k8s.io/app-api-ingress configured

real	0m6.884s
user	0m0.410s
sys	0m0.156s

Here i go from 2 paths:

$ cat /tmp/ingress.yaml | grep path:
      - path: /api/
      - path: /api/external

To 21 paths:

$ cat /tmp/ingress-with-many-paths.yaml   | grep path:
      - path: /api/
      - path: /api/external/
      - path: /api/internal/
      - path: /foo/(ja|nei)(-[^/]+)?/okse1/
      - path: /foo/(ja|nei)(-[^/]+)?/okse2/
      - path: /foo/(ja|nei)(-[^/]+)?/okse3/
      - path: /foo/(ja|nei)(-[^/]+)?/okse/
      - path: /foo/(ja|nei)(-[^/]+)?/banan/
      - path: /foo/(ja|nei)(-[^/]+)?/sjallabais/
      - path: /fisk/api/
      - path: /fisk/api/external/
      - path: /fisk/api/internal/
      - path: /fisk/foo/(ja|nei)(-[^/]+)?/okse1/
      - path: /fisk/foo/(ja|nei)(-[^/]+)?/okse2/
      - path: /fisk/foo/(ja|nei)(-[^/]+)?/okse3/
      - path: /fisk/foo/(ja|nei)(-[^/]+)?/okse/
      - path: /fisk/foo/(ja|nei)(-[^/]+)?/banan/
      - path: /fisk/foo/(ja|nei)(-[^/]+)?/sjallabais/
      - path: /kake/foo/(ja|nei)(-[^/]+)?/okse2/
      - path: /kake/foo/(ja|nei)(-[^/]+)?/okse3/
      - path: /kake/foo/(ja|nei)(-[^/]+)?/okse/

The validation webhook becomes super slow:

$ time kubectl apply -f /tmp/ingress-with-many-paths.yaml 
ingress.networking.k8s.io/app-api-ingress configured

real	0m25.353s
user	0m0.380s
sys	0m0.116s

Adding more paths makes it go over Kubernetes' admission webhook max timeout of 30s and kubectl fails.

With the object above applied, you can see how other ingress updates to the same nginx-server also is affected, presumably because the validation webhook makes nginx render all ingress objects and validate the full config each time (rightfully so):

$ time kubectl -n infra patch ing <random ingress in cluster> --type='json' -p='[{"op": "add", "path": "/metadata/labels/testing", "value": "just-a-small-change"}]'
ingress.networking.k8s.io/<random ingress in cluster> patched

real	0m27.728s
user	0m0.385s
sys	0m0.051s

In total, this does not generate much more config, it is just a 2% increase in nginx location directives:

$ kubectl -n ingress exec nginx-ingress-controller-7686cb664f-5lnnc -- cat /etc/nginx/nginx.conf | grep " location " | wc -lDefaulted container "nginx-ingress-controller" out of: nginx-ingress-controller, opentelemetry (init), sysctl (init), geoip-database-download (init)
1472

The full ingress manifests, and the diff between them.
ingress.yaml.txt
ingress-with-many-paths.yaml.txt
diff.txt

I can try and steps to reproduce this locally.

@longwuyuan
Copy link
Contributor

Awesome data.
Hope you get to the bottom of this.

@torvald
Copy link
Author

torvald commented Apr 25, 2024

I've figured out where this comes from!

The added latency on the admission controller webhook is triggered with the combination of the annotation

  • nginx.ingress.kubernetes.io/enable-owasp-core-rules: "true", and
  • multiple paths in one ingress (10+).
    More paths seem to have an exponential effect on processing duration.

For our production cluster, we run 270 ingresses, and 4 of these had this annotation, and most of them also had a good handful of paths. Removing the annotation, mitigated our problem. I've also reproduce this locally, if you are interested in the config, commands and manifests.

Moving this (and the corresponding modsecurity flag) to the main configmap works fine though, it does not add any additional latency. We did not do this though as we have a WAF infront of our nginx-ingress that does the same job.

  enable-modsecurity: "true"
  enable-owasp-modsecurity-crs: "true"

image

Profit. 🌮

@longwuyuan
Copy link
Contributor

Thank you very much for the update. Awesome work. Love this kind of results :-) cheers

If no more an issue, kindly consider closing the issue. Once again, awesome and thank a lot.

@torvald
Copy link
Author

torvald commented Apr 26, 2024

I do think I'd classify this as a regression bug; while more config obviously add validation time, the amount of validation time that we've seen here is unreasonably high compared to the amount config that actually gets added. I'm thinking I can try and profile where the time is spent, and maybe that can be used as input for some future optimizations. I'll close this if that does not yield and good results - or if no one else picks this up within reasonable time.

Thanks for your help and guidance, @longwuyuan.

@longwuyuan
Copy link
Contributor

There is painful & intensive work in progress that relates to this kind of circumstances.

One such effort is a PR trying to split the data-plane away from the control-plane. That paradigm shift introduces improved security, performance and hopefully management.

If you provide the data that you hinted, it can only do good for the developer decisions. The action item arising from that data is not someting I can comment on because the complications are just too much to summarise in writing.

Thank you very much for your info

Copy link

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

@github-actions github-actions bot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
Development

No branches or pull requests

3 participants