Context Canceled on all authentication attempts for 10 minutes after an API node has shutdown #184

Timothy-Dougherty · 2023-02-16T04:44:29Z

Hi, we're observing the authentication request fail consistently for ~10 minutes after an API node has been shutdown.

We run multiple API nodes and bring one down at a time for updates, all other api nodes are accessible via our loadbalancer and we have confirmed the url used by the plugin operates correctly during this time.

However the vault plugin seems to hold onto a bad connection for 10 minutes and constantly fails all authentication attempts, after this period it operates correctly.

This is the response we see:
“https://{kube api url}/tokenreviews\“: context canceled”

We're using v1.12.3 of Vault

tvoran · 2023-02-16T06:42:17Z

Hi @Timothy-Dougherty, this sounds suspiciously similar to #169, so you might try reloading the plugin after the ~~leadership~~ API node change just to see if that resolves it faster than 10 minutes.

Can you turn on trace level logging and post the logs here from a period where authentication attempts are failing?

tomhjp · 2023-02-17T15:26:43Z

In addition to the above, please could you also share the config you are setting up the k8s auth method with? e.g. vault write auth/kubernetes/config .... Especially wrt the kubernetes_host parameter, and how that relates to the API nodes' addresses and the load balancer address. Can you give more details about the load balancer itself too please? It would be very useful in trying to repro this issue.

We do use a single HTTP client for all tokenreview requests, but my expectation would be that its connection gets shut down when the API node is shut down, but it sounds like that connection is staying alive longer for some reason.

Timothy-Dougherty · 2023-02-20T23:58:15Z

Thanks @tvoran & @tomhjp .

It'll take me a bit to get the trace logs and try the reload as I need to coordinate with another team to do so, hopefully in the next day or so.

As for the config and load balancer, we use an AWS network load balancer which connects to the ec2 instances of the api nodes, we have 3 api nodes which get replaced one after the other. The kubernetes_host value is a dns record that resolves to the network load balancer.

Timothy-Dougherty · 2023-02-21T02:12:03Z

Its worth noting that our NLB wasn't using an attribute called “Terminate connections on deregistration” which will close connections after a target has been deregistered, After turning that setting on and using a very low delay time on deregistration the period is of auth failure reduced from 10 minutes to < 1 minute.

Perhaps this is expected behaviour? however would have assumed the plugin should drop connections from its cache that have timed out (maybe not immediately but at least quicker than 10 minutes).

tomhjp · 2023-02-21T15:06:52Z

Thanks, that's really helpful. I don't have a ton of direct experience with AWS NLBs, but sounds like we're forming a good idea of what's going on.

From here:

Each individual TCP connection is routed to a single target for the life of the connection.

From here:

To ensure that existing connections are closed, you can do one of the following: enable the target group attribute for connection termination, ensure that the instance is unhealthy before you deregister it, or periodically close client connections.

So it sounds like you've tried out option 1 currently. I'm not clear what the exact disconnection behaviour is for option 2. And option 3 sounds like it's probably left to the responsibility of the client (Vault in this case). During the failure period, I'd be interested to know if Vault sees a 502, 503, 504, or something else, and I think the debug logs should tell us. Perhaps we could trigger a reconnect/close the connection on one or some of those error codes.

EDIT: Also for my own reference, AWS LB TargetGroupAttribute reference docs here.

Timothy-Dougherty · 2023-02-22T22:12:41Z

Sorry to say the trace logs didn't seem to give much more information.

There were trace logs reported for the certificate pool not needing an update but the logs related to the auth failures were the ones I've mentioned above:

Feb 21 22:05:29 ip-10-10-10-10 vault[983]: 2023-02-21T22:05:29.912Z [DEBUG] auth.kubernetes.auth_kubernetes_47797564: login unauthorized: err="Post \"https://k8s-api.sandbox.com:8443/apis/authentication.k8s.io/v1/tokenreviews\": context canceled"

Feb 21 22:05:29 ip-10-10-10-10 vault[983]: 2023-02-21T22:05:29.982Z [DEBUG] auth.kubernetes.auth_kubernetes_47797564: login unauthorized: err="Post \"https://k8s-api.sandbox.com:8443/apis/authentication.k8s.io/v1/tokenreviews\": context canceled"

Timothy-Dougherty · 2023-02-22T22:29:28Z

It would have been nice to see this in trace logs, but our api servers send a http2 GOAWAY when they are in the process of shutting down. If the plugin handled those and closed connections that received that, it would also resolve this issue.

tomhjp · 2023-02-23T21:31:19Z

That's weird. Go should close the client connection if it receives a GOAWAY frame. It's actually a relatively recent addition to the net library: golang/net@d0c6ba3, but available from 0.1.0, and Vault 1.12.3 was built with 0.4.0.

Timothy-Dougherty · 2023-02-26T22:02:09Z

hmm, I'll admit my understanding of how Go manages its dependecies is not great, but i can see the plugin
https://github.com/hashicorp/vault-plugin-auth-kubernetes/blob/main/go.mod#L61
indirectly references v0.0.0-20220906165146-f3363e06e74c which is a few months off the 0.4.0 version i think:
https://pkg.go.dev/golang.org/x/net?tab=versions
I dont know if that matters at all if vault is built with 0.4.0.

sidewinder12s · 2023-05-09T17:48:48Z

When you got this error, did you also see a huge spike in goroutines?

We're seeing something eerily similar to these errors. We're using EKS so we also do not control the Load Balancer config or management, though I believe AWS uses NLBs under the hood: https://aws.github.io/aws-eks-best-practices/reliability/docs/controlplane/#eks-architecture

As far as we can tell nothing is happening with our control planes and we're getting this error out of nowhere. Debug logging just surfaces the context cancelled errors and nothing else. Once these errors start kicking off, we start seeing failures across almost all of our EKS Clusters for some time and then it'll stop.

alexandersimpson17 · 2023-06-09T19:49:32Z

I am seeing a very similar issue. More info below..

Context Canceled on All Authentication Randomly

Vault Version: 13.1

Running Vault in EKS 1.23 (been seeing the issue since 1.22)

Once auth starts to fail it fails across all clusters in multiple AWS accounts.

Fails for 15 minutes and 30 seconds (almost exactly - at least for the last 2 occurrences, haven’t checked the others). Recovers and then the failures repeat 2-4 times (also for the same 15 min and 30 secs).

This occurs very randomly - as much as twice a week or once every 2 weeks.

Failures look like:
{"@Level":"debug","@message":"login unauthorized","@module":"auth.kubernetes.<redacted","@timestamp":"2023-05-31T04:56:11.240263Z","err":"Post "eks.amazonaws.com/apis/authentication.k8s.io/v1/tokenreviews": context canceled"}

Audit logs show that failures are only with auth/kubernetes but don’t show anything else of substance.

Trace logs just enabled (incident hasn’t occurred since) but from the code I don’t know if it will show much more info.

While the trigger could likely be after an API server has scaled/shutdown - we have many accounts/clusters and limited visibility with EKS it is hard to determine if that is the exact trigger.

Any help would be appreciated!

Timothy-Dougherty · 2023-06-21T05:51:37Z

@sidewinder12s this dropped off my radar as the 'Terminate connections on deregistration" setting on the NLB was enough to resolve our issue.

@alexandersimpson17 vault 13.1 still referenced v0.15.0 of this plugin:
https://github.com/hashicorp/vault/blob/4472e4a3fbcc984b7e3dc48f5a8283f3efe6f282/go.mod#L120

the x/net versions were bumped in v0.15.1
0068915#diff-33ef32bf6c23acb95f5902d7097b7a1d5128ca061167ec0716715b0b9eeaa5f6

again my understanding of go dependencies is not great, but it may be worth trying a later version of the vault that references v0.15.1 or later of this plugin and see if that fixes your issue

sidewinder12s · 2023-06-23T23:52:06Z

Thanks for the update. Since we're using EKS we're flying a little blind but are going to give vault 1.14 a try since that seemed to include the correct updated plugin version (We had agreed with your assessment of the go deps).

alexandersimpson17 · 2023-11-28T19:55:41Z

This issue is still occurring for me (currently on 1.15).

I was looking at this plugin and noticed it is using cleanhttp.

Cleanhttp hasn't been updated in almost 3 years and uses Go 1.13.

I was wondering if anyone thought if changing the plugin to just use the latest version of net/http could possibly improve things?

hsimon-hashicorp added the bug label Feb 16, 2023

tanishq-dubey mentioned this issue Apr 24, 2024

Kubernetes API errors are not logged as errors #168

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context Canceled on all authentication attempts for 10 minutes after an API node has shutdown #184

Context Canceled on all authentication attempts for 10 minutes after an API node has shutdown #184

Timothy-Dougherty commented Feb 16, 2023 •

edited

tvoran commented Feb 16, 2023 •

edited

tomhjp commented Feb 17, 2023

Timothy-Dougherty commented Feb 20, 2023

Timothy-Dougherty commented Feb 21, 2023

tomhjp commented Feb 21, 2023 •

edited

Timothy-Dougherty commented Feb 22, 2023 •

edited

Timothy-Dougherty commented Feb 22, 2023

tomhjp commented Feb 23, 2023

Timothy-Dougherty commented Feb 26, 2023 •

edited

sidewinder12s commented May 9, 2023 •

edited

alexandersimpson17 commented Jun 9, 2023

Timothy-Dougherty commented Jun 21, 2023

sidewinder12s commented Jun 23, 2023

alexandersimpson17 commented Nov 28, 2023

Context Canceled on all authentication attempts for 10 minutes after an API node has shutdown #184

Context Canceled on all authentication attempts for 10 minutes after an API node has shutdown #184

Comments

Timothy-Dougherty commented Feb 16, 2023 • edited

tvoran commented Feb 16, 2023 • edited

tomhjp commented Feb 17, 2023

Timothy-Dougherty commented Feb 20, 2023

Timothy-Dougherty commented Feb 21, 2023

tomhjp commented Feb 21, 2023 • edited

Timothy-Dougherty commented Feb 22, 2023 • edited

Timothy-Dougherty commented Feb 22, 2023

tomhjp commented Feb 23, 2023

Timothy-Dougherty commented Feb 26, 2023 • edited

sidewinder12s commented May 9, 2023 • edited

alexandersimpson17 commented Jun 9, 2023

Timothy-Dougherty commented Jun 21, 2023

sidewinder12s commented Jun 23, 2023

alexandersimpson17 commented Nov 28, 2023

Timothy-Dougherty commented Feb 16, 2023 •

edited

tvoran commented Feb 16, 2023 •

edited

tomhjp commented Feb 21, 2023 •

edited

Timothy-Dougherty commented Feb 22, 2023 •

edited

Timothy-Dougherty commented Feb 26, 2023 •

edited

sidewinder12s commented May 9, 2023 •

edited