Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context Canceled on all authentication attempts for 10 minutes after an API node has shutdown #184

Open
Timothy-Dougherty opened this issue Feb 16, 2023 · 14 comments
Labels

Comments

@Timothy-Dougherty
Copy link

Timothy-Dougherty commented Feb 16, 2023

Hi, we're observing the authentication request fail consistently for ~10 minutes after an API node has been shutdown.

We run multiple API nodes and bring one down at a time for updates, all other api nodes are accessible via our loadbalancer and we have confirmed the url used by the plugin operates correctly during this time.

However the vault plugin seems to hold onto a bad connection for 10 minutes and constantly fails all authentication attempts, after this period it operates correctly.

This is the response we see:
“https://{kube api url}/tokenreviews\“: context canceled”

We're using v1.12.3 of Vault

@tvoran
Copy link
Member

tvoran commented Feb 16, 2023

Hi @Timothy-Dougherty, this sounds suspiciously similar to #169, so you might try reloading the plugin after the leadership API node change just to see if that resolves it faster than 10 minutes.

Can you turn on trace level logging and post the logs here from a period where authentication attempts are failing?

@tomhjp
Copy link
Contributor

tomhjp commented Feb 17, 2023

In addition to the above, please could you also share the config you are setting up the k8s auth method with? e.g. vault write auth/kubernetes/config .... Especially wrt the kubernetes_host parameter, and how that relates to the API nodes' addresses and the load balancer address. Can you give more details about the load balancer itself too please? It would be very useful in trying to repro this issue.

We do use a single HTTP client for all tokenreview requests, but my expectation would be that its connection gets shut down when the API node is shut down, but it sounds like that connection is staying alive longer for some reason.

@Timothy-Dougherty
Copy link
Author

Thanks @tvoran & @tomhjp .

It'll take me a bit to get the trace logs and try the reload as I need to coordinate with another team to do so, hopefully in the next day or so.

As for the config and load balancer, we use an AWS network load balancer which connects to the ec2 instances of the api nodes, we have 3 api nodes which get replaced one after the other. The kubernetes_host value is a dns record that resolves to the network load balancer.

@Timothy-Dougherty
Copy link
Author

Its worth noting that our NLB wasn't using an attribute called “Terminate connections on deregistration” which will close connections after a target has been deregistered, After turning that setting on and using a very low delay time on deregistration the period is of auth failure reduced from 10 minutes to < 1 minute.

Perhaps this is expected behaviour? however would have assumed the plugin should drop connections from its cache that have timed out (maybe not immediately but at least quicker than 10 minutes).

@tomhjp
Copy link
Contributor

tomhjp commented Feb 21, 2023

Thanks, that's really helpful. I don't have a ton of direct experience with AWS NLBs, but sounds like we're forming a good idea of what's going on.

From here:

Each individual TCP connection is routed to a single target for the life of the connection.

From here:

To ensure that existing connections are closed, you can do one of the following: enable the target group attribute for connection termination, ensure that the instance is unhealthy before you deregister it, or periodically close client connections.

So it sounds like you've tried out option 1 currently. I'm not clear what the exact disconnection behaviour is for option 2. And option 3 sounds like it's probably left to the responsibility of the client (Vault in this case). During the failure period, I'd be interested to know if Vault sees a 502, 503, 504, or something else, and I think the debug logs should tell us. Perhaps we could trigger a reconnect/close the connection on one or some of those error codes.

EDIT: Also for my own reference, AWS LB TargetGroupAttribute reference docs here.

@Timothy-Dougherty
Copy link
Author

Timothy-Dougherty commented Feb 22, 2023

Sorry to say the trace logs didn't seem to give much more information.

There were trace logs reported for the certificate pool not needing an update but the logs related to the auth failures were the ones I've mentioned above:

Feb 21 22:05:29 ip-10-10-10-10 vault[983]: 2023-02-21T22:05:29.912Z [DEBUG] auth.kubernetes.auth_kubernetes_47797564: login unauthorized: err="Post \"https://k8s-api.sandbox.com:8443/apis/authentication.k8s.io/v1/tokenreviews\": context canceled"

Feb 21 22:05:29 ip-10-10-10-10 vault[983]: 2023-02-21T22:05:29.982Z [DEBUG] auth.kubernetes.auth_kubernetes_47797564: login unauthorized: err="Post \"https://k8s-api.sandbox.com:8443/apis/authentication.k8s.io/v1/tokenreviews\": context canceled"

@Timothy-Dougherty
Copy link
Author

It would have been nice to see this in trace logs, but our api servers send a http2 GOAWAY when they are in the process of shutting down. If the plugin handled those and closed connections that received that, it would also resolve this issue.

@tomhjp
Copy link
Contributor

tomhjp commented Feb 23, 2023

That's weird. Go should close the client connection if it receives a GOAWAY frame. It's actually a relatively recent addition to the net library: golang/net@d0c6ba3, but available from 0.1.0, and Vault 1.12.3 was built with 0.4.0.

@Timothy-Dougherty
Copy link
Author

Timothy-Dougherty commented Feb 26, 2023

hmm, I'll admit my understanding of how Go manages its dependecies is not great, but i can see the plugin
https://github.com/hashicorp/vault-plugin-auth-kubernetes/blob/main/go.mod#L61
indirectly references v0.0.0-20220906165146-f3363e06e74c which is a few months off the 0.4.0 version i think:
https://pkg.go.dev/golang.org/x/net?tab=versions
I dont know if that matters at all if vault is built with 0.4.0.

@sidewinder12s
Copy link

sidewinder12s commented May 9, 2023

When you got this error, did you also see a huge spike in goroutines?

We're seeing something eerily similar to these errors. We're using EKS so we also do not control the Load Balancer config or management, though I believe AWS uses NLBs under the hood: https://aws.github.io/aws-eks-best-practices/reliability/docs/controlplane/#eks-architecture

As far as we can tell nothing is happening with our control planes and we're getting this error out of nowhere. Debug logging just surfaces the context cancelled errors and nothing else. Once these errors start kicking off, we start seeing failures across almost all of our EKS Clusters for some time and then it'll stop.

@alexandersimpson17
Copy link

I am seeing a very similar issue. More info below..

Context Canceled on All Authentication Randomly

Vault Version: 13.1

Running Vault in EKS 1.23 (been seeing the issue since 1.22)

Once auth starts to fail it fails across all clusters in multiple AWS accounts.

Fails for 15 minutes and 30 seconds (almost exactly - at least for the last 2 occurrences, haven’t checked the others). Recovers and then the failures repeat 2-4 times (also for the same 15 min and 30 secs).

This occurs very randomly - as much as twice a week or once every 2 weeks.

Failures look like:
{"@Level":"debug","@message":"login unauthorized","@module":"auth.kubernetes.<redacted","@timestamp":"2023-05-31T04:56:11.240263Z","err":"Post "eks.amazonaws.com/apis/authentication.k8s.io/v1/tokenreviews": context canceled"}

Audit logs show that failures are only with auth/kubernetes but don’t show anything else of substance.

Trace logs just enabled (incident hasn’t occurred since) but from the code I don’t know if it will show much more info.

While the trigger could likely be after an API server has scaled/shutdown - we have many accounts/clusters and limited visibility with EKS it is hard to determine if that is the exact trigger.

Any help would be appreciated!

@Timothy-Dougherty
Copy link
Author

@sidewinder12s this dropped off my radar as the 'Terminate connections on deregistration" setting on the NLB was enough to resolve our issue.

@alexandersimpson17 vault 13.1 still referenced v0.15.0 of this plugin:
https://github.com/hashicorp/vault/blob/4472e4a3fbcc984b7e3dc48f5a8283f3efe6f282/go.mod#L120

the x/net versions were bumped in v0.15.1
0068915#diff-33ef32bf6c23acb95f5902d7097b7a1d5128ca061167ec0716715b0b9eeaa5f6

again my understanding of go dependencies is not great, but it may be worth trying a later version of the vault that references v0.15.1 or later of this plugin and see if that fixes your issue

@sidewinder12s
Copy link

Thanks for the update. Since we're using EKS we're flying a little blind but are going to give vault 1.14 a try since that seemed to include the correct updated plugin version (We had agreed with your assessment of the go deps).

@alexandersimpson17
Copy link

This issue is still occurring for me (currently on 1.15).

I was looking at this plugin and noticed it is using cleanhttp.

Cleanhttp hasn't been updated in almost 3 years and uses Go 1.13.

I was wondering if anyone thought if changing the plugin to just use the latest version of net/http could possibly improve things?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants