Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP Reset breaks connection without a retry #1020

Closed
dohnto opened this issue Apr 6, 2022 · 9 comments
Closed

TCP Reset breaks connection without a retry #1020

dohnto opened this issue Apr 6, 2022 · 9 comments

Comments

@dohnto
Copy link
Contributor

dohnto commented Apr 6, 2022

My setup

We run a Grafana which queries Prometheus. Inbetween Grafana and Prometheus, there is a Google Cloud Load Balancer L7 which terminates SSL:

Grafana --> GCLB --> Prometheus

Issue

Grafana user opens a dashboard and Grafana opens a long running TCP connection. Rarely we observe a TCP Reset sent by the Load Balancer, which Grafana displays to user as 5xx error. The request is not retried.

Background

Grafana utilizes client_golang library and in particular its httpapi.QueryRange method which relies on apiClientImpl.DoGetFallback. This method tries to perform the query using POST request and in case of specific issues (405) it fallbacks to GET.

Prometheus' client prefers to use POST over GET, but this causes inconsistencies as POST is not consider idempotent.

Quote from net/http

Transport only retries a request upon encountering a network error if the request is idempotent and either has no body or has its Request.GetBody defined. HTTP requests are considered idempotent if they have HTTP methods GET, HEAD, OPTIONS, or TRACE; or if their Header map contains an "Idempotency-Key" or "X-Idempotency-Key" entry. If the idempotency key value is a zero-length slice, the request is treated as idempotent but the header is not sent on the wire.

More information about this retry feature can be found in https://go-review.googlesource.com/c/go/+/3210/9//COMMIT_MSG .
I found it really confusing, that when the implementation of apiClientImpl.DoGetFallback fallbacks to GET request it might recover from some connectivity issues while with default POST request, it will just error.

Expectations

For me ideally I would prefer making the POST request idempotent (as I believe they are; using one of the Idempotency-* header?). I am not sure if this would be acceptable as this changes the behavior of the client a bit.

Reproducer

None yet.

@dohnto
Copy link
Contributor Author

dohnto commented Apr 7, 2022

This is the related code which decides, whether a request is idempotent https://github.com/golang/go/blob/eca0d44cec58951fb716e540dcc21c0f527686d5/src/net/http/request.go#L1420-L1433

dohnto added a commit to dohnto/client_golang that referenced this issue Apr 7, 2022
Address prometheus#1020.

Signed-off-by: Tomáš Dohnálek <dohnto@gmail.com>
kakkoyun pushed a commit that referenced this issue Apr 13, 2022
* Make Query requests idempotent

Address #1020.

Signed-off-by: Tomáš Dohnálek <dohnto@gmail.com>

* Use empty header

Signed-off-by: Tomáš Dohnálek <dohnto@gmail.com>

* Document issue with original documentation

Signed-off-by: Tomáš Dohnálek <dohnto@gmail.com>
@kakkoyun
Copy link
Member

@dohnto Could you please confirm this is fixed with #1022 ?

@dohnto
Copy link
Contributor Author

dohnto commented Apr 13, 2022

@kakkoyun Give me some time to verify this in production. I am trying to communicate with our teams, whether it would be possible to include this library from master branch rather than a release.

@dohnto
Copy link
Contributor Author

dohnto commented May 4, 2022

I believe the fix is working. We deployed a custom build of the Grafana with client_golang from main branch.

The y axe shows RST occurences resulting into 5xx error in Grafana and blue annotation is a point in time when we deployed the version with client_golang from main branch.

Screenshot from 2022-05-04 21-14-27

@kakkoyun Is there any timeframe to create a release so we could utilize it in production using a regular build process? Thanks

@dohnto dohnto closed this as completed May 4, 2022
@kakkoyun
Copy link
Member

kakkoyun commented May 5, 2022

There will be a release soon, we need to take care of several issues first. Unfortunately, there's no fixed release cadence for the project.

dohnto added a commit to grafana/grafana that referenced this issue May 23, 2022
Use latest released version of client_golang library with
prometheus/client_golang#1020
included. This affected Grafana quering a Prometheus datasource
behind a flaky load balancer (reseting long running connections) and resulted
in user visible error in Grafana. With this change included, Grafana will retry
the query.
@skeilson
Copy link

skeilson commented Nov 3, 2022

@kakkoyun - any perspective on when this will be released? I see it was not included in the most recent prometheus/client_golang release (Nov 2).

@kakkoyun
Copy link
Member

kakkoyun commented Nov 7, 2022

@skeilson There will be a release. Hopefully this week.

@skeilson
Copy link

@kakkoyun any chance of a timeframe for this to be released? I see v.1.14.0 was released on Nov 8th and it does not seem to include this PR.

@JayEkin
Copy link

JayEkin commented Nov 22, 2022

@kakkoyun

Do we have an ETA on the next release where the fix is included?
We are still seeing v.1.14.0

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants