Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

promhttp: count hard request failures in roundtripper metrics #943

Conversation

EdSchouten
Copy link

Right now InstrumentRoundTripper{Counter,Duration} don't increase any
metrics in case of hard request failures (e.g., TCP, TLS failures). This
makes these functions unattractive to use, as those failures are
typically among the most interesting ones to alert on.

This change extends these functions to count such requests properly. To
distinguish them from requests that did yield an actual response, we
leave the "code" label empty. This seems the correct thing to do,
because hard request failures don't yield a code from the remote side.

Right now InstrumentRoundTripper{Counter,Duration} don't increase any
metrics in case of hard request failures (e.g., TCP, TLS failures). This
makes these functions unattractive to use, as those failures are
typically among the most interesting ones to alert on.

This change extends these functions to count such requests properly. To
distinguish them from requests that did yield an actual response, we
leave the "code" label empty. This seems the correct thing to do,
because hard request failures don't yield a code from the remote side.

Signed-off-by: Ed Schouten <eschouten@apple.com>
@EdSchouten EdSchouten force-pushed the eschouten/20211130-client-errors branch from 5ba3bfd to ca86423 Compare December 2, 2021 10:32
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution!

Hm, we somehow never needed that as such error would usually surface in other means.
Think about the abstraction of round tripper. We have res, err := r.RoundTrip(...). If we have no error we have kind of successful roundtrip from point of view of this method, thus we increment that request was made with such code.

If error was returned, technically no request was made. So if our function observes results, there was no result too:

InstrumentRoundTripperCounter is a middleware that wraps the provided
// http.RoundTripper to observe the request result with the provided CounterVec.

If no result was returned error will be returned. Then users handle errors, usually by incrementing some metric due. If we increment our metric here we risk the anti-pattern of handling error multiple times, especially for existing users of our library.

It really depends on the semantics the user have on this. I am afraid this change might surprise many. Is there away to provide optionality for this feature? 🤔

cc @kakkoyun @beorn7

@@ -64,6 +84,8 @@ func InstrumentRoundTripperCounter(counter *prometheus.CounterVec, next http.Rou
resp, err := next.RoundTrip(r)
if err == nil {
counter.With(labels(code, method, r.Method, resp.StatusCode)).Inc()
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just return resp, err here and avoid using else (opinionated nit).

@@ -93,6 +115,8 @@ func InstrumentRoundTripperDuration(obs prometheus.ObserverVec, next http.RoundT
resp, err := next.RoundTrip(r)
if err == nil {
obs.With(labels(code, method, r.Method, resp.StatusCode)).Observe(time.Since(start).Seconds())
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just return resp, err here and avoid using else (opinionated nit).

// for the case where a HTTP request failed without getting a server
// response. The "code" label is left empty to distinguish from cases
// where a server response was obtained.
func labelsForError(code, method bool, reqMethod string) prometheus.Labels {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not reusing label and injecting custom code (e.g -1) that would be a sentinel for us to put empty label ""? Less deduplicate functionality.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"code" suppose to be a valid HTTP code after the latest #962 change. I totally agree that we need to make it more sentinel.

Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the need behind this. And I think it would be nice to have a facility to track underlying transport layer errors. However, there are several things we need to account here:

  1. First of all, as of know RoundTripper description and API suggests it only meant to used for HTTP, so having errors from underlying layers could be confusing. As we already discussed we assume metrics would have HTTP status code and method labels, not having them or overloading their usage would be confusing.
  2. As Bartek suggested this is changing existing behaviour and this could confuse people. Even changing and implicit behaviour could be problematic for the dependants of the library.

Those being said, I think if we steer the direction of the design we can achieve the same goal.
My suggestion would be to provide wrapper instrumented transport layers in client_golang. Such as instrumentedTCPTransport. So that we can have transport specific metrics and more over users can opt-in to usages of those.

What do you think?

// for the case where a HTTP request failed without getting a server
// response. The "code" label is left empty to distinguish from cases
// where a server response was obtained.
func labelsForError(code, method bool, reqMethod string) prometheus.Labels {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"code" suppose to be a valid HTTP code after the latest #962 change. I totally agree that we need to make it more sentinel.

@kakkoyun
Copy link
Member

@EdSchouten Do you want to continue working on this?

@stale
Copy link

stale bot commented Jun 4, 2022

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale label Jun 4, 2022
@stale stale bot closed this Jun 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants