Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

small updates to Telemetry #833

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

robuye
Copy link
Contributor

@robuye robuye commented Mar 18, 2024

Hi there, I added some telemetry around webhook handler and changed metadata included in all API requests. It's all just a few small changes to make monitoring for stripity-stripe users awesome out of the box 馃檭

Webhooks are often important part of integration with Stripe so having telemetry in there is good idea imo.

API requests metadata contains some low cardinality fields that can be useful when visualising these metrics in other tools. For example using stripe.request.stop event we can infer metrics for RED method and also have visibility into individual Stripe API endpoints.

There are no breaking changes in here, only telemetry metadata is different (url field is gone). There is assumption that event object always contains type property, which is always a case to my knowledge and Stripe docs.

Here is example how it looks in PhoenixLiveDashboard:

image

In order to enable this visualisation I only needed to add few lines to telemetry.ex:

      # Stripe
      summary("stripe.request.stop.duration",
        tags: [:endpoint, :status],
        unit: {:native, :millisecond}
      ),
      summary("stripe.webhook.stop.duration",
        tags: [:event],
        unit: {:native, :millisecond}
      ),

Other systems such as prom_ex, telemetry_ui, otel, etc can be easily integrated too.

Hope you like it, happy to take in any and all feedback.

@robuye robuye requested a review from a team as a code owner March 18, 2024 13:45
@yordis
Copy link
Member

yordis commented Mar 18, 2024

CI is failing btw

@robuye
Copy link
Contributor Author

robuye commented Mar 18, 2024

CI is failing btw

Same error happens when I reverted all my changes, it seems to be related to something else. I will try to debug it further.

@robuye
Copy link
Contributor Author

robuye commented Mar 18, 2024

Okay, all tests are passing now, I think I fixed original problem by setting SHELL env. I also noticed mix format is updating 2 files so I fixed that too and added format checking to CI job.

Event data: #{inspect(event)}
"""
end
:telemetry.span(~w[stripe webhook]a, %{}, fn ->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this be done at the WebServer Telemetry level? That was the feedback given from the OTEL folks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes, Phoenix and Plug already add their own events around requests, but these are very generic and we can do it better here by annotating with Stripe event type. Having this together with other part gives us more complete view of Stripe subsystem, or that's how I see it from monitoring perspective.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am with you! I see the point of adding event.type since it is domain-specific.

But wouldn't the WebServer handle the handler_status based on the response?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw I'm actually working on integrating OTEL and added Telemetry here to create spans and traces similar to how it's done in phoenix in otel-erlang-contrib. I hope to have spans that include stripe api calls by name, I think that will be super helpful when browsing traces, but would be difficult to do using generic telemetry events from phoenix or other lib.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not opposed to anything, but since these are architectural components, we are better off doing research and making sure we are following "standards" (OTEL practices for now) so we all benefit

I came across particular telemetry requirements that required the Operator to change their mindset and workflow to accomplish what they wanted; I am not saying that is the case here! I am just asking questions to make sure we are doing the right thing. I hope you understand.

This is a tricky domain; everyone wants their way of observing things...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But wouldn't the WebServer handle the handler_status based on the response?

Well, sort of. WebServer will report it in its own telemetry as completely different event so in a sense yes, but it's not always available. I think OTEL traces will be able to connect to parent span and present this info but other systems, for example Prometheus, won't.

Adding handler_status to this event makes it possible to monitor error rates by Stripe event type and use RED method to get useful insights with little effort, that's the main reason why I decided to add it here.

Alternatively we could add exactly same tracing to handler module that WebhookHandler is calling. It will work the same but this is additional effort to build custom observability versus having it provided ootb by stripity-stripe.

I personally prefer ootb experience because otherwise it's difficult to get in and do it right on your own.

I am not opposed to anything, but since these are architectural components, we are better off doing research and making sure we are following "standards" (OTEL practices for now) so we all benefit
[...]

Yup, understood, happy to have this conversation and to look for best solution together, thanks for taking the time to discuss it too!

lib/stripe/api.ex Show resolved Hide resolved
Added telemetry support to webhook handler and changed metadata included
in API requests.

API requests metadata contains some low cardinality fields that can be
useful when visualising these metrics in other tools. For example using
'stripe.request.stop' event we can infer metrics for RED method and have
visibility into all Stripe API endpoints.

'url' passed in to 'do_perform_request_and_retry' contains query params
which can contain sensitive data such as emails so it is URI-parsed and
all params are dropped.

Webhooks are often important part of integration with Stripe and having
telemetry in place for that is good idea too imo.
Maybe fix problem with erlexec failing to start:

```
SHELL environment variable not set! [c_src/exec.cpp:609]

Notice: 15:48:19.772 [notice] Application erlexec exited: :exec_app.start(:normal, []) returned an error: shutdown: failed to start child: :exec
    ** (EXIT) {:port_exited_with_status, 4}
```
Run mix compile before tests so the output is separate.

Trace will make it easier to understand if some tests were excluded and
such.
Some tests are failing because query params are in wrong order.

My fix is to ignore order and just checking all were set with correct
value.
Copy link
Member

@yordis yordis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, just few minors adjustment, primarily around HTTP metadata and aligned a bit more with common OTEL lingos.

@@ -376,14 +376,22 @@ defmodule Stripe.API do
end

defp do_perform_request_and_retry(method, url, headers, body, opts, {:attempts, attempts}) do
telemetry_meta = %{
endpoint: URI.parse(url).path,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
endpoint: URI.parse(url).path,
http_url: URI.parse(url),

Pass the entire URL, let processors to filter the info, HTTP OTEL packages would do something around the line of https://github.com/open-telemetry/opentelemetry-erlang-contrib/blob/3f13bc8b2042071b723d0ea8c7b118fd10f42d26/instrumentation/opentelemetry_req/lib/opentelemetry_req.ex#L143-L148

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay added and I renamed original field to stripe_api_endpoint. I can remove it because you're right that it can be extracted from http_url via processor, I kept it for better out-ot-the-box experience.

lib/stripe/api.ex Outdated Show resolved Hide resolved
endpoint: URI.parse(url).path,
method: method,
attempt: attempts,
stripe_api_version: headers["Stripe-Version"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stripe_api_version: headers["Stripe-Version"],
http_headers: headers,

lib/stripe/api.ex Outdated Show resolved Hide resolved
lib/stripe/api.ex Outdated Show resolved Hide resolved
case http_module().request(method, url, Map.to_list(headers), body, opts) do
{:ok, status, _, _} = resp ->
{resp, %{status: status}}
{resp, %{telemetry_meta | status: status}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{resp, %{telemetry_meta | status: status}}
{resp, %{telemetry_meta | http_status_code: status, http_response_content_length: http_response_content_length }}

If content-length is present add http_response_content_length

Updated naming and changed where metadata is constructed in order to
keep original function short and focused.

URL is stripped of query params to avoid leaking of sensitive data such
as emails to logging systems.
@robuye
Copy link
Contributor Author

robuye commented Mar 20, 2024

Hii, I updated names per your feedback, but not completely.

  1. I didn't include all headers because they can contain sensitive information such as cookies or api tokens
  2. I strip query params from url so things like email don't end up in logs
  3. I didn't include response content length because it requires parsing :hackney headers using low level API, this would make code more coupled to hackney adapter and it didn't feel like something you'd want

I also moved setting metadata one level higher so the original function is less complicated. Just experimenting and trying to fit in with current code patterns.

Lemme know what you think, if it looks good this way I will squash this commit with the original.

@robuye
Copy link
Contributor Author

robuye commented Apr 11, 2024

Hey, I'm having second thoughts about the usefulness of this PR. Everything I added here can be easily instrumented client-side too, just by wrapping calls to stripity_stripe with :telemetry.span. I'm not sure it's worth the new complexity this code brings in, do you think this it is worth going forward with? If not I will close the PR, no problem here.

@yordis
Copy link
Member

yordis commented Apr 11, 2024

I am OK taking these changes; I left a few comments the last time, primarily around the name of the tags and the value of them.


opts
|> Keyword.put(:telemetry_metadata, %{
stripe_api_endpoint: parsed_uri.path,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be extracted from http_url if somebody really wants to map the data to something else

@robuye
Copy link
Contributor Author

robuye commented Apr 15, 2024

Okay, I think I addressed earlier feedback in code or in GH and also left some followup questions for you, I'm not sure you noticed it 馃槄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants