New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenTelemetry with logger exporter and Prometheus #237
Conversation
c976e66
to
443d7db
Compare
I have rewritten the description a lot and added a video, feel free to read it again. Here is the patch which implements pgx open telemetry, I pulled this out of this PR because it needs some work on the pgx side: https://gist.github.com/lzap/d6e6cc619170d0cddeb57fc502fd0791 I am resolving conflicts now. |
df5ed65
to
1fba88d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ❤️ing this!
After reading the code I've got only one nit.
Tho what I do not understand (mostly just because I am not too familiar with telemetry) is where the metrics are now collected (without jaeger I mean) and how does this tie in with our /metrics
endpoint? 🤔
Could we read these metrics somehow, or do they just go to logs without jaeger?
internal/middleware/request_id.go
Outdated
) | ||
|
||
var RequestIDHeader = "X-Request-Id" | ||
|
||
func RequestID(next http.Handler) http.Handler { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to rename this middleware now, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, added a commit.
It's confusing, let me explain. OpenTelemetry is just an attempt to create a common API and protocol for three things:
The added value is the correlation that can be added to all three sources. Now, unfortunately only tracing part is now stable with stable implementations, telemetry API is v1 however prometheus exporter is considered experimental. Therefore, in my patch I only use OpenTelemetry (otel) API to do tracing. Later on, we can replace our prometheus metrics (monitoring) with OpenTelemetry Telemetry API, it should be an easy task since the otel API is heavily inspired by Prometheus after all, so it should be very familiar. So for this reason, I am keeping the "old" way of gathering metrics from the software that is completely independant and does not contain any correlation information. So to answer your question, yes, metrics (via |
Added new commit. |
Since this has been reviewed, tests are passing and I need to start working on some changes on top of this PR, I am merging. Thanks! |
This patch adds preliminary support for OpenTelemetry Tracing via W3C Trace Context header. Initially, I thought I would simply add a correlation header that would be propagated into logs and into other backend services, but I was made aware that full OpenTelemetry support is planned for the whole platform.
Therefore it makes sense to use Trace ID from W3C Trace Context instead our own Request ID, it serves the very same purpose (unique string which gives ability to correlate logs). The patch drops Request ID as well as Request Number (it was not very useful I think) and replaces this with Trace ID. Parsing and generation is provided by the OpenTelemetry SDK itself.
There is a new middleware
telemetry.Middleware
inserted into our stack that parses or creates a new Trace and Span ID and puts that into the context. This can be retrieved anywhere in the codebase to create a new span and measure performance of any block or function:This way we can measure anything in our stack and this can be correlated with other services as well to get a full picture of performance of a whole transaction. The patch does not add spans in a mass way, it is only in the Image Builder - Ready function because I am worried about conflicts - all clients were heavily modified in my other PR. Let's add them later.
There is also a new OpenTelemetryDoer and it's used in both ImageBuilder and Sources clients, this automatically gives us span (measurements) of how slow all requests to backend services were. So we can tell how much time was spent waiting for a backend service and how much time was spent in our own code. Although it looks like OpenTelemetry HTTP client automatically does the very same thing, the only difference is that it does not provide information about if the client was image builder or sources, but this should be obvious from the parent span (in Jaeger UI). So I will probably drop this Doer.
Now this all can be nicely displayed via Jaeger UI, but this is not available on the platform yet, therefore I created a temporary exporter that writes the most important data via our zerolog logger via TRACE level. It looks like this:
This is how you can read this:
Telemetry can be enabled or disabled in settings completely. Exporting telemetry data into logger can be turned on via settings too. The code is ready to be extended with Jaeger exporter once we have that service available.
Also I looked into PostgreSQL instrumentation, there is a library which creates spans automatically for each SQL operation so we would get important timing information of all queries. Unfortunately, this is a very new feature only implemented in pgx/v5 which has been released recently and there are two problems which must be resolved before this can be added:
I am going to implement this in a separate PR, it's a relatively small change. When I was working on this, I slightly improved log levels configuration (it's now string not int). After this is merged, we all must update our
local.yaml
files or the app will likely fail to boot.On top of that, I realized that our prometheus metrics do not aggregate request paths to chi routes, so for each path like
/reservation/{ID}
the app generates an unique label for each database ID. This is not correct, so the patch also drops the old way of doing prometheus metrics and introduces a new middleware that aggregates path label from Chi routes. I wanted to use a dependency but all the projects were dead and did not work (we use chi/v5 which was no longer compatible). So I copy-pasted the code, dropped a license note in the source file and updates it so it works fine.Overall, I think I achieved the main goal - having a correlation id (Trace ID) that can be passed across services, it sticks with a W3C standard that is widely accepted and everything is ready for expansion towards the full-blown OpenTelemetry tracing solution with Jaeger.
This was created as part of the platform hackathon 2022. Here is my hackathon presentation:
https://www.youtube.com/watch?v=5yoradDDu7g