Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry with logger exporter and Prometheus #237

Merged
merged 3 commits into from Sep 30, 2022

Conversation

lzap
Copy link
Member

@lzap lzap commented Sep 22, 2022

This patch adds preliminary support for OpenTelemetry Tracing via W3C Trace Context header. Initially, I thought I would simply add a correlation header that would be propagated into logs and into other backend services, but I was made aware that full OpenTelemetry support is planned for the whole platform.

Therefore it makes sense to use Trace ID from W3C Trace Context instead our own Request ID, it serves the very same purpose (unique string which gives ability to correlate logs). The patch drops Request ID as well as Request Number (it was not very useful I think) and replaces this with Trace ID. Parsing and generation is provided by the OpenTelemetry SDK itself.

There is a new middleware telemetry.Middleware inserted into our stack that parses or creates a new Trace and Span ID and puts that into the context. This can be retrieved anywhere in the codebase to create a new span and measure performance of any block or function:

func AFunction(context.Context ctx) {
	ctx, span := otel.Tracer(TraceName).Start(ctx, "AFunction")
	defer span.End()
}

This way we can measure anything in our stack and this can be correlated with other services as well to get a full picture of performance of a whole transaction. The patch does not add spans in a mass way, it is only in the Image Builder - Ready function because I am worried about conflicts - all clients were heavily modified in my other PR. Let's add them later.

There is also a new OpenTelemetryDoer and it's used in both ImageBuilder and Sources clients, this automatically gives us span (measurements) of how slow all requests to backend services were. So we can tell how much time was spent waiting for a backend service and how much time was spent in our own code. Although it looks like OpenTelemetry HTTP client automatically does the very same thing, the only difference is that it does not provide information about if the client was image builder or sources, but this should be obvious from the parent span (in Jaeger UI). So I will probably drop this Doer.

Now this all can be nicely displayed via Jaeger UI, but this is not available on the platform yet, therefore I created a temporary exporter that writes the most important data via our zerolog logger via TRACE level. It looks like this:

11:19AM TRC otel: HTTP GET duration=2418.531833 hostname=mone.home.lan http.flavor=1.1 http.host=console.stage.redhat.com http.method=GET http.scheme=http http.url=https://console.stage.redhat.com/api/image-builder... otel=true span_id=8fb56982443c1ae0 span_id_parent=a9c1040b9686ba5d trace_id=a58e792bb7754c7132e9812374cf0243
11:19AM TRC {"readiness":"ready"}
 account_id=1 account_number=13 bytes_in=0 content_length=22 hostname=mone.home.lan method=GET org_id=000013 remote_ip=127.0.0.1:50180 response_trace=true status="200 OK" status_code=200 trace_id=a58e792bb7754c7132e9812374cf0243 url=/api/provisioning/v1/ready/ib
11:19AM TRC otel: ImageBuilder-GET duration=2418.952625 hostname=mone.home.lan otel=true span_id=a9c1040b9686ba5d span_id_parent=72638bd6a72a23ed status_code=Ok trace_id=a58e792bb7754c7132e9812374cf0243
11:19AM TRC otel: Ready duration=2419.063708 hostname=mone.home.lan otel=true span_id=72638bd6a72a23ed span_id_parent=0b9434ba2311ac5f trace_id=a58e792bb7754c7132e9812374cf0243
11:19AM INF Completed GET request /api/provisioning/v1/ready/ib in 2.491s ms with 200 bytes_in=0 bytes_out=0 hostname=mone.home.lan latency_ms=2490.723792 method=GET remote_ip=127.0.0.1:50180 status=200 trace_id=a58e792bb7754c7132e9812374cf0243 url=/api/provisioning/v1/ready/ib
11:19AM TRC otel: /api/provisioning/v1/ready/{SRV} duration=2490.834875 hostname=mone.home.lan http.flavor=1.1 http.host=localhost:8000 http.method=GET http.route=/api/provisioning/v1/ready/{SRV} http.scheme=http http.server_name=provisioning-backend http.target=/api/provisioning/v1/ready/ib http.user_agent="Apache-HttpClient/4.5.13 (Java/17.0.3)" net.host.name=localhost net.peer.ip=127.0.0.1 net.transport=ip_tcp otel=true span_id=0b9434ba2311ac5f trace_id=a58e792bb7754c7132e9812374cf0243

This is how you can read this:

  • OpenTelemetry span called "HTTP GET" created by otel HTTP client automatically
  • OpenTelemetry span called "ImageBuilder-GET" created by "Doer" (this can be dropped I think)
  • OpenTelemetry span called "Ready" created in our function called Ready
  • Completed request log line
  • OpenTelemetry span called "/api/provisioning/v1/ready/{SRV}" created by the main middleware (the root span)

Telemetry can be enabled or disabled in settings completely. Exporting telemetry data into logger can be turned on via settings too. The code is ready to be extended with Jaeger exporter once we have that service available.

Also I looked into PostgreSQL instrumentation, there is a library which creates spans automatically for each SQL operation so we would get important timing information of all queries. Unfortunately, this is a very new feature only implemented in pgx/v5 which has been released recently and there are two problems which must be resolved before this can be added:

I am going to implement this in a separate PR, it's a relatively small change. When I was working on this, I slightly improved log levels configuration (it's now string not int). After this is merged, we all must update our local.yaml files or the app will likely fail to boot.

On top of that, I realized that our prometheus metrics do not aggregate request paths to chi routes, so for each path like /reservation/{ID} the app generates an unique label for each database ID. This is not correct, so the patch also drops the old way of doing prometheus metrics and introduces a new middleware that aggregates path label from Chi routes. I wanted to use a dependency but all the projects were dead and did not work (we use chi/v5 which was no longer compatible). So I copy-pasted the code, dropped a license note in the source file and updates it so it works fine.

Overall, I think I achieved the main goal - having a correlation id (Trace ID) that can be passed across services, it sticks with a W3C standard that is widely accepted and everything is ready for expansion towards the full-blown OpenTelemetry tracing solution with Jaeger.

This was created as part of the platform hackathon 2022. Here is my hackathon presentation:

https://www.youtube.com/watch?v=5yoradDDu7g

@lzap lzap force-pushed the otel-initial branch 2 times, most recently from c976e66 to 443d7db Compare September 23, 2022 09:57
@lzap
Copy link
Member Author

lzap commented Sep 23, 2022

I have rewritten the description a lot and added a video, feel free to read it again.

Here is the patch which implements pgx open telemetry, I pulled this out of this PR because it needs some work on the pgx side: https://gist.github.com/lzap/d6e6cc619170d0cddeb57fc502fd0791

I am resolving conflicts now.

@lzap lzap force-pushed the otel-initial branch 7 times, most recently from df5ed65 to 1fba88d Compare September 23, 2022 15:55
Copy link
Member

@ezr-ondrej ezr-ondrej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ❤️ing this!
After reading the code I've got only one nit.

Tho what I do not understand (mostly just because I am not too familiar with telemetry) is where the metrics are now collected (without jaeger I mean) and how does this tie in with our /metrics endpoint? 🤔
Could we read these metrics somehow, or do they just go to logs without jaeger?

)

var RequestIDHeader = "X-Request-Id"

func RequestID(next http.Handler) http.Handler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to rename this middleware now, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added a commit.

@lzap
Copy link
Member Author

lzap commented Sep 30, 2022

It's confusing, let me explain. OpenTelemetry is just an attempt to create a common API and protocol for three things:

  • telemetry (prometheus, ...)
  • tracing (jaeger, ...)
  • logging (zerolog, ...)

The added value is the correlation that can be added to all three sources. Now, unfortunately only tracing part is now stable with stable implementations, telemetry API is v1 however prometheus exporter is considered experimental. Therefore, in my patch I only use OpenTelemetry (otel) API to do tracing.

Later on, we can replace our prometheus metrics (monitoring) with OpenTelemetry Telemetry API, it should be an easy task since the otel API is heavily inspired by Prometheus after all, so it should be very familiar. So for this reason, I am keeping the "old" way of gathering metrics from the software that is completely independant and does not contain any correlation information.

So to answer your question, yes, metrics (via /metrics) are currently completely independent from tracing. Metrics are collected by the prometheus client library (in-memory) until they are scraped by the monitoring infrastructure. I probably should not work on this in this PR, it's a somewhat separate problem, but I just took the opportunity to fix that too - we had a serious issue with the code when it would create a new metric for each database ID leading to extensive amount of metrics. This could even take our infrastructure down, now, metrics are based off Chi routes.

@lzap
Copy link
Member Author

lzap commented Sep 30, 2022

Added new commit.

@lzap
Copy link
Member Author

lzap commented Sep 30, 2022

Since this has been reviewed, tests are passing and I need to start working on some changes on top of this PR, I am merging. Thanks!

@lzap lzap merged commit 3815012 into RHEnVision:main Sep 30, 2022
@lzap lzap deleted the otel-initial branch September 30, 2022 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants