New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce cardinality in Dapr metrics and add more information to API logs #6919
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #6919 +/- ##
==========================================
+ Coverage 65.05% 65.07% +0.02%
==========================================
Files 231 230 -1
Lines 20854 20998 +144
==========================================
+ Hits 13566 13664 +98
- Misses 6152 6192 +40
- Partials 1136 1142 +6
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not review the entire thing since I think there is a more fundamental decision to be made here. Do we accept breaking changes to metrics? I believe those are part of our contract and people depend on data being published in those.
@@ -228,7 +228,7 @@ func (h *Channel) invokeMethodV1(ctx context.Context, req *invokev1.InvokeMethod | |||
}() | |||
|
|||
// Emit metric when request is sent | |||
diag.DefaultHTTPMonitoring.ClientRequestStarted(ctx, channelReq.Method, req.Message().Method, int64(len(req.Message().Data.GetValue()))) | |||
diag.DefaultHTTPMonitoring.ClientRequestStarted(ctx, int64(len(req.Message().Data.GetValue()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are breaking changes since it removes information others might already be depending on for their dashboards and alerts. Is there a way to make this configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remain consistent with industry standard golden metrics - meaning, latency and status codes should be reported on response metrics. Size I think we can remove safely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latency and status codes are included. Size is included when available, but if it can be omitted it would allow for a lot of simplifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are latency and status codes included by default? I want to clarify due to this:
including things that are not going to be included in metrics by default anymore:
* Response status code
* Response size (for HTTP only & if possible)
* Latency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote that a while ago so I had to go double-check :)
Yes, they are still included (size is included only if available):
What I think I meant with that is that they are not included anymore with the same granularity. Users can rely on the API logs to get more specific data points, such as per-endpoint (rather than just per target app)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to include back the original granular metrics if needed? I.e. can either this change be the new normal and the old way be behind an opt-in flag or the other way around?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can look into that, but before we invest engineering effort, is there any demand for that? The current behavior is causing massive memory consumption (hundreds of MBs), causing live site incidents due to pods OOM'ing (we actually had users opening support tickets with Azure too... and seems Diagrid heard users having issues too). It is also costing users a lot of money when metrics are ingested in analytics tools. To the point it may even be considered a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per conversation on Discord, we can move forward on this PR without the flag to add the old behavior back as opt-in, as there's already enough "substance" here. When this is merged I'll open an issue to discuss that need (see if folks from the community have opinions) and also how that can be implemented (e.g. should it be a permanent option in the Configuration CRD, a feature flag, or...?). That issue will be completed before 1.13 (as a release blocker, but we have lots of time)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ItalyPaleAle is there an issue open to track the addition of flag to include the original granular metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any idea where this issue or flag is implemented? it is mentioned as a release blocker and operation teams might want to keep using their old dashboards/alerts that rely on these labels
Yes, it’s a sort of breaking change. In the issue and conversations I’ve had with @yaron2 this was deemed acceptable because the current situation is not sustainable. On the HTTP server, there’s a new “bucket” created for each path. In REST-ful APIs, it’s common to have things like IDs in paths, so there’s REALLY high cardinality. That not only reduces the usefulness of the metrics (too much cardinality makes it harder to see aggregate data), but it has caused live-site incidents for users due to pods being killed with OOM: we have observed daprd using hundreds of MBs of memory due to the metrics subsystem, before being OOM’d. Paths can also contain PII, so this causes issues with things like GDPR compliance. The other side is that the gRPC server doesn’t expose this behavior at all, as it just shows the method name and nothing else. So there was already some unusual behavioral differences there. |
publicR := s.getRouter() | ||
s.useContextSetup(publicR) | ||
s.useTracing(publicR) | ||
s.useMetrics(publicR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been kept for backwards-compat, and it was checked by an E2E test, but should the public API server (used by K8s healthchecks primarily) still use metrics and tracing?
Not sure if it's needed to collect metrics on that. And it may be odd to have metrics for the "regular" HTTP server intermixed with metrics from the public HTTP server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not completely reviewed the PR yet. But left some initial comments.
@@ -600,7 +598,12 @@ message InvokeActorResponse { | |||
bytes data = 1; | |||
} | |||
|
|||
// GetMetadataResponse is a message that is returned on GetMetadata rpc call | |||
// GetMetadataRequest is the message for the GetMetadata request. | |||
message GetMetadataRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ItalyPaleAle : Reviewers: this needed to have a message ending in "Request". This change is backwards compatible in the wire (i.e. v1.11 SDKs can talk to this without issues), but it may require a change in the name in the SDK if the protos are updated.
#6729 (comment)
I agree that on the wire it might be the same. But this will require users that use the proto directly to have a breaking change. And must be communicated as such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but:
- This will not break apps that are currently running, which is the most important thing. Users may need to change a struct name if they import the protos directly and are updating the protos.
- Using the gRPC protos directly is not an officially supported scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the gRPC protos directly is not an officially supported scenario.
Still something that users might use. Again, my point being here is that it must be communicated as a breaking change. We can expand on it saying that running apps will not be affected, just struct change as of now for new proto imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change, however this will be breaking at compilation time and not at runtime. I recommend we announce the breaking change with no need to create a v2 of this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change, however this will be breaking at compilation time and not at runtime. I recommend we announce the breaking change with no need to create a v2 of this method.
This is a fair assessment.
To recap, and to be fully clear, from a practical effect this change is similar to what happened with #6945 (landing in 1.13 too) which removed a method from the gRPC proto (of course, that was done for different reasons).
- No impact on apps currently running
- Dapr SDKs maintainers will need to rename the structs/objects when they upgrade the protos
- Users of the SDKs should not see any direct impact.
- AFAIK the SDKs always abstract the gRPC objects, so users who just depend on Dapr SDKs shouldn't need any code change if/when they upgrade the SDKs.
- At runtime, old SDKs will continue to work alongside new SDKs.
- For users who import the protos directly:
- No change needed if they are using existing protos
- If they upgrade the protos, they will need to rename the structs/objects
@@ -228,7 +228,7 @@ func (h *Channel) invokeMethodV1(ctx context.Context, req *invokev1.InvokeMethod | |||
}() | |||
|
|||
// Emit metric when request is sent | |||
diag.DefaultHTTPMonitoring.ClientRequestStarted(ctx, channelReq.Method, req.Message().Method, int64(len(req.Message().Data.GetValue()))) | |||
diag.DefaultHTTPMonitoring.ClientRequestStarted(ctx, int64(len(req.Message().Data.GetValue()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to include back the original granular metrics if needed? I.e. can either this change be the new normal and the old way be behind an opt-in flag or the other way around?
458ff08
to
d2fec67
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass through, just some syntax and tidying up suggestions
Since it is difficult to know at this point if some labels need to be reinstated by default, I suggest we merge this PR so we can audit the new metrics carefully and then revert if/whatever is needed (by rolling forward, not reverting the PR). As mentioned, an option will be introduced later to enable the more granular metrics we had until now, and that option might be used to choose if we want non-granular or granular metrics by default. |
Yes. We need the nightly build to work too so we can test with an immutable artifact. |
I was thinking about this and it is an opportunity for us to increase test coverage. Before merging this PR, I think there should be a PR that adds functional tests for metrics from the current implementation. Then this PR can be merged with higher guarantees of not adding a regression. This was done by Josh when he added upgrade/downgrade tests and integration tests to increase confidence in the refactoring and changes he wanted to merge. |
What test would you like to see?
--EDIT Scratch the above. I think there's something I can do with ITs. |
Fixes dapr#6723 Includes: 1. Drastically reduce cardinality of metrics, especially those emitted by the HTTP server: - Also removes the possibility of PII being included in the metrics endpoint. 1. Add more information to API logs for both HTTP and gRPC, including things that are not going to be included in metrics by default anymore: - Response status code - Response size (for HTTP only & if possible) - Latency 3. When `obfuscateURLs` is enabled for API logging, in API logs we now include a more descriptive name for the method rather than the path that was matched in the router. The descriptive names map to the names of the methods in gRPC, for example `SaveState` in place of `POST /state/{storeName}/{name}`. Since gRPC API logs are always, only "obfuscated" (because the params are in the body and not in the "URL"), this makes HTTP and gRPC API logs more consistent too 4. Refactors how tracing and API logging (and the API token auth middleware) get the data from the handler/router: - The new approach is a lot more declarative, and less based on heuristics (such as parsing the path from the URL again) - The new approach also reduces the number of maps that are allocated and used in each request, which generally contained duplicate information generated in multiple parts of the code 5. For both HTTP and gRPC, the way metadata is added to a tracing span has changed and it's now more declarative: - When looking at how tracing spans were composed, the metadata was added in `pkg/diagnostics`, separately from the endpoints. Only a VERY SMALL number of APIs had proper tracing configured (as a matter of fact, see the number of "TODO"'s in this PR), in large part due to the fact that this was in a separate package. The approach also relied a lot on heuristics. - For HTTP, now the `Endpoint` struct contains a property to add a function that adds properties to a span for tracing purposes. This lives right next to the handler. - For gRPC, messages defined in the protos whose name ends in "Request" now must define an `AppendSpanAttribute` method (this is enforced by a unit test) 6. Update API Allowlisting/Denylisting for HTTP to: - Make sure that the same constants can be used for both HTTP and gRPC, especially versions. Right now, versions are in the format "v1.0-alpha1" for HTTP, and "v1alpha1" for gRPC. This PR changes the HTTP versions to be in the format "v1alpha1" too ("v1.0-alpha1" is preserved for bacwkards-compatibility) - Improved perf of the HTTP Allowlist/Denylist, especially when using the "new" format (e.g. versions with "v1alpha1"): checking the allowlist is now a simple map lookup rather than an iteration over the entire allowlist. Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
5907996
to
2012ffd
Compare
Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
@artursouza I've added an IT that validates metrics are sent correctly. For tracing, we have an existing E2E test. For traces, it's not going to be possible to create an IT as that does require a service (Zipkin). In fact, for metrics we can scrape them in a "pull" way, but traces are pushed to Zipkin so they require a service. Should cover the tests? |
Signed-off-by: Alessandro (Ale) Segala <43508+ItalyPaleAle@users.noreply.github.com>
/test-sdk-all |
/test-sdks-all |
Dapr SDK Python testCommit ref: 7da569d ✅ Python SDK tests passed |
Dapr SDK Java testCommit ref: 7da569d ❌ Java SDK tests failedPlease check the logs for details on the error. |
Dapr SDK Go testCommit ref: 7da569d ✅ Go SDK tests passed |
Dapr SDK JS testCommit ref: 7da569d ❌ JS SDK tests failedPlease check the logs for details on the error. |
Is it still a network issue? |
Looks like the tests are broken. Here's a run from master: https://github.com/dapr/dapr/actions/runs/6620638088 |
Fixes #6723
As requested by the maintainers, re-opening #6729 now that the merge window for 1.13 is open. It also complies with the "rule" of having refactoring PRs merged early in the release cycle.
The main goal of this PR is to fix #6723 and do that in a "tidy" way.
Endpoint
struct used by the HTTP server to include more fields, so things are done more declaratively in the pkg/http package, alongside the rule definition.The main refactoring done above is the reason why the PR is on the larger side, as the changes are all interrelated.
Details of changes
obfuscateURLs
is enabled for API logging, in API logs we now include a more descriptive name for the method rather than the path that was matched in the router. The descriptive names map to the names of the methods in gRPC, for exampleSaveState
in place ofPOST /state/{storeName}/{name}
. Since gRPC API logs are always, only "obfuscated" (because the params are in the body and not in the "URL"), this makes HTTP and gRPC API logs more consistent toopkg/diagnostics
, separately from the endpoints. Only a VERY SMALL number of APIs had proper tracing configured (as a matter of fact, see the number of "TODO"'s in this PR), in large part due to the fact that this was in a separate package. The approach also relied a lot on heuristics.Endpoint
struct contains a property to add a function that adds properties to a span for tracing purposes. This lives right next to the handler.AppendSpanAttribute
method (this is enforced by a unit test)