Reduce cardinality in Dapr metrics and add more information to API logs #6919

ItalyPaleAle · 2023-09-13T22:03:18Z

As requested by the maintainers, re-opening #6729 now that the merge window for 1.13 is open. It also complies with the "rule" of having refactoring PRs merged early in the release cycle.

The main goal of this PR is to fix #6723 and do that in a "tidy" way.

Things are currently "scattered" a bit all over the place. For example, code that extracts metrics needs to have specific rules defined for each endpoint in pkg/diagnostics, and that means that very few building blocks have "proper" metrics.
The main refactoring part is changing the Endpoint struct used by the HTTP server to include more fields, so things are done more declaratively in the pkg/http package, alongside the rule definition.
For example, this PR would have made workflows HTTP - obfuscate metrics #6904 redundant as it would have included that already, but without the need to define specific rules for the building block in pkg/diagnostics.

The main refactoring done above is the reason why the PR is on the larger side, as the changes are all interrelated.

Details of changes

Drastically reduce cardinality of metrics, especially those emitted by the HTTP server:
- Also removes the possibility of PII being included in the metrics endpoint.
Add more information to API logs for both HTTP and gRPC, including things that are not going to be included in metrics by default anymore:
- Response status code
- Response size (for HTTP only & if possible)
- Latency
When obfuscateURLs is enabled for API logging, in API logs we now include a more descriptive name for the method rather than the path that was matched in the router. The descriptive names map to the names of the methods in gRPC, for example SaveState in place of POST /state/{storeName}/{name}. Since gRPC API logs are always, only "obfuscated" (because the params are in the body and not in the "URL"), this makes HTTP and gRPC API logs more consistent too
Refactors how tracing and API logging (and the API token auth middleware) get the data from the handler/router:
- The new approach is a lot more declarative, and less based on heuristics (such as parsing the path from the URL again)
- The new approach also reduces the number of maps that are allocated and used in each request, which generally contained duplicate information generated in multiple parts of the code
For both HTTP and gRPC, the way metadata is added to a tracing span has changed and it's now more declarative:
- When looking at how tracing spans were composed, the metadata was added in pkg/diagnostics, separately from the endpoints. Only a VERY SMALL number of APIs had proper tracing configured (as a matter of fact, see the number of "TODO"'s in this PR), in large part due to the fact that this was in a separate package. The approach also relied a lot on heuristics.
- For HTTP, now the Endpoint struct contains a property to add a function that adds properties to a span for tracing purposes. This lives right next to the handler.
- For gRPC, messages defined in the protos whose name ends in "Request" now must define an AppendSpanAttribute method (this is enforced by a unit test)
  - Note that the SDKs importing the updated protos will need to define new types. However, what matters is that the messages are wire-compatible, so clients using old protos (old SDKs) won't see any change.
Update API Allowlisting/Denylisting for HTTP to:
- Make sure that the same constants can be used for both HTTP and gRPC, especially versions. Right now, versions are in the format "v1.0-alpha1" for HTTP, and "v1alpha1" for gRPC. This PR changes the HTTP versions to be in the format "v1alpha1" too ("v1.0-alpha1" is preserved for bacwkards-compatibility)
- Improved perf of the HTTP Allowlist/Denylist, especially when using the "new" format (e.g. versions with "v1alpha1"): checking the allowlist is now a simple map lookup rather than an iteration over the entire allowlist.

codecov · 2023-09-13T22:23:30Z

Codecov Report

Attention: 201 lines in your changes are missing coverage. Please review.

Comparison is base (52e6d18) 65.05% compared to head (4d31a47) 65.07%.
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6919      +/-   ##
==========================================
+ Coverage   65.05%   65.07%   +0.02%     
==========================================
  Files         231      230       -1     
  Lines       20854    20998     +144     
==========================================
+ Hits        13566    13664      +98     
- Misses       6152     6192      +40     
- Partials     1136     1142       +6

Files	Coverage Δ
pkg/config/configuration.go	`61.43% <100.00%> (+3.64%)`	⬆️
pkg/grpc/universalapi/api_metadata.go	`92.75% <ø> (ø)`
pkg/grpc/universalapi/api_shutdown.go	`100.00% <ø> (ø)`
pkg/http/api_crypto.go	`87.62% <100.00%> (+1.11%)`	⬆️
pkg/http/api_healthz.go	`85.36% <100.00%> (+3.54%)`	⬆️
pkg/http/api_lock.go	`96.15% <100.00%> (+1.70%)`	⬆️
pkg/http/api_metadata.go	`97.14% <100.00%> (+0.36%)`	⬆️
pkg/http/api_shutdown.go	`100.00% <100.00%> (ø)`
pkg/http/api_workflow.go	`96.84% <100.00%> (+1.31%)`	⬆️
pkg/http/middlewares.go	`88.09% <100.00%> (ø)`
... and 20 more

... and 4 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pkg/diagnostics/consts/consts.go

artursouza

I did not review the entire thing since I think there is a more fundamental decision to be made here. Do we accept breaking changes to metrics? I believe those are part of our contract and people depend on data being published in those.

artursouza · 2023-09-14T13:41:18Z

pkg/channel/http/http_channel.go

@@ -228,7 +228,7 @@ func (h *Channel) invokeMethodV1(ctx context.Context, req *invokev1.InvokeMethod
 	}()

 	// Emit metric when request is sent
-	diag.DefaultHTTPMonitoring.ClientRequestStarted(ctx, channelReq.Method, req.Message().Method, int64(len(req.Message().Data.GetValue())))
+	diag.DefaultHTTPMonitoring.ClientRequestStarted(ctx, int64(len(req.Message().Data.GetValue())))


These are breaking changes since it removes information others might already be depending on for their dashboards and alerts. Is there a way to make this configurable?

We should remain consistent with industry standard golden metrics - meaning, latency and status codes should be reported on response metrics. Size I think we can remove safely

Latency and status codes are included. Size is included when available, but if it can be omitted it would allow for a lot of simplifications.

Are latency and status codes included by default? I want to clarify due to this:

including things that are not going to be included in metrics by default anymore: * Response status code * Response size (for HTTP only & if possible) * Latency

I wrote that a while ago so I had to go double-check :)

Yes, they are still included (size is included only if available):

https://github.com/ItalyPaleAle/dapr/blob/38b09613667e88e5fe84f48f9f0188b2ce090219/pkg/diagnostics/http_monitoring.go#L208-L240

What I think I meant with that is that they are not included anymore with the same granularity. Users can rely on the API logs to get more specific data points, such as per-endpoint (rather than just per target app)

Is there a way to include back the original granular metrics if needed? I.e. can either this change be the new normal and the old way be behind an opt-in flag or the other way around?

I can look into that, but before we invest engineering effort, is there any demand for that? The current behavior is causing massive memory consumption (hundreds of MBs), causing live site incidents due to pods OOM'ing (we actually had users opening support tickets with Azure too... and seems Diagrid heard users having issues too). It is also costing users a lot of money when metrics are ingested in analytics tools. To the point it may even be considered a bug?

Per conversation on Discord, we can move forward on this PR without the flag to add the old behavior back as opt-in, as there's already enough "substance" here. When this is merged I'll open an issue to discuss that need (see if folks from the community have opinions) and also how that can be implemented (e.g. should it be a permanent option in the Configuration CRD, a feature flag, or...?). That issue will be completed before 1.13 (as a release blocker, but we have lots of time)

@ItalyPaleAle is there an issue open to track the addition of flag to include the original granular metrics.

any idea where this issue or flag is implemented? it is mentioned as a release blocker and operation teams might want to keep using their old dashboards/alerts that rely on these labels

ItalyPaleAle · 2023-09-14T14:38:38Z

I did not review the entire thing since I think there is a more fundamental decision to be made here. Do we accept breaking changes to metrics? I believe those are part of our contract and people depend on data being published in those.

Yes, it’s a sort of breaking change. In the issue and conversations I’ve had with @yaron2 this was deemed acceptable because the current situation is not sustainable.

On the HTTP server, there’s a new “bucket” created for each path. In REST-ful APIs, it’s common to have things like IDs in paths, so there’s REALLY high cardinality. That not only reduces the usefulness of the metrics (too much cardinality makes it harder to see aggregate data), but it has caused live-site incidents for users due to pods being killed with OOM: we have observed daprd using hundreds of MBs of memory due to the metrics subsystem, before being OOM’d. Paths can also contain PII, so this causes issues with things like GDPR compliance.

The other side is that the gRPC server doesn’t expose this behavior at all, as it just shows the method name and nothing else. So there was already some unusual behavioral differences there.

ItalyPaleAle · 2023-09-14T16:01:30Z

pkg/http/server.go

 		publicR := s.getRouter()
+		s.useContextSetup(publicR)
 		s.useTracing(publicR)
 		s.useMetrics(publicR)


This has been kept for backwards-compat, and it was checked by an E2E test, but should the public API server (used by K8s healthchecks primarily) still use metrics and tracing?

Not sure if it's needed to collect metrics on that. And it may be odd to have metrics for the "regular" HTTP server intermixed with metrics from the public HTTP server

mukundansundar

I have not completely reviewed the PR yet. But left some initial comments.

mukundansundar · 2023-09-15T04:30:40Z

dapr/proto/runtime/v1/dapr.proto

@@ -600,7 +598,12 @@ message InvokeActorResponse {
  bytes data = 1;
 }

-// GetMetadataResponse is a message that is returned on GetMetadata rpc call
+// GetMetadataRequest is the message for the GetMetadata request.
+message GetMetadataRequest {


@ItalyPaleAle : Reviewers: this needed to have a message ending in "Request". This change is backwards compatible in the wire (i.e. v1.11 SDKs can talk to this without issues), but it may require a change in the name in the SDK if the protos are updated.
#6729 (comment)

I agree that on the wire it might be the same. But this will require users that use the proto directly to have a breaking change. And must be communicated as such.

Yes, but:

This will not break apps that are currently running, which is the most important thing. Users may need to change a struct name if they import the protos directly and are updating the protos.

Using the gRPC protos directly is not an officially supported scenario.

Using the gRPC protos directly is not an officially supported scenario.

Still something that users might use. Again, my point being here is that it must be communicated as a breaking change. We can expand on it saying that running apps will not be affected, just struct change as of now for new proto imports.

This is a breaking change, however this will be breaking at compilation time and not at runtime. I recommend we announce the breaking change with no need to create a v2 of this method.

This is a breaking change, however this will be breaking at compilation time and not at runtime. I recommend we announce the breaking change with no need to create a v2 of this method.

This is a fair assessment.

To recap, and to be fully clear, from a practical effect this change is similar to what happened with #6945 (landing in 1.13 too) which removed a method from the gRPC proto (of course, that was done for different reasons).

No impact on apps currently running

Dapr SDKs maintainers will need to rename the structs/objects when they upgrade the protos

Users of the SDKs should not see any direct impact.

AFAIK the SDKs always abstract the gRPC objects, so users who just depend on Dapr SDKs shouldn't need any code change if/when they upgrade the SDKs.

At runtime, old SDKs will continue to work alongside new SDKs.

For users who import the protos directly:

No change needed if they are using existing protos

If they upgrade the protos, they will need to rename the structs/objects

mukundansundar · 2023-09-19T05:12:45Z

pkg/channel/http/http_channel.go

@@ -228,7 +228,7 @@ func (h *Channel) invokeMethodV1(ctx context.Context, req *invokev1.InvokeMethod
 	}()

 	// Emit metric when request is sent
-	diag.DefaultHTTPMonitoring.ClientRequestStarted(ctx, channelReq.Method, req.Message().Method, int64(len(req.Message().Data.GetValue())))
+	diag.DefaultHTTPMonitoring.ClientRequestStarted(ctx, int64(len(req.Message().Data.GetValue())))


Is there a way to include back the original granular metrics if needed? I.e. can either this change be the new normal and the old way be behind an opt-in flag or the other way around?

pkg/config/configuration.go

RyanLettieri

First pass through, just some syntax and tidying up suggestions

pkg/diagnostics/http_tracing.go

pkg/grpc/server.go

pkg/http/api_lock.go

yaron2 · 2023-10-19T21:12:52Z

Since it is difficult to know at this point if some labels need to be reinstated by default, I suggest we merge this PR so we can audit the new metrics carefully and then revert if/whatever is needed (by rolling forward, not reverting the PR).

As mentioned, an option will be introduced later to enable the more granular metrics we had until now, and that option might be used to choose if we want non-granular or granular metrics by default.

cc @mukundansundar @artursouza @daixiang0

artursouza · 2023-10-19T21:55:50Z

Since it is difficult to know at this point if some labels need to be reinstated by default, I suggest we merge this PR so we can audit the new metrics carefully and then revert if/whatever is needed (by rolling forward, not reverting the PR).

As mentioned, an option will be introduced later to enable the more granular metrics we had until now, and that option might be used to choose if we want non-granular or granular metrics by default.

cc @mukundansundar @artursouza @daixiang0

Yes. We need the nightly build to work too so we can test with an immutable artifact.

artursouza · 2023-10-20T05:25:21Z

I was thinking about this and it is an opportunity for us to increase test coverage. Before merging this PR, I think there should be a PR that adds functional tests for metrics from the current implementation. Then this PR can be merged with higher guarantees of not adding a regression.

This was done by Josh when he added upgrade/downgrade tests and integration tests to increase confidence in the refactoring and changes he wanted to merge.

ItalyPaleAle · 2023-10-20T05:29:25Z

Before merging this PR, I think there should be a PR that adds functional tests for metrics from the current implementation.

What test would you like to see?

We currently have E2E tests that check that metrics are sent to Prometheus. Because of the dependency on an external service (Prometheus), testing metrics is probably something that is not well suited for an IT. If you have suggestions on additional E2E tests, let me know.

--EDIT

Scratch the above. I think there's something I can do with ITs.

Fixes dapr#6723 Includes: 1. Drastically reduce cardinality of metrics, especially those emitted by the HTTP server: - Also removes the possibility of PII being included in the metrics endpoint. 1. Add more information to API logs for both HTTP and gRPC, including things that are not going to be included in metrics by default anymore: - Response status code - Response size (for HTTP only & if possible) - Latency 3. When `obfuscateURLs` is enabled for API logging, in API logs we now include a more descriptive name for the method rather than the path that was matched in the router. The descriptive names map to the names of the methods in gRPC, for example `SaveState` in place of `POST /state/{storeName}/{name}`. Since gRPC API logs are always, only "obfuscated" (because the params are in the body and not in the "URL"), this makes HTTP and gRPC API logs more consistent too 4. Refactors how tracing and API logging (and the API token auth middleware) get the data from the handler/router: - The new approach is a lot more declarative, and less based on heuristics (such as parsing the path from the URL again) - The new approach also reduces the number of maps that are allocated and used in each request, which generally contained duplicate information generated in multiple parts of the code 5. For both HTTP and gRPC, the way metadata is added to a tracing span has changed and it's now more declarative: - When looking at how tracing spans were composed, the metadata was added in `pkg/diagnostics`, separately from the endpoints. Only a VERY SMALL number of APIs had proper tracing configured (as a matter of fact, see the number of "TODO"'s in this PR), in large part due to the fact that this was in a separate package. The approach also relied a lot on heuristics. - For HTTP, now the `Endpoint` struct contains a property to add a function that adds properties to a span for tracing purposes. This lives right next to the handler. - For gRPC, messages defined in the protos whose name ends in "Request" now must define an `AppendSpanAttribute` method (this is enforced by a unit test) 6. Update API Allowlisting/Denylisting for HTTP to: - Make sure that the same constants can be used for both HTTP and gRPC, especially versions. Right now, versions are in the format "v1.0-alpha1" for HTTP, and "v1alpha1" for gRPC. This PR changes the HTTP versions to be in the format "v1alpha1" too ("v1.0-alpha1" is preserved for bacwkards-compatibility) - Improved perf of the HTTP Allowlist/Denylist, especially when using the "new" format (e.g. versions with "v1alpha1"): checking the allowlist is now a simple map lookup rather than an iteration over the entire allowlist. Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

ItalyPaleAle · 2023-10-20T18:22:26Z

@artursouza I've added an IT that validates metrics are sent correctly.

For tracing, we have an existing E2E test. For traces, it's not going to be possible to create an IT as that does require a service (Zipkin). In fact, for metrics we can scrape them in a "pull" way, but traces are pushed to Zipkin so they require a service.

Should cover the tests?

tests/integration/framework/binary/binary.go

Signed-off-by: Alessandro (Ale) Segala <43508+ItalyPaleAle@users.noreply.github.com>

artursouza · 2023-10-20T21:02:58Z

/test-sdk-all

artursouza · 2023-10-20T21:03:05Z

/test-sdks-all

dapr-bot · 2023-10-20T21:03:24Z

Dapr SDK Python test

🔗 Link to Action run

Commit ref: 7da569d

✅ Python SDK tests passed

dapr-bot · 2023-10-20T21:03:25Z

Dapr SDK Java test

🔗 Link to Action run

Commit ref: 7da569d

❌ Java SDK tests failed

Please check the logs for details on the error.

dapr-bot · 2023-10-20T21:03:25Z

Dapr SDK Go test

🔗 Link to Action run

Commit ref: 7da569d

✅ Go SDK tests passed

dapr-bot · 2023-10-20T21:03:52Z

Dapr SDK JS test

🔗 Link to Action run

Commit ref: 7da569d

❌ JS SDK tests failed

Please check the logs for details on the error.

daixiang0 · 2023-10-24T06:17:19Z

Dapr SDK JS test

🔗 Link to Action run

Commit ref: 7da569d

❌ JS SDK tests failed

Please check the logs for details on the error.

Is it still a network issue?

ItalyPaleAle · 2023-10-24T15:27:19Z

Dapr SDK JS test

🔗 Link to Action run
Commit ref: 7da569d

❌ JS SDK tests failed

Please check the logs for details on the error.

Is it still a network issue?

Looks like the tests are broken. Here's a run from master: https://github.com/dapr/dapr/actions/runs/6620638088

ItalyPaleAle requested review from a team as code owners September 13, 2023 22:03

JoshVanL requested changes Sep 14, 2023

View reviewed changes

pkg/diagnostics/consts/consts.go Show resolved Hide resolved

artursouza requested changes Sep 14, 2023

View reviewed changes

ItalyPaleAle added the autoupdate DaprBot will keep the Pull Request up to date with master branch label Sep 14, 2023

ItalyPaleAle commented Sep 14, 2023

View reviewed changes

mukundansundar reviewed Sep 19, 2023

View reviewed changes

ItalyPaleAle force-pushed the metrics-cardinality branch 2 times, most recently from 458ff08 to d2fec67 Compare September 25, 2023 17:44

RyanLettieri suggested changes Sep 26, 2023

View reviewed changes

pkg/diagnostics/http_tracing.go Show resolved Hide resolved

pkg/grpc/server.go Show resolved Hide resolved

pkg/http/api_lock.go Outdated Show resolved Hide resolved

ItalyPaleAle force-pushed the metrics-cardinality branch from 5907996 to 2012ffd Compare October 20, 2023 16:10

ItalyPaleAle added 2 commits October 20, 2023 09:19

Added missing workflow beta1 APIs

fbb3624

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

Added IT for metrics

e95974e

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>

ItalyPaleAle requested review from artursouza and JoshVanL October 20, 2023 18:23

ItalyPaleAle commented Oct 20, 2023

View reviewed changes

tests/integration/framework/binary/binary.go Outdated Show resolved Hide resolved

ItalyPaleAle and others added 2 commits October 20, 2023 12:19

Update tests/integration/framework/binary/binary.go

80ab29a

Signed-off-by: Alessandro (Ale) Segala <43508+ItalyPaleAle@users.noreply.github.com>

Merge branch 'master' into metrics-cardinality

7da569d

dapr-bot added 7 commits October 20, 2023 18:36

Merge branch 'master' into metrics-cardinality

d8761dd

Merge branch 'master' into metrics-cardinality

3fb6b3d

Merge branch 'master' into metrics-cardinality

4538ed2

Merge branch 'master' into metrics-cardinality

07d74da

Merge branch 'master' into metrics-cardinality

30bfca0

Merge branch 'master' into metrics-cardinality

62899a1

Merge branch 'master' into metrics-cardinality

50d6074

ItalyPaleAle mentioned this pull request Oct 24, 2023

Fix wrong service invocation type in http tracing #7088

Closed

7 tasks

Merge branch 'master' into metrics-cardinality

2803b4b

Merge branch 'master' into metrics-cardinality

4d31a47

JoshVanL mentioned this pull request Oct 24, 2023

v1.13 Release Planning #7093

Closed

31 tasks

yaron2 approved these changes Oct 24, 2023

View reviewed changes

yaron2 merged commit 4d4c8b9 into dapr:master Oct 24, 2023
29 of 31 checks passed

JoshVanL added this to the v1.13 milestone Nov 27, 2023

JoshVanL added the P0 label Nov 27, 2023

mukundansundar added the breaking-change This is a breaking change label Dec 12, 2023

JoshVanL mentioned this pull request Dec 12, 2023

Daprd Metrics Breaking Change #7295

Closed

ItalyPaleAle mentioned this pull request May 15, 2024

Low cardinality metrics issues #7719

Open

Reduce cardinality in Dapr metrics and add more information to API logs #6919

Reduce cardinality in Dapr metrics and add more information to API logs #6919

Conversation

ItalyPaleAle commented Sep 13, 2023

Details of changes

codecov bot commented Sep 13, 2023 • edited

Codecov Report

artursouza left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

filintod Jan 19, 2024 • edited

Choose a reason for hiding this comment

ItalyPaleAle commented Sep 14, 2023

Choose a reason for hiding this comment

mukundansundar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RyanLettieri left a comment

Choose a reason for hiding this comment

yaron2 commented Oct 19, 2023

artursouza commented Oct 19, 2023

artursouza commented Oct 20, 2023

ItalyPaleAle commented Oct 20, 2023 • edited

ItalyPaleAle commented Oct 20, 2023

artursouza commented Oct 20, 2023

artursouza commented Oct 20, 2023

dapr-bot commented Oct 20, 2023 • edited

Dapr SDK Python test

✅ Python SDK tests passed

dapr-bot commented Oct 20, 2023 • edited

Dapr SDK Java test

❌ Java SDK tests failed

dapr-bot commented Oct 20, 2023 • edited

Dapr SDK Go test

✅ Go SDK tests passed

dapr-bot commented Oct 20, 2023 • edited

Dapr SDK JS test

❌ JS SDK tests failed

daixiang0 commented Oct 24, 2023

Dapr SDK JS test

❌ JS SDK tests failed

ItalyPaleAle commented Oct 24, 2023 • edited

Dapr SDK JS test

❌ JS SDK tests failed

codecov bot commented Sep 13, 2023 •

edited

filintod Jan 19, 2024 •

edited

ItalyPaleAle commented Oct 20, 2023 •

edited

dapr-bot commented Oct 20, 2023 •

edited

dapr-bot commented Oct 20, 2023 •

edited

dapr-bot commented Oct 20, 2023 •

edited

dapr-bot commented Oct 20, 2023 •

edited

ItalyPaleAle commented Oct 24, 2023 •

edited