Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce cardinality in Dapr metrics and add more information to API logs #6919

Merged
merged 14 commits into from Oct 24, 2023

Conversation

ItalyPaleAle
Copy link
Contributor

Fixes #6723

As requested by the maintainers, re-opening #6729 now that the merge window for 1.13 is open. It also complies with the "rule" of having refactoring PRs merged early in the release cycle.

The main goal of this PR is to fix #6723 and do that in a "tidy" way.

  • Things are currently "scattered" a bit all over the place. For example, code that extracts metrics needs to have specific rules defined for each endpoint in pkg/diagnostics, and that means that very few building blocks have "proper" metrics.
  • The main refactoring part is changing the Endpoint struct used by the HTTP server to include more fields, so things are done more declaratively in the pkg/http package, alongside the rule definition.
  • For example, this PR would have made workflows HTTP - obfuscate metrics #6904 redundant as it would have included that already, but without the need to define specific rules for the building block in pkg/diagnostics.

The main refactoring done above is the reason why the PR is on the larger side, as the changes are all interrelated.

Details of changes

  1. Drastically reduce cardinality of metrics, especially those emitted by the HTTP server:
    • Also removes the possibility of PII being included in the metrics endpoint.
  2. Add more information to API logs for both HTTP and gRPC, including things that are not going to be included in metrics by default anymore:
    • Response status code
    • Response size (for HTTP only & if possible)
    • Latency
  3. When obfuscateURLs is enabled for API logging, in API logs we now include a more descriptive name for the method rather than the path that was matched in the router. The descriptive names map to the names of the methods in gRPC, for example SaveState in place of POST /state/{storeName}/{name}. Since gRPC API logs are always, only "obfuscated" (because the params are in the body and not in the "URL"), this makes HTTP and gRPC API logs more consistent too
  4. Refactors how tracing and API logging (and the API token auth middleware) get the data from the handler/router:
    • The new approach is a lot more declarative, and less based on heuristics (such as parsing the path from the URL again)
    • The new approach also reduces the number of maps that are allocated and used in each request, which generally contained duplicate information generated in multiple parts of the code
  5. For both HTTP and gRPC, the way metadata is added to a tracing span has changed and it's now more declarative:
    • When looking at how tracing spans were composed, the metadata was added in pkg/diagnostics, separately from the endpoints. Only a VERY SMALL number of APIs had proper tracing configured (as a matter of fact, see the number of "TODO"'s in this PR), in large part due to the fact that this was in a separate package. The approach also relied a lot on heuristics.
    • For HTTP, now the Endpoint struct contains a property to add a function that adds properties to a span for tracing purposes. This lives right next to the handler.
    • For gRPC, messages defined in the protos whose name ends in "Request" now must define an AppendSpanAttribute method (this is enforced by a unit test)
      • Note that the SDKs importing the updated protos will need to define new types. However, what matters is that the messages are wire-compatible, so clients using old protos (old SDKs) won't see any change.
  6. Update API Allowlisting/Denylisting for HTTP to:
    • Make sure that the same constants can be used for both HTTP and gRPC, especially versions. Right now, versions are in the format "v1.0-alpha1" for HTTP, and "v1alpha1" for gRPC. This PR changes the HTTP versions to be in the format "v1alpha1" too ("v1.0-alpha1" is preserved for bacwkards-compatibility)
    • Improved perf of the HTTP Allowlist/Denylist, especially when using the "new" format (e.g. versions with "v1alpha1"): checking the allowlist is now a simple map lookup rather than an iteration over the entire allowlist.

@ItalyPaleAle ItalyPaleAle requested review from a team as code owners September 13, 2023 22:03
@codecov
Copy link

codecov bot commented Sep 13, 2023

Codecov Report

Attention: 201 lines in your changes are missing coverage. Please review.

Comparison is base (52e6d18) 65.05% compared to head (4d31a47) 65.07%.
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6919      +/-   ##
==========================================
+ Coverage   65.05%   65.07%   +0.02%     
==========================================
  Files         231      230       -1     
  Lines       20854    20998     +144     
==========================================
+ Hits        13566    13664      +98     
- Misses       6152     6192      +40     
- Partials     1136     1142       +6     
Files Coverage Δ
pkg/config/configuration.go 61.43% <100.00%> (+3.64%) ⬆️
pkg/grpc/universalapi/api_metadata.go 92.75% <ø> (ø)
pkg/grpc/universalapi/api_shutdown.go 100.00% <ø> (ø)
pkg/http/api_crypto.go 87.62% <100.00%> (+1.11%) ⬆️
pkg/http/api_healthz.go 85.36% <100.00%> (+3.54%) ⬆️
pkg/http/api_lock.go 96.15% <100.00%> (+1.70%) ⬆️
pkg/http/api_metadata.go 97.14% <100.00%> (+0.36%) ⬆️
pkg/http/api_shutdown.go 100.00% <100.00%> (ø)
pkg/http/api_workflow.go 96.84% <100.00%> (+1.31%) ⬆️
pkg/http/middlewares.go 88.09% <100.00%> (ø)
... and 20 more

... and 4 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@artursouza artursouza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not review the entire thing since I think there is a more fundamental decision to be made here. Do we accept breaking changes to metrics? I believe those are part of our contract and people depend on data being published in those.

@@ -228,7 +228,7 @@ func (h *Channel) invokeMethodV1(ctx context.Context, req *invokev1.InvokeMethod
}()

// Emit metric when request is sent
diag.DefaultHTTPMonitoring.ClientRequestStarted(ctx, channelReq.Method, req.Message().Method, int64(len(req.Message().Data.GetValue())))
diag.DefaultHTTPMonitoring.ClientRequestStarted(ctx, int64(len(req.Message().Data.GetValue())))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are breaking changes since it removes information others might already be depending on for their dashboards and alerts. Is there a way to make this configurable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remain consistent with industry standard golden metrics - meaning, latency and status codes should be reported on response metrics. Size I think we can remove safely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latency and status codes are included. Size is included when available, but if it can be omitted it would allow for a lot of simplifications.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are latency and status codes included by default? I want to clarify due to this:

including things that are not going to be included in metrics by default anymore:
* Response status code
* Response size (for HTTP only & if possible)
* Latency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote that a while ago so I had to go double-check :)

Yes, they are still included (size is included only if available):

https://github.com/ItalyPaleAle/dapr/blob/38b09613667e88e5fe84f48f9f0188b2ce090219/pkg/diagnostics/http_monitoring.go#L208-L240

What I think I meant with that is that they are not included anymore with the same granularity. Users can rely on the API logs to get more specific data points, such as per-endpoint (rather than just per target app)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to include back the original granular metrics if needed? I.e. can either this change be the new normal and the old way be behind an opt-in flag or the other way around?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look into that, but before we invest engineering effort, is there any demand for that? The current behavior is causing massive memory consumption (hundreds of MBs), causing live site incidents due to pods OOM'ing (we actually had users opening support tickets with Azure too... and seems Diagrid heard users having issues too). It is also costing users a lot of money when metrics are ingested in analytics tools. To the point it may even be considered a bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per conversation on Discord, we can move forward on this PR without the flag to add the old behavior back as opt-in, as there's already enough "substance" here. When this is merged I'll open an issue to discuss that need (see if folks from the community have opinions) and also how that can be implemented (e.g. should it be a permanent option in the Configuration CRD, a feature flag, or...?). That issue will be completed before 1.13 (as a release blocker, but we have lots of time)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ItalyPaleAle is there an issue open to track the addition of flag to include the original granular metrics.

Copy link
Contributor

@filintod filintod Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any idea where this issue or flag is implemented? it is mentioned as a release blocker and operation teams might want to keep using their old dashboards/alerts that rely on these labels

@ItalyPaleAle
Copy link
Contributor Author

I did not review the entire thing since I think there is a more fundamental decision to be made here. Do we accept breaking changes to metrics? I believe those are part of our contract and people depend on data being published in those.

Yes, it’s a sort of breaking change. In the issue and conversations I’ve had with @yaron2 this was deemed acceptable because the current situation is not sustainable.

On the HTTP server, there’s a new “bucket” created for each path. In REST-ful APIs, it’s common to have things like IDs in paths, so there’s REALLY high cardinality. That not only reduces the usefulness of the metrics (too much cardinality makes it harder to see aggregate data), but it has caused live-site incidents for users due to pods being killed with OOM: we have observed daprd using hundreds of MBs of memory due to the metrics subsystem, before being OOM’d. Paths can also contain PII, so this causes issues with things like GDPR compliance.

The other side is that the gRPC server doesn’t expose this behavior at all, as it just shows the method name and nothing else. So there was already some unusual behavioral differences there.

@ItalyPaleAle ItalyPaleAle added the autoupdate DaprBot will keep the Pull Request up to date with master branch label Sep 14, 2023
Comment on lines 169 to 172
publicR := s.getRouter()
s.useContextSetup(publicR)
s.useTracing(publicR)
s.useMetrics(publicR)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been kept for backwards-compat, and it was checked by an E2E test, but should the public API server (used by K8s healthchecks primarily) still use metrics and tracing?

Not sure if it's needed to collect metrics on that. And it may be odd to have metrics for the "regular" HTTP server intermixed with metrics from the public HTTP server

Copy link
Contributor

@mukundansundar mukundansundar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not completely reviewed the PR yet. But left some initial comments.

@@ -600,7 +598,12 @@ message InvokeActorResponse {
bytes data = 1;
}

// GetMetadataResponse is a message that is returned on GetMetadata rpc call
// GetMetadataRequest is the message for the GetMetadata request.
message GetMetadataRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ItalyPaleAle : Reviewers: this needed to have a message ending in "Request". This change is backwards compatible in the wire (i.e. v1.11 SDKs can talk to this without issues), but it may require a change in the name in the SDK if the protos are updated.
#6729 (comment)

I agree that on the wire it might be the same. But this will require users that use the proto directly to have a breaking change. And must be communicated as such.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but:

  1. This will not break apps that are currently running, which is the most important thing. Users may need to change a struct name if they import the protos directly and are updating the protos.
  2. Using the gRPC protos directly is not an officially supported scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the gRPC protos directly is not an officially supported scenario.

Still something that users might use. Again, my point being here is that it must be communicated as a breaking change. We can expand on it saying that running apps will not be affected, just struct change as of now for new proto imports.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change, however this will be breaking at compilation time and not at runtime. I recommend we announce the breaking change with no need to create a v2 of this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change, however this will be breaking at compilation time and not at runtime. I recommend we announce the breaking change with no need to create a v2 of this method.

This is a fair assessment.

To recap, and to be fully clear, from a practical effect this change is similar to what happened with #6945 (landing in 1.13 too) which removed a method from the gRPC proto (of course, that was done for different reasons).

  • No impact on apps currently running
  • Dapr SDKs maintainers will need to rename the structs/objects when they upgrade the protos
  • Users of the SDKs should not see any direct impact.
    • AFAIK the SDKs always abstract the gRPC objects, so users who just depend on Dapr SDKs shouldn't need any code change if/when they upgrade the SDKs.
    • At runtime, old SDKs will continue to work alongside new SDKs.
  • For users who import the protos directly:
    • No change needed if they are using existing protos
    • If they upgrade the protos, they will need to rename the structs/objects

@@ -228,7 +228,7 @@ func (h *Channel) invokeMethodV1(ctx context.Context, req *invokev1.InvokeMethod
}()

// Emit metric when request is sent
diag.DefaultHTTPMonitoring.ClientRequestStarted(ctx, channelReq.Method, req.Message().Method, int64(len(req.Message().Data.GetValue())))
diag.DefaultHTTPMonitoring.ClientRequestStarted(ctx, int64(len(req.Message().Data.GetValue())))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to include back the original granular metrics if needed? I.e. can either this change be the new normal and the old way be behind an opt-in flag or the other way around?

pkg/config/configuration.go Show resolved Hide resolved
@ItalyPaleAle ItalyPaleAle force-pushed the metrics-cardinality branch 2 times, most recently from 458ff08 to d2fec67 Compare September 25, 2023 17:44
Copy link
Contributor

@RyanLettieri RyanLettieri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass through, just some syntax and tidying up suggestions

pkg/diagnostics/http_tracing.go Show resolved Hide resolved
pkg/grpc/server.go Show resolved Hide resolved
pkg/http/api_lock.go Outdated Show resolved Hide resolved
@yaron2
Copy link
Member

yaron2 commented Oct 19, 2023

Since it is difficult to know at this point if some labels need to be reinstated by default, I suggest we merge this PR so we can audit the new metrics carefully and then revert if/whatever is needed (by rolling forward, not reverting the PR).

As mentioned, an option will be introduced later to enable the more granular metrics we had until now, and that option might be used to choose if we want non-granular or granular metrics by default.

cc @mukundansundar @artursouza @daixiang0

@artursouza
Copy link
Member

Since it is difficult to know at this point if some labels need to be reinstated by default, I suggest we merge this PR so we can audit the new metrics carefully and then revert if/whatever is needed (by rolling forward, not reverting the PR).

As mentioned, an option will be introduced later to enable the more granular metrics we had until now, and that option might be used to choose if we want non-granular or granular metrics by default.

cc @mukundansundar @artursouza @daixiang0

Yes. We need the nightly build to work too so we can test with an immutable artifact.

@artursouza
Copy link
Member

I was thinking about this and it is an opportunity for us to increase test coverage. Before merging this PR, I think there should be a PR that adds functional tests for metrics from the current implementation. Then this PR can be merged with higher guarantees of not adding a regression.

This was done by Josh when he added upgrade/downgrade tests and integration tests to increase confidence in the refactoring and changes he wanted to merge.

@ItalyPaleAle
Copy link
Contributor Author

ItalyPaleAle commented Oct 20, 2023

Before merging this PR, I think there should be a PR that adds functional tests for metrics from the current implementation.

What test would you like to see?

We currently have E2E tests that check that metrics are sent to Prometheus. Because of the dependency on an external service (Prometheus), testing metrics is probably something that is not well suited for an IT. If you have suggestions on additional E2E tests, let me know.

--EDIT

Scratch the above. I think there's something I can do with ITs.

Fixes dapr#6723

Includes:

1. Drastically reduce cardinality of metrics, especially those emitted by the HTTP server:
   - Also removes the possibility of PII being included in the metrics endpoint.
1. Add more information to API logs for both HTTP and gRPC, including things that are not going to be included in metrics by default anymore:
   - Response status code
   - Response size (for HTTP only & if possible)
   - Latency
3. When `obfuscateURLs` is enabled for API logging, in API logs we now include a more descriptive name for the method rather than the path that was matched in the router. The descriptive names map to the names of the methods in gRPC, for example `SaveState` in place of `POST /state/{storeName}/{name}`. Since gRPC API logs are always, only "obfuscated" (because the params are in the body and not in the "URL"), this makes HTTP and gRPC API logs more consistent too
4. Refactors how tracing and API logging (and the API token auth middleware) get the data from the handler/router:
   - The new approach is a lot more declarative, and less based on heuristics (such as parsing the path from the URL again)
   - The new approach also reduces the number of maps that are allocated and used in each request, which generally contained duplicate information generated in multiple parts of the code
5. For both HTTP and gRPC, the way metadata is added to a tracing span has changed and it's now more declarative:
   - When looking at how tracing spans were composed, the metadata was added in `pkg/diagnostics`, separately from the endpoints. Only a VERY SMALL number of APIs had proper tracing configured (as a matter of fact, see the number of "TODO"'s in this PR), in large part due to the fact that this was in a separate package. The approach also relied a lot on heuristics.
   - For HTTP, now the `Endpoint` struct contains a property to add a function that adds properties to a span for tracing purposes. This lives right next to the handler.
   - For gRPC, messages defined in the protos whose name ends in "Request" now must define an `AppendSpanAttribute` method (this is enforced by a unit test)
6. Update API Allowlisting/Denylisting for HTTP to:
   - Make sure that the same constants can be used for both HTTP and gRPC, especially versions. Right now, versions are in the format "v1.0-alpha1" for HTTP, and "v1alpha1" for gRPC. This PR changes the HTTP versions to be in the format "v1alpha1" too ("v1.0-alpha1" is preserved for bacwkards-compatibility)
   - Improved perf of the HTTP Allowlist/Denylist, especially when using the "new" format (e.g. versions with "v1alpha1"): checking the allowlist is now a simple map lookup rather than an iteration over the entire allowlist.

Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
Signed-off-by: ItalyPaleAle <43508+ItalyPaleAle@users.noreply.github.com>
@ItalyPaleAle
Copy link
Contributor Author

@artursouza I've added an IT that validates metrics are sent correctly.

For tracing, we have an existing E2E test. For traces, it's not going to be possible to create an IT as that does require a service (Zipkin). In fact, for metrics we can scrape them in a "pull" way, but traces are pushed to Zipkin so they require a service.

Should cover the tests?

ItalyPaleAle and others added 2 commits October 20, 2023 12:19
Signed-off-by: Alessandro (Ale) Segala <43508+ItalyPaleAle@users.noreply.github.com>
@artursouza
Copy link
Member

/test-sdk-all

@artursouza
Copy link
Member

/test-sdks-all

@dapr-bot
Copy link
Collaborator

dapr-bot commented Oct 20, 2023

Dapr SDK Python test

🔗 Link to Action run

Commit ref: 7da569d

✅ Python SDK tests passed

@dapr-bot
Copy link
Collaborator

dapr-bot commented Oct 20, 2023

Dapr SDK Java test

🔗 Link to Action run

Commit ref: 7da569d

❌ Java SDK tests failed

Please check the logs for details on the error.

@dapr-bot
Copy link
Collaborator

dapr-bot commented Oct 20, 2023

Dapr SDK Go test

🔗 Link to Action run

Commit ref: 7da569d

✅ Go SDK tests passed

@dapr-bot
Copy link
Collaborator

dapr-bot commented Oct 20, 2023

Dapr SDK JS test

🔗 Link to Action run

Commit ref: 7da569d

❌ JS SDK tests failed

Please check the logs for details on the error.

@daixiang0
Copy link
Member

Dapr SDK JS test

🔗 Link to Action run

Commit ref: 7da569d

❌ JS SDK tests failed

Please check the logs for details on the error.

Is it still a network issue?

@ItalyPaleAle
Copy link
Contributor Author

ItalyPaleAle commented Oct 24, 2023

Dapr SDK JS test

🔗 Link to Action run
Commit ref: 7da569d

❌ JS SDK tests failed

Please check the logs for details on the error.

Is it still a network issue?

Looks like the tests are broken. Here's a run from master: https://github.com/dapr/dapr/actions/runs/6620638088

@JoshVanL JoshVanL mentioned this pull request Oct 24, 2023
31 tasks
@yaron2 yaron2 merged commit 4d4c8b9 into dapr:master Oct 24, 2023
29 of 31 checks passed
@JoshVanL JoshVanL added this to the v1.13 milestone Nov 27, 2023
@JoshVanL JoshVanL added the P0 label Nov 27, 2023
@mukundansundar mukundansundar added the breaking-change This is a breaking change label Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autoupdate DaprBot will keep the Pull Request up to date with master branch breaking-change This is a breaking change P0
Projects
Development

Successfully merging this pull request may close these issues.

Reduce cardinality in Dapr metrics and add more information to API logs
9 participants