-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce cardinality in Dapr metrics and add more information to API logs #6919
Changes from all commits
2012ffd
fbb3624
e95974e
80ab29a
7da569d
d8761dd
3fb6b3d
4538ed2
07d74da
30bfca0
62899a1
50d6074
2803b4b
4d31a47
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -228,7 +228,7 @@ | |
}() | ||
|
||
// Emit metric when request is sent | ||
diag.DefaultHTTPMonitoring.ClientRequestStarted(ctx, channelReq.Method, req.Message().Method, int64(len(req.Message().Data.GetValue()))) | ||
diag.DefaultHTTPMonitoring.ClientRequestStarted(ctx, int64(len(req.Message().Data.GetValue()))) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are breaking changes since it removes information others might already be depending on for their dashboards and alerts. Is there a way to make this configurable? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should remain consistent with industry standard golden metrics - meaning, latency and status codes should be reported on response metrics. Size I think we can remove safely There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Latency and status codes are included. Size is included when available, but if it can be omitted it would allow for a lot of simplifications. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are latency and status codes included by default? I want to clarify due to this:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wrote that a while ago so I had to go double-check :) Yes, they are still included (size is included only if available): What I think I meant with that is that they are not included anymore with the same granularity. Users can rely on the API logs to get more specific data points, such as per-endpoint (rather than just per target app) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a way to include back the original granular metrics if needed? I.e. can either this change be the new normal and the old way be behind an opt-in flag or the other way around? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can look into that, but before we invest engineering effort, is there any demand for that? The current behavior is causing massive memory consumption (hundreds of MBs), causing live site incidents due to pods OOM'ing (we actually had users opening support tickets with Azure too... and seems Diagrid heard users having issues too). It is also costing users a lot of money when metrics are ingested in analytics tools. To the point it may even be considered a bug? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Per conversation on Discord, we can move forward on this PR without the flag to add the old behavior back as opt-in, as there's already enough "substance" here. When this is merged I'll open an issue to discuss that need (see if folks from the community have opinions) and also how that can be implemented (e.g. should it be a permanent option in the Configuration CRD, a feature flag, or...?). That issue will be completed before 1.13 (as a release blocker, but we have lots of time) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ItalyPaleAle is there an issue open to track the addition of flag to include the original granular metrics. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. any idea where this issue or flag is implemented? it is mentioned as a release blocker and operation teams might want to keep using their old dashboards/alerts that rely on these labels |
||
startRequest := time.Now() | ||
|
||
var resp *http.Response | ||
|
@@ -270,17 +270,17 @@ | |
} | ||
|
||
if err != nil { | ||
diag.DefaultHTTPMonitoring.ClientRequestCompleted(ctx, channelReq.Method, req.Message().GetMethod(), strconv.Itoa(http.StatusInternalServerError), contentLength, elapsedMs) | ||
diag.DefaultHTTPMonitoring.ClientRequestCompleted(ctx, strconv.Itoa(http.StatusInternalServerError), contentLength, elapsedMs) | ||
return nil, err | ||
} | ||
|
||
rsp, err := h.parseChannelResponse(req, resp) | ||
if err != nil { | ||
diag.DefaultHTTPMonitoring.ClientRequestCompleted(ctx, channelReq.Method, req.Message().GetMethod(), strconv.Itoa(http.StatusInternalServerError), contentLength, elapsedMs) | ||
diag.DefaultHTTPMonitoring.ClientRequestCompleted(ctx, strconv.Itoa(http.StatusInternalServerError), contentLength, elapsedMs) | ||
return nil, err | ||
} | ||
|
||
diag.DefaultHTTPMonitoring.ClientRequestCompleted(ctx, channelReq.Method, req.Message().GetMethod(), strconv.Itoa(int(rsp.Status().Code)), contentLength, elapsedMs) | ||
diag.DefaultHTTPMonitoring.ClientRequestCompleted(ctx, strconv.Itoa(int(rsp.Status().Code)), contentLength, elapsedMs) | ||
|
||
return rsp, nil | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,68 @@ | ||
/* | ||
Copyright 2023 The Dapr Authors | ||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
http://www.apache.org/licenses/LICENSE-2.0 | ||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
|
||
package consts | ||
|
||
import ( | ||
semconv "go.opentelemetry.io/otel/semconv/v1.10.0" | ||
) | ||
|
||
const ( | ||
// DaprInternalSpanAttrPrefix is the internal span attribution prefix. | ||
// Middleware will not populate it if the span key starts with this prefix. | ||
DaprInternalSpanAttrPrefix = "__dapr." | ||
// DaprAPISpanNameInternal is the internal attribution, but not populated to span attribution. | ||
DaprAPISpanNameInternal = DaprInternalSpanAttrPrefix + "spanname" | ||
|
||
// Span attribute keys | ||
// Reference trace semantics https://github.com/open-telemetry/opentelemetry-specification/tree/master/specification/trace/semantic_conventions | ||
DBSystemSpanAttributeKey = string(semconv.DBSystemKey) | ||
DBNameSpanAttributeKey = string(semconv.DBNameKey) | ||
DBStatementSpanAttributeKey = string(semconv.DBStatementKey) | ||
DBConnectionStringSpanAttributeKey = string(semconv.DBConnectionStringKey) | ||
MessagingSystemSpanAttributeKey = string(semconv.MessagingSystemKey) | ||
MessagingDestinationSpanAttributeKey = string(semconv.MessagingDestinationKey) | ||
MessagingDestinationKindSpanAttributeKey = string(semconv.MessagingDestinationKindKey) | ||
GrpcServiceSpanAttributeKey = string(semconv.RPCServiceKey) | ||
NetPeerNameSpanAttributeKey = string(semconv.NetPeerNameKey) | ||
|
||
DaprAPISpanAttributeKey = "dapr.api" | ||
yaron2 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
DaprAPIStatusCodeSpanAttributeKey = "dapr.status_code" | ||
DaprAPIProtocolSpanAttributeKey = "dapr.protocol" | ||
DaprAPIInvokeMethod = "dapr.invoke_method" | ||
DaprAPIActorTypeID = "dapr.actor" | ||
|
||
DaprAPIHTTPSpanAttrValue = "http" | ||
DaprAPIGRPCSpanAttrValue = "grpc" | ||
|
||
StateBuildingBlockType = "state" | ||
SecretBuildingBlockType = "secrets" | ||
BindingBuildingBlockType = "bindings" | ||
PubsubBuildingBlockType = "pubsub" | ||
|
||
DaprGRPCServiceInvocationService = "ServiceInvocation" | ||
DaprGRPCDaprService = "Dapr" | ||
|
||
// Keys used in the context's metadata for streaming calls | ||
// Note: these keys must always be all-lowercase | ||
DaprCallLocalStreamMethodKey = "__dapr_calllocalstream_method" | ||
) | ||
|
||
// MessagingDestinationTopicKind is effectively const, but isn't a const from upstream. | ||
var MessagingDestinationTopicKind = semconv.MessagingDestinationKindTopic.Value.AsString() | ||
|
||
// GrpcAppendSpanAttributesFn is the interface that applies to gRPC requests that add span attributes. | ||
type GrpcAppendSpanAttributesFn interface { | ||
// AppendSpanAttributes appends attributes to the map used for the span in tracing for the gRPC method. | ||
AppendSpanAttributes(rpcMethod string, m map[string]string) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that on the wire it might be the same. But this will require users that use the proto directly to have a breaking change. And must be communicated as such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still something that users might use. Again, my point being here is that it must be communicated as a breaking change. We can expand on it saying that running apps will not be affected, just struct change as of now for new proto imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change, however this will be breaking at compilation time and not at runtime. I recommend we announce the breaking change with no need to create a v2 of this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fair assessment.
To recap, and to be fully clear, from a practical effect this change is similar to what happened with #6945 (landing in 1.13 too) which removed a method from the gRPC proto (of course, that was done for different reasons).