Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce cardinality in Dapr metrics and add more information to API logs #6919

Merged
merged 14 commits into from
Oct 24, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
18 changes: 13 additions & 5 deletions dapr/proto/runtime/v1/dapr.proto
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ service Dapr {
rpc DecryptAlpha1(stream DecryptRequest) returns (stream DecryptResponse);

// Gets metadata of the sidecar
rpc GetMetadata (google.protobuf.Empty) returns (GetMetadataResponse) {}
rpc GetMetadata (GetMetadataRequest) returns (GetMetadataResponse) {}

// Sets value in extended metadata of the sidecar
rpc SetMetadata (SetMetadataRequest) returns (google.protobuf.Empty) {}
Expand Down Expand Up @@ -187,7 +187,7 @@ service Dapr {
// Raise an event to a running workflow instance
rpc RaiseEventWorkflowBeta1 (RaiseEventWorkflowRequest) returns (google.protobuf.Empty) {}
// Shutdown the sidecar
rpc Shutdown (google.protobuf.Empty) returns (google.protobuf.Empty) {}
rpc Shutdown (ShutdownRequest) returns (google.protobuf.Empty) {}
}

// InvokeServiceRequest represents the request message for Service invocation.
Expand Down Expand Up @@ -404,15 +404,13 @@ message BulkPublishResponse {

// BulkPublishResponseFailedEntry is the message containing the entryID and error of a failed event in BulkPublishEvent call
message BulkPublishResponseFailedEntry {

// The response scoped unique ID referring to this message
string entry_id = 1;

// The error message if any on failure
string error = 2;
}


// InvokeBindingRequest is the message to send data to output bindings
message InvokeBindingRequest {
// The name of the output binding to invoke.
Expand Down Expand Up @@ -589,7 +587,12 @@ message InvokeActorResponse {
bytes data = 1;
}

// GetMetadataResponse is a message that is returned on GetMetadata rpc call
// GetMetadataRequest is the message for the GetMetadata request.
message GetMetadataRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ItalyPaleAle : Reviewers: this needed to have a message ending in "Request". This change is backwards compatible in the wire (i.e. v1.11 SDKs can talk to this without issues), but it may require a change in the name in the SDK if the protos are updated.
#6729 (comment)

I agree that on the wire it might be the same. But this will require users that use the proto directly to have a breaking change. And must be communicated as such.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but:

  1. This will not break apps that are currently running, which is the most important thing. Users may need to change a struct name if they import the protos directly and are updating the protos.
  2. Using the gRPC protos directly is not an officially supported scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the gRPC protos directly is not an officially supported scenario.

Still something that users might use. Again, my point being here is that it must be communicated as a breaking change. We can expand on it saying that running apps will not be affected, just struct change as of now for new proto imports.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change, however this will be breaking at compilation time and not at runtime. I recommend we announce the breaking change with no need to create a v2 of this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change, however this will be breaking at compilation time and not at runtime. I recommend we announce the breaking change with no need to create a v2 of this method.

This is a fair assessment.

To recap, and to be fully clear, from a practical effect this change is similar to what happened with #6945 (landing in 1.13 too) which removed a method from the gRPC proto (of course, that was done for different reasons).

  • No impact on apps currently running
  • Dapr SDKs maintainers will need to rename the structs/objects when they upgrade the protos
  • Users of the SDKs should not see any direct impact.
    • AFAIK the SDKs always abstract the gRPC objects, so users who just depend on Dapr SDKs shouldn't need any code change if/when they upgrade the SDKs.
    • At runtime, old SDKs will continue to work alongside new SDKs.
  • For users who import the protos directly:
    • No change needed if they are using existing protos
    • If they upgrade the protos, they will need to rename the structs/objects

// Empty
}

// GetMetadataResponse is a message that is returned on GetMetadata rpc call.
message GetMetadataResponse {
string id = 1;
repeated ActiveActorsCount active_actors_count = 2 [json_name = "actors"];
Expand Down Expand Up @@ -1077,3 +1080,8 @@ message PurgeWorkflowRequest {
// Name of the workflow component.
string workflow_component = 2 [json_name = "workflowComponent"];
}

// ShutdownRequest is the request for Shutdown.
message ShutdownRequest {
// Empty
}
12 changes: 6 additions & 6 deletions pkg/acl/acl.go
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ func ApplyAccessControlPolicies(ctx context.Context, operation string, httpVerb
}

action, actionPolicy := isOperationAllowedByAccessControlPolicy(spiffeID, operation, httpVerb, isHTTP, acl)
emitACLMetrics(spiffeID, actionPolicy, operation, httpVerb.String(), action)
emitACLMetrics(spiffeID, actionPolicy, action)

if !action {
errMessage = fmt.Sprintf("access control policy has denied access to id: %s operation: %s verb: %s", spiffeID.URL(), operation, httpVerb)
Expand All @@ -182,20 +182,20 @@ func ApplyAccessControlPolicies(ctx context.Context, operation string, httpVerb
return action, errMessage
}

func emitACLMetrics(spiffeID *spiffe.Parsed, actionPolicy, operation, verb string, action bool) {
func emitACLMetrics(spiffeID *spiffe.Parsed, actionPolicy string, action bool) {
if action {
switch actionPolicy {
case config.ActionPolicyApp:
diag.DefaultMonitoring.RequestAllowedByAppAction(spiffeID, operation, verb, action)
diag.DefaultMonitoring.RequestAllowedByAppAction(spiffeID)
case config.ActionPolicyGlobal:
diag.DefaultMonitoring.RequestAllowedByGlobalAction(spiffeID, operation, verb, action)
diag.DefaultMonitoring.RequestAllowedByGlobalAction(spiffeID)
}
} else {
switch actionPolicy {
case config.ActionPolicyApp:
diag.DefaultMonitoring.RequestBlockedByAppAction(spiffeID, operation, verb, action)
diag.DefaultMonitoring.RequestBlockedByAppAction(spiffeID)
case config.ActionPolicyGlobal:
diag.DefaultMonitoring.RequestBlockedByGlobalAction(spiffeID, operation, verb, action)
diag.DefaultMonitoring.RequestBlockedByGlobalAction(spiffeID)
}
}
}
Expand Down
8 changes: 4 additions & 4 deletions pkg/channel/http/http_channel.go
Original file line number Diff line number Diff line change
Expand Up @@ -228,7 +228,7 @@ func (h *Channel) invokeMethodV1(ctx context.Context, req *invokev1.InvokeMethod
}()

// Emit metric when request is sent
diag.DefaultHTTPMonitoring.ClientRequestStarted(ctx, channelReq.Method, req.Message().Method, int64(len(req.Message().Data.GetValue())))
diag.DefaultHTTPMonitoring.ClientRequestStarted(ctx, int64(len(req.Message().Data.GetValue())))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are breaking changes since it removes information others might already be depending on for their dashboards and alerts. Is there a way to make this configurable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remain consistent with industry standard golden metrics - meaning, latency and status codes should be reported on response metrics. Size I think we can remove safely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latency and status codes are included. Size is included when available, but if it can be omitted it would allow for a lot of simplifications.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are latency and status codes included by default? I want to clarify due to this:

including things that are not going to be included in metrics by default anymore:
* Response status code
* Response size (for HTTP only & if possible)
* Latency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote that a while ago so I had to go double-check :)

Yes, they are still included (size is included only if available):

https://github.com/ItalyPaleAle/dapr/blob/38b09613667e88e5fe84f48f9f0188b2ce090219/pkg/diagnostics/http_monitoring.go#L208-L240

What I think I meant with that is that they are not included anymore with the same granularity. Users can rely on the API logs to get more specific data points, such as per-endpoint (rather than just per target app)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to include back the original granular metrics if needed? I.e. can either this change be the new normal and the old way be behind an opt-in flag or the other way around?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look into that, but before we invest engineering effort, is there any demand for that? The current behavior is causing massive memory consumption (hundreds of MBs), causing live site incidents due to pods OOM'ing (we actually had users opening support tickets with Azure too... and seems Diagrid heard users having issues too). It is also costing users a lot of money when metrics are ingested in analytics tools. To the point it may even be considered a bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per conversation on Discord, we can move forward on this PR without the flag to add the old behavior back as opt-in, as there's already enough "substance" here. When this is merged I'll open an issue to discuss that need (see if folks from the community have opinions) and also how that can be implemented (e.g. should it be a permanent option in the Configuration CRD, a feature flag, or...?). That issue will be completed before 1.13 (as a release blocker, but we have lots of time)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ItalyPaleAle is there an issue open to track the addition of flag to include the original granular metrics.

Copy link
Contributor

@filintod filintod Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any idea where this issue or flag is implemented? it is mentioned as a release blocker and operation teams might want to keep using their old dashboards/alerts that rely on these labels

startRequest := time.Now()

var resp *http.Response
Expand Down Expand Up @@ -270,17 +270,17 @@ func (h *Channel) invokeMethodV1(ctx context.Context, req *invokev1.InvokeMethod
}

if err != nil {
diag.DefaultHTTPMonitoring.ClientRequestCompleted(ctx, channelReq.Method, req.Message().GetMethod(), strconv.Itoa(http.StatusInternalServerError), contentLength, elapsedMs)
diag.DefaultHTTPMonitoring.ClientRequestCompleted(ctx, strconv.Itoa(http.StatusInternalServerError), contentLength, elapsedMs)
return nil, err
}

rsp, err := h.parseChannelResponse(req, resp)
if err != nil {
diag.DefaultHTTPMonitoring.ClientRequestCompleted(ctx, channelReq.Method, req.Message().GetMethod(), strconv.Itoa(http.StatusInternalServerError), contentLength, elapsedMs)
diag.DefaultHTTPMonitoring.ClientRequestCompleted(ctx, strconv.Itoa(http.StatusInternalServerError), contentLength, elapsedMs)
return nil, err
}

diag.DefaultHTTPMonitoring.ClientRequestCompleted(ctx, channelReq.Method, req.Message().GetMethod(), strconv.Itoa(int(rsp.Status().Code)), contentLength, elapsedMs)
diag.DefaultHTTPMonitoring.ClientRequestCompleted(ctx, strconv.Itoa(int(rsp.Status().Code)), contentLength, elapsedMs)

return rsp, nil
}
Expand Down
14 changes: 7 additions & 7 deletions pkg/config/configuration.go
Original file line number Diff line number Diff line change
Expand Up @@ -156,18 +156,18 @@ const (
APIAccessRuleProtocolGRPC APIAccessRuleProtocol = "grpc"
)

// GetRulesByProtocol returns a list of APIAccessRule objects filtered by protocol
func (r APIAccessRules) GetRulesByProtocol(protocol APIAccessRuleProtocol) []APIAccessRule {
res := make([]APIAccessRule, len(r))
n := 0
// GetRulesByProtocol returns a list of APIAccessRule objects for a protocol
// The result is a map where the key is in the format "<version>/<endpoint>"
func (r APIAccessRules) GetRulesByProtocol(protocol APIAccessRuleProtocol) map[string]struct{} {
res := make(map[string]struct{}, len(r))
for _, v := range r {
//nolint:gocritic
if strings.ToLower(string(v.Protocol)) == string(protocol) {
res[n] = v
n++
key := v.Version + "/" + v.Name
res[key] = struct{}{}
ItalyPaleAle marked this conversation as resolved.
Show resolved Hide resolved
}
}
return res[:n]
return res
}

type HandlerSpec struct {
Expand Down
24 changes: 24 additions & 0 deletions pkg/config/configuration_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ import (

"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"golang.org/x/exp/maps"

"github.com/dapr/dapr/pkg/buildinfo"
"github.com/dapr/kit/ptr"
Expand Down Expand Up @@ -481,6 +482,29 @@ func TestSetTracingSpecFromEnv(t *testing.T) {
require.True(t, conf.Spec.TracingSpec.Otel.GetIsSecure())
}

func TestAPIAccessRules(t *testing.T) {
config := &Configuration{
Spec: ConfigurationSpec{
APISpec: &APISpec{
Allowed: APIAccessRules{
APIAccessRule{Name: "foo", Version: "v1", Protocol: "http"},
APIAccessRule{Name: "MyMethod", Version: "v1alpha1", Protocol: "grpc"},
},
Denied: APIAccessRules{
APIAccessRule{Name: "bar", Version: "v1", Protocol: "http"},
},
},
},
}

apiSpec := config.Spec.APISpec

assert.Equal(t, []string{"v1/foo"}, maps.Keys(apiSpec.Allowed.GetRulesByProtocol(APIAccessRuleProtocolHTTP)))
assert.Equal(t, []string{"v1alpha1/MyMethod"}, maps.Keys(apiSpec.Allowed.GetRulesByProtocol(APIAccessRuleProtocolGRPC)))
assert.Equal(t, []string{"v1/bar"}, maps.Keys(apiSpec.Denied.GetRulesByProtocol(APIAccessRuleProtocolHTTP)))
assert.Empty(t, maps.Keys(apiSpec.Denied.GetRulesByProtocol(APIAccessRuleProtocolGRPC)))
}

func TestSortMetrics(t *testing.T) {
t.Run("metrics overrides metric - enabled false", func(t *testing.T) {
config := &Configuration{
Expand Down
68 changes: 68 additions & 0 deletions pkg/diagnostics/consts/consts.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
/*
Copyright 2023 The Dapr Authors
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package consts

import (
semconv "go.opentelemetry.io/otel/semconv/v1.10.0"
)

const (
// DaprInternalSpanAttrPrefix is the internal span attribution prefix.
// Middleware will not populate it if the span key starts with this prefix.
DaprInternalSpanAttrPrefix = "__dapr."
// DaprAPISpanNameInternal is the internal attribution, but not populated to span attribution.
DaprAPISpanNameInternal = DaprInternalSpanAttrPrefix + "spanname"

// Span attribute keys
// Reference trace semantics https://github.com/open-telemetry/opentelemetry-specification/tree/master/specification/trace/semantic_conventions
DBSystemSpanAttributeKey = string(semconv.DBSystemKey)
DBNameSpanAttributeKey = string(semconv.DBNameKey)
DBStatementSpanAttributeKey = string(semconv.DBStatementKey)
DBConnectionStringSpanAttributeKey = string(semconv.DBConnectionStringKey)
MessagingSystemSpanAttributeKey = string(semconv.MessagingSystemKey)
MessagingDestinationSpanAttributeKey = string(semconv.MessagingDestinationKey)
MessagingDestinationKindSpanAttributeKey = string(semconv.MessagingDestinationKindKey)
GrpcServiceSpanAttributeKey = string(semconv.RPCServiceKey)
NetPeerNameSpanAttributeKey = string(semconv.NetPeerNameKey)

DaprAPISpanAttributeKey = "dapr.api"
yaron2 marked this conversation as resolved.
Show resolved Hide resolved
DaprAPIStatusCodeSpanAttributeKey = "dapr.status_code"
DaprAPIProtocolSpanAttributeKey = "dapr.protocol"
DaprAPIInvokeMethod = "dapr.invoke_method"
DaprAPIActorTypeID = "dapr.actor"

DaprAPIHTTPSpanAttrValue = "http"
DaprAPIGRPCSpanAttrValue = "grpc"

StateBuildingBlockType = "state"
SecretBuildingBlockType = "secrets"
BindingBuildingBlockType = "bindings"
PubsubBuildingBlockType = "pubsub"

DaprGRPCServiceInvocationService = "ServiceInvocation"
DaprGRPCDaprService = "Dapr"

// Keys used in the context's metadata for streaming calls
// Note: these keys must always be all-lowercase
DaprCallLocalStreamMethodKey = "__dapr_calllocalstream_method"
)

// MessagingDestinationTopicKind is effectively const, but isn't a const from upstream.
var MessagingDestinationTopicKind = semconv.MessagingDestinationKindTopic.Value.AsString()

// GrpcAppendSpanAttributesFn is the interface that applies to gRPC requests that add span attributes.
type GrpcAppendSpanAttributesFn interface {
// AppendSpanAttributes appends attributes to the map used for the span in tracing for the gRPC method.
AppendSpanAttributes(rpcMethod string, m map[string]string)
}