Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry support #356

Closed
saschagrunert opened this issue Nov 1, 2022 · 30 comments
Closed

OpenTelemetry support #356

saschagrunert opened this issue Nov 1, 2022 · 30 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@saschagrunert
Copy link
Member

/kind feature

Describe the solution you'd like

Logs are helpful data attached to OpenTelemetry traces. klog could support that by adding the log messages to the current tracing span if it exist.

Anything else you would like to add:

There is a k/k KEP existing for the kubelet tracing part: kubernetes/enhancements#2831

Refers to kubernetes/kubernetes#113414

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 1, 2022
@pohly
Copy link

pohly commented Nov 1, 2022

Is this meant to be used together with JSON or text output or instead of it?

@saschagrunert
Copy link
Member Author

Is this meant to be used together with JSON or text output or instead of it?

I'd assume that it works in parallel to them. For example if tracing is enabled, then we pass the text message to the tracer side by side to the other output(s).

@pohly
Copy link

pohly commented Nov 1, 2022

Makes sense, in particular when the tracer doesn't record all log output.

An implementation of this will depend on contextual logging because non-contextual calls (klog.Info, klog.InfoS) have no way of determining the call context and thus the current tracing span.

But contextual logging alone is not enough because those calls are also not passed a context. Whoever adds a span to the context would also have to install a logger for it which then would get picked up by contextual logging calls in that call chain. As https://github.com/sallyom/enhancements-1/blob/master/keps/sig-instrumentation/2831-kubelet-tracing/README.md describes how the code gets instrumented, it probably should get extended to also cover the logging aspect.

@saschagrunert
Copy link
Member Author

Makes sense, in particular when the tracer doesn't record all log output.

An implementation of this will depend on contextual logging because non-contextual calls (klog.Info, klog.InfoS) have no way of determining the call context and thus the current tracing span.

Depends how we make it usable. If we have a global tracer set, then we may create spans from that as well.

@dashpole
Copy link

dashpole commented Nov 1, 2022

klog could support that by adding the log messages to the current tracing span if it exist.

An alternative, which I would prefer, would be to attach the trace ID and span ID to logs as part of the structured output (as trace_id and span_id), which can then be used to link to/from traces.

Depends how we make it usable. If we have a global tracer set, then we may create spans from that as well.

You still need a context if you want to link to other spans. The global tracerprovider just lets you export spans from anywhere.

Thus far, we (sig-instrumentation) haven't touched on the subject of log or metric correlation, but we probably should extend our contributor guidelines add traceing guidelines + log and metric correlation guidelines for other components. @dgrisonnet @logicalhan

@pohly
Copy link

pohly commented Nov 1, 2022

An alternative, which I would prefer, would be to attach the trace ID and span ID to logs as part of the structured output (as trace_id and span_id), which can then be used to link to/from traces.

So no logging through OpenTelemetry?

That already works. The code that adds trace and span ID then would have to look like this:

traceID, spanID := <generate IDs>
ctx := context.WithValue(ctx, ... trace ID...)
ctx := context.WithValue(ctx, ... span ID...)
logger := klog.FromContext(ctx)
logger = klog.LoggerWithValues(logger, "trace_id", traceID, "span_id", spanID /* Kubernetes guidelines are to not use underscores in keys, but  https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/logs#trace-context-in-legacy-formats explicitly asks for it */)
ctx := klog.NewContext(ctx, logger)

@dashpole
Copy link

dashpole commented Nov 1, 2022

So no logging through OpenTelemetry?

OpenTelemetry is working on a set of logging libraries, but they are still a ways off (https://opentelemetry.io/docs/instrumentation/go/#status-and-releases). We can discuss what the benefits would be of adopting it, but I'm not sure what the benefits would be.

I'm just saying we shouldn't add all logs as span events. Logging is its own thing, and has well-defined ways to correlate with tracing that it is probably best we follow.

What I really want is for logger := klog.FromContext(ctx) to always give me a logger with trace ID and span ID attached, if they exist in the context.

@pohly
Copy link

pohly commented Nov 1, 2022

What I really want is for logger := klog.FromContext(ctx) to always give me a logger with trace ID and span ID attached, if they exist in the context.

There are two ways of doing this:

  • teach klog.FromContext about trace and span IDs
  • teach the code which adds these IDs how to add a logger with the values to the context (my proposal above)

The problems with extending klog.FromContext are:

  • Some code might use logr.FromContext directly because it has no klog dependency. Then it will not log the IDs.
  • A solution that works for arbitrary loggers would make retrieving a logger from the context more expensive because it must search for IDs and modify the logger.
  • A special solutions for the two loggers typically used by Kubernetes can delay that by just remembering the context, then look up the IDs if (and only if) needed.

The future contextual logging API in Golang will add some more options, but that is just being implemented and will be experimental. I am not sure whether want to wait for it. It will support adding the current context to the logger when retrieving it.

@pohly
Copy link

pohly commented Nov 1, 2022

We also have to consider whether it is acceptable to make klog depend on OpenTelemetry. If any code in klog is meant to retrieve these IDs, it will have to have such a dependency, right?

@pohly
Copy link

pohly commented Nov 2, 2022

It should be possible to just configure klog to extract certain values by giving it the keys. Then it won't depend on OpenTelemetry. It may even be possible to move this all into go-logr and do it so that it works for all logger backends - let me try that out...

@dashpole
Copy link

dashpole commented Nov 2, 2022

As a note, otel's interfaces, including context, are kept separate from the rest of OTel's codebase. As a result, it has nearly no dependencies. But it is still a consideration.

@sallyom
Copy link

sallyom commented Nov 10, 2022

What I really want is for logger := klog.FromContext(ctx) to always give me a logger with trace ID and span ID attached, if they exist in the context.

I agree adding span id, trace id to logs will be useful.
This is the example i've seen.

func logWithContext(span oteltrace.Span) log.Fields {
	return log.Fields{
		"span_id":  span.SpanContext().SpanID().String(),
		"trace_id": span.SpanContext().TraceID().String(),
	}
}
log.WithFields(logWithContext(span)).Info("Counter", zap.String("response", msg))

@pohly
Copy link

pohly commented Nov 17, 2022

It should be possible to just configure klog to extract certain values by giving it the keys.

I've implemented that in PR #358. It is fairly complex because wrapping a LogSink (as done in that PR) is tricky.

I'll try next what it would look like if logr.Logger itself did the extra work. Then wrapping a LogSink will not be necessary and it will also work for code which just does logr.FromContext (assuming that there is a logger installed which was configured to log from the context).

@pohly
Copy link

pohly commented Nov 18, 2022

I think the best solution is with this changed in logr: go-logr/logr#158

Once that is merged, #359 can also be merged.

@dashpole
Copy link

As a note, those won't work well with OpenTelemetry. The context keys for span context and for baggage are not exported, so we would need to support accessor functions or something similar to be able to make use of context-based functionality

@pohly
Copy link

pohly commented Nov 21, 2022

That's surprising. I could have sworn that OpenTelemetry was mentioned as use case for logging from context in the upstream Go discussion around a future builtin contextual logging API. But perhaps it was just tracing in general and people will hit this snag later when they really try to use it (like we do now).

So would it be possible to get access to these context keys? If not, then we are stuck with the solution from #356 (comment), which is more expensive.

@pohly
Copy link

pohly commented Nov 21, 2022

One possible alternative would be to allow passing a func(ctx context.Context) interface{} to the logger to retrieve some value (or values - it could be a struct, slice or map) instead of letting the logger directly retrieve the value. For OpenTelemetry, that function then could call SpanFromContext.

But what is it that should get logged for such a span? #356 (comment) mentions SpanID and TraceID, but I don't see such methods for the SpanContext, only some generic accessor for key/value string pairs.

Should all of those pairs get logged? Using which keys at the level of a log entry? We must be relatively certain that there is no accidental overlap with keys passed to WithValues or Info/Error.

@dashpole
Copy link

dashpole commented Nov 21, 2022

I think it was a general discussion.

It might've been a reference to how OpenCensus works, which is that all attributes ("tags" in OC) on spans/metrics are from context. Example.

OpenTelemetry's SpanContext provides accessors for TraceID and SpanID (OpenTracing's SpanContext, which is what you linked, does not appear to provide explicit accessors). I DO NOT think we should add all SpanContext-based k/v pairs from either TraceState (w3c tracestate)or from Baggage (w3c baggage). It would be nice if there was a way to opt-in values from baggage, though.

@pohly
Copy link

pohly commented Nov 21, 2022

So if we log span ID and trace ID, what should that look like?

{"msg": "something", "tracing": {"spanID": "xyz", "traceID": "abc"}}

Or

{"msg": "something", "tracing.spanID": "xyz", "tracing.traceID": "abc"}

@dashpole
Copy link

It should look like:

{"msg": "something", "trace_id": "xyz", "span_id": "abc"}

See https://github.com/open-telemetry/opentelemetry-specification/tree/a87fb8c5a60f942d08c67c1582034eb8b4322009/specification/logs#trace-context-in-legacy-formats

@thockin
Copy link
Member

thockin commented Nov 21, 2022

One possible alternative would be to allow passing a func(ctx context.Context) interface{} to the logger to retrieve some value

I want to be sensitive to performance , too - having to call an arbitrary number of functions once is not a big deal, but doing that on every log call seems unfortunate, unless we're clear about how we can cache those.

@pohly
Copy link

pohly commented Nov 22, 2022

Logging as "trace_id": "xyz", "span_id": "abc in combination with a single function for lookup would be a good fit for something like PseudoStruct as discussed over in #357. We could do two functions, but I suspect that this is going to be less efficient and more complicated (need to supply the key names separately, multiple calls instead of one).

Calling ctx.Value through some function is adding some more overhead, but the baseline (ctx.Value itself, formatting of the log entry) is already pretty big, so relative to that it's probably not that much more.

@thockin I saw that you pre-render key/value pairs in structr's WithValues call. That makes sense if there are then multiple log calls that use that cached information, but it's probably worse for just a single call (need to store the cache somewhere) and definitely hurts when there is no log call at all.

When instrumenting the scheduler with contextual logging, a big concern was the slowdown at low log levels because those are typically used when performance is crucial. But I haven't checked how often there there would have been a WithValues call and no log call - I'll probably have to come back to it because the PR stalled.

Perhaps some on-demand caching could give us the best of both worlds (only do expensive work if needed, but then only once). It's just not easy to implement because such a cache would need to be shared between different Logger or LogSink instances.

@pohly
Copy link

pohly commented Nov 23, 2022

I have implemented this in the same PRs which also implement #357:

With that modified Kubernetes and a local-up cluster, I get:

$ grep "Immediate dispatch" /tmp/kube-apiserver.log 
I1124 08:44:15.439987   46581 queueset.go:716] "Immediate dispatch" trace_id="2f53c6cc100aaa44d8d9303fcc75f3a1" span_id="3d99c8f69661e4f1" QS="catch-all" t="2022-11-24 08:44:15.439803964" R="0.00000000ss" request="catch-all" description1=&{IsResourceRequest:true Path:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/127.0.0.1 Verb:get APIPrefix:apis APIGroup:coordination.k8s.io APIVersion:v1 Namespace:kube-node-lease Resource:leases Subresource: Name:127.0.0.1 Parts:[leases 127.0.0.1]} description2=&{Name:system:node:127.0.0.1 UID: Groups:[system:nodes system:authenticated] Extra:map[]} qsExecuting=1
$ cat /nvme/gopath/src/k8s.io/apiserver-tracing.conf
apiVersion: apiserver.config.k8s.io/v1alpha1
kind: TracingConfiguration
# default value
#endpoint: localhost:4317
samplingRatePerMillion: 1000000

$ APISERVER_TRACING_CONFIG_FILE=/nvme/gopath/src/k8s.io/apiserver-tracing.conf FEATURE_GATES=APIServerTracing=true,ContextualLogging=true KUBELET_RESOLV_CONF="/etc/resolv-9999.conf" DNS_ADDON="coredns" CGROUP_DRIVER=systemd CONTAINER_RUNTIME_ENDPOINT=unix:///var/run/crio/crio.sock LOG_LEVEL=6 ENABLE_CSI_SNAPSHOTTER=false API_SECURE_PORT=6444 ALLOW_PRIVILEGED=1 PATH=/nvme/gopath/src/k8s.io/kubernetes/third_party/etcd:$PATH ./hack/local-up-cluster.sh -O

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 24, 2023
@saschagrunert
Copy link
Member Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 24, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 22, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 18, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 17, 2024
@k8s-ci-robot
Copy link

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
7 participants