OpenTelemetry support #356

saschagrunert · 2022-11-01T09:12:02Z

/kind feature

Describe the solution you'd like

Logs are helpful data attached to OpenTelemetry traces. klog could support that by adding the log messages to the current tracing span if it exist.

Anything else you would like to add:

There is a k/k KEP existing for the kubelet tracing part: kubernetes/enhancements#2831

Refers to kubernetes/kubernetes#113414

pohly · 2022-11-01T10:42:11Z

Is this meant to be used together with JSON or text output or instead of it?

saschagrunert · 2022-11-01T11:06:34Z

Is this meant to be used together with JSON or text output or instead of it?

I'd assume that it works in parallel to them. For example if tracing is enabled, then we pass the text message to the tracer side by side to the other output(s).

pohly · 2022-11-01T12:33:34Z

Makes sense, in particular when the tracer doesn't record all log output.

An implementation of this will depend on contextual logging because non-contextual calls (klog.Info, klog.InfoS) have no way of determining the call context and thus the current tracing span.

But contextual logging alone is not enough because those calls are also not passed a context. Whoever adds a span to the context would also have to install a logger for it which then would get picked up by contextual logging calls in that call chain. As https://github.com/sallyom/enhancements-1/blob/master/keps/sig-instrumentation/2831-kubelet-tracing/README.md describes how the code gets instrumented, it probably should get extended to also cover the logging aspect.

saschagrunert · 2022-11-01T13:37:45Z

Makes sense, in particular when the tracer doesn't record all log output.

An implementation of this will depend on contextual logging because non-contextual calls (klog.Info, klog.InfoS) have no way of determining the call context and thus the current tracing span.

Depends how we make it usable. If we have a global tracer set, then we may create spans from that as well.

dashpole · 2022-11-01T13:45:41Z

klog could support that by adding the log messages to the current tracing span if it exist.

An alternative, which I would prefer, would be to attach the trace ID and span ID to logs as part of the structured output (as trace_id and span_id), which can then be used to link to/from traces.

Depends how we make it usable. If we have a global tracer set, then we may create spans from that as well.

You still need a context if you want to link to other spans. The global tracerprovider just lets you export spans from anywhere.

Thus far, we (sig-instrumentation) haven't touched on the subject of log or metric correlation, but we probably should extend our contributor guidelines add traceing guidelines + log and metric correlation guidelines for other components. @dgrisonnet @logicalhan

pohly · 2022-11-01T19:12:05Z

An alternative, which I would prefer, would be to attach the trace ID and span ID to logs as part of the structured output (as trace_id and span_id), which can then be used to link to/from traces.

So no logging through OpenTelemetry?

That already works. The code that adds trace and span ID then would have to look like this:

traceID, spanID := <generate IDs>
ctx := context.WithValue(ctx, ... trace ID...)
ctx := context.WithValue(ctx, ... span ID...)
logger := klog.FromContext(ctx)
logger = klog.LoggerWithValues(logger, "trace_id", traceID, "span_id", spanID /* Kubernetes guidelines are to not use underscores in keys, but  https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/logs#trace-context-in-legacy-formats explicitly asks for it */)
ctx := klog.NewContext(ctx, logger)

dashpole · 2022-11-01T19:34:28Z

So no logging through OpenTelemetry?

OpenTelemetry is working on a set of logging libraries, but they are still a ways off (https://opentelemetry.io/docs/instrumentation/go/#status-and-releases). We can discuss what the benefits would be of adopting it, but I'm not sure what the benefits would be.

I'm just saying we shouldn't add all logs as span events. Logging is its own thing, and has well-defined ways to correlate with tracing that it is probably best we follow.

What I really want is for logger := klog.FromContext(ctx) to always give me a logger with trace ID and span ID attached, if they exist in the context.

pohly · 2022-11-01T21:15:57Z

What I really want is for logger := klog.FromContext(ctx) to always give me a logger with trace ID and span ID attached, if they exist in the context.

There are two ways of doing this:

teach klog.FromContext about trace and span IDs
teach the code which adds these IDs how to add a logger with the values to the context (my proposal above)

The problems with extending klog.FromContext are:

Some code might use logr.FromContext directly because it has no klog dependency. Then it will not log the IDs.
A solution that works for arbitrary loggers would make retrieving a logger from the context more expensive because it must search for IDs and modify the logger.
A special solutions for the two loggers typically used by Kubernetes can delay that by just remembering the context, then look up the IDs if (and only if) needed.

The future contextual logging API in Golang will add some more options, but that is just being implemented and will be experimental. I am not sure whether want to wait for it. It will support adding the current context to the logger when retrieving it.

pohly · 2022-11-01T21:17:22Z

We also have to consider whether it is acceptable to make klog depend on OpenTelemetry. If any code in klog is meant to retrieve these IDs, it will have to have such a dependency, right?

pohly · 2022-11-02T07:03:34Z

It should be possible to just configure klog to extract certain values by giving it the keys. Then it won't depend on OpenTelemetry. It may even be possible to move this all into go-logr and do it so that it works for all logger backends - let me try that out...

dashpole · 2022-11-02T14:24:56Z

As a note, otel's interfaces, including context, are kept separate from the rest of OTel's codebase. As a result, it has nearly no dependencies. But it is still a consideration.

sallyom · 2022-11-10T18:45:11Z

What I really want is for logger := klog.FromContext(ctx) to always give me a logger with trace ID and span ID attached, if they exist in the context.

I agree adding span id, trace id to logs will be useful.
This is the example i've seen.

func logWithContext(span oteltrace.Span) log.Fields {
	return log.Fields{
		"span_id":  span.SpanContext().SpanID().String(),
		"trace_id": span.SpanContext().TraceID().String(),
	}
}
log.WithFields(logWithContext(span)).Info("Counter", zap.String("response", msg))

pohly · 2022-11-17T13:20:05Z

It should be possible to just configure klog to extract certain values by giving it the keys.

I've implemented that in PR #358. It is fairly complex because wrapping a LogSink (as done in that PR) is tricky.

I'll try next what it would look like if logr.Logger itself did the extra work. Then wrapping a LogSink will not be necessary and it will also work for code which just does logr.FromContext (assuming that there is a logger installed which was configured to log from the context).

pohly · 2022-11-18T13:21:22Z

I think the best solution is with this changed in logr: go-logr/logr#158

Once that is merged, #359 can also be merged.

dashpole · 2022-11-21T14:44:22Z

As a note, those won't work well with OpenTelemetry. The context keys for span context and for baggage are not exported, so we would need to support accessor functions or something similar to be able to make use of context-based functionality

pohly · 2022-11-21T16:20:13Z

That's surprising. I could have sworn that OpenTelemetry was mentioned as use case for logging from context in the upstream Go discussion around a future builtin contextual logging API. But perhaps it was just tracing in general and people will hit this snag later when they really try to use it (like we do now).

So would it be possible to get access to these context keys? If not, then we are stuck with the solution from #356 (comment), which is more expensive.

pohly · 2022-11-21T16:42:21Z

One possible alternative would be to allow passing a func(ctx context.Context) interface{} to the logger to retrieve some value (or values - it could be a struct, slice or map) instead of letting the logger directly retrieve the value. For OpenTelemetry, that function then could call SpanFromContext.

But what is it that should get logged for such a span? #356 (comment) mentions SpanID and TraceID, but I don't see such methods for the SpanContext, only some generic accessor for key/value string pairs.

Should all of those pairs get logged? Using which keys at the level of a log entry? We must be relatively certain that there is no accidental overlap with keys passed to WithValues or Info/Error.

dashpole · 2022-11-21T16:56:48Z

I think it was a general discussion.

It might've been a reference to how OpenCensus works, which is that all attributes ("tags" in OC) on spans/metrics are from context. Example.

OpenTelemetry's SpanContext provides accessors for TraceID and SpanID (OpenTracing's SpanContext, which is what you linked, does not appear to provide explicit accessors). I DO NOT think we should add all SpanContext-based k/v pairs from either TraceState (w3c tracestate)or from Baggage (w3c baggage). It would be nice if there was a way to opt-in values from baggage, though.

pohly · 2022-11-21T20:53:33Z

So if we log span ID and trace ID, what should that look like?

{"msg": "something", "tracing": {"spanID": "xyz", "traceID": "abc"}}

Or

{"msg": "something", "tracing.spanID": "xyz", "tracing.traceID": "abc"}

dashpole · 2022-11-21T21:05:17Z

It should look like:

{"msg": "something", "trace_id": "xyz", "span_id": "abc"}

See https://github.com/open-telemetry/opentelemetry-specification/tree/a87fb8c5a60f942d08c67c1582034eb8b4322009/specification/logs#trace-context-in-legacy-formats

thockin · 2022-11-21T21:07:54Z

One possible alternative would be to allow passing a func(ctx context.Context) interface{} to the logger to retrieve some value

I want to be sensitive to performance , too - having to call an arbitrary number of functions once is not a big deal, but doing that on every log call seems unfortunate, unless we're clear about how we can cache those.

pohly · 2022-11-22T09:06:02Z

Logging as "trace_id": "xyz", "span_id": "abc in combination with a single function for lookup would be a good fit for something like PseudoStruct as discussed over in #357. We could do two functions, but I suspect that this is going to be less efficient and more complicated (need to supply the key names separately, multiple calls instead of one).

Calling ctx.Value through some function is adding some more overhead, but the baseline (ctx.Value itself, formatting of the log entry) is already pretty big, so relative to that it's probably not that much more.

@thockin I saw that you pre-render key/value pairs in structr's WithValues call. That makes sense if there are then multiple log calls that use that cached information, but it's probably worse for just a single call (need to store the cache somewhere) and definitely hurts when there is no log call at all.

When instrumenting the scheduler with contextual logging, a big concern was the slowdown at low log levels because those are typically used when performance is crucial. But I haven't checked how often there there would have been a WithValues call and no log call - I'll probably have to come back to it because the PR stalled.

Perhaps some on-demand caching could give us the best of both worlds (only do expensive work if needed, but then only once). It's just not easy to implement because such a cache would need to be shared between different Logger or LogSink instances.

pohly · 2022-11-23T18:30:41Z

I have implemented this in the same PRs which also implement #357:

With that modified Kubernetes and a local-up cluster, I get:

$ grep "Immediate dispatch" /tmp/kube-apiserver.log 
I1124 08:44:15.439987   46581 queueset.go:716] "Immediate dispatch" trace_id="2f53c6cc100aaa44d8d9303fcc75f3a1" span_id="3d99c8f69661e4f1" QS="catch-all" t="2022-11-24 08:44:15.439803964" R="0.00000000ss" request="catch-all" description1=&{IsResourceRequest:true Path:/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/127.0.0.1 Verb:get APIPrefix:apis APIGroup:coordination.k8s.io APIVersion:v1 Namespace:kube-node-lease Resource:leases Subresource: Name:127.0.0.1 Parts:[leases 127.0.0.1]} description2=&{Name:system:node:127.0.0.1 UID: Groups:[system:nodes system:authenticated] Extra:map[]} qsExecuting=1

$ cat /nvme/gopath/src/k8s.io/apiserver-tracing.conf
apiVersion: apiserver.config.k8s.io/v1alpha1
kind: TracingConfiguration
# default value
#endpoint: localhost:4317
samplingRatePerMillion: 1000000

$ APISERVER_TRACING_CONFIG_FILE=/nvme/gopath/src/k8s.io/apiserver-tracing.conf FEATURE_GATES=APIServerTracing=true,ContextualLogging=true KUBELET_RESOLV_CONF="/etc/resolv-9999.conf" DNS_ADDON="coredns" CGROUP_DRIVER=systemd CONTAINER_RUNTIME_ENDPOINT=unix:///var/run/crio/crio.sock LOG_LEVEL=6 ENABLE_CSI_SNAPSHOTTER=false API_SECURE_PORT=6444 ALLOW_PRIVILEGED=1 PATH=/nvme/gopath/src/k8s.io/kubernetes/third_party/etcd:$PATH ./hack/local-up-cluster.sh -O

k8s-triage-robot · 2023-02-22T08:17:45Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-03-24T08:49:22Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

saschagrunert · 2023-03-24T08:51:01Z

/remove-lifecycle rotten

k8s-triage-robot · 2023-06-22T09:23:27Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-01-18T23:00:26Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-02-17T23:49:15Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-02-17T23:49:20Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 1, 2022

saschagrunert mentioned this issue Nov 1, 2022

Connected traces in the kubelet kubernetes/kubernetes#113414

Closed

2 tasks

pohly mentioned this issue Nov 17, 2022

RFC: extract values from context on demand #358

Closed

pohly mentioned this issue Nov 18, 2022

WIP: support logging of values from context #359

Closed

pohly mentioned this issue Nov 21, 2022

WIP: log values from context go-logr/logr#158

Closed

pohly mentioned this issue Nov 23, 2022

WIP: apiserver: log trace and span ID kubernetes/kubernetes#114088

Closed

saschagrunert mentioned this issue Jan 10, 2023

KEP-2831: adding beta graduation criteria kubernetes/enhancements#3714

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2023

saschagrunert mentioned this issue Mar 14, 2023

Graduate KubeletTracing feature to beta kubernetes/kubernetes#115750

Merged

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 24, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 24, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 22, 2023

pleshakov mentioned this issue Aug 1, 2023

Add logging guidelines for developers nginxinc/nginx-gateway-fabric#905

Merged

6 tasks

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 18, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenTelemetry support #356

OpenTelemetry support #356

saschagrunert commented Nov 1, 2022

pohly commented Nov 1, 2022

saschagrunert commented Nov 1, 2022

pohly commented Nov 1, 2022

saschagrunert commented Nov 1, 2022

dashpole commented Nov 1, 2022

pohly commented Nov 1, 2022

dashpole commented Nov 1, 2022

pohly commented Nov 1, 2022

pohly commented Nov 1, 2022

pohly commented Nov 2, 2022

dashpole commented Nov 2, 2022

sallyom commented Nov 10, 2022

pohly commented Nov 17, 2022

pohly commented Nov 18, 2022

dashpole commented Nov 21, 2022

pohly commented Nov 21, 2022 •

edited

pohly commented Nov 21, 2022 •

edited

dashpole commented Nov 21, 2022 •

edited

pohly commented Nov 21, 2022

dashpole commented Nov 21, 2022

thockin commented Nov 21, 2022

pohly commented Nov 22, 2022

pohly commented Nov 23, 2022 •

edited

k8s-triage-robot commented Feb 22, 2023

k8s-triage-robot commented Mar 24, 2023

saschagrunert commented Mar 24, 2023

k8s-triage-robot commented Jun 22, 2023

k8s-triage-robot commented Jan 18, 2024

k8s-triage-robot commented Feb 17, 2024

k8s-ci-robot commented Feb 17, 2024

OpenTelemetry support #356

OpenTelemetry support #356

Comments

saschagrunert commented Nov 1, 2022

pohly commented Nov 1, 2022

saschagrunert commented Nov 1, 2022

pohly commented Nov 1, 2022

saschagrunert commented Nov 1, 2022

dashpole commented Nov 1, 2022

pohly commented Nov 1, 2022

dashpole commented Nov 1, 2022

pohly commented Nov 1, 2022

pohly commented Nov 1, 2022

pohly commented Nov 2, 2022

dashpole commented Nov 2, 2022

sallyom commented Nov 10, 2022

pohly commented Nov 17, 2022

pohly commented Nov 18, 2022

dashpole commented Nov 21, 2022

pohly commented Nov 21, 2022 • edited

pohly commented Nov 21, 2022 • edited

dashpole commented Nov 21, 2022 • edited

pohly commented Nov 21, 2022

dashpole commented Nov 21, 2022

thockin commented Nov 21, 2022

pohly commented Nov 22, 2022

pohly commented Nov 23, 2022 • edited

k8s-triage-robot commented Feb 22, 2023

k8s-triage-robot commented Mar 24, 2023

saschagrunert commented Mar 24, 2023

k8s-triage-robot commented Jun 22, 2023

k8s-triage-robot commented Jan 18, 2024

k8s-triage-robot commented Feb 17, 2024

k8s-ci-robot commented Feb 17, 2024

pohly commented Nov 21, 2022 •

edited

pohly commented Nov 21, 2022 •

edited

dashpole commented Nov 21, 2022 •

edited

pohly commented Nov 23, 2022 •

edited