grpc metrics exporter client starts and keeps logging context deadline exceeded after stably running for about 2.5 hours #2217

tim-mwangi · 2021-09-02T21:14:40Z

Description

The grpc metrics exporter client keeps consistently logging context deadline exceeded after stably running for about 2.5 hours. The server it is reporting to, an open telemetry collector, is still running but the client disconnects and is just not able to reconnect for some reason. If I start another process using the client and connect to the collector otlp grpc receiver endpoint, it's able to connect. Also this happens in while running on an ubuntu based container in k8s (Both the collector and client are running there). I could not reproduce it on mac os(darwin).

I switched to the http metrics exporter client and I did not observe this problem. I was previously using v0.18.0 of opentelemetry-go and did not observe this issue.

Environment

OS: Ubuntu 20.04 based docker container
Env: k8s
Architecture: x86
Go Version: 1.16
opentelemetry-go version: v1.0.0-RC2 (upgraded from v0.18.0)
opentelemetry-go metric version: v0.22.0 (upgraded from v0.18.0)
opentelemetry collector version: v0.33.0 (upgraded from v0.29.0)

Collector log with timestamp for when the receiver started:

2021-08-29T22:00:25.175Z	info	otlpreceiver/otlp.go:75	Starting GRPC server on endpoint 0.0.0.0:4317	{"kind": "receiver", "name": "otlp"}

Exporter metrics client logs "context deadline exceeded" start happening after about 2.5 hours.

2021/08/30 00:24:10 context deadline exceeded
2021/08/30 00:24:21 context deadline exceeded
2021/08/30 00:24:21 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:24:23 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:24:25 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:24:40 context deadline exceeded
2021/08/30 00:24:41 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:24:55 context deadline exceeded
2021/08/30 00:24:55 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:15 context deadline exceeded
2021/08/30 00:25:25 context deadline exceeded
2021/08/30 00:25:25 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:27 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:29 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:40 context deadline exceeded
2021/08/30 00:25:41 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:52 context deadline exceeded

They do not stop till the process running the client is killed.

Steps To Reproduce

Use the configuration or similar below to start an opentelemetry collector(using v0.33.0):

# opentelemetry collector v0.33.0
extensions:
  health_check:
  pprof:
    endpoint: 0.0.0.0:1777
  zpages:
    endpoint: 0.0.0.0:55679

receivers:
  otlp:
    protocols:
      grpc:
        max_recv_msg_size_mib: 4
      http:
  opencensus:

  # Collect own metrics
  prometheus:
    config:
      scrape_configs:
        - job_name: "my-collector"
          scrape_interval: 10s
          static_configs:
            - targets: ["0.0.0.0:8888"]
  jaeger:
    protocols:
      grpc:
      thrift_binary:
      thrift_compact:
      thrift_http:
  zipkin:

processors:
  batch:
    timeout: 200ms
    send_batch_size: 8192
    send_batch_max_size: 10000
  # Some other customer processors

exporters:
  otlp:
    endpoint: "myendpoint:443"
    compression: gzip
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: traceableai

service:
  pipelines:
    traces:
      receivers: [otlp, opencensus, zipkin, jaeger, ]
      processors: [batch] # Some other custom trace processors were here as well
      exporters: [otlp]
    metrics:
      receivers: [otlp, opencensus, prometheus]
      processors: [batch]
      exporters: [otlp, prometheus] # 2 exporters. Now testing with 1 - otlp - see if it crashes

  extensions: [health_check, pprof, zpages]

Use the code snippet below to set up and start a simple metrics client that creates a single metric every 5 seconds. The otlp metric packages version is v0.22.0

package main

import (
	"context"
	"time"

	"go.opentelemetry.io/otel/exporters/otlp/otlpmetric"
	"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
	//"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"
	"go.opentelemetry.io/otel/metric"
	otelmetricglobal "go.opentelemetry.io/otel/metric/global"
	controller "go.opentelemetry.io/otel/sdk/metric/controller/basic"
	processor "go.opentelemetry.io/otel/sdk/metric/processor/basic"
	"go.opentelemetry.io/otel/sdk/metric/selector/simple"
	"go.uber.org/zap"
)

func main() {
	logger, _ := zap.NewProduction()
	defer logger.Sync()

	// Set up grpc client
	metricsClient := otlpmetricgrpc.NewClient(
		otlpmetricgrpc.WithInsecure(),
		otlpmetricgrpc.WithEndpoint("localhost:4317"),
	)
	// Set up http client. This one works
	// Uncomment to switch
	// metricsClient := otlpmetrichttp.NewClient(
	// 	otlpmetrichttp.WithInsecure(),
	// 	otlpmetrichttp.WithEndpoint("localhost:4318"),
	// )
	metricsExporter := otlpmetric.NewUnstarted(metricsClient)

	// Set up controller
	ctlr := controller.New(
		processor.New(
			simple.NewWithExactDistribution(),
			metricsExporter,
		),
		controller.WithExporter(metricsExporter),
		controller.WithCollectPeriod(2*time.Second),
	)
	// Set meter provider
	otelmetricglobal.SetMeterProvider(ctlr.MeterProvider())

	// Start controller and exporter
	handleErr(ctlr.Start(context.Background()), "failed to start controller", logger)
	handleErr(metricsClient.Start(context.Background()), "failed to start exporter", logger)

	// defer cleanup when done
	defer func() {
		// Push any last metric events to the exporter.
		handleErr(ctlr.Stop(context.Background()), "failed to stop controller", logger)
		handleErr(metricsExporter.Shutdown(context.Background()), "failed to stop metrics exporter", logger)
	}()

	// Start a ticker to send a heartbeat metric every 5 seconds or send some other metric of your choice
	ticker := time.NewTicker(5 * time.Second)
	heartBeatTickerDone := make(chan struct{}, 1)
	defer ticker.Stop()

	metricMeter := createMeter()
	valueRecorder := metric.Must(metricMeter).NewInt64ValueRecorder("heartbeat")
	for {
		select {
		case <-heartBeatTickerDone:
			//wg.Done()
			return
		case <-ticker.C:
			logger.Info("Creating metric")
			valueRecorder.Record(context.Background(), 1)
		}
	}
}

func handleErr(err error, message string, logger *zap.Logger) {
	if err != nil {
		logger.Error(message, zap.Error(err))
	}
}

func createMeter() metric.Meter {
	return otelmetricglobal.Meter("test-grpc-metric-expoter")
}

See error above after running for about 2.5 hours

Expected behavior

It should continue to happily report metrics to the otel collector for however long it runs.

The text was updated successfully, but these errors were encountered:

tim-mwangi · 2021-09-08T14:44:47Z

By the way, this happens when I ran the container on GKE k8s but I cannot reproduce it on docker-desktop k8s which is quite strange.

disq · 2021-11-18T13:21:56Z

This is also reproducible by using otlptracegrpc.New(ctx, otlptracegrpc.WithTimeout(some short timeout)) and not having the collector running. After a span ends it will try to connect and fail and the log line will be printed.

MrAlias · 2021-12-06T19:47:32Z

The connection handling to for the otlpmetricgrpc client was recently refactored (#2425) to use the grpc ClientConn to handle connections. This bug needs to be verified to still be relevant with these added changes.

hardproblems · 2022-01-29T02:50:16Z

This is still a bug for otlpmetricgrpc too. The reason is that the current implementation of exporter.Export / client.UploadMetrics by default sends all metrics in a collection / snapshot in one ExportMetricsServiceRequest. If the amount of metrics is too large to finish sending within the pushTimeout, or if the request size exceeds the server's grpc.MaxRecvMessage, the request will fail with a timeout error. The solution maybe batching large payloads into multiple requests or streaming gRPC.

tim-mwangi added the bug Something isn't working label Sep 2, 2021

MrAlias added the area:metrics Part of OpenTelemetry Metrics label Sep 2, 2021

MrAlias added this to Needs triage in Bugs via automation Sep 2, 2021

MrAlias added the pkg:exporter:otlp Related to the OTLP exporter package label Sep 2, 2021

MrAlias moved this from Needs triage to Low priority in Bugs Oct 15, 2021

jmacd mentioned this issue Dec 5, 2022

Include more detail in OTLP exporter errors #3513

Closed

pellared assigned MadVikingGod Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grpc metrics exporter client starts and keeps logging context deadline exceeded after stably running for about 2.5 hours #2217

grpc metrics exporter client starts and keeps logging context deadline exceeded after stably running for about 2.5 hours #2217

tim-mwangi commented Sep 2, 2021 •

edited

tim-mwangi commented Sep 8, 2021

disq commented Nov 18, 2021

MrAlias commented Dec 6, 2021

hardproblems commented Jan 29, 2022 •

edited

grpc metrics exporter client starts and keeps logging context deadline exceeded after stably running for about 2.5 hours #2217

grpc metrics exporter client starts and keeps logging context deadline exceeded after stably running for about 2.5 hours #2217

Comments

tim-mwangi commented Sep 2, 2021 • edited

Description

Environment

Steps To Reproduce

Expected behavior

tim-mwangi commented Sep 8, 2021

disq commented Nov 18, 2021

MrAlias commented Dec 6, 2021

hardproblems commented Jan 29, 2022 • edited

tim-mwangi commented Sep 2, 2021 •

edited

hardproblems commented Jan 29, 2022 •

edited