Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc metrics exporter client starts and keeps logging context deadline exceeded after stably running for about 2.5 hours #2217

Open
tim-mwangi opened this issue Sep 2, 2021 · 4 comments
Assignees
Labels
area:metrics Part of OpenTelemetry Metrics bug Something isn't working pkg:exporter:otlp Related to the OTLP exporter package
Projects

Comments

@tim-mwangi
Copy link

tim-mwangi commented Sep 2, 2021

Description

The grpc metrics exporter client keeps consistently logging context deadline exceeded after stably running for about 2.5 hours. The server it is reporting to, an open telemetry collector, is still running but the client disconnects and is just not able to reconnect for some reason. If I start another process using the client and connect to the collector otlp grpc receiver endpoint, it's able to connect. Also this happens in while running on an ubuntu based container in k8s (Both the collector and client are running there). I could not reproduce it on mac os(darwin).

I switched to the http metrics exporter client and I did not observe this problem. I was previously using v0.18.0 of opentelemetry-go and did not observe this issue.

Environment

  • OS: Ubuntu 20.04 based docker container
  • Env: k8s
  • Architecture: x86
  • Go Version: 1.16
  • opentelemetry-go version: v1.0.0-RC2 (upgraded from v0.18.0)
  • opentelemetry-go metric version: v0.22.0 (upgraded from v0.18.0)
  • opentelemetry collector version: v0.33.0 (upgraded from v0.29.0)

Collector log with timestamp for when the receiver started:

2021-08-29T22:00:25.175Z	info	otlpreceiver/otlp.go:75	Starting GRPC server on endpoint 0.0.0.0:4317	{"kind": "receiver", "name": "otlp"}

Exporter metrics client logs "context deadline exceeded" start happening after about 2.5 hours.

2021/08/30 00:24:10 context deadline exceeded
2021/08/30 00:24:21 context deadline exceeded
2021/08/30 00:24:21 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:24:23 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:24:25 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:24:40 context deadline exceeded
2021/08/30 00:24:41 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:24:55 context deadline exceeded
2021/08/30 00:24:55 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:15 context deadline exceeded
2021/08/30 00:25:25 context deadline exceeded
2021/08/30 00:25:25 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:27 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:29 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:40 context deadline exceeded
2021/08/30 00:25:41 metrics exporter is disconnected from the server localhost:4317: context deadline exceeded
2021/08/30 00:25:52 context deadline exceeded

They do not stop till the process running the client is killed.

Steps To Reproduce

  1. Use the configuration or similar below to start an opentelemetry collector(using v0.33.0):
# opentelemetry collector v0.33.0
extensions:
  health_check:
  pprof:
    endpoint: 0.0.0.0:1777
  zpages:
    endpoint: 0.0.0.0:55679

receivers:
  otlp:
    protocols:
      grpc:
        max_recv_msg_size_mib: 4
      http:
  opencensus:

  # Collect own metrics
  prometheus:
    config:
      scrape_configs:
        - job_name: "my-collector"
          scrape_interval: 10s
          static_configs:
            - targets: ["0.0.0.0:8888"]
  jaeger:
    protocols:
      grpc:
      thrift_binary:
      thrift_compact:
      thrift_http:
  zipkin:

processors:
  batch:
    timeout: 200ms
    send_batch_size: 8192
    send_batch_max_size: 10000
  # Some other customer processors

exporters:
  otlp:
    endpoint: "myendpoint:443"
    compression: gzip
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: traceableai

service:
  pipelines:
    traces:
      receivers: [otlp, opencensus, zipkin, jaeger, ]
      processors: [batch] # Some other custom trace processors were here as well
      exporters: [otlp]
    metrics:
      receivers: [otlp, opencensus, prometheus]
      processors: [batch]
      exporters: [otlp, prometheus] # 2 exporters. Now testing with 1 - otlp - see if it crashes

  extensions: [health_check, pprof, zpages]
  1. Use the code snippet below to set up and start a simple metrics client that creates a single metric every 5 seconds. The otlp metric packages version is v0.22.0
package main

import (
	"context"
	"time"

	"go.opentelemetry.io/otel/exporters/otlp/otlpmetric"
	"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc"
	//"go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp"
	"go.opentelemetry.io/otel/metric"
	otelmetricglobal "go.opentelemetry.io/otel/metric/global"
	controller "go.opentelemetry.io/otel/sdk/metric/controller/basic"
	processor "go.opentelemetry.io/otel/sdk/metric/processor/basic"
	"go.opentelemetry.io/otel/sdk/metric/selector/simple"
	"go.uber.org/zap"
)

func main() {
	logger, _ := zap.NewProduction()
	defer logger.Sync()

	// Set up grpc client
	metricsClient := otlpmetricgrpc.NewClient(
		otlpmetricgrpc.WithInsecure(),
		otlpmetricgrpc.WithEndpoint("localhost:4317"),
	)
	// Set up http client. This one works
	// Uncomment to switch
	// metricsClient := otlpmetrichttp.NewClient(
	// 	otlpmetrichttp.WithInsecure(),
	// 	otlpmetrichttp.WithEndpoint("localhost:4318"),
	// )
	metricsExporter := otlpmetric.NewUnstarted(metricsClient)

	// Set up controller
	ctlr := controller.New(
		processor.New(
			simple.NewWithExactDistribution(),
			metricsExporter,
		),
		controller.WithExporter(metricsExporter),
		controller.WithCollectPeriod(2*time.Second),
	)
	// Set meter provider
	otelmetricglobal.SetMeterProvider(ctlr.MeterProvider())

	// Start controller and exporter
	handleErr(ctlr.Start(context.Background()), "failed to start controller", logger)
	handleErr(metricsClient.Start(context.Background()), "failed to start exporter", logger)

	// defer cleanup when done
	defer func() {
		// Push any last metric events to the exporter.
		handleErr(ctlr.Stop(context.Background()), "failed to stop controller", logger)
		handleErr(metricsExporter.Shutdown(context.Background()), "failed to stop metrics exporter", logger)
	}()

	// Start a ticker to send a heartbeat metric every 5 seconds or send some other metric of your choice
	ticker := time.NewTicker(5 * time.Second)
	heartBeatTickerDone := make(chan struct{}, 1)
	defer ticker.Stop()

	metricMeter := createMeter()
	valueRecorder := metric.Must(metricMeter).NewInt64ValueRecorder("heartbeat")
	for {
		select {
		case <-heartBeatTickerDone:
			//wg.Done()
			return
		case <-ticker.C:
			logger.Info("Creating metric")
			valueRecorder.Record(context.Background(), 1)
		}
	}
}

func handleErr(err error, message string, logger *zap.Logger) {
	if err != nil {
		logger.Error(message, zap.Error(err))
	}
}

func createMeter() metric.Meter {
	return otelmetricglobal.Meter("test-grpc-metric-expoter")
}
  1. See error above after running for about 2.5 hours

Expected behavior

It should continue to happily report metrics to the otel collector for however long it runs.

@tim-mwangi tim-mwangi added the bug Something isn't working label Sep 2, 2021
@MrAlias MrAlias added the area:metrics Part of OpenTelemetry Metrics label Sep 2, 2021
@MrAlias MrAlias added this to Needs triage in Bugs via automation Sep 2, 2021
@MrAlias MrAlias added the pkg:exporter:otlp Related to the OTLP exporter package label Sep 2, 2021
@tim-mwangi
Copy link
Author

By the way, this happens when I ran the container on GKE k8s but I cannot reproduce it on docker-desktop k8s which is quite strange.

@MrAlias MrAlias moved this from Needs triage to Low priority in Bugs Oct 15, 2021
@disq
Copy link

disq commented Nov 18, 2021

This is also reproducible by using otlptracegrpc.New(ctx, otlptracegrpc.WithTimeout(some short timeout)) and not having the collector running. After a span ends it will try to connect and fail and the log line will be printed.

@MrAlias
Copy link
Contributor

MrAlias commented Dec 6, 2021

The connection handling to for the otlpmetricgrpc client was recently refactored (#2425) to use the grpc ClientConn to handle connections. This bug needs to be verified to still be relevant with these added changes.

@hardproblems
Copy link

hardproblems commented Jan 29, 2022

This is still a bug for otlpmetricgrpc too. The reason is that the current implementation of exporter.Export / client.UploadMetrics by default sends all metrics in a collection / snapshot in one ExportMetricsServiceRequest. If the amount of metrics is too large to finish sending within the pushTimeout, or if the request size exceeds the server's grpc.MaxRecvMessage, the request will fail with a timeout error. The solution maybe batching large payloads into multiple requests or streaming gRPC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:metrics Part of OpenTelemetry Metrics bug Something isn't working pkg:exporter:otlp Related to the OTLP exporter package
Projects
Status: Low priority
Bugs
  
Low priority
Development

No branches or pull requests

5 participants