Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collector crashing with invalid memory address or nil pointer dereference on collector exposing a GRPC endpoint behind an AWS ALB #9296

Closed
tcoville-coveo opened this issue Jan 16, 2024 · 20 comments · Fixed by #9312
Labels
bug Something isn't working priority:p1 High

Comments

@tcoville-coveo
Copy link

tcoville-coveo commented Jan 16, 2024

Describe the bug
Collector is crashing continuously a few seconds after starting up after upgrading to 0.92.0

Steps to reproduce
I have several collectors running with the same version but only one is having the issue. the only difference I can see is that it is used as a Gateway behind an AWS ALB. the ALB healthcheck in GRPC polls on the / path of the receiver and the failure seem to correlate with the probe frequency. Also noted that if I remove the ingress the error stops (along with the traffic so not sure if this proves anything....)
I also tried to downgrade the relay collector to 0.91.0 and send the traces "as-is" to a second collector running on 0.92.0 with the same config but without ingress and I don't get the error so it would tend to confirm the ALB probe theory....

What did you expect to see?
The collector should at least survive the error and log a meaningful message

What did you see instead?

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x1235929]

goroutine 167 [running]:
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.(*config).handleRPC(0xc002a0e9c0, {0x902ac80, 0xc002aaa510}, {0x8ff2fd0?, 0xc0020fba88?})
	go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.46.1/stats_handler.go:144 +0xa9
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.(*serverHandler).HandleRPC(0xc00232a340?, {0x902ac80?, 0xc002aaa510?}, {0x8ff2fd0?, 0xc0020fba88?})
	go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.46.1/stats_handler.go:88 +0x2a
google.golang.org/grpc/internal/transport.(*http2Server).WriteStatus(0xc00232a340, 0xc002a0c480, 0xc002b3c340)
	google.golang.org/grpc@v1.60.1/internal/transport/http2_server.go:1071 +0xaf2
google.golang.org/grpc.(*Server).handleStream(0xc002858c00, {0x905cfe0, 0xc00232a340}, 0xc002a0c480)
	google.golang.org/grpc@v1.60.1/server.go:1749 +0x575
google.golang.org/grpc.(*Server).serveStreams.func2.1()
	google.golang.org/grpc@v1.60.1/server.go:1016 +0x59
created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 166
	google.golang.org/grpc@v1.60.1/server.go:1027 +0x115

What version did you use?
otel-operator: 0.91.0 , collector image: 0.92.0

What config did you use?
collector:

"exporters":
  "logging": null
  "otlp/central":
    "endpoint": "https://xxxxxxxxxxxxxx"
"extensions":
  "health_check": {}
  "zpages": {}
"processors":
  "batch":
    "send_batch_max_size": 15000
    "send_batch_size": 7500
    "timeout": "200ms"
  "memory_limiter":
    "check_interval": "1s"
    "limit_percentage": 75
    "spike_limit_percentage": 15
  "transform":
    "metric_statements":
    - "context": "metric"
      "statements":
      - "set(resource.attributes[\"meta.collector\"], \"relay-collector\")"
    "trace_statements":
    - "context": "spanevent"
      "statements":
      - "set(resource.attributes[\"meta.collector\"], \"relay-collector\")"
    - "context": "span"
      "statements":
      - "set(resource.attributes[\"meta.collector\"], \"relay-collector\")"
"receivers":
  "otlp":
    "protocols":
      "grpc":
        "endpoint": "0.0.0.0:4317"
        "max_recv_msg_size_mib": 15
        "tls":
          "cert_file": "/certificate/tls.crt"
          "key_file": "/certificate/tls.key"
"service":
  "extensions":
  - "health_check"
  - "zpages"
  "pipelines":
    "traces":
      "exporters":
      - "otlp/central"
      "processors":
      - "memory_limiter"
      - "transform"
      - "batch"
      "receivers":
      - "otlp"
  "telemetry":
    "logs":
      "encoding": "json"
      "level": "info"
    "metrics":
      "address": ":8888"

ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: relay-collector
  namespace: '${config.namespace}'
  annotations:
    "alb.ingress.kubernetes.io/backend-protocol": "HTTPS"
    "alb.ingress.kubernetes.io/backend-protocol-version": "GRPC"
    "alb.ingress.kubernetes.io/target-type": "ip"
    "alb.ingress.kubernetes.io/healthcheck-path": "/"
    "alb.ingress.kubernetes.io/healthcheck-protocol": "HTTPS"
    "alb.ingress.kubernetes.io/healthcheck-interval-seconds": "5"
    "alb.ingress.kubernetes.io/healthcheck-timeout-seconds": "4"
    "alb.ingress.kubernetes.io/unhealthy-threshold-count": "3"
    "alb.ingress.kubernetes.io/success-codes": "0-99"
    "alb.ingress.kubernetes.io/ssl-policy": "ELBSecurityPolicy-TLS-1-2-2017-01"
    "alb.ingress.kubernetes.io/inbound-cidrs": "${config.inbound_cidrs}"
    "alb.ingress.kubernetes.io/certificate-arn": "${config.certificate}"
    "alb.ingress.kubernetes.io/listen-ports": "[{\"HTTPS\": 443 }]"
    "alb.ingress.kubernetes.io/scheme": "internal"
spec:
  ingressClassName: albv2
  rules:
    - host: ${config.hostname}
      http:
        paths:
          - path: /*
            pathType: ImplementationSpecific
            backend:
              service:
                name: relay-collector
                port:
                  name: otlp-grpc

Environment
otel-contrib image

Additional context
I did not manage to repro locally on my machine, the issue only seem to happen on the cluster

@tcoville-coveo tcoville-coveo added the bug Something isn't working label Jan 16, 2024
@neeej
Copy link

neeej commented Jan 16, 2024

I also have issues with 0.92.0 behind AWS ALB, which worked fine in all previous versions I've ran (including 0.91.0), where mine just fails the health checks and I don't get why..

But I do seem to get the same error message as you, when I restart the service (via systemd).

Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: panic: runtime error: invalid memory address or nil pointer dereference
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xd37fa8]
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: goroutine 318 [running]:
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.(*config).>
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]:         go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0>
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc.(*serverHa>
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]:         go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0>
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: google.golang.org/grpc/internal/transport.(*http2Server).WriteStatus(0x4002b22340, 0x4>
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]:         google.golang.org/grpc@v1.60.1/internal/transport/http2_server.go:1071 +0x8dc
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: google.golang.org/grpc.(*Server).handleStream(0x4001fb7600, {0x852a3a0, 0x4002b22340},>
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]:         google.golang.org/grpc@v1.60.1/server.go:1749 +0x474
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: google.golang.org/grpc.(*Server).serveStreams.func2.1()
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]:         google.golang.org/grpc@v1.60.1/server.go:1016 +0x5c
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]: created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 95
Jan 16 19:22:53 opentelemetry-collector-338 otelcol-contrib[116191]:         google.golang.org/grpc@v1.60.1/server.go:1027 +0x138
Jan 16 19:22:53 opentelemetry-collector-338 systemd[1]: otelcol-contrib.service: Main process exited, code=exited, status=2/INVALIDARGUMENT

I'm running this on al2023 (but I have seen the same issue on amzn2 using x86 as well)
cat /etc/os-release

NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
SUPPORT_END="2028-03-01"

uname -a

Linux opentelemetry-collector-338 6.1.56-82.125.amzn2023.aarch64 #1 SMP Tue Oct 10 17:03:29 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

config:

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  loadbalancing:
    protocol:
      otlp:
        timeout: 5s
        sending_queue:
          queue_size: 10000
        tls:
          insecure: true
    resolver:
      dns:
        hostname: otelsampler.etc
        timeout: 3s

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 30
  batch:
    send_batch_max_size: 12288
    timeout: 5s

extensions:
  health_check:
  zpages:

service:
  extensions: [zpages, health_check]
  pipelines:
    traces/otlp:
      receivers: [otlp]
      exporters: [loadbalancing]
      processors: [memory_limiter, batch]
    metrics:
      receivers: [otlp]
      exporters: [loadbalancing]
      processors: [memory_limiter, batch]
    logs:
      receivers: [otlp]
      exporters: [loadbalancing]
      processors: [memory_limiter, batch]

@dmitryax
Copy link
Member

dmitryax commented Jan 16, 2024

Thanks for reporting! Might be a regression go OTel Go GRPC instrumentation. Looking into it.

cc @open-telemetry/go-instrumentation-approvers

@mx-psi mx-psi added the priority:p1 High label Jan 17, 2024
@mx-psi
Copy link
Member

mx-psi commented Jan 17, 2024

@mx-psi
Copy link
Member

mx-psi commented Jan 17, 2024

The diff between v1.59.0 and v1.60.0 has these two PRs which are potentially related to the instrumentation code: grpc/grpc-go/pull/6716 and grpc/grpc-go/pull/6750. 6176 looks more suspicious.

@mx-psi
Copy link
Member

mx-psi commented Jan 17, 2024

I don't know why the crash happens, but open-telemetry/opentelemetry-go-contrib/pull/4825 would prevent it and it follows existing practice in grpc-go, so I think this would at least make the issue less serious.

@codeboten
Copy link
Contributor

codeboten commented Jan 17, 2024

@tcoville-coveo @neeej can you confirm that restarting the collector w/ the following feature gate prevents the issue from occurring:

--feature-gates -telemetry.useOtelForInternalMetrics

I'm wondering if this issue is caused by the change to using otel for internal metrics, but it might not be

@tcoville-coveo
Copy link
Author

tcoville-coveo commented Jan 17, 2024

@codeboten I tried what you suggested but I didn't notice any difference, the container is still crashing with the same error.
I didn't see any mention of the flag in the logs, how can I make sure that it has actually been applied ?
I managed to ping zpages before the container restart and I confirm the flag is disabled
image

FYI i deploy via the operator so I added the following config in the CRD:

  args:
    feature-gates: "-telemetry.useOtelForInternalMetrics"

which results in the pod args generated as:

apiVersion: v1
kind: Pod
[...]
spec:
  containers:
  - name: otc-container
    args:
    - --config=/conf/collector.yaml
    - --feature-gates=-telemetry.useOtelForInternalMetrics
[...]

@neeej
Copy link

neeej commented Jan 18, 2024

I also get the same error on restarts with that setting.

otelcol+    5870  2.5  3.3 1455156 133792 ?      Ssl  06:31   0:00 /usr/bin/otelcol-contrib --config=/etc/otelcol-contrib/config.yaml --feature-gates=-telemetry.useOtelForInternalMetrics

@mx-psi
Copy link
Member

mx-psi commented Jan 18, 2024

Filed grpc/grpc-go/issues/6928

MadVikingGod pushed a commit to open-telemetry/opentelemetry-go-contrib that referenced this issue Jan 18, 2024
…eRPC receives a gRPCContext (#4825)

* [instrumentation/google.golang.org/grpc/otelgrpc] Add safeguard to HandleRPC

To help prevent crashes such as open-telemetry/opentelemetry-collector/issues/9296, check if the context is nil before proceeding.

This is inspired by: https://github.com/grpc/grpc-go/blob/ddd377f19841eae70862559c854d957d61b3b692/stats/opencensus/opencensus.go#L204-L207

* Add changelog entry

* Create and handle error

* Make check a bit more clear

* Update instrumentation/google.golang.org/grpc/otelgrpc/stats_handler.go

Co-authored-by: Anthony Mirabella <a9@aneurysm9.com>

* Remove unused code

* additional check for gctx

---------

Co-authored-by: Anthony Mirabella <a9@aneurysm9.com>
codeboten pushed a commit to codeboten/opentelemetry-collector that referenced this issue Jan 18, 2024
Potentially fix open-telemetry#9296

Signed-off-by: Alex Boten <aboten@lightstep.com>
codeboten pushed a commit that referenced this issue Jan 18, 2024
Potentially fix #9296

---------

Signed-off-by: Alex Boten <aboten@lightstep.com>
@dmitryax dmitryax reopened this Jan 18, 2024
@dmitryax
Copy link
Member

I think we should not close this issue until grpc/grpc-go#6928 is resolved. Or file another one. It sounds potentially pretty impactful. Maybe even worth putting a release blocker label. @codeboten @codeboten WDYT?

@zasweq
Copy link

zasweq commented Jan 18, 2024

Hello, I replied in the issue in gRPC-Go, but is there a way to reproduce this? I wrote a test locally and triaged this repo's stats handler code and it all seems to be in good order (my test had no problem accessing context values in the server side trailers case from WriteStatus). Here to help, hopefully we can get to the bottom of this.

@codeboten
Copy link
Contributor

@dmitryax i think its fine to keep it opened until its been confirmed to be fixed

@codeboten
Copy link
Contributor

codeboten commented Jan 18, 2024

@tcoville-coveo @neeej the latest dev build of contrib is available, would you be able to test it out and see if the original crash occurs? No worries if not, we just haven't been able to repro on our end. The docker tag is:

otel/opentelemetry-collector-contrib-dev:bcc7163f7a8a7c2611fe74ba23f998a7fc0889a2

@tcoville-coveo
Copy link
Author

@codeboten YES ! the issue is fixed with the dev build 🥳 🎉
Thank you so much to all of you for prioritizing this and fixing the issue so quickly !

@codeboten
Copy link
Contributor

Thanks @mx-psi for getting to the bottom of this so quickly! @dmitryax @mx-psi did you want to leave this open until the release or mark it closed?

@mx-psi
Copy link
Member

mx-psi commented Jan 19, 2024

Thank you so much to all of you for prioritizing this and fixing the issue so quickly !

Thanks to you both for providing us with feedback and detailed information about this as well :)

While the issue is fixed at the Collector level, it would be great to also fix this at grpc-go. It would be extremely helpful if you can help us reproduce this locally @tcoville-coveo @neeej. To this end, I think the following would be helpful:

  • More details about your setup: you both have an AWS ALB, but could you share more details about your setup there?
  • Minimizing your Collector configuration while still reproducing the panic. Some things to try and remove here:
    • Removing processors, extensions and pipelines from your configuration
    • Disabling particular features (e.g. telemetry::metrics, max_recv_msg_size_mib...)

Anything else you can think of that would help is also welcome :)

did you want to leave this open until the release or mark it closed?

I think this issue can be marked as solved. I would still want to keep the discussion going to help the folks at grpc-go and prevent further similar issues.

@neeej
Copy link

neeej commented Jan 19, 2024

I have an AWS ALB configured which has a listener on HTTPS:4317 (and a certificate connected to it).
That ALB forwards to a AWS TG which is defined with port HTTP:4317 and grpc as protocol.

The TG has a health check configured to check for valid grpc codes 0-99
The AMIs launched for this via ASG (connected to the TG) are prebuilt on al2023 with otelcol-contrib installed which are then configured on instance boot (this is so I can scale in/out).

I use a form of this as my AWS CF template, if that might help:
https://pastebin.pl/view/15d2c892

I looked a bit further on my logs, and it does crash the same as @tcoville-coveo
It's just systemd that keeps it running, so it looks like it runs as it should, except it restarts all the time.

So, I stripped my config to this, and I still see the same issue:

receivers:
  otlp:
    protocols:
      grpc:

exporters:
  otlp:
    endpoint: "https://otlp.domain:4317"

service:
  extensions: []
  pipelines:
    traces/otlp:
      receivers: [otlp]
      exporters: [otlp]
      processors: []

I also tried installing the otelcol rpm instead, used above config and that also gave me the same result.

I tried running otelcol-contrib with one time each of these below as options, and got the same result then as well:

OTELCOL_OPTIONS="--config=/etc/otelcol-contrib/config.yaml --feature-gates=-telemetry.disableHighCardinalityMetrics"
OTELCOL_OPTIONS="--config=/etc/otelcol-contrib/config.yaml --feature-gates=-telemetry.useOtelForInternalMetrics"
OTELCOL_OPTIONS="--config=/etc/otelcol-contrib/config.yaml --feature-gates=-telemetry.useOtelWithSDKConfigurationForInternalTelemetry"

Then I installed https://github.com/rmedvedev/grpcdump and got the below output from it:

TEST [root@opentelemetry-collector v338 eu-west-1: grpcdump]$ go run cmd/grpcdump/main.go -i ens5 -p 4317
INFO[0000] Starting sniff ethernet packets at interface ens5 on port 4317
172.27.182.41:16706 -> 172.27.182.110:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.8:38552 -> 172.27.182.110:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
ERRO[0003] Error in try to read http2: Error reading frame: unexpected EOF
172.27.182.41:16716 -> 172.27.182.110:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
ERRO[0006] Error in try to read http2: Error reading frame: unexpected EOF
172.27.182.41:3594 -> 172.27.182.110:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.110:4317 -> 172.27.182.41:3594  <nil> map[:status:200 content-type:application/grpc grpc-message:malformed method name: "/" grpc-status:12]
ERRO[0022] Error in try to read http2: Error reading frame: unexpected EOF

Which when compared to the output I get from an instance running with 0.91.0 (and with the config from my first post in this issue), it looks very different:

INFO[0000] Starting sniff ethernet packets at interface ens5 on port 4317
172.27.182.8:59638 -> 172.27.182.125:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.125:4317 -> 172.27.182.8:59638  <nil> map[:status:200 content-type:application/grpc grpc-message:malformed method name: "/" grpc-status:12]
172.27.182.41:49472 -> 172.27.182.125:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.125:4317 -> 172.27.182.41:49472  <nil> map[:status:200 content-type:application/grpc grpc-message:malformed method name: "/" grpc-status:12]
172.27.182.8:59640 -> 172.27.182.125:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.125:4317 -> 172.27.182.8:59640  <nil> map[:status:200 content-type:application/grpc grpc-message:malformed method name: "/" grpc-status:12]
172.27.182.41:49488 -> 172.27.182.125:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.125:4317 -> 172.27.182.41:49488  <nil> map[:status:200 content-type:application/grpc grpc-message:malformed method name: "/" grpc-status:12]
172.27.182.125:4317 -> 172.27.182.41:49494   map[:status:200 content-type:application/grpc grpc-message: grpc-status:0]
172.27.182.8:43914 -> 172.27.182.125:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]
172.27.182.125:4317 -> 172.27.182.8:43914  <nil> map[:status:200 content-type:application/grpc grpc-message:malformed method name: "/" grpc-status:12]
172.27.182.41:5922 -> 172.27.182.125:4317 /  map[:method:POST :path:/ :scheme:http accept-encoding:gzip, compressed content-length:5 content-type:application/grpc host:AWSALB te:trailers user-agent:ELB-HealthChecker/2.0]

Let me know if I should test something more.

@mx-psi
Copy link
Member

mx-psi commented Jan 22, 2024

Thanks @neeej, this is helpful! cc @zasweq in case the log messages give you any hint

TylerHelmuth pushed a commit to TylerHelmuth/opentelemetry-collector that referenced this issue Jan 23, 2024
Potentially fix open-telemetry#9296

---------

Signed-off-by: Alex Boten <aboten@lightstep.com>
@tcoville-coveo
Copy link
Author

I confirm that the issue is fixed for me on 0.93.0 🎉

@mx-psi
Copy link
Member

mx-psi commented Jan 26, 2024

I am going to close this as done, further discussion can happen over at grpc/grpc-go#6928. Thanks again everybody :)

@mx-psi mx-psi closed this as completed Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:p1 High
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants