Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNMP panic #15200

Open
llamafilm opened this issue Apr 22, 2024 · 6 comments
Open

SNMP panic #15200

llamafilm opened this issue Apr 22, 2024 · 6 comments
Labels
bug unexpected problem or unintended behavior upstream bug or issues that rely on dependency fixes

Comments

@llamafilm
Copy link
Contributor

llamafilm commented Apr 22, 2024

Relevant telegraf.conf

[agent]
  snmp_translator = "gosmi"

Logs from Telegraf

Starting Telegraf...
telegraf.service: Can't open PID file /run/telegraf.pid (yet?) after start: Operation not permitted
2024-04-21T14:50:06Z I! Loading config: /apps/aragorn/telegraf.d/inputs_snmp_trap.conf
2024-04-21T14:50:06Z W! DeprecationWarning: Option "timeout" of plugin "inputs.snmp_trap" deprecated since version 1.20.0 and will be removed in 2.0.0: unused option
2024-04-21T14:50:06Z I! Loading config: /apps/aragorn/telegraf.d/netbox-cloud1.conf
2024-04-21T14:50:06Z W! DeprecationWarning: Option "fieldpass" of plugin "inputs.prometheus" deprecated since version 1.29.0 and will be removed in 2.0.0: use 'fieldinclude' instead
2024-04-21T14:50:06Z W! DeprecationWarning: Option "fieldpass" of plugin "inputs.prometheus" deprecated since version 1.29.0 and will be removed in 2.0.0: use 'fieldinclude' instead
2024-04-21T14:50:06Z I! Loading config: /apps/aragorn/telegraf.d/outputs_grafanacloud.conf
2024-04-21T14:50:06Z I! Loading config: /apps/aragorn/telegraf.d/outputs_loki.conf
2024-04-21T14:50:06Z I! Loading config: /apps/aragorn/telegraf.d/processors.conf
2024-04-21T14:50:06Z I! Loading config: /apps/aragorn/telegraf.d/telegraf.conf
2024-04-21T14:50:06Z I! Starting Telegraf 1.29.5-66b924ec (customized) brought to you by InfluxData the makers of InfluxDB
2024-04-21T14:50:06Z I! Available plugins: 17 inputs, 0 aggregators, 7 processors, 5 parsers, 3 outputs, 0 secret-stores
2024-04-21T14:50:06Z I! Loaded inputs: cpu disk diskio exec (2x) internal mem net net_response netstat ping (66x) prometheus (8x) snmp (19x) snmp_trap swap system systemd_units
2024-04-21T14:50:06Z I! Loaded aggregators:
2024-04-21T14:50:06Z I! Loaded processors: converter (4x) enum (3x) regex (3x) rename (2x) scale (2x) starlark strings
2024-04-21T14:50:06Z I! Loaded secretstores:
2024-04-21T14:50:06Z I! Loaded outputs: http loki
2024-04-21T14:50:06Z I! Tags enabled: host=cloud1
2024-04-21T14:50:06Z W! Deprecated inputs: 0 and 1 options
2024-04-21T14:50:06Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:"cloud1", Flush Interval:1m30s
2024-04-21T14:50:06Z D! [agent] Initializing plugins
2024-04-21T14:50:06Z I! Config watcher started
Started Telegraf.
2024-04-21T14:50:06Z I! Config watcher started
2024-04-21T14:50:06Z I! Config watcher started
2024-04-21T14:50:06Z I! Config watcher started
2024-04-21T14:50:06Z I! Config watcher started
2024-04-21T14:50:06Z I! Config watcher started
2024-04-21T14:50:07Z I! [inputs.prometheus::9] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::36] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::47] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::62] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::78] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::79] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::5844] Using the label selector:  and field selector:
2024-04-21T14:50:07Z I! [inputs.prometheus::5915] Using the label selector:  and field selector:
2024-04-21T14:50:07Z W! DeprecationWarning: Value "false" for option "ignore_protocol_stats" of plugin "inputs.net" deprecated since version 1.27.3 and will be removed in 1.36.0: use the 'inputs.nstat' plugin instead
2024-04-21T14:50:07Z I! [processors.regex] tags: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] tags: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z I! [processors.regex] fields: Using explicit mode...
2024-04-21T14:50:07Z D! [agent] Connecting outputs
2024-04-21T14:50:07Z D! [agent] Attempting connection to [outputs.http::grafana-cloud]
2024-04-21T14:50:07Z D! [agent] Successfully connected to outputs.http::grafana-cloud
2024-04-21T14:50:07Z D! [agent] Attempting connection to [outputs.loki]
2024-04-21T14:50:07Z D! [agent] Successfully connected to outputs.loki
2024-04-21T14:50:07Z D! [agent] Starting service inputs
2024-04-21T14:50:07Z I! [inputs.snmp_trap] Listening on udp://:162
2024-04-21T14:50:07Z E! [inputs.snmp::70] Error in plugin: agent 10.92.177.47: performing get on field Cat1600SerialNumber: error reading from socket: read udp 100.65.16.139:11296->10.92.177.47:161: recvfrom: connection refused
2024-04-21T14:50:07Z E! [inputs.snmp::71] Error in plugin: agent 10.92.177.48: performing get on field Cat1600SerialNumber: error reading from socket: read udp 100.65.16.139:36407->10.92.177.48:161: recvfrom: connection refused
2024-04-21T14:50:07Z D! [outputs.http::grafana-cloud] Wrote batch of 1000 metrics in 237.499482ms
2024-04-21T14:50:07Z D! [outputs.http::grafana-cloud] Buffer fullness: 304 / 1000000 metrics
2024-04-21T14:50:07Z D! [processors.starlark] Removed string field: serverSoftwareVersion from snmp_adder_aim
panic: interface conversion: interface {} is nil, not []uint8
goroutine 642 [running]:
github.com/sleepinggenius2/gosmi/models.GetEnumBitsFormatted({0x0?, 0x0?}, 0x90?, 0xc00221da20?)
        /home/builds/go/pkg/mod/github.com/sleepinggenius2/gosmi@v0.4.4/models/format_bits.go:31 +0x598
github.com/sleepinggenius2/gosmi/models.Type.FormatValue({0xb, 0x1, {0x0, 0x0}, 0xc002112340, {0x0, 0x0}, {0x22a2cb2, 0x4}, {0x3939620, ...}, ...}, ...)
        /home/builds/go/pkg/mod/github.com/sleepinggenius2/gosmi@v0.4.4/models/format.go:163 +0x23c
github.com/sleepinggenius2/gosmi/models.Node.FormatValue(...)
        /home/builds/go/pkg/mod/github.com/sleepinggenius2/gosmi@v0.4.4/models/format.go:127
github.com/influxdata/telegraf/plugins/inputs/snmp.(*gosmiTranslator).SnmpFormatEnum(0xc0004c6037?, {0xc002143230?, 0x265ed10?}, {0x0, 0x0}, 0x0)
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/gosmi.go:139 +0x338
github.com/influxdata/telegraf/plugins/inputs/snmp.fieldConvert({0x267f628, 0x3939620}, {0xc0004c6037, 0x4}, {{0x0, 0x0}, {0xc002143230, 0x23}, 0x5})
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:771 +0xb27
github.com/influxdata/telegraf/plugins/inputs/snmp.Table.Build({{0x22a29b6, 0x4}, {0x0, 0x0, 0x0}, 0x0, {0xc0001fb208, 0x10, 0x10}, {0x0, ...}, ...}, ...)
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:439 +0x685
github.com/influxdata/telegraf/plugins/inputs/snmp.(*Snmp).gatherTable(0xc00047db00, {0x2699be0, 0xc0003b1e80}, {0x268e2d0, 0xc002396f20}, {{0x22a29b6, 0x4}, {0x0, 0x0, 0x0}, ...}, ...)
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:355 +0xa7
github.com/influxdata/telegraf/plugins/inputs/snmp.(*Snmp).Gather.func1(0xc00257cfd0?, {0xc0004b0de1, 0xc})
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:337 +0x20b
created by github.com/influxdata/telegraf/plugins/inputs/snmp.(*Snmp).Gather in goroutine 641
        /data/agent/workspace/MSE-aragorn-publish/build/telegraf/plugins/inputs/snmp/snmp.go:323 +0x66
telegraf.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
telegraf.service: Failed with result 'exit-code'.
telegraf.service: Consumed 1.379s CPU time.
telegraf.service: Scheduled restart job, restart counter is at 2.
Stopped Telegraf.
telegraf.service: Consumed 1.379s CPU time.

System info

Telegraf 1.29.5-66b924ec, Ubuntu 22.04.4

Docker

No response

Steps to reproduce

Unknown

Expected behavior

no crash

Actual behavior

Telegraf has been running for several days under systemd, and this weekend it crashed. Systemd tried to restart it several times, and it kept crashing repeatedly. This log snippet from journald shows a full cycle, beginning after the first crash, until it crashes again. My telegraf config is several thousand lines long, so I'm not sure which part is relevant here. I have dozens of different SNMP devices with different input configs and processors.

There was a power outage Saturday morning, about 24 hours before this crash occurred, so it's likely some of the SNMP devices were in a bad state, but I can't reproduce it. This morning after restarting the service it's working fine.

Additional info

I built this telegraf binary using the custom builder to reduce the input and output plugins. But I did not customize anything else. So it's weird that the log references lines that don't exist like snmp.go:323.

@llamafilm llamafilm added the bug unexpected problem or unintended behavior label Apr 22, 2024
@srebhan
Copy link
Contributor

srebhan commented Apr 22, 2024

@llamafilm could you please try to reproduce this with latest master and maybe only the SNMP (and a file output) plugin?!? We shifted code for SNMP quite a bit between v1.29 and v1.30...

@Hipska
Copy link
Contributor

Hipska commented Apr 23, 2024

You need to look at the correct version of the source code, this is snmp.go:323: https://github.com/influxdata/telegraf/blob/v1.29.5/plugins/inputs/snmp/snmp.go#L323

Going though the stack trace, the panic actually happens here:
https://github.com/sleepinggenius2/gosmi/blob/v0.4.4/models/format_bits.go#L31
Which should be replaced by

	octets := v.Bytes()

@powersj
Copy link
Contributor

powersj commented Apr 23, 2024

Thanks for calling that out. I guess the question then is if Telegraf should make a change as well? If the value is nil, should Telegraf even be calling the format value function?

@Hipska
Copy link
Contributor

Hipska commented Apr 23, 2024

I would try to get that fixed upstream and see what the maintainers say.

@powersj
Copy link
Contributor

powersj commented Apr 24, 2024

I have put up issue sleepinggenius2/gosmi#44 and a PR sleepinggenius2/gosmi#45

Happy to have reviews or comments to those. I did not realize this library had not had a lot of updates in a while, so let's see if we get a response.

@powersj powersj added the upstream bug or issues that rely on dependency fixes label Apr 24, 2024
@Hipska
Copy link
Contributor

Hipska commented Apr 24, 2024

It appears like the maintainer didn't do much anymore lately. Let's see indeed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior upstream bug or issues that rely on dependency fixes
Projects
None yet
Development

No branches or pull requests

4 participants