feat(inputs.execd): allow failures on cmd start #14244

ajw1980 · 2023-11-03T15:04:47Z

Relevant telegraf.conf

[[inputs.execd]]
  command = ["/opt/adm/sbin/status.py"]
  signal = "none"
  restart_delay = "10s"
  data_format = "influx"

Logs from Telegraf

Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "Traceback (most recent call last):"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 248, in <module>"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "    main()"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 240, in main"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "    xrst.poll_status()"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 49, in poll_status"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "    self.get_systemd_status()"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 163, in get_systemd_status"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "    pipe = Popen(systemctl_cmd, stdout=PIPE, stderr=DEVNULL, universal_newlines=True)"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "  File \"/usr/lib64/python3.6/subprocess.py\", line 709, in __init__"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "    restore_signals, start_new_session)"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "  File \"/usr/lib64/python3.6/subprocess.py\", line 1275, in _execute_child"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "    restore_signals, start_new_session, preexec_fn)"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "OSError: [Errno 12] Cannot allocate memory"
Nov  3 09:38:43 machine telegraf[24194]: 2023-11-03T14:38:43Z E! [inputs.execd] Process /opt/adm/sbin/status.py exited: exit status 1
Nov  3 09:38:43 machine telegraf[24194]: 2023-11-03T14:38:43Z I! [inputs.execd] Restarting in 10s...
Nov  3 09:38:53 machine telegraf[24194]: 2023-11-03T14:38:53Z I! [inputs.execd] Starting process: /opt/adm/sbin/status.py []
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "Traceback (most recent call last):"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 248, in <module>"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "    main()"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 240, in main"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "    xrst.poll_status()"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 49, in poll_status"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "    self.get_systemd_status()"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 163, in get_systemd_status"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "    pipe = Popen(systemctl_cmd, stdout=PIPE, stderr=DEVNULL, universal_newlines=True)"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "  File \"/usr/lib64/python3.6/subprocess.py\", line 709, in __init__"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "    restore_signals, start_new_session)"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "  File \"/usr/lib64/python3.6/subprocess.py\", line 1275, in _execute_child"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "    restore_signals, start_new_session, preexec_fn)"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "OSError: [Errno 12] Cannot allocate memory"
Nov  3 09:39:43 machine telegraf[24194]: 2023-11-03T14:39:43Z E! [inputs.execd] Process /opt/adm/sbin/status.py exited: exit status 1
Nov  3 09:39:43 machine telegraf[24194]: 2023-11-03T14:39:43Z I! [inputs.execd] Restarting in 10s...
Nov  3 09:39:54 machine telegraf[24194]: 2023-11-03T14:39:53Z I! [inputs.execd] Starting process: /opt/adm/sbin/status.py []
Nov  3 09:39:54 machine telegraf[24194]: 2023-11-03T14:39:54Z E! [inputs.execd] Process quit with message: error starting process: fork/exec /opt/adm/sbin/status.py: cannot allocate memory

System info

telegraf 1.21.3 fedora

Docker

No response

Steps to reproduce

Create an execd input.
Have the system fail in a way that processes don't start properly (out of memory)

Expected behavior

execd process should always be restarted

Actual behavior

execd process was not restarted.

Additional info

In certain instances where a system problem causes processes to not start, an execd plugin process will not get restarted. In this case the machine ran out of memory and the execd process stopped. It would seem if the process starts and exits telegraf will restart it, but if telegraf fails to even start the process it will no longer be restarted.

The text was updated successfully, but these errors were encountered:

powersj · 2023-11-03T15:23:09Z

execd process should always be restarted

From the code, we will continuously try to restart, except on errors from running the cmd start. If telegraf cannot start an input plugin, or in this case, start the execd that you want us to, then telegraf will fail. This is the expected behavior in general, as it makes little sense to try to continue run if we cannot start a plugin that you expect to provide data.

However, we have other FR to enable settings on a per-plugin basis that would allow ignoring errors on start up, and we can do that here as well.

srebhan · 2024-04-10T16:23:39Z

Trying to reproduce the issue really gives me a headache. It seems like the startup only fails in cases where the OS hard-terminates the executed process. Such events are severe like out-of-memory or maybe segfaults. I don't think we should handle those cases as the kernel rightfully terminated the process, maybe even in an uncontrolled way (as in the OOM case).

ajw1980 · 2024-04-12T13:38:41Z

Yeah, it would most likely be memory issues on the system to cause this situation. Maybe this should be an option to just shutdown telegraf with an error state if execd fails? This would at least make it more obvious that something failed and for systemd hosts the service would get restarted if configured to do so.

srebhan · 2024-04-30T12:02:38Z

@ajw1980 please test the binary in PR #15271, available as-soon-as CI finished the tests, and let me know if this fixes your issue. The new option is called stop_on_error and it needs to be set to true for your use-case.

ajw1980 · 2024-05-03T20:21:37Z

I downloaded it and added the config option. It seems like this only stops the execd plugin not telegraf itself, right?

srebhan · 2024-05-06T07:31:26Z

Exactly. Once Telegraf started it is impossible to stop it completely as it is now.

srebhan · 2024-05-08T18:47:15Z

@ajw1980 are you good with the fix?

ajw1980 · 2024-05-09T16:37:18Z

That option doesn't really address this issue. The execd input will already not relaunch the command if there is some sort of system (out of memory) error.

powersj · 2024-05-16T16:37:09Z

@ajw1980,

Unfortunately, there is no way to kill Telegraf from a failure when calling the script. With @srebhan's PR we could kill the plugin, but not all of Telegraf.

We are left with taking that PR as an additional option or closing this as won't fix unless some other solution comes up.

ajw1980 added the bug unexpected problem or unintended behavior label Nov 3, 2023

powersj added feature request Requests for new plugin and for new features to existing plugins help wanted Request for community participation, code, contribution size/m 2-4 day effort and removed bug unexpected problem or unintended behavior labels Nov 3, 2023

powersj changed the title ~~execd plugin doesn't restart if process quits~~ feat(inputs.execd): allow failures on cmd start Nov 3, 2023

srebhan added the waiting for response waiting for response from contributor label Apr 10, 2024

telegraf-tiger bot removed the waiting for response waiting for response from contributor label Apr 12, 2024

srebhan mentioned this issue Apr 30, 2024

feat(inputs.execd): Add option to not restart program on error #15271

Merged

1 task

srebhan added the waiting for response waiting for response from contributor label May 8, 2024

srebhan self-assigned this May 8, 2024

telegraf-tiger bot removed the waiting for response waiting for response from contributor label May 9, 2024

powersj added the waiting for response waiting for response from contributor label May 16, 2024

DStrand1 closed this as completed in #15271 May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(inputs.execd): allow failures on cmd start #14244

feat(inputs.execd): allow failures on cmd start #14244

ajw1980 commented Nov 3, 2023

powersj commented Nov 3, 2023

srebhan commented Apr 10, 2024

ajw1980 commented Apr 12, 2024

srebhan commented Apr 30, 2024

ajw1980 commented May 3, 2024

srebhan commented May 6, 2024

srebhan commented May 8, 2024

ajw1980 commented May 9, 2024

powersj commented May 16, 2024

feat(inputs.execd): allow failures on cmd start #14244

feat(inputs.execd): allow failures on cmd start #14244

Comments

ajw1980 commented Nov 3, 2023

Relevant telegraf.conf

Logs from Telegraf

System info

Docker

Steps to reproduce

Expected behavior

Actual behavior

Additional info

powersj commented Nov 3, 2023

srebhan commented Apr 10, 2024

ajw1980 commented Apr 12, 2024

srebhan commented Apr 30, 2024

ajw1980 commented May 3, 2024

srebhan commented May 6, 2024

srebhan commented May 8, 2024

ajw1980 commented May 9, 2024

powersj commented May 16, 2024