Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(inputs.execd): allow failures on cmd start #14244

Closed
ajw1980 opened this issue Nov 3, 2023 · 9 comments · Fixed by #15271
Closed

feat(inputs.execd): allow failures on cmd start #14244

ajw1980 opened this issue Nov 3, 2023 · 9 comments · Fixed by #15271
Assignees
Labels
feature request Requests for new plugin and for new features to existing plugins help wanted Request for community participation, code, contribution size/m 2-4 day effort waiting for response waiting for response from contributor

Comments

@ajw1980
Copy link

ajw1980 commented Nov 3, 2023

Relevant telegraf.conf

[[inputs.execd]]
  command = ["/opt/adm/sbin/status.py"]
  signal = "none"
  restart_delay = "10s"
  data_format = "influx"

Logs from Telegraf

Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "Traceback (most recent call last):"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 248, in <module>"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "    main()"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 240, in main"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "    xrst.poll_status()"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 49, in poll_status"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "    self.get_systemd_status()"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 163, in get_systemd_status"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "    pipe = Popen(systemctl_cmd, stdout=PIPE, stderr=DEVNULL, universal_newlines=True)"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "  File \"/usr/lib64/python3.6/subprocess.py\", line 709, in __init__"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "    restore_signals, start_new_session)"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "  File \"/usr/lib64/python3.6/subprocess.py\", line 1275, in _execute_child"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "    restore_signals, start_new_session, preexec_fn)"
Nov  3 09:38:42 machine telegraf[24194]: 2023-11-03T14:38:42Z E! [inputs.execd] stderr: "OSError: [Errno 12] Cannot allocate memory"
Nov  3 09:38:43 machine telegraf[24194]: 2023-11-03T14:38:43Z E! [inputs.execd] Process /opt/adm/sbin/status.py exited: exit status 1
Nov  3 09:38:43 machine telegraf[24194]: 2023-11-03T14:38:43Z I! [inputs.execd] Restarting in 10s...
Nov  3 09:38:53 machine telegraf[24194]: 2023-11-03T14:38:53Z I! [inputs.execd] Starting process: /opt/adm/sbin/status.py []
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "Traceback (most recent call last):"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 248, in <module>"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "    main()"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 240, in main"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "    xrst.poll_status()"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 49, in poll_status"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "    self.get_systemd_status()"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "  File \"/opt/adm/sbin/status.py\", line 163, in get_systemd_status"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "    pipe = Popen(systemctl_cmd, stdout=PIPE, stderr=DEVNULL, universal_newlines=True)"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "  File \"/usr/lib64/python3.6/subprocess.py\", line 709, in __init__"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "    restore_signals, start_new_session)"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "  File \"/usr/lib64/python3.6/subprocess.py\", line 1275, in _execute_child"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "    restore_signals, start_new_session, preexec_fn)"
Nov  3 09:39:42 machine telegraf[24194]: 2023-11-03T14:39:42Z E! [inputs.execd] stderr: "OSError: [Errno 12] Cannot allocate memory"
Nov  3 09:39:43 machine telegraf[24194]: 2023-11-03T14:39:43Z E! [inputs.execd] Process /opt/adm/sbin/status.py exited: exit status 1
Nov  3 09:39:43 machine telegraf[24194]: 2023-11-03T14:39:43Z I! [inputs.execd] Restarting in 10s...
Nov  3 09:39:54 machine telegraf[24194]: 2023-11-03T14:39:53Z I! [inputs.execd] Starting process: /opt/adm/sbin/status.py []
Nov  3 09:39:54 machine telegraf[24194]: 2023-11-03T14:39:54Z E! [inputs.execd] Process quit with message: error starting process: fork/exec /opt/adm/sbin/status.py: cannot allocate memory

System info

telegraf 1.21.3 fedora

Docker

No response

Steps to reproduce

Create an execd input.
Have the system fail in a way that processes don't start properly (out of memory)

Expected behavior

execd process should always be restarted

Actual behavior

execd process was not restarted.

Additional info

In certain instances where a system problem causes processes to not start, an execd plugin process will not get restarted. In this case the machine ran out of memory and the execd process stopped. It would seem if the process starts and exits telegraf will restart it, but if telegraf fails to even start the process it will no longer be restarted.

@ajw1980 ajw1980 added the bug unexpected problem or unintended behavior label Nov 3, 2023
@powersj
Copy link
Contributor

powersj commented Nov 3, 2023

execd process should always be restarted

From the code, we will continuously try to restart, except on errors from running the cmd start. If telegraf cannot start an input plugin, or in this case, start the execd that you want us to, then telegraf will fail. This is the expected behavior in general, as it makes little sense to try to continue run if we cannot start a plugin that you expect to provide data.

However, we have other FR to enable settings on a per-plugin basis that would allow ignoring errors on start up, and we can do that here as well.

@powersj powersj added feature request Requests for new plugin and for new features to existing plugins help wanted Request for community participation, code, contribution size/m 2-4 day effort and removed bug unexpected problem or unintended behavior labels Nov 3, 2023
@powersj powersj changed the title execd plugin doesn't restart if process quits feat(inputs.execd): allow failures on cmd start Nov 3, 2023
@srebhan
Copy link
Contributor

srebhan commented Apr 10, 2024

Trying to reproduce the issue really gives me a headache. It seems like the startup only fails in cases where the OS hard-terminates the executed process. Such events are severe like out-of-memory or maybe segfaults. I don't think we should handle those cases as the kernel rightfully terminated the process, maybe even in an uncontrolled way (as in the OOM case).

@srebhan srebhan added the waiting for response waiting for response from contributor label Apr 10, 2024
@ajw1980
Copy link
Author

ajw1980 commented Apr 12, 2024

Yeah, it would most likely be memory issues on the system to cause this situation. Maybe this should be an option to just shutdown telegraf with an error state if execd fails? This would at least make it more obvious that something failed and for systemd hosts the service would get restarted if configured to do so.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Apr 12, 2024
@srebhan
Copy link
Contributor

srebhan commented Apr 30, 2024

@ajw1980 please test the binary in PR #15271, available as-soon-as CI finished the tests, and let me know if this fixes your issue. The new option is called stop_on_error and it needs to be set to true for your use-case.

@ajw1980
Copy link
Author

ajw1980 commented May 3, 2024

I downloaded it and added the config option. It seems like this only stops the execd plugin not telegraf itself, right?

@srebhan
Copy link
Contributor

srebhan commented May 6, 2024

Exactly. Once Telegraf started it is impossible to stop it completely as it is now.

@srebhan
Copy link
Contributor

srebhan commented May 8, 2024

@ajw1980 are you good with the fix?

@srebhan srebhan added the waiting for response waiting for response from contributor label May 8, 2024
@srebhan srebhan self-assigned this May 8, 2024
@ajw1980
Copy link
Author

ajw1980 commented May 9, 2024

That option doesn't really address this issue. The execd input will already not relaunch the command if there is some sort of system (out of memory) error.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label May 9, 2024
@powersj
Copy link
Contributor

powersj commented May 16, 2024

@ajw1980,

Unfortunately, there is no way to kill Telegraf from a failure when calling the script. With @srebhan's PR we could kill the plugin, but not all of Telegraf.

We are left with taking that PR as an additional option or closing this as won't fix unless some other solution comes up.

@powersj powersj added the waiting for response waiting for response from contributor label May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requests for new plugin and for new features to existing plugins help wanted Request for community participation, code, contribution size/m 2-4 day effort waiting for response waiting for response from contributor
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants