Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smartctl_exporter does not notice if previous smartctl did not finish (hangs) #56

Open
NiceGuyIT opened this issue Aug 8, 2022 · 2 comments

Comments

@NiceGuyIT
Copy link
Member

I ran into an incident recently where smartctl hung due to a bad disk. smartctl_exporter continued to spawn new smartctl processes to monitor the disk even though the previous run did not finish. smartctl_exporter should detect if the previous process finished before starting a new process. I don't know how many processes were spawned before I was aware of the situation. I'm guessing several hundred.

Here's the processes. I noticed the PPID is 1, not the PID of smartctl_exporter.

$ ps -ef | grep smartctl
root       343     1  0 20:33 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root       369     1  0 20:33 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root       374     1  0 20:33 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root       687     1  0 20:35 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root       702     1  0 20:35 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
...
root      5483     1  0 21:16 ?        00:00:00 /usr/local/bin/smartctl_exporter
root      5513  5483  0 21:16 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root      5607  5260  0 21:17 pts/2    00:00:00 smartctl -a /dev/sdc
root      6014  4750  0 21:20 pts/1    00:00:00 grep --color=auto smartctl
root     29674     1  0 20:12 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root     29675     1  0 20:12 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root     29682     1  0 20:12 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root     29684     1  0 20:13 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
...
root     32744     1  0 20:32 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root     32746     1  0 20:32 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc
root     32751     1  0 20:32 ?        00:00:00 /usr/sbin/smartctl --json --xall /dev/sdc

It wasn't until I tried to stop the smartctl_exporter service that systemd tried to clean up the processes. Unfortunately, systemd could not kill the processes either.

Jul 04 21:14:05 sns5 kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jul 04 21:14:05 sns5 kernel: ata5.00: configured for UDMA/33
Jul 04 21:14:39 sns5 smartctl_exporter[1721]: [Warning] S.M.A.R.T. output reading error: exit status 4
Jul 04 21:14:39 sns5 smartctl_exporter[1721]: [Warning] Some SMART or other ATA command to the disk failed, or there was a checksum error in a SMART data structure
Jul 04 21:14:43 sns5 systemd[1]: Stopping smartctl exporter service...
Jul 04 21:15:08 sns5 kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jul 04 21:15:08 sns5 kernel: ata5.00: configured for UDMA/33
Jul 04 21:15:28 sns5 systemd-logind[984]: [🡕] New session 12131 of user suseuser.
Jul 04 21:15:28 sns5 systemd[1]: Started Session 12131 of user suseuser.
Jul 04 21:15:28 sns5 sshd[5042]: pam_unix(sshd:session): session opened for user suseuser by (uid=0)
Jul 04 21:15:43 sns5 sudo[5259]: suseuser : TTY=pts/2 ; PWD=/root ; USER=root ; COMMAND=/bin/bash
Jul 04 21:15:43 sns5 sudo[5259]: pam_unix(sudo-i:session): session opened for user root by suseuser(uid=5000)
Jul 04 21:16:11 sns5 kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jul 04 21:16:11 sns5 kernel: ata5.00: configured for UDMA/33
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: State 'final-sigterm' timed out. Killing.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29674 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29675 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29682 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29684 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29688 (smartctl) with signal SIGKILL.
...
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 4741 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 4962 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Failed with result 'timeout'.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29674 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29675 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29682 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29684 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29688 (smartctl) remains running after unit stopped.
...
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 4542 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 4547 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 4741 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 4962 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: Stopped smartctl exporter service.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29674 (smartctl) in control group while starting unit. Ignoring.
Jul 04 21:16:14 sns5 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29675 (smartctl) in control group while starting unit. Ignoring.
Jul 04 21:16:14 sns5 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29682 (smartctl) in control group while starting unit. Ignoring.
Jul 04 21:16:14 sns5 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29684 (smartctl) in control group while starting unit. Ignoring.
Jul 04 21:16:14 sns5 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29688 (smartctl) in control group while starting unit. Ignoring.

What would I like to see happen?

It would be nice if smartctl_exporter checked if the previous process exited before spawning a new smartctl process. Nothing can be done for smartctl hanging on a bad disk but we can prevent smartctl_exporter from making things worse.

@k0ste
Copy link
Contributor

k0ste commented Aug 8, 2022

Also it will be nice after some threshold to blacklist device that hang smartctl, and put metric, for example smartctl_device_blacklist to 1
This will prevent to issues like described @NiceGuyIT, continue to monitor alive disks and can be handled via blacklist metric

@jantman
Copy link

jantman commented Mar 22, 2023

FWIW, this feels like a pretty critical bug to me...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants