You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I ran into an incident recently where smartctl hung due to a bad disk. smartctl_exporter continued to spawn new smartctl processes to monitor the disk even though the previous run did not finish. smartctl_exporter should detect if the previous process finished before starting a new process. I don't know how many processes were spawned before I was aware of the situation. I'm guessing several hundred.
Here's the processes. I noticed the PPID is 1, not the PID of smartctl_exporter.
It wasn't until I tried to stop the smartctl_exporter service that systemd tried to clean up the processes. Unfortunately, systemd could not kill the processes either.
Jul 04 21:14:05 sns5 kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jul 04 21:14:05 sns5 kernel: ata5.00: configured for UDMA/33
Jul 04 21:14:39 sns5 smartctl_exporter[1721]: [Warning] S.M.A.R.T. output reading error: exit status 4
Jul 04 21:14:39 sns5 smartctl_exporter[1721]: [Warning] Some SMART or other ATA command to the disk failed, or there was a checksum error in a SMART data structure
Jul 04 21:14:43 sns5 systemd[1]: Stopping smartctl exporter service...
Jul 04 21:15:08 sns5 kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jul 04 21:15:08 sns5 kernel: ata5.00: configured for UDMA/33
Jul 04 21:15:28 sns5 systemd-logind[984]: [🡕] New session 12131 of user suseuser.
Jul 04 21:15:28 sns5 systemd[1]: Started Session 12131 of user suseuser.
Jul 04 21:15:28 sns5 sshd[5042]: pam_unix(sshd:session): session opened for user suseuser by (uid=0)
Jul 04 21:15:43 sns5 sudo[5259]: suseuser : TTY=pts/2 ; PWD=/root ; USER=root ; COMMAND=/bin/bash
Jul 04 21:15:43 sns5 sudo[5259]: pam_unix(sudo-i:session): session opened for user root by suseuser(uid=5000)
Jul 04 21:16:11 sns5 kernel: ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jul 04 21:16:11 sns5 kernel: ata5.00: configured for UDMA/33
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: State 'final-sigterm' timed out. Killing.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29674 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29675 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29682 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29684 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 29688 (smartctl) with signal SIGKILL.
...
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 4741 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Killing process 4962 (smartctl) with signal SIGKILL.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Failed with result 'timeout'.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29674 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29675 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29682 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29684 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 29688 (smartctl) remains running after unit stopped.
...
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 4542 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 4547 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 4741 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Unit process 4962 (smartctl) remains running after unit stopped.
Jul 04 21:16:14 sns5 systemd[1]: Stopped smartctl exporter service.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29674 (smartctl) in control group while starting unit. Ignoring.
Jul 04 21:16:14 sns5 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29675 (smartctl) in control group while starting unit. Ignoring.
Jul 04 21:16:14 sns5 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29682 (smartctl) in control group while starting unit. Ignoring.
Jul 04 21:16:14 sns5 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29684 (smartctl) in control group while starting unit. Ignoring.
Jul 04 21:16:14 sns5 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Jul 04 21:16:14 sns5 systemd[1]: smartctl_exporter.service: Found left-over process 29688 (smartctl) in control group while starting unit. Ignoring.
What would I like to see happen?
It would be nice if smartctl_exporter checked if the previous process exited before spawning a new smartctl process. Nothing can be done for smartctl hanging on a bad disk but we can prevent smartctl_exporter from making things worse.
The text was updated successfully, but these errors were encountered:
Also it will be nice after some threshold to blacklist device that hang smartctl, and put metric, for example smartctl_device_blacklist to 1
This will prevent to issues like described @NiceGuyIT, continue to monitor alive disks and can be handled via blacklist metric
I ran into an incident recently where
smartctl
hung due to a bad disk.smartctl_exporter
continued to spawn newsmartctl
processes to monitor the disk even though the previous run did not finish.smartctl_exporter
should detect if the previous process finished before starting a new process. I don't know how many processes were spawned before I was aware of the situation. I'm guessing several hundred.Here's the processes. I noticed the PPID is 1, not the PID of smartctl_exporter.
It wasn't until I tried to stop the
smartctl_exporter
service that systemd tried to clean up the processes. Unfortunately, systemd could not kill the processes either.What would I like to see happen?
It would be nice if smartctl_exporter checked if the previous process exited before spawning a new smartctl process. Nothing can be done for smartctl hanging on a bad disk but we can prevent smartctl_exporter from making things worse.
The text was updated successfully, but these errors were encountered: