Device open failed, device did not return an IDENTIFY DEVICE structure, #91

Lusitaniae · 2022-10-27T04:55:29Z

Oct 27 04:45:22 host smart-exporter[19557]: ts=2022-10-27T04:45:22.306Z caller=readjson.go:69 level=warn msg="S.M.A.R.T. output reading" err="exit status 2"
Oct 27 04:45:22 host smart-exporter[19557]: ts=2022-10-27T04:45:22.306Z caller=readjson.go:122 level=error msg="Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode"
Oct 27 04:45:22 host smart-exporter[19557]: ts=2022-10-27T04:45:22.306Z caller=main.go:57 level=error msg="Error collecting SMART data" err="smartctl returned bad data for device /dev/sdb"
Oct 27 04:45:22 host smart-exporter[19557]: ts=2022-10-27T04:45:22.306Z caller=main.go:57 level=error msg="Error collecting SMART data" err="Device /dev/bus/0 unavialable"
Oct 27 04:45:22 host smart-exporter[19557]: ts=2022-10-27T04:45:22.306Z caller=main.go:57 level=error msg="Error collecting SMART data" err="Device /dev/bus/0 unavialable"

/usr/local/bin/smartctl_exporter  --version
smartctl_exporter, version 0.9.0 (branch: HEAD, revision: 0f32489b4018a21747109a33d7297c1ed85e10ab)
  build user:       root@f07a6d7b35c8
  build date:       20221020-16:19:31
  go version:       go1.18.7
  platform:         linux/amd64

constantly seing NVMe drives fail due to heavy load

Usually will see something like the below in dmesg

But seems smartctl_exporter doesn't pick up any of this? (could be the smart tool itself too)

At least it should should report some kind of error no if it can't scan the drive?

(metrics are not reset to 0 when the exporter can't scan again?)

[Wed Oct 26 06:18:54 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:18:59 2022] nvme nvme0: I/O 718 QID 11 timeout, aborting
[Wed Oct 26 06:19:00 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:19:00 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:19:00 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:19:00 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:19:31 2022] nvme nvme0: I/O 529 QID 34 timeout, aborting
[Wed Oct 26 06:19:31 2022] nvme nvme0: I/O 530 QID 34 timeout, aborting
[Wed Oct 26 06:19:31 2022] nvme nvme0: I/O 544 QID 34 timeout, aborting
[Wed Oct 26 06:19:31 2022] nvme nvme0: I/O 545 QID 34 timeout, aborting
...
[Wed Oct 26 06:20:17 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:20:19 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:20:19 2022] nvme nvme0: Abort status: 0x0
[Wed Oct 26 06:20:19 2022] blk_update_request: I/O error, dev nvme0n1, sector 1875858760 op 0x1:(WRITE) flags 0x1800 phys_seg 1 prio class 0
[Wed Oct 26 06:20:19 2022] XFS (nvme0n1p1): log I/O error -5
[Wed Oct 26 06:20:19 2022] XFS (nvme0n1p1): xfs_do_force_shutdown(0x2) called from line 1250 of file fs/xfs/xfs_log.c. Return address = 00000000dbc93c6d
[Wed Oct 26 06:20:19 2022] XFS (nvme0n1p1): Log I/O Error Detected. Shutting down filesystem
[Wed Oct 26 06:20:19 2022] XFS (nvme0n1p1): Please unmount the filesystem and rectify the problem(s)
[Wed Oct 26 06:20:19 2022] nvme nvme0: Abort status: 0x0

curl localhost:9633/metrics -s | grep crit | grep -v "#"
critical_warning{device="/dev/nvme0",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="6150A061TCD8"} 0
critical_warning{device="/dev/nvme1",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="51B0A02UTCD8"} 0
smartctl_device_critical_warning{device="/dev/nvme0",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="6150A061TCD8"} 0
smartctl_device_critical_warning{device="/dev/nvme1",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="51B0A02UTCD8"} 0

curl localhost:9633/metrics -s | grep err | grep -v "#"
media_errors{device="/dev/nvme0",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="6150A061TCD8"} 0
media_errors{device="/dev/nvme1",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="51B0A02UTCD8"} 0
smartctl_device_media_errors{device="/dev/nvme0",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="6150A061TCD8"} 0
smartctl_device_media_errors{device="/dev/nvme1",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="51B0A02UTCD8"} 0
smartctl_device_num_err_log_entries{device="/dev/nvme0",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="6150A061TCD8"} 113
smartctl_device_num_err_log_entries{device="/dev/nvme1",model_family="",model_name="Dell Ent NVMe CM6 RI 1.92TB",serial_number="51B0A02UTCD8"} 209

Also small nitpicks:

typo: unavialable
there should be a metric with smartctl_exporter version ?

The text was updated successfully, but these errors were encountered:

robryk · 2023-01-10T12:52:59Z

I encountered the same behaviour when smartctl_exporter was running as a user that couldn't open the device:

the errors were logged to stderr,
there was no indication of the errors in metrics (just as if I never asked it to scan that device).

Regardless of the reason for errors, I think it's a bug that they are not explicitly reported in metrics.

NiceGuyIT · 2023-08-26T19:43:04Z

Hi @Lusitaniae. "Device open failed, device did not return an IDENTIFY DEVICE structure, or device is in a low-power mode" is coming from smartctl as documented by their exit codes. When that happens, smartctl_exporter logs an error and does not produce any metrics. As @robryk mentioned, this can happen if you run the exporter as a user that doesn't have permission.

The two nitpicks have been fixed.

there was no indication of the errors in metrics (just as if I never asked it to scan that device).

Regardless of the reason for errors, I think it's a bug that they are not explicitly reported in metrics.

Hey @robryk, if you believe this is a bug, please open a separate issue to address it. One could argue environmental errors, such as permission errors, should not be included in the exporter output.

nazar-pc · 2023-10-25T09:25:39Z

I have devices in low power/standby mode, but smartctl -a and other commands still return data just fine.

The reason seems to be --nocheck=standby argument, which results in the status code 2 and that error message, why is it necessary? I suspect to not wake up HDDs, but there are two things:

SSDs should still be fine, so exporter can remember which drives are SSDs and still query them
This shouldn't break reporting for other devices, but I have 2 of these low power/standby SSDs (cheap Chinese ones) and 3 Samsung SSDs that are no longer being reported because of it

nazar-pc · 2023-10-26T09:29:32Z

#61 requested --nocheck=standby and it was implemented in #74. I think it'd be helpful to support exceptions for that option because I have SSDs that report that they are sleeping, though clearly there is no moving parts in there.

robryk mentioned this issue Jan 10, 2023

smartctl_exporter ignores nvme devices by default NixOS/nixpkgs#210041

Open

onedr0p mentioned this issue Mar 19, 2023

Deploy smartctl_exporter on router onedr0p/home-ops#4895

Closed

robryk mentioned this issue Nov 15, 2023

systemd.go: Added systemd health metric prometheus-community/systemd_exporter#113

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device open failed, device did not return an IDENTIFY DEVICE structure, #91

Device open failed, device did not return an IDENTIFY DEVICE structure, #91

Lusitaniae commented Oct 27, 2022 •

edited

robryk commented Jan 10, 2023

NiceGuyIT commented Aug 26, 2023

nazar-pc commented Oct 25, 2023

nazar-pc commented Oct 26, 2023

Device open failed, device did not return an IDENTIFY DEVICE structure, #91

Device open failed, device did not return an IDENTIFY DEVICE structure, #91

Comments

Lusitaniae commented Oct 27, 2022 • edited

robryk commented Jan 10, 2023

NiceGuyIT commented Aug 26, 2023

nazar-pc commented Oct 25, 2023

nazar-pc commented Oct 26, 2023

Lusitaniae commented Oct 27, 2022 •

edited