Failed to collect raid controller device S.M.A.R.T data #89

sdragon83 · 2022-10-19T05:39:51Z

I tried to collect data from a server with a raid controller through smartctl exporter.

However, an error occurred as below.

How can i collect S.M.A.R.T data on raid controller devices?

tomazb · 2022-11-10T15:57:09Z

Yes, this is the real reason why you need such a service in the first place - to monitor devices that are not easily visible inside the operating system.

marpears · 2022-11-14T16:43:02Z

If the device type was able to be retrieved and passed into function readSMARTctl then this could be used with the --device flag and would be a safer way of being able to scan all device types. EG as below :

smartctl --json --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --device megaraid,0 /dev/bus/1

josefzahner · 2022-12-02T09:44:29Z

@marpears I can read the device info with smartctl including the device option, but NOT with smartctl_exporter...

$ smartctl --json --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --device cciss,1 /dev/sdb
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      3
    ],
    "svn_revision": "5338",
    "platform_info": "x86_64-linux-3.10.0-957.27.2.el7.x86_64",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "--json",
...

but this doesn't work:

$ smartctl_exporter --smartctl.device='cciss,1 /dev/sdb'
ts=2022-12-02T09:40:45.718Z caller=main.go:90 level=info msg="Starting smartctl_exporter" version="(version=0.9.1, branch=HEAD, revision=a58c632ea8fa0f4f10a9ac9e941e610a7bb2efc1)"
ts=2022-12-02T09:40:45.718Z caller=main.go:91 level=info msg="Build context" build_context="(go=go1.19.3, user=root@fa2a9a938fb5, date=20221106-21:46:18)"
ts=2022-12-02T09:40:45.735Z caller=main.go:112 level=warn msg="Device unavailable" name="cciss,1 /dev/sdb"
ts=2022-12-02T09:40:45.735Z caller=main.go:119 level=info msg="No devices specified, trying to load them automatically"
ts=2022-12-02T09:40:45.735Z caller=main.go:124 level=error msg="No devices found"

lahwaacz · 2022-12-09T15:44:21Z

@josefzahner The --smartctl.device flag in smartctl_exporter does not translate to the --device flag of smartctl. The exporter expects just the /dev/ node path. Also note that --device cciss,1 /dev/sdb are 3 distinct flags passed on the command line, you can't pass all of that to --smartctl.device.

kfox1111 · 2022-12-12T21:51:55Z

how does one configure cciss,1? I need to do it on some of my nodes and have not found a way yet.

anthonyeleven · 2023-02-27T19:01:18Z

This is a gating factor for me too. I've added comments to the above issue and linked PR.

jakubgs · 2023-03-20T11:37:14Z

This is also an issue for me. I guess a proper solution would involve adding a separate flag to provide extra flags for smartctl.

anthonyeleven · 2023-03-20T19:10:27Z

The tool should discover such HBAs and do so automagically at per-device granularity, since there can and will be a mixed population of direct-attach, passthrough, and hidden-by-VD drives on various sytems and especially within a given system.

smartmon.sh for example does this:


for device in ${device_list}; do
  disk="$(echo ${device} | cut -f1 -d'|')"
  type="$(echo ${device} | cut -f2 -d'|')"
  active=1
  echo "smartctl_run{disk=\"${disk}\",type=\"${type}\"}" "$(TZ=UTC date '+%s')"
  # Check if the device is in a low-power mode
  $SMARTCTL -n standby -d "${type}" "${disk}" > /dev/null || active=0
  echo "device_active{disk=\"${disk}\",type=\"${type}\"}" "${active}"
  # Skip further metrics to prevent the disk from spinning up
  test ${active} -eq 0 && continue
  # Get the SMART information and health
  $SMARTCTL  -i -H -d "${type}" "${disk}" | parse_smartctl_info "${disk}" "${type}"
  # Get the SMART attributes
  case ${type} in
  sat) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_attributes "${disk}" "${type}" ;;
  sat+megaraid*) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_attributes "${disk}" "${type}" ;;
  scsi) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_scsi_attributes "${disk}" "${type}" ;;
  nvme) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_nvme_attributes "${disk}" "${type}" ;;
  megaraid*) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_scsi_attributes "${disk}" "${type}" ;;
  *)
    echo "disk type is not sat, scsi or megaraid but ${type}"
    exit
    ;;
  esac
done | format_output```


Mind you, I *despise* RoC HBAs and would just as soon never have one, or to set passthrough/JBOD on legacy systems, but walking into an existing deployment of thousands I don't have the luxury of greenfield.

anthonyeleven · 2023-03-29T15:39:57Z

@jakubgs It's more than just extra flags, it's discovery too.

/dev/sda -d scsi # /dev/sda, SCSI device
/dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device
/dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device```

I no longer have HP HBAs, but it would be polite for however this is done to be architected in such a way that they could be supported later.

I hope to sunset RoC VDs through attrition, but that will take years :-/

kfox1111 · 2023-07-14T21:10:02Z

Any way to do this yet?

anthonyeleven · 2023-07-14T23:37:55Z

I’d do it myself if I had the coding skills. It really is a fatal flaw. Mind you HBA RAID is itself a fatal flaw but Dell’s BOSS-N1 is too useful, though one has to invoke ‘mvcli’ to get status. On Jul 14, 2023, at 5:10 PM, kfox1111 ***@***.***> wrote: Any way to do this yet? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: ***@***.***>

jakubgs · 2023-10-05T19:09:37Z

I did a bit of research into this and found out that these devices can be found with smartctl by using -d scsi:

 > smartctl --json --scan | jq -c '.devices[] | { name, protocol }'         
jq: error (at <stdin>:21): Cannot iterate over null (null)

 > smartctl --json --scan --device scsi | jq -c '.devices[] | { name, protocol }'
{"name":"/dev/sda","protocol":"SCSI"}
{"name":"/dev/sdb","protocol":"SCSI"}
{"name":"/dev/sdc","protocol":"SCSI"}

But there might be an even better way to identify those devices, and that is lsblk:

 > lsblk --json -O | jq -c '.blockdevices[] | { path, hctl, subsystems }'
{"path":"/dev/sda","hctl":"0:1:0:0","subsystems":"block:scsi:pci"}
{"path":"/dev/sdb","hctl":"0:1:0:1","subsystems":"block:scsi:pci"}
{"path":"/dev/sdc","hctl":"0:1:0:2","subsystems":"block:scsi:pci"}

As we can see the hctl field informs us what number to use for --device cciss,N and sybsystems informs us that scsi is being used, which together can be a pretty reliable heuristic for detecting HBA.

And different host without HBA:

 > lsblk --json -O | jq -c '.blockdevices[] | { path, hctl, subsystems }'
{"path":"/dev/nvme0n1","hctl":null,"subsystems":"block:nvme:pci"}
{"path":"/dev/nvme1n1","hctl":null,"subsystems":"block:nvme:pci"}

I don't know what maintainers would think about using a tool other than systemctl for discovery, but this is a pretty standard tool available in most system, and we could still have a fallback to smartctl if unavailable.

I'm going to read a bit the code to see how difficult this would be.

jakubgs · 2023-10-05T19:15:26Z

Main issue as far as I can tell is that even if you discover the devices, often you won't get much info from them:

{
  "json_format_version": [1, 0],
  "smartctl": {
    "version": [7, 2],
    "svn_revision": "5155",
    "platform_info": "x86_64-linux-5.15.0-79-generic",
    "build_info": "(local build)",
    "argv": ["smartctl", "-A", "--device", "cciss,1", "/dev/sdb", "--json"],
    "exit_status": 0
  },
  "device": {
    "name": "/dev/sdb",
    "info_name": "/dev/sdb [cciss_disk_01] [SCSI]",
    "type": "cciss",
    "protocol": "SCSI"
  },
  "temperature": {
    "current": 21,
    "drive_trip": 70
  },
  "power_on_time": {
    "hours": 47138,
    "minutes": 5
  },
  "scsi_grown_defect_list": 0
}

Temperature and power-on time... not great.

anthonyeleven · 2023-10-05T19:19:47Z

Better than nothing, but yeah. I haven't had an HP HBA to work with for years, but re the scsi factor above, is the subject drive SAS? I would not be surprised if this would not surface SATA (but it might).

anthonyeleven · 2023-10-05T19:20:55Z

I'm increasingly leaning toward having a protege write a SMART harvester from scratch in Python, which would make it easier to normalize the vagaries of data that smartctl gives us. Then redirect the output into a file and let node_exporter's textfile collector snarf it up.

jakubgs · 2023-10-05T19:24:05Z

Personally I'd rather fix what we have working than try from scratch. I'm busy enough dealing with what I have working already have the time to reinvent wheels. Even if I get just temp and power-on hours that's better than deployed SMART exporter just failing at startup and Prometheus returning alerts for the downed service.

jakubgs · 2023-10-05T19:25:59Z

But your point about SATA/SAS is well made. I will have to check how that is done on my servers.

josefzahner mentioned this issue Dec 16, 2022

Non standard device accessors such as -d cciss,N do not work #26

Open

NiceGuyIT mentioned this issue Aug 26, 2023

Deal with Seagate Mixing Other Data with Error Counts #108

Open

zxzharmlesszxz mentioned this issue Mar 1, 2024

Added determining device type and use it at scrape data #205

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to collect raid controller device S.M.A.R.T data #89

Failed to collect raid controller device S.M.A.R.T data #89

sdragon83 commented Oct 19, 2022

tomazb commented Nov 10, 2022

marpears commented Nov 14, 2022 •

edited

josefzahner commented Dec 2, 2022 •

edited

lahwaacz commented Dec 9, 2022

kfox1111 commented Dec 12, 2022

anthonyeleven commented Feb 27, 2023

jakubgs commented Mar 20, 2023

anthonyeleven commented Mar 20, 2023

anthonyeleven commented Mar 29, 2023

kfox1111 commented Jul 14, 2023

anthonyeleven commented Jul 14, 2023 via email

jakubgs commented Oct 5, 2023 •

edited

jakubgs commented Oct 5, 2023 •

edited

anthonyeleven commented Oct 5, 2023

anthonyeleven commented Oct 5, 2023 •

edited

jakubgs commented Oct 5, 2023

jakubgs commented Oct 5, 2023 •

edited

Failed to collect raid controller device S.M.A.R.T data #89

Failed to collect raid controller device S.M.A.R.T data #89

Comments

sdragon83 commented Oct 19, 2022

tomazb commented Nov 10, 2022

marpears commented Nov 14, 2022 • edited

josefzahner commented Dec 2, 2022 • edited

lahwaacz commented Dec 9, 2022

kfox1111 commented Dec 12, 2022

anthonyeleven commented Feb 27, 2023

jakubgs commented Mar 20, 2023

anthonyeleven commented Mar 20, 2023

anthonyeleven commented Mar 29, 2023

kfox1111 commented Jul 14, 2023

anthonyeleven commented Jul 14, 2023 via email

jakubgs commented Oct 5, 2023 • edited

jakubgs commented Oct 5, 2023 • edited

anthonyeleven commented Oct 5, 2023

anthonyeleven commented Oct 5, 2023 • edited

jakubgs commented Oct 5, 2023

jakubgs commented Oct 5, 2023 • edited

marpears commented Nov 14, 2022 •

edited

josefzahner commented Dec 2, 2022 •

edited

jakubgs commented Oct 5, 2023 •

edited

jakubgs commented Oct 5, 2023 •

edited

anthonyeleven commented Oct 5, 2023 •

edited

jakubgs commented Oct 5, 2023 •

edited