Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to collect raid controller device S.M.A.R.T data #89

Open
sdragon83 opened this issue Oct 19, 2022 · 17 comments
Open

Failed to collect raid controller device S.M.A.R.T data #89

sdragon83 opened this issue Oct 19, 2022 · 17 comments

Comments

@sdragon83
Copy link

I tried to collect data from a server with a raid controller through smartctl exporter.

However, an error occurred as below.

How can i collect S.M.A.R.T data on raid controller devices?

image

@tomazb
Copy link

tomazb commented Nov 10, 2022

Yes, this is the real reason why you need such a service in the first place - to monitor devices that are not easily visible inside the operating system.

@marpears
Copy link

marpears commented Nov 14, 2022

If the device type was able to be retrieved and passed into function readSMARTctl then this could be used with the --device flag and would be a safer way of being able to scan all device types. EG as below :

smartctl --json --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --device megaraid,0 /dev/bus/1

@josefzahner
Copy link

josefzahner commented Dec 2, 2022

@marpears I can read the device info with smartctl including the device option, but NOT with smartctl_exporter...

$ smartctl --json --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --device cciss,1 /dev/sdb
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      3
    ],
    "svn_revision": "5338",
    "platform_info": "x86_64-linux-3.10.0-957.27.2.el7.x86_64",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "--json",
...

but this doesn't work:

$ smartctl_exporter --smartctl.device='cciss,1 /dev/sdb'
ts=2022-12-02T09:40:45.718Z caller=main.go:90 level=info msg="Starting smartctl_exporter" version="(version=0.9.1, branch=HEAD, revision=a58c632ea8fa0f4f10a9ac9e941e610a7bb2efc1)"
ts=2022-12-02T09:40:45.718Z caller=main.go:91 level=info msg="Build context" build_context="(go=go1.19.3, user=root@fa2a9a938fb5, date=20221106-21:46:18)"
ts=2022-12-02T09:40:45.735Z caller=main.go:112 level=warn msg="Device unavailable" name="cciss,1 /dev/sdb"
ts=2022-12-02T09:40:45.735Z caller=main.go:119 level=info msg="No devices specified, trying to load them automatically"
ts=2022-12-02T09:40:45.735Z caller=main.go:124 level=error msg="No devices found"

@lahwaacz
Copy link
Contributor

lahwaacz commented Dec 9, 2022

@josefzahner The --smartctl.device flag in smartctl_exporter does not translate to the --device flag of smartctl. The exporter expects just the /dev/ node path. Also note that --device cciss,1 /dev/sdb are 3 distinct flags passed on the command line, you can't pass all of that to --smartctl.device.

@kfox1111
Copy link

how does one configure cciss,1? I need to do it on some of my nodes and have not found a way yet.

@anthonyeleven
Copy link

This is a gating factor for me too. I've added comments to the above issue and linked PR.

@jakubgs
Copy link

jakubgs commented Mar 20, 2023

This is also an issue for me. I guess a proper solution would involve adding a separate flag to provide extra flags for smartctl.

@anthonyeleven
Copy link

The tool should discover such HBAs and do so automagically at per-device granularity, since there can and will be a mixed population of direct-attach, passthrough, and hidden-by-VD drives on various sytems and especially within a given system.

smartmon.sh for example does this:


for device in ${device_list}; do
  disk="$(echo ${device} | cut -f1 -d'|')"
  type="$(echo ${device} | cut -f2 -d'|')"
  active=1
  echo "smartctl_run{disk=\"${disk}\",type=\"${type}\"}" "$(TZ=UTC date '+%s')"
  # Check if the device is in a low-power mode
  $SMARTCTL -n standby -d "${type}" "${disk}" > /dev/null || active=0
  echo "device_active{disk=\"${disk}\",type=\"${type}\"}" "${active}"
  # Skip further metrics to prevent the disk from spinning up
  test ${active} -eq 0 && continue
  # Get the SMART information and health
  $SMARTCTL  -i -H -d "${type}" "${disk}" | parse_smartctl_info "${disk}" "${type}"
  # Get the SMART attributes
  case ${type} in
  sat) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_attributes "${disk}" "${type}" ;;
  sat+megaraid*) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_attributes "${disk}" "${type}" ;;
  scsi) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_scsi_attributes "${disk}" "${type}" ;;
  nvme) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_nvme_attributes "${disk}" "${type}" ;;
  megaraid*) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_scsi_attributes "${disk}" "${type}" ;;
  *)
    echo "disk type is not sat, scsi or megaraid but ${type}"
    exit
    ;;
  esac
done | format_output```


Mind you, I *despise* RoC HBAs and would just as soon never have one, or to set passthrough/JBOD on legacy systems, but walking into an existing deployment of thousands I don't have the luxury of greenfield.


@anthonyeleven
Copy link

@jakubgs It's more than just extra flags, it's discovery too.

/dev/sda -d scsi # /dev/sda, SCSI device
/dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device
/dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device```

I no longer have HP HBAs, but it would be polite for however this is done to be architected in such a way that they could be supported later.

I hope to sunset RoC VDs through attrition, but that will take years :-/

@kfox1111
Copy link

Any way to do this yet?

@anthonyeleven
Copy link

anthonyeleven commented Jul 14, 2023 via email

@jakubgs
Copy link

jakubgs commented Oct 5, 2023

I did a bit of research into this and found out that these devices can be found with smartctl by using -d scsi:

 > smartctl --json --scan | jq -c '.devices[] | { name, protocol }'         
jq: error (at <stdin>:21): Cannot iterate over null (null)

 > smartctl --json --scan --device scsi | jq -c '.devices[] | { name, protocol }'
{"name":"/dev/sda","protocol":"SCSI"}
{"name":"/dev/sdb","protocol":"SCSI"}
{"name":"/dev/sdc","protocol":"SCSI"}

But there might be an even better way to identify those devices, and that is lsblk:

 > lsblk --json -O | jq -c '.blockdevices[] | { path, hctl, subsystems }'
{"path":"/dev/sda","hctl":"0:1:0:0","subsystems":"block:scsi:pci"}
{"path":"/dev/sdb","hctl":"0:1:0:1","subsystems":"block:scsi:pci"}
{"path":"/dev/sdc","hctl":"0:1:0:2","subsystems":"block:scsi:pci"}

As we can see the hctl field informs us what number to use for --device cciss,N and sybsystems informs us that scsi is being used, which together can be a pretty reliable heuristic for detecting HBA.

And different host without HBA:

 > lsblk --json -O | jq -c '.blockdevices[] | { path, hctl, subsystems }'
{"path":"/dev/nvme0n1","hctl":null,"subsystems":"block:nvme:pci"}
{"path":"/dev/nvme1n1","hctl":null,"subsystems":"block:nvme:pci"}

I don't know what maintainers would think about using a tool other than systemctl for discovery, but this is a pretty standard tool available in most system, and we could still have a fallback to smartctl if unavailable.

I'm going to read a bit the code to see how difficult this would be.

@jakubgs
Copy link

jakubgs commented Oct 5, 2023

Main issue as far as I can tell is that even if you discover the devices, often you won't get much info from them:

{
  "json_format_version": [1, 0],
  "smartctl": {
    "version": [7, 2],
    "svn_revision": "5155",
    "platform_info": "x86_64-linux-5.15.0-79-generic",
    "build_info": "(local build)",
    "argv": ["smartctl", "-A", "--device", "cciss,1", "/dev/sdb", "--json"],
    "exit_status": 0
  },
  "device": {
    "name": "/dev/sdb",
    "info_name": "/dev/sdb [cciss_disk_01] [SCSI]",
    "type": "cciss",
    "protocol": "SCSI"
  },
  "temperature": {
    "current": 21,
    "drive_trip": 70
  },
  "power_on_time": {
    "hours": 47138,
    "minutes": 5
  },
  "scsi_grown_defect_list": 0
}

Temperature and power-on time... not great.

@anthonyeleven
Copy link

Better than nothing, but yeah. I haven't had an HP HBA to work with for years, but re the scsi factor above, is the subject drive SAS? I would not be surprised if this would not surface SATA (but it might).

@anthonyeleven
Copy link

anthonyeleven commented Oct 5, 2023

I'm increasingly leaning toward having a protege write a SMART harvester from scratch in Python, which would make it easier to normalize the vagaries of data that smartctl gives us. Then redirect the output into a file and let node_exporter's textfile collector snarf it up.

@jakubgs
Copy link

jakubgs commented Oct 5, 2023

Personally I'd rather fix what we have working than try from scratch. I'm busy enough dealing with what I have working already have the time to reinvent wheels. Even if I get just temp and power-on hours that's better than deployed SMART exporter just failing at startup and Prometheus returning alerts for the downed service.

@jakubgs
Copy link

jakubgs commented Oct 5, 2023

But your point about SATA/SAS is well made. I will have to check how that is done on my servers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants