Exploit the mining abstraction to normalize metrics #111

anthonyeleven · 2023-02-27T19:44:25Z

The exporter currently appears to expose two classes of metrics:

Transcribed but not interpreted smartctl_device_attribute metrics
Mined metrics eg. smartctl_device_percentage_used

The mining function paradigm has considerable potential beyond the way it is currently used. For example:

        smart.ch <- prometheus.MustNewConstMetric(
                metricDevicePercentageUsed,
                prometheus.CounterValue,
                smart.json.Get("nvme_smart_health_information_log.percentage_used").Float(),
                smart.device.device,
	)
}```

This function today only exposes data for NVMe devices.
* The metrics for other device types are misleading
* SAS/SATA devices are not mined, this function could abstract the varying format and presence of how things like wear and temperature are reported across SMART attributes and SAS/NVMe passthrough from `smartctl`.
* Some devices report wear _used_, some report wear _remaining_.  A mining / wrapper function has the potential to transparently harmonize this
* Some devices report well-known counters on unusual SMART ID numbers
* `smartctl` attribute labels are arbitrary:  they are defined in `drivedb.h` and are not consistent.  For example, entries for drive self-reported wear have at least six names:
  * Media_Wearout_Indicator
  * Wear_Leveling_Count
  * Wear_Level_Used
  * Percent_Lifetime_Remain
  * Reallocated_Sector_Ct
  * SSD_Life_Left
The `minePercentageUsed` function could easily abstract / normalize across names, polarity, and `smartctl` output formats.

* Similarly Airflow_Temperature_Cel, Temperature_Celsius, Temperature_Internal, Drive_Temperature are example `smartctl` labels for drive temperature that should be abstracted by a mining function across drive models and interfaces.
* ` CRC_Error_Count` and `UDMA_CRC_Error_Count`etc.

The text was updated successfully, but these errors were encountered:

NiceGuyIT · 2023-08-26T21:40:45Z

Hi @anthonyeleven, your request was a little hard to understand without examples. I went digging and found an example!

This is for the same drive. While Airflow_Temperature_Cel is different than Temperature_Celsius, there is a .temperature that appears to be the same as Temperature_Celsius.

smartctl --json --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --log=error /dev/sdg |
yq -o json '.ata_smart_attributes.table | with_entries(select(.[].name == "Airflow_Temperature_Cel"))'

{
  "13": {
    "id": 190,
    "name": "Airflow_Temperature_Cel",
    "value": 72,
    "worst": 48,
    "thresh": 45,
    "when_failed": "",
    "flags": {
      "value": 34,
      "string": "-O---K ",
      "prefailure": false,
      "updated_online": true,
      "performance": false,
      "error_rate": false,
      "event_count": false,
      "auto_keep": true
    },
    "raw": {
      "value": 471269404,
      "string": "28 (Min/Max 23/28)"
    }
  }
}

smartctl --json --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --log=error /dev/sdg |
yq -o json '.ata_smart_attributes.table | with_entries(select(.[].name == "Temperature_Celsius"))'

{
  "14": {
    "id": 194,
    "name": "Temperature_Celsius",
    "value": 28,
    "worst": 52,
    "thresh": 0,
    "when_failed": "",
    "flags": {
      "value": 34,
      "string": "-O---K ",
      "prefailure": false,
      "updated_online": true,
      "performance": false,
      "error_rate": false,
      "event_count": false,
      "auto_keep": true
    },
    "raw": {
      "value": 34359738396,
      "string": "28 (0 8 0 0 0)"
    }
  }
}

smartctl --json --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --log=error /dev/sdg | yq -o json '.temperature'

{
  "current": 28
}

$ xh --body :19633/metrics | rg sdg | rg temp
smartctl_device_attribute{attribute_flags_long="updated_online,auto_keep",attribute_flags_short="-O---K",attribute_id="190",attribute_name="Airflow_Temperature_Cel",attribute_value_type="raw",device="sdg"} 4.71269404e+08
smartctl_device_attribute{attribute_flags_long="updated_online,auto_keep",attribute_flags_short="-O---K",attribute_id="190",attribute_name="Airflow_Temperature_Cel",attribute_value_type="thresh",device="sdg"} 45
smartctl_device_attribute{attribute_flags_long="updated_online,auto_keep",attribute_flags_short="-O---K",attribute_id="190",attribute_name="Airflow_Temperature_Cel",attribute_value_type="value",device="sdg"} 72
smartctl_device_attribute{attribute_flags_long="updated_online,auto_keep",attribute_flags_short="-O---K",attribute_id="190",attribute_name="Airflow_Temperature_Cel",attribute_value_type="worst",device="sdg"} 48
smartctl_device_attribute{attribute_flags_long="updated_online,auto_keep",attribute_flags_short="-O---K",attribute_id="194",attribute_name="Temperature_Celsius",attribute_value_type="raw",device="sdg"} 3.4359738396e+10
smartctl_device_attribute{attribute_flags_long="updated_online,auto_keep",attribute_flags_short="-O---K",attribute_id="194",attribute_name="Temperature_Celsius",attribute_value_type="thresh",device="sdg"} 0
smartctl_device_attribute{attribute_flags_long="updated_online,auto_keep",attribute_flags_short="-O---K",attribute_id="194",attribute_name="Temperature_Celsius",attribute_value_type="value",device="sdg"} 28
smartctl_device_attribute{attribute_flags_long="updated_online,auto_keep",attribute_flags_short="-O---K",attribute_id="194",attribute_name="Temperature_Celsius",attribute_value_type="worst",device="sdg"} 52
smartctl_device_temperature{device="sdg",temperature_type="current"} 28

anthonyeleven · 2023-08-26T21:55:43Z

That's one example.

To rephrase:

SMART is not implemented very consistently. Some SSDs report, for example, lifetime remaining, while others report lifetime used. Some report a given attribute with a different numeric ID than others. So a tool that reports based on numeric attribute IDs will benefit from a bit of "correction" before emitting the prom exposition format.

smartmon-ix.txt

smartmontools' drivedb.h is kinda the wild west: the text attribute names are not at all consistent and are kinda arbitrary. So any tool that relies on the attribute names, unless it does some normalization, will report different timeseries for different drives.

The whole point of an exporter like this is to emit metrics that can be queried across a whole fleet, without having to craft the query differently for different drive models.

I've attached a hacked-up script that I'm currently using via node_exporter's textfile collector. It's gnarly and I don't like that it spits out metrics I'll never care about, including thresholds and worsts.

If we can ever get a usable HBA RAID enhancement merged, I'll contribute this sort of thing. I'm stuck with thousands of @#$@#%^!! RoC HBA VDs so a collector that can't handle them doesn't do anything for me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploit the mining abstraction to normalize metrics #111

Exploit the mining abstraction to normalize metrics #111

anthonyeleven commented Feb 27, 2023 •

edited

NiceGuyIT commented Aug 26, 2023

anthonyeleven commented Aug 26, 2023

Exploit the mining abstraction to normalize metrics #111

Exploit the mining abstraction to normalize metrics #111

Comments

anthonyeleven commented Feb 27, 2023 • edited

NiceGuyIT commented Aug 26, 2023

anthonyeleven commented Aug 26, 2023

anthonyeleven commented Feb 27, 2023 •

edited