Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploit the mining abstraction to normalize metrics #111

Open
anthonyeleven opened this issue Feb 27, 2023 · 2 comments
Open

Exploit the mining abstraction to normalize metrics #111

anthonyeleven opened this issue Feb 27, 2023 · 2 comments

Comments

@anthonyeleven
Copy link

anthonyeleven commented Feb 27, 2023

The exporter currently appears to expose two classes of metrics:

  1. Transcribed but not interpreted smartctl_device_attribute metrics
  2. Mined metrics eg. smartctl_device_percentage_used

The mining function paradigm has considerable potential beyond the way it is currently used. For example:

        smart.ch <- prometheus.MustNewConstMetric(
                metricDevicePercentageUsed,
                prometheus.CounterValue,
                smart.json.Get("nvme_smart_health_information_log.percentage_used").Float(),
                smart.device.device,
	)
}```

This function today only exposes data for NVMe devices.
* The metrics for other device types are misleading
* SAS/SATA devices are not mined, this function could abstract the varying format and presence of how things like wear and temperature are reported across SMART attributes and SAS/NVMe passthrough from `smartctl`.
* Some devices report wear _used_, some report wear _remaining_.  A mining / wrapper function has the potential to transparently harmonize this
* Some devices report well-known counters on unusual SMART ID numbers
* `smartctl` attribute labels are arbitrary:  they are defined in `drivedb.h` and are not consistent.  For example, entries for drive self-reported wear have at least six names:
  * Media_Wearout_Indicator
  * Wear_Leveling_Count
  * Wear_Level_Used
  * Percent_Lifetime_Remain
  * Reallocated_Sector_Ct
  * SSD_Life_Left
The `minePercentageUsed` function could easily abstract / normalize across names, polarity, and `smartctl` output formats.

* Similarly Airflow_Temperature_Cel, Temperature_Celsius, Temperature_Internal, Drive_Temperature are example `smartctl` labels for drive temperature that should be abstracted by a mining function across drive models and interfaces.
* ` CRC_Error_Count` and `UDMA_CRC_Error_Count`etc.
@NiceGuyIT
Copy link
Member

Hi @anthonyeleven, your request was a little hard to understand without examples. I went digging and found an example!

This is for the same drive. While Airflow_Temperature_Cel is different than Temperature_Celsius, there is a .temperature that appears to be the same as Temperature_Celsius.

smartctl --json --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --log=error /dev/sdg |
yq -o json '.ata_smart_attributes.table | with_entries(select(.[].name == "Airflow_Temperature_Cel"))'
{
  "13": {
    "id": 190,
    "name": "Airflow_Temperature_Cel",
    "value": 72,
    "worst": 48,
    "thresh": 45,
    "when_failed": "",
    "flags": {
      "value": 34,
      "string": "-O---K ",
      "prefailure": false,
      "updated_online": true,
      "performance": false,
      "error_rate": false,
      "event_count": false,
      "auto_keep": true
    },
    "raw": {
      "value": 471269404,
      "string": "28 (Min/Max 23/28)"
    }
  }
}

smartctl --json --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --log=error /dev/sdg |
yq -o json '.ata_smart_attributes.table | with_entries(select(.[].name == "Temperature_Celsius"))'
{
  "14": {
    "id": 194,
    "name": "Temperature_Celsius",
    "value": 28,
    "worst": 52,
    "thresh": 0,
    "when_failed": "",
    "flags": {
      "value": 34,
      "string": "-O---K ",
      "prefailure": false,
      "updated_online": true,
      "performance": false,
      "error_rate": false,
      "event_count": false,
      "auto_keep": true
    },
    "raw": {
      "value": 34359738396,
      "string": "28 (0 8 0 0 0)"
    }
  }
}

smartctl --json --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --log=error /dev/sdg | yq -o json '.temperature'
{
  "current": 28
}

$ xh --body :19633/metrics | rg sdg | rg temp
smartctl_device_attribute{attribute_flags_long="updated_online,auto_keep",attribute_flags_short="-O---K",attribute_id="190",attribute_name="Airflow_Temperature_Cel",attribute_value_type="raw",device="sdg"} 4.71269404e+08
smartctl_device_attribute{attribute_flags_long="updated_online,auto_keep",attribute_flags_short="-O---K",attribute_id="190",attribute_name="Airflow_Temperature_Cel",attribute_value_type="thresh",device="sdg"} 45
smartctl_device_attribute{attribute_flags_long="updated_online,auto_keep",attribute_flags_short="-O---K",attribute_id="190",attribute_name="Airflow_Temperature_Cel",attribute_value_type="value",device="sdg"} 72
smartctl_device_attribute{attribute_flags_long="updated_online,auto_keep",attribute_flags_short="-O---K",attribute_id="190",attribute_name="Airflow_Temperature_Cel",attribute_value_type="worst",device="sdg"} 48
smartctl_device_attribute{attribute_flags_long="updated_online,auto_keep",attribute_flags_short="-O---K",attribute_id="194",attribute_name="Temperature_Celsius",attribute_value_type="raw",device="sdg"} 3.4359738396e+10
smartctl_device_attribute{attribute_flags_long="updated_online,auto_keep",attribute_flags_short="-O---K",attribute_id="194",attribute_name="Temperature_Celsius",attribute_value_type="thresh",device="sdg"} 0
smartctl_device_attribute{attribute_flags_long="updated_online,auto_keep",attribute_flags_short="-O---K",attribute_id="194",attribute_name="Temperature_Celsius",attribute_value_type="value",device="sdg"} 28
smartctl_device_attribute{attribute_flags_long="updated_online,auto_keep",attribute_flags_short="-O---K",attribute_id="194",attribute_name="Temperature_Celsius",attribute_value_type="worst",device="sdg"} 52
smartctl_device_temperature{device="sdg",temperature_type="current"} 28

@anthonyeleven
Copy link
Author

That's one example.

To rephrase:

SMART is not implemented very consistently. Some SSDs report, for example, lifetime remaining, while others report lifetime used. Some report a given attribute with a different numeric ID than others. So a tool that reports based on numeric attribute IDs will benefit from a bit of "correction" before emitting the prom exposition format.

smartmon-ix.txt

smartmontools' drivedb.h is kinda the wild west: the text attribute names are not at all consistent and are kinda arbitrary. So any tool that relies on the attribute names, unless it does some normalization, will report different timeseries for different drives.

The whole point of an exporter like this is to emit metrics that can be queried across a whole fleet, without having to craft the query differently for different drive models.

I've attached a hacked-up script that I'm currently using via node_exporter's textfile collector. It's gnarly and I don't like that it spits out metrics I'll never care about, including thresholds and worsts.

If we can ever get a usable HBA RAID enhancement merged, I'll contribute this sort of thing. I'm stuck with thousands of @#$@#%^!! RoC HBA VDs so a collector that can't handle them doesn't do anything for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants