Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics do not follow metric naming best practices about units #121

Open
ruuda opened this issue May 14, 2023 · 1 comment
Open

Metrics do not follow metric naming best practices about units #121

ruuda opened this issue May 14, 2023 · 1 comment

Comments

@ruuda
Copy link

ruuda commented May 14, 2023

Prometheus’ metric and label naming best practices recommend:

A metric name should have a suffix describing the unit, in plural form. Note that an accumulating count has total as a suffix, in addition to the unit if applicable.

  • http_request_duration_seconds
  • node_memory_usage_bytes
  • http_requests_total (for a unit-less accumulating count)
  • process_cpu_seconds_total (for an accumulating count with unit)
  • foobar_build_info (for a pseudo-metric that provides metadata about the running binary)
  • data_pipeline_last_record_processed_timestamp_seconds (for a timestamp that tracks the time of the latest record processed in a data processing pipeline)

Furthermore, it has a table with base units that specifies:

Temperature – Celsius – Celsius is preferred over Kelvin for practical reasons. Kelvin is acceptable as a base unit in special cases like color temperature or where temperature has to be absolute.

Currently smartctl_exporter does not follow these best practices. A few metrics that can be improved are:

  • smartctl_device_block_size could be smartctl_device_block_size_bytes
  • smartctl_device_interface_speed could be smartctl_device_interface_speed_bps or smartctl_device_interface_speed_bits_per_second
  • smartctl_device_rotation_rate could be smartctl_device_rotation_rate_rpm (I suppose a really strict interpretation of the guidelines would be to convert the rotations per minute to radians per second or Hertz, but since RPM is the standard measure of rotation rate, I would keep it at that)
  • smartctl_device_temperature could be smartctl_device_temperature_celsius (thanks for at least putting the unit in the documentation by the way!)

Renaming these metrics would be a breaking change, but I think it is worth considering, because when adoption continues to grow, the aggregate upside to new users will be greater than the downside to current users. A compromise might be to expose the renamed metrics under both the old and new name for one or more releases.

@tekert
Copy link
Contributor

tekert commented Jul 23, 2023

I was puzzled by this too so i refreshed my memory from the smart standard.
The thing is S.M.A.R.T unit types are not specified by the standard.

from the man pages:

Each Attribute has a "Raw" value, printed under the heading "RAW_VALUE", and a "Normalized" value printed under the heading "VALUE". [Note: smartctl prints these values in base-10.] In the example just given, the "Raw Value" for Attribute 12 would be the actual number of times that the disk has been power-cycled, for example 365 if the disk has been turned on once per day for exactly one year. Each vendor uses their own algorithm to convert this "Raw" value to a "Normalized" value in the range from 1 to 254. Please keep in mind that smartctl only reports the different Attribute types, values, and thresholds as read from the device. It does not carry out the conversion between "Raw" and "Normalized" values: this is done by the disk's firmware.

The conversion from Raw value to a quantity with physical units is not specified by the SMART standard. In most cases, the values printed by smartctl are sensible. For example the temperature Attribute generally has its raw value equal to the temperature in Celsius. However in some cases vendors use unusual conventions. For example the Hitachi disk on my laptop reports its power-on hours in minutes, not hours. Some IBM disks track three temperatures rather than one, in their raw values. And so on.
....
So to summarize: the Raw Attribute values are the ones that might have a real physical interpretation, such as "Temperature Celsius", "Hours", or "Start-Stop Cycles". Each manufacturer converts these, using their detailed knowledge of the disk's operations and failure modes, to Normalized Attribute values in the range 1-254. The current and worst (lowest measured) of these Normalized Attribute values are stored on the disk, along with a Threshold value that the manufacturer has determined will indicate that the disk is going to fail, or that it has exceeded its design age or aging limit. smartctl does not calculate any of the Attribute values, thresholds, or types, it merely reports them from the SMART data on the device.


EDIT (for extra info):
smartctl_explorer parses the output of:
.\smartctl --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief <device> --json

You can strip the --json flag to check human readable values.
It seems the explorer will parse json names "as is" for the metric names.

Example: json for an SATA disk may report:

 {
        "id": 194,
        "name": "Temperature_Celsius",
        "value": 111,
...
        "flags": {
          "value": 34,
          ...

The type is just a name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants