Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate cpu_usage() from returning NaN on Macs #1236

Merged
merged 3 commits into from Mar 28, 2024

Conversation

Hoverbear
Copy link
Contributor

Describe the bug

Using the current latest checkout of this repo (or any 0.30.x version) on a Mac can result in cpu.cpu_usage() reporting a NaN. This is primarily caused by this code:

in_use = get_in_use(cpu_info, offset).saturating_sub(get_in_use(old_cpu_info, offset));
total = in_use.saturating_add(
get_idle(cpu_info, offset).saturating_sub(get_idle(old_cpu_info, offset)) as _,
);
}
in_use as f32 / total as f32 * 100.

If refresh_cpu() is called in rapid succession the collected data can be the same, resulting in total being 0, since in_use is divided by total, this creates a divide by zero situation and can return a NaN.

It can be worked around by waiting sysinfo::MINIMUM_CPU_UPDATE_INTERVAL between calls to refresh_cpu(), as the data will (hopefully) no longer be the same.

Some platforms, such as Linux, do this interval check internally:

// If the last CPU usage update is too close (less than `MINIMUM_CPU_UPDATE_INTERVAL`),
// we don't want to update CPUs times.
if need_cpu_usage_update {

This PR takes the same strategy as Linux and enforces an interval internally.

I can confirm that as a result of this change, cargo test passess successfully on my aarch64 Mac.

To Reproduce

I used a aarch64 Mac, but it should be reproducable on any Mac. It was also observed on during local Rust builds with build.metrics enabled.

In order to reproduce this outside of the Rust repo, I did a bit of light refactoring of compute_usage_of_cpu:

// src/unix/apple/cpu.rs
pub(crate) fn compute_usage_of_cpu(proc_: &Cpu, cpu_info: *mut i32, offset: isize) -> f32 {
    let old_cpu_info = proc_.inner.data().cpu_info.0;
    let in_use;
    let total;

    // In case we are initializing cpus, there is no "old value" yet.
    if old_cpu_info == cpu_info {
        in_use = get_in_use(cpu_info, offset);
        eprintln !("in_use {in_use}");
        let idle = get_idle(cpu_info, offset);
        total = in_use.saturating_add(idle as _);
    } else {
        in_use = get_in_use(cpu_info, offset).saturating_sub(get_in_use(old_cpu_info, offset));
        eprintln !("in_use {in_use}");
        let idle = get_idle(cpu_info, offset);
        let old_idle = get_idle(old_cpu_info, offset);
        eprintln!("idle {idle}, old_idle {old_idle}");
        total = in_use.saturating_add(
            idle.saturating_sub(old_idle) as _,
        );
    }
    eprintln !("total {total}");
    in_use as f32 / total as f32 * 100.
}

Then I updated the the test_cpu test to rapidly call refresh_cpu() twice, and assert that none of the usage was NaN:

// tests/cpu.rs
#[test]
fn test_cpu() {
    let mut s = sysinfo::System::new();
    assert!(s.cpus().is_empty());

    if !sysinfo::IS_SUPPORTED_SYSTEM {
        return;
    }

    s.refresh_cpu();
    assert!(!s.cpus().is_empty());
    // std::thread::sleep(std::time::Duration::from_millis(200));
    s.refresh_cpu();

    assert!(s.cpus().iter().any(|c| !c.cpu_usage().is_nan()));

    let mut s = sysinfo::System::new_all();
    assert!(!s.cpus().is_empty());

    assert!(!s.cpus()[0].brand().chars().any(|c| c == '\0'));

    if !cfg!(target_os = "freebsd") {
        // This information is currently not retrieved on freebsd...
        assert!(s.cpus().iter().any(|c| !c.brand().is_empty()));
    }
    assert!(s.cpus().iter().any(|c| !c.vendor_id().is_empty()));
}

With that done, I was able to reproduce the issue via cargo test test_cpu -- --nocapture:

~/git/GuillameGomez/sysinfo> cargo test test_cpu -- --nocapture
# snipped

     Running tests/cpu.rs (target/debug/deps/cpu-02526d4c3ab91678)

running 1 test
in_use 5690012
total 24318088
in_use 5149645
total 24419660
in_use 3795005
total 24553864
in_use 2914872
total 24646964
in_use 2465127
total 24707921
in_use 2097229
total 24736675
in_use 878747
total 24836498
in_use 882113
total 24842076
in_use 874508
total 24843371
in_use 883108
total 24843382
in_use 886023
total 24844247
in_use 870234
total 24845131
in_use 0
idle 18628076, old_idle 18628076
total 0
in_use 0
idle 19270015, old_idle 19270015
total 0
in_use 0
idle 20758859, old_idle 20758859
total 0
in_use 0
idle 21732092, old_idle 21732092
total 0
in_use 0
idle 22242794, old_idle 22242794
total 0
in_use 0
idle 22639446, old_idle 22639446
total 0
in_use 0
idle 23957751, old_idle 23957751
total 0
in_use 0
idle 23959963, old_idle 23959963
total 0
in_use 0
idle 23968863, old_idle 23968863
total 0
in_use 0
idle 23960274, old_idle 23960274
total 0
in_use 0
idle 23958224, old_idle 23958224
total 0
in_use 0
idle 23974897, old_idle 23974897
total 0
thread 'test_cpu' panicked at tests/cpu.rs:19:5:
assertion failed: s.cpus().iter().any(|c| !c.cpu_usage().is_nan())
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test test_cpu ... FAILED

failures:

failures:
    test_cpu

I confirmed that adding the delay resolves the issue.

@Hoverbear
Copy link
Contributor Author

I believe the clippy failure to be unrelated to this PR, please let me know if I should take action to resolve it.

@GuillaumeGomez
Copy link
Owner

Hi @Hoverbear! o/

First of all: thanks a lot for the PR, the detailed explanations of the issue and for adding test. It's really awesome. :)

Just one nit and I merge.

As for the clippy lint, it's not from your code, just last nightly clippy version. Can you add a #[allow()] attribute for it where it is triggered please? If you don't want to it's fine, I'll send a fix for it.

@Hoverbear
Copy link
Contributor Author

So good to bump into you again, my friend. :) Really appreciate this crate and your work on it!

@Hoverbear Hoverbear changed the title Prevent cpu_usage() from returning NaN on Macs Mitigate cpu_usage() from returning NaN on Macs Mar 28, 2024
@Hoverbear
Copy link
Contributor Author

I renamed this issue, I believe cpu_usage() may still be able to return NaN after this patch, it will simply be much more uncommon.

If the measurements of the total = in_use + (idle - old_idle) result in 0 (for example, when the measurement was identical to the last measurement) I believe a NaN could still be returned.

@GuillaumeGomez
Copy link
Owner

Could you not simply add a check and if it's 0, just return 0?

@GuillaumeGomez
Copy link
Owner

Forgot to precise: in addition to your add (which I think is a very good idea).

@Hoverbear
Copy link
Contributor Author

Hoverbear commented Mar 28, 2024

Hmmmm... I believe that reporting zero would be incorrect, no? It's possible that say, in_use + new_idle - old_idle is zero but that actual usage the same as the last usage?

Let's say:

let old_in_use = 200;
let new_in_use = 200;
let old_idle = 4000;
let new_idle = 4000;
// Results in
let in_use = (new_in_use - old_in_use); // 0
let total = (new_in_use - old_in_use) + (new_idle - old_idle); // 0

Then we'd go divide by total and get a NaN again! Shucks. Returning zero here isn't correct either, since the usage is indeed not zero.

Therefore, to be safe here, we can check to see if the measurement is the same, and return the just the straight measurement instead of the difference if so.

let new_in_use = get_in_use(cpu_info, offset);
let old_in_use = get_in_use(old_cpu_info, offset);

let new_idle = get_idle(cpu_info, offset);
let old_idle = get_idle(old_cpu_info, offset);

// Check the old measurement and new one differ before trying to calculate the difference
// Avoid `total` becoming a zero since `in_use / 0 == NaN`
if new_in_use == old_in_use && new_idle == old_idle {
    in_use = new_in_use.saturating_sub(old_in_use);
    total = new_in_use.saturating_add(new_idle as _);
} else {
    in_use = new_in_use.saturating_sub(old_in_use);
    total = in_use.saturating_add(new_idle.saturating_sub(old_idle) as _);
}

Then, even still, it's possible to get a NaN! Assuming that:

let old_in_use = 400;  // It's bigger!
let new_in_use = 200;  // It's smaller!
let old_idle = 4000;
let new_idle = 4000;   // Unchanged because... CPU Frequency scaled(?)
// Results in
let in_use = (new_in_use - old_in_use); // 0
let total = (new_in_use - old_in_use) + (new_idle - old_idle); // 0

Heck! In this case we now need to do even more work, and how do we provide something accurate? 😵

Instead, I suggest we set in_use and idle, then calculate in_use / (idle + in_use)? I believe that would be more frequently correct...

let in_use;
let idle;

// In case we are initializing cpus, there is no "old value" yet.
if old_cpu_info == cpu_info {
    in_use = get_in_use(cpu_info, offset);
    idle = get_idle(cpu_info, offset);
} else {
    let new_in_use = get_in_use(cpu_info, offset);
    let old_in_use = get_in_use(old_cpu_info, offset);

    let new_idle = get_idle(cpu_info, offset);
    let old_idle = get_idle(old_cpu_info, offset);

    in_use = new_in_use.saturating_sub(old_in_use);
    idle = new_idle.saturating_sub(old_idle) as _;
}
let total = in_use.saturating_add(idle as _);
let usage = (in_use as f32 / total as f32) * 100.;
if usage.is_nan() {
    // If divided by zero, avoid returning a NaN
    0.
} else {
    usage
}

Please let me know if my brain became momentarily vacant and there would be a more appropriate fix?

Here's a log from a run using that code:

~/git/GuillameGomez/sysinfo> cargo run --example simple                                                                                                                                                                                                                  2024-03-28 06:53:03 AM
    Finished dev [unoptimized + debuginfo] target(s) in 0.01s
     Running `target/debug/examples/simple`
Getting system information...
Done.
To get the commands' list, enter 'help'.
> refresh_cpu
Refreshing CPUs...
Done.
> cpus
number of physical cores: 12
total CPU usage: 3.2543619%
Cpu { name: "1", CPU usage: 12.703101, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "2", CPU usage: 10.895523, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "3", CPU usage: 5.8479533, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "4", CPU usage: 3.9358602, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "5", CPU usage: 2.1802325, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "6", CPU usage: 1.1678832, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "7", CPU usage: 0.5805515, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "8", CPU usage: 0.14492755, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "9", CPU usage: 0.5805515, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "10", CPU usage: 0.8708273, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "11", CPU usage: 0.0, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "12", CPU usage: 0.14492755, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
> refresh_cpu
Refreshing CPUs...
Done.
> cpus
number of physical cores: 12
total CPU usage: 3.3380797%
Cpu { name: "1", CPU usage: 13.250883, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "2", CPU usage: 10.229277, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "3", CPU usage: 8.391608, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "4", CPU usage: 2.7972028, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "5", CPU usage: 1.9130435, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "6", CPU usage: 0.8695652, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "7", CPU usage: 1.0416667, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "8", CPU usage: 0.0, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "9", CPU usage: 0.34843206, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "10", CPU usage: 1.2152778, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "11", CPU usage: 0.0, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }
Cpu { name: "12", CPU usage: 0.0, frequency: 4056, vendor ID: "Apple", brand: "Apple M3 Pro" }

@GuillaumeGomez
Copy link
Owner

I agree with you. New code looks better too. Thanks a lot for this!

@GuillaumeGomez GuillaumeGomez merged commit 93c1c15 into GuillaumeGomez:master Mar 28, 2024
66 of 67 checks passed
@Hoverbear Hoverbear deleted the hoverbear/sysinfo branch March 28, 2024 23:20
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Apr 2, 2024
…o-0.30.8, r=Nilstrieb

Update sysinfo to 0.30.8

Fixes a Mac specific issue when using `metrics = true` in `config.toml`.

```config.toml
# Collect information and statistics about the current build and writes it to
# disk. Enabling this or not has no impact on the resulting build output. The
# schema of the file generated by the build metrics feature is unstable, and
# this is not intended to be used during local development.
metrics = true
```

During repeated builds, as the generated `metrics.json` grew, eventually `refresh_cpu()` would be called in quick enough succession (specifically: under 200ms) that a divide by zero would occur, leading to a `NaN` which would not be serialized, then when the `metrics.json` was re-read it would fail to parse.

That error looks like this (collected from Ferrocene's CI):

```
   Compiling rustdoc-tool v0.0.0 (/Users/distiller/project/src/tools/rustdoc)
    Finished release [optimized] target(s) in 38.37s
thread 'main' panicked at src/utils/metrics.rs:180:21:
serde_json::from_slice::<JsonRoot>(&contents) failed with invalid type: null, expected f64 at line 1 column 9598
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Build completed unsuccessfully in 0:00:40

Exited with code exit status 1
```

Related: GuillaumeGomez/sysinfo#1236
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Apr 2, 2024
Rollup merge of rust-lang#123378 - ferrocene:hoverbear/bump-sysinfo-to-0.30.8, r=Nilstrieb

Update sysinfo to 0.30.8

Fixes a Mac specific issue when using `metrics = true` in `config.toml`.

```config.toml
# Collect information and statistics about the current build and writes it to
# disk. Enabling this or not has no impact on the resulting build output. The
# schema of the file generated by the build metrics feature is unstable, and
# this is not intended to be used during local development.
metrics = true
```

During repeated builds, as the generated `metrics.json` grew, eventually `refresh_cpu()` would be called in quick enough succession (specifically: under 200ms) that a divide by zero would occur, leading to a `NaN` which would not be serialized, then when the `metrics.json` was re-read it would fail to parse.

That error looks like this (collected from Ferrocene's CI):

```
   Compiling rustdoc-tool v0.0.0 (/Users/distiller/project/src/tools/rustdoc)
    Finished release [optimized] target(s) in 38.37s
thread 'main' panicked at src/utils/metrics.rs:180:21:
serde_json::from_slice::<JsonRoot>(&contents) failed with invalid type: null, expected f64 at line 1 column 9598
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Build completed unsuccessfully in 0:00:40

Exited with code exit status 1
```

Related: GuillaumeGomez/sysinfo#1236
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants