Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rayon uses a lot of CPU when there's not a lot of work to do #642

Closed
jrmuizel opened this issue Mar 5, 2019 · 22 comments
Closed

Rayon uses a lot of CPU when there's not a lot of work to do #642

jrmuizel opened this issue Mar 5, 2019 · 22 comments

Comments

@jrmuizel
Copy link

jrmuizel commented Mar 5, 2019

The following program uses about 30% of the CPU on 4 core (8 HT) Linux and Mac machines

fn main() {
    while true {
        std::thread::sleep(std::time::Duration::from_millis(10));
        rayon::spawn(move || {  } );
    }
}

Reducing the sleep duration to 1ms pushes the CPU usage up to 200%.

@cuviper
Copy link
Member

cuviper commented Mar 5, 2019

See also #576 and #614.

@jrmuizel
Copy link
Author

jrmuizel commented Mar 5, 2019

A profile on macOS shows most of the time being spent in sched_yield

@jrmuizel
Copy link
Author

jrmuizel commented Mar 5, 2019

Also, a similar program written using the dispatch crate uses about 3-4% cpu at the 1ms duration vs the 200% that rayon uses.

@jrmuizel
Copy link
Author

@lukewagner here's an example of some of the bad.

@jrmuizel
Copy link
Author

Rewriting this using crossbeam-channel produces an executable that only uses 4% cpu.

    let (s, r) = unbounded();

    for i in 0..8 {
        let r = r.clone();
        thread::spawn( move || loop {
            let m: i32 = r.recv().unwrap();
        });
    }

    while true {
        std::thread::sleep(std::time::Duration::from_millis(1));
        s.send(0);
    }

Perhaps rayon should use the same wakeup infrastructure that crossbeam-channel uses?

@ishitatsuyuki
Copy link

From https://github.com/rayon-rs/rayon/blob/master/rayon-core/src/sleep/README.md, it seems rayon will repetitively yield to the operating system before a thread goes to sleep. How does this work wrt to syscall overhead? I think yielding to the scheduler is almost as expensive as blocking on a condition variable?

@cuviper
Copy link
Member

cuviper commented Aug 15, 2019

I think yielding to the scheduler is almost as expensive as blocking on a condition variable?

The net performance of blocking has to include the cost of the syscall to wake it up too. Even if the raw cost were the same, there's also a latency difference in fully blocking.

The life demo added in #603 shows this trade-off pretty well. Currently I get:

$ cargo run --release -- life bench
    Finished release [optimized] target(s) in 0.05s
     Running `/home/jistone/rust/rayon/target/release/rayon-demo life bench`
  serial:  114948948 ns
parallel:   32975762 ns -> 3.49x speedup
par_bridge:  631383430 ns -> 0.18x speedup

$ cargo run --release -- life play
    Finished release [optimized] target(s) in 0.05s
     Running `/home/jistone/rust/rayon/target/release/rayon-demo life play`
  serial: 60.00 fps
    cpu usage: 16.2%
parallel: 60.00 fps
  cpu usage: 45.1%
par_bridge: 60.00 fps
  cpu usage: 366.8%

If I change ROUNDS_UNTIL_SLEEPY and ROUNDS_UNTIL_ASLEEP to 1:

$ cargo run --release -- life bench
    Finished release [optimized] target(s) in 0.05s
     Running `/home/jistone/rust/rayon/target/release/rayon-demo life bench`
  serial:  111999829 ns
parallel:   40216218 ns -> 2.78x speedup
par_bridge:  625470343 ns -> 0.18x speedup

$ cargo run --release -- life play
    Finished release [optimized] target(s) in 0.05s
     Running `/home/jistone/rust/rayon/target/release/rayon-demo life play`
  serial: 60.00 fps
    cpu usage: 15.9%
parallel: 60.00 fps
  cpu usage: 33.8%
par_bridge: 60.00 fps
  cpu usage: 350.2%

So not as much speedup, but lower cpu usage.

@ishitatsuyuki
Copy link

A few links to what might making the problem worse:

@nikomatsakis
Copy link
Member

I've been working on a rewrite of the sleep system that I think will address this problem. I opened #691 with an immediate refactoring, but I'm working on a more extensive rewrite. I hope to open an RFC in a few days with a sketch of the algorithm plus a few alternatives I'd like to evaluate.

The gist of it is this:

  • In the new system, we almost never wake up more one thread at a time (the exception is when terminating, which wakes all threads so they can quit).
  • When producing new work, we can try to avoid waking threads at all if we judge there to be idle threads that are currently awake.
  • When signalling latches, we only wake up the one thread that is actually blocked on the latch (if any).

We'll have to benchmark it, of course, and we'll probably want to tune things.

@nikomatsakis
Copy link
Member

Opened: rayon-rs/rfcs#5 with a fairly complete description of what I have in mind. I've not implemented all of that yet, but I'm getting there.

@nikomatsakis
Copy link
Member

Using the algorithm from the RFC seems to reduce this example to taking negligible CPU time:

> time -p /home/nmatsakis/versioned/rayon/target/release/rayon-demo noop --iters 1000
real 10.15
user 0.06
sys 0.04
> time -p /home/nmatsakis/versioned/rayon/target/release/rayon-demo noop --iters 1000 --sleep 1
real 1.09
user 0.05
sys 0.03

@djg
Copy link

djg commented Sep 10, 2019

I've been experimenting with this case where work is spawned every millisecond:

fn main() {
    while true {
        std::thread::sleep(std::time::Duration::from_millis(1));
        rayon::spawn(move || {  } );
    }
}

With the current release of rayon, all my HTs are pegged:

Current Rayon

With the latch-target-thread branch, CPU usage is massively reduced:

Rayon latch-target-thread branch

Excited to see where this goes.

bors bot added a commit that referenced this issue Sep 23, 2019
695: a few tweaks to the demos r=cuviper a=nikomatsakis

A few changes that I am making on my branch that I would like applied to master to. 

This renames some benchmarks to make them uniquely identifiable and also adds a test to measure #642. 

Co-authored-by: Niko Matsakis <niko@alum.mit.edu>
@jrmuizel
Copy link
Author

Here's a screenshot from a GPUview profile of WebRender using Rayon.

It shows 2ms of spinning 2 cpus just starting and stopping the threads.

rayon

@cuviper
Copy link
Member

cuviper commented Apr 13, 2020

@jrmuizel would it be possible for you to gather a similar profile with #746?

@jrmuizel
Copy link
Author

Yeah. I'm trying that now.

@jrmuizel
Copy link
Author

Here's a similar situation with #746
rayon2

It looks like we're spinning for less time which is good.

I haven't read through #746 but wouldn't it be better for us to wait on something instead of having 2 cores jumping between 4 thread spinning on yield()?

@cuviper
Copy link
Member

cuviper commented Apr 14, 2020

We do ultimately wait on condvars, but there's some spinning on yield() first, in hopes of reducing latency. You can see some of that effect in my earlier comment, where I cut down the number of spinning rounds. Still, there may be room to fine-tune this further in #746.

@hniksic
Copy link
Contributor

hniksic commented Sep 2, 2020

I think I might have encountered this issue. It involves work that is easy to parallelize (essentially order-insensitive stream processing using par_bridge() with almost no mutable shared state) running on a 64-CPU server. On my laptop the work parallelizes nicely to 4 cores and runs approximately 4x faster than sequentially. But on the 64-CPU server the work definitely doesn't proceed with speed anywhere near 64x of the sequential version on the same machine, while all the CPUs are at 100% the entire time. Experimenting with RAYON_NUM_THREADS shows that performance peaks around 8-16 CPUs and gets worse with 32 and 64.

I was unable to reproduce the issue with simpler minimal examples, where Rayon seemed to distribute the load nicely. So I rewrote the code to use a crossbeam_channel to distribute the work to 64 manually created threads instead. I noticed two things: first, the timing was a little bit better (the program finished faster than with Rayon), but not spectacularly better. Second, and more important, the individual CPUs, as reported by htop were at under 50% all the time.

I suspect that the issue is that the work-giving thread (the one that reads the stream) is limited by IO and is not providing work fast enough to saturate 64 CPUs. The workload will never utilize full 64 CPUs - and that's fine. However, with Rayon it might happen that it is providing just enough work to hit this bug and make all cores 100% busy, but spending much of the time in busy-wait, so much that it actually prevents them from doing useful work. Reducing the number of threads fixes the issue because when the bug doesn't appear when there is enough work. The bug doesn't appear on my laptop for the same reason - it doesn't have enough CPU horsepower to outmatch the IO bandwidth of the reader thread.

Sadly, upgrading Rayon to 1.4.0, which as I understand should include #746, didn't seem to make a difference. Is there some way to test whether my program triggers the issue reported here?

(All of the above experiments are performed with release builds.)

@cuviper
Copy link
Member

cuviper commented Sep 2, 2020

I'm going to close this issue now that 1.4.0 is published.

@hniksic could you open a new issue for your case? There are definitely some scalability issues with par_bridge() that we should discuss separately.

@cuviper cuviper closed this as completed Sep 2, 2020
@hniksic
Copy link
Contributor

hniksic commented Sep 2, 2020

@cuviper I was loath to open an issue since I couldn't provide a minimal example that reproduces the bug, and I tried. If you think the above description would be useful enough as an issue of its own, I'll gladly make one.

@cuviper
Copy link
Member

cuviper commented Sep 2, 2020

@hniksic Issues are not a scarce resource. 🙂 We can at least discuss the problem in the abstract, and there might be other people that want to chime in, who may have something more reproducible.

@hniksic
Copy link
Contributor

hniksic commented Sep 2, 2020

@cuviper Thanks! I've now submitted #795.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants