improve the sleeping thread algorithm #5

nikomatsakis · 2019-09-08T14:09:36Z

Rayon's existing approach to putting threads to sleep can lead to
excessive CPU usage (see e.g. rayon-rs/rayon#642). In the current
algorithm, threads can gradually put themselves to sleep if they don't
find work to do. They do this one at a time. But as soon as any
work arrives (or -- in fact -- even any work completes) all threads
awaken.

This RFC proposes an alternative algorithm, and explores some of the
design options and tradeoffs available. It does not claim to be
exhaustive and feedback is most certainly desired on alternative
approaches!

ishitatsuyuki · 2019-09-09T00:58:16Z

It isn't very clear what the sleepy state do now - does it continuously yield to the OS scheduler like it currently does, or does it just hold a lock and goes on to sleep?

The yield operation has felt like a major source of non-determinism and I'm looking forward to getting rid of it.

nikomatsakis · 2019-09-09T13:57:17Z

By the way, an implementation of this algorithm is available in my latch-target-thread branch. The results so far are...not encouraging. It seems to perform terribly and not to help with CPU usage. =) However, I only finished hacking it up this morning and I haven't tuned or examined it at all. So it may well be buggy or maybe even incomplete.

Regarding the RFC:

I plan to include some more material on how current algorithm works
- also some of the key differences
as well as some discussion of other possible variants
and (maybe) comparisons to other schedulers

One thing I think would be very useful also is to try and clarify our benchmarks. I have been using the rayon-demo benchmarks, but if people have other suggestions that seems good. In the past, we've had some FF folks do experimentation, but it'd be good to get some other "real world" benchmarks.

nikomatsakis · 2019-09-10T10:04:34Z

Update:

It seems to perform terribly and not to help with CPU usage. =)

I made some improvements and this is no longer the case. Performance now seems to be comparable to the existing scheduler, perhaps if somewhat slower. CPU usage is indeed improved. I've been testing primarily on the life benchmark and not particularly rigorously, so consider that to be a tentative conclusion.

If you take a look at the branch, you'll also see that I implemented a number of variations. I'd like to start doing some "more rigorous" benchmarking to try and decide between them. There are a number of things we can tune, I'll try to update the RFCs with a bit more details on those factors -- although preliminary tinkering hasn't shown that much impact.

Here are some examples:

For life bench --size 1024, I observed typical speedups with my branch of approximately 6.05x, whereas typical speedups with master were around 6.45x. This is on my 14x2 core machine. in both cases speedup could go as high as 12x and sometimes as low as 5x. It'd be nice to know what makes the difference there. =)

Running with life bench without the --size parameter yielded speedups of around 1.5x in both cases, with jumps up to 3x or down to 1x.

In terms of CPU usage:

command	branch	master
`life play`	~185%	~500%
`life play --size 1024`	~865%	~1010%

lnicola · 2019-09-10T11:14:22Z

Not sure if it makes a difference, but can you try disabling Turbo Boost? I imagine there might be some interactions between the CPU frequency and the spinning threads.

nikomatsakis · 2019-09-10T13:09:09Z

@lnicola any advice on how to do that on Fedora linux?

lnicola · 2019-09-10T13:11:41Z

Check if you're using intel_pstate (I think it's the default on most distros):

cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_driver

If so, then:

echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

If you're not using intel_pstate, either switch to it, or try another method.

nikomatsakis · 2019-09-10T14:19:17Z

Fascinating. Disable turbo boost yielded the following results. These are the results of ```

for n in 1 2 3; do cargo run --release -- life bench --size 1024 --skip-bridge ; done

I ran the command twice and I am reporting the second round of results. (The first round were basically comparable.)

what	1	2	3	average
branch	7.27x	9.13x	8.17x	8.19x
master	7.7x	8.34x	8.43x	8.15x

So basically identical.

jrmuizel · 2019-09-11T14:28:07Z

How does this compare with what crossbeam-channel does? @stjepang

nikomatsakis · 2019-09-12T01:04:56Z

The algorithm as described is flawed -- as I wrote, the 'new work' notifications are not guaranteed to be observed, but this means that it is possible for a job to be injected from outside the pool without actually waking any threads. As a result, all the threads can be sleeping and they'll never be awoken. This isn't a problem with jobs from within the pool because at least one thread (the injector) is awake. This seems correctable in a variety of ways, but it's too late for me to think about the best way to do it at this moment. =)

nikomatsakis · 2019-09-12T12:46:17Z

OK, I did some experimenting this morning. Really, more than I should have, as I should be doing some rustc work right now. I came up with two approaches to solve the problem.

The first is to keep a counter of all "injected jobs". We would read the value of this counter upon entering the idle loop and then check that it has not changed before going to sleep. If you are careful with respect to the ordering of events, and you use seq-cst guarantees, you can then ensure that you would have observed either a new job in the queue or the counter changing.

The downside of this approach is that (a) injected jobs are kind of special and -- more importantly -- (b) it requires an atomic increment per injected job even if no threads are sleepy. This seems bad in the steady state. It feels like a valid use case would be injecting jobs from outside the pool over time. (I'd eventually like to make threads outside the pool able to be more active participants, as well, and this seems like it would work against that.)

So I experimented with another approach. I brought back the "sleepy worker" concept from the existing sleeping pool. This means that workers fall asleep one at a time, coordinating via a per-thread-pool atomic. In particular, before actually going to sleep, they must first become "the sleepy worker" (of which there is only 1). This requires an atomic compare-exchange. When they then try to go to sleep, they release the "sleepy worker" state with another exchange. When we inject new job, meanwhile, we can check if there is a sleepy worker and pre-emptively clear the flag. This means that when going to sleep, a sleepy worker will notice that they are no longer sleepy, and hence they will go back to searching for work again.

Experimentally, both of these approaches prevent the deadlock I was seeing, though I'd like to document them more carefully. I compare the performance on a handful of benchmarks and found that overall the sleepy worker performs best.

You can view a plot of my measurements here -- I'll try to put them in some more consumable form later. Each point is the output of a single cargo bench run (I did approximately 3 per branch, though 6 for master). The Y axis is ns/iter, and hence "lower is better". The columns at the bottom are commits:

8677bcb -- injection counter
6231444 -- master
22c0380 -- sleepy worker

You can see that sleepy worker performs better than injection counter, and master seems to generally perform best of all (though the difference is small). The biggest change is the quicksort-par-bench benchmark.

cuviper · 2019-09-12T15:51:34Z

The algorithm as described is flawed -- as I wrote, the 'new work' notifications are not guaranteed to be observed, but this means that it is possible for a job to be injected from outside the pool without actually waking any threads. As a result, all the threads can be sleeping and they'll never be awoken. This isn't a problem with jobs from within the pool because at least one thread (the injector) is awake.

It's still a performance problem if that happens within, because we won't use all of the threads that we could. Not as bad as a hang making no progress though. Anyway, the "sleepy worker" solves both aspects, right? How is the idle CPU usage with this back in place?

nikomatsakis · 2019-09-12T21:31:17Z

@cuviper

It's still a performance problem if that happens within, because we won't use all of the threads that we could.

Correct. I argue in the RFC why this is unlikely to be a major problem, but it's still something to be minimized.

Anyway, the "sleepy worker" solves both aspects, right?

Partially correct. It does apply equally to both cases, but it's still possible for us to have fewer threads active than the "optimum". For example, it two new jobs appear very close to one another, and we have only one idle thread, both of them might conclude that no threads need to be re-awoken -- but the idle thread will only be able to handle one of those jobs. However, we do now guarantee that we will never have zero active threads when there are waiting jobs.

How is the idle CPU usage with this back in place?

I've not tested, I'll take a look. I wouldn't expect a major impact but you never know.

nikomatsakis · 2019-09-13T01:06:14Z

Reran the CPU usage benchmarks (with turbo boost disabled, this time). You can see a difference from before, but still markedly improved from master, at least for smaller sizes.

command	branch	master
`life play`	282%	546%
`life play --size 1024`	1151%	1264%

It's also worth pointing out that I haven't tried tuning the "number of rounds" values -- currently I have 16 rounds until sleepy, then 1 more round until asleep. Probably worth experimenting.

Oh, and the no-op command from rayon-rs/rayon#642 still uses very little CPU overall (though slightly more):

> time -p /home/nmatsakis/versioned/rayon/target/release/rayon-demo noop --iters 1000 --sleep 1
real 1.12
user 0.05
sys 0.05
> time -p /home/nmatsakis/versioned/rayon/target/release/rayon-demo noop --iters 1000 --sleep 10
real 10.15
user 0.05
sys 0.05

nikomatsakis · 2019-09-13T01:10:20Z

So I experimented with another approach. I brought back the "sleepy worker" concept from the existing sleeping pool. This means that workers fall asleep one at a time, coordinating via a per-thread-pool atomic.

Also, we could in principle allow workers to fall asleep more than one at a time with the same basic mechanism, but it would require us to use e.g. 1 bit per worker, which would put a hard cap on the number of workers. I didn't like the sound of that.

It's also worth pointing out that I haven't tried tuning the "number of rounds" values -- currently I have 16 rounds until sleepy, then 1 more round until asleep. Probably worth experimenting.

It occurs to me that some kind of pseudo-random counter that tries to pick a random number of rounds might help to "stagger" threads and lessen the chance of them blocking on one another.

nikomatsakis · 2019-09-13T01:15:09Z

I tried playing with different number of rounds. The results surprised me.

Rounds until sleepy	Add'l rounds until sleeping	CPU usage for `life play`
8	1	1440%
8	8	1553%
16	1	282%
32	1	255%
64	1	233%
128	1	318%

OK, off to bed. =)

nikomatsakis · 2019-09-13T09:26:31Z

I had an idea last night but I've not had time to try it out. I thought I'd jot it down before I forget, since I don't have time to try it out this morning.

The goal would be to:

allow multiple threads to go to sleep at once and avoid the "missed job" problem,

but without requiring a write in the "steady state" of jobs being injected regularly.

Basically, you have two counters:

SLEEPY
JOBS

They both start out at zero.

Whenever someone gets sleepy, they increment SLEEPY, and they remember the value S that they set it to.
Whenever someone publishes work, they read the counters. If JOBS == SLEEPY, there is no work to do. Otherwise, they ~~increment JOBS once per job they have to publish~~ CAS to make JOBS equal to SLEEPY.
Whenever that sleepy thread tries to go to sleep, they check if JOBS is >= their value S. If so, they cancel their attempt to fall asleep.

Ideally, you would make these two 32-bit counters in a 64-bit word, so they can be easily read and manipulated atomically.

One complication is how you handle rollover. Particularly w/ 32-bit counters, I think that is a real possibility. But it..seems like it should be possible somehow. I'm imagining that a SLEEPY job that would roll over has to also adjust the JOBS counter back to zero, and then that the workers going to sleep need to check if SLEEPY < S -- that indicates rollover and they should probably just start over and go sleepy again. (Technically, of course, they might delay so long that SLEEPY gets re-incremented back up to equal S, but that seems like a scenario we can discount as being a truly pathological scheduler. If we were really worried, we'd have a separate 64-bit "epoch" counter they could look at, I guess, that also gets incremented on rollover?)

Anyway, as I said, I came up with this last night while drifting off to sleep, so maybe there's a flaw. Presuming it works, though, I think it should avoid steady state writes because -- basically -- if all threads are busy, then nobody is getting sleepy, and there is no need to increment JOBS, you just push the work onto the queue.

(Something about this is bothering me; it feels like there is such overlap between the idle/sleeping job counters and these counters, but I don't yet see how to consolidate them.)

nikomatsakis · 2019-09-15T12:41:30Z

OK, so I got some more time to mess with this today. First of all, I implemented the algorithm described in the previous comment, with one slight variation which I edited into the comment: when announcing new jobs, we always CAS to make JOBS equal to SLEEPY. This forces all SLEEPY workers to cycle around one more time. This seems to be pretty important for overall performance, empirically.

While this algorithm remains my personal favorite (*), it doesn't really perform particularly differently from the rest. In particular, the quick_sort_par_bench benchmark remains ~50% slower than the existing sleep algorithm. So I went in to do a closer look.

The problem seems to be precisely this matter of not keeping enough threads awake. The sleep algorithm on master, when monitored with perf stat, keeps 20.85 cpus active and takes around 8,000,000 ns/iter. The algorithm implemented here (well, I haven't kept the RFC up to date, but more-or-less) keeps only 8.7 cpus active, and takes around 12,000,000 ns/iter. I've tried a number of variations of this algorithm, all oriented at keeping more CPUs active (e.g., waking more CPUs, having them idle longer, etc). There is basically a linear correlation between CPUs active and ns/iter, as you can see in this google spreadsheet.

A few updates:

I've looked into the logs of what's happening. As best I can tell, with the algorithm as described here, we suffer at least sometimes from the expected race of "many jobs being published but only 1 idle worker". The jobs all expect that idle worker to service them, but it can only handle 1 job.

I went and re-read the Go scheduler comment and noticed one idea I had not considered before, which is to have the last idle thread, when it finds work, awake a replacement. I implemented this but it didn't seem to help much (still, it seems smart, so I kept it).

I also stared at the "logs" that Rayon can generate. It's hard though to know how must to trust them because they just dump out with eprintln so they are obviously very intrusive. From those logs, it seemed like if often happened that we had 1 or 2 idle threads and ~7 sleeping threads. This might explain why the Go tweak didn't help much: the last idle thread doesn't shut down? But I'm not sure if this is real.

One thing I am considering is trying to implement a better logging mechanism. The idea would be to have each thread kind of log events in a lightweight fashion (perhaps pushing to a thread-local vector), dump them out, and then later try to reconstruct the overall state (how many threads idle, sleeping, how many jobs lingering in queues). I've not thought too much about this but it seems like it'd be a super useful tool. It also seems like it might overlap a lot with what @wagnerf42 has proposed in #4, which could be good. It also seems like a fair amount of work. =)

Barring future improvements, though, we have a few choices. We could land the new scheduler roughly as is and accept that quick_sort_par_bench needs improvement. We could land one of the more "aggressive" variants (e.g., one that wakes all threads), and try to work from there (they still improve CPU usage, although not as much). Or we can just keep experimenting.

One other thing I should probably do is to try and update that measurement spreadsheet with other kinds of data, such as the benchmark results for different variants of this RFC.

(*) Update: What are the major things I've explored thus far? There are two axes.

First, how to detect idle workers? With a counter or a heuristic? I think a counter is probably better. It seems simpler, performs roughly the same, and it allows us to detect things like "when the last idle thread finds work".

Second, how to avoid deadlock? I tried three variants:

Increment a jobs counter for every job.
- Downside: atomic write on every new job, at least from outside the thread-pool
- Downside: doesn't help with thread-local internal jobs, which we see to be a problem
One sleepy worker at a time.
- Downside: slower for workers to sleep, which hurts CPU time
- Upside: common case is just a seqcst load for new jobs
- Upside: treats internal and external jobs the same
Separate JOBS and SLEEPY counters
- Upside: any number of works can go to sleep in parallel
- Upside: common case is just a seqcst load for new jobs
- Upside: treats internal and external jobs the same

So you can see why the latest variant is my favorite, I guess.

nikomatsakis · 2019-09-18T12:59:13Z

OK, I did a bit more digging. I added a new-and-improved logging mechanism that lets us (a) measure without interfering with wall-clock times by deferring to a separate logging thread and (b) reproduce the state of the rayon workers at each point in time. Right now it produces a CSV file with the number of sleeping/idle/notified threads, number of pending jobs, along with the state of each worker. I'd like to connect this to a nifty chart to let us visualize what is going on.

Studying the data let to two tweaks to improve what seemed to be failure modes. Unfortunately, these didn't appear to improve performance on the quicksort-par-bench, but they still seem like good ideas:

I realized we can combine the "local queue is empty" heuristic with precise counters. So now, if we see that the local queue is non-empty, and there are sleeping threads, then we always try to wake a sleeping thread -- no matter if there are idle threads. The premise is "well those idle threads didn't seem to be consuming the things in my queue". This helps in particular to deal with the races where there may be a lot of jobs pushed but only a small number of idle threads, leading us to wake too few workers.

The other change I made is to tweak how the notification mechanism works. Since we have a per-worker-thread boolean state, we can now set it to false when a thread is notified (not when it actually awakens). We can also subtract the number of sleeping threads at that time. This fixes a problem I observed where many new jobs arrive and each re-notifies the same sleeping worker, since they didn't yet awake.

At this point, when I look over the data, everything seems to roughly be working "as it should". When new jobs come, we start to wake up workers, etc. We do seem some temporary blockage of large number of jobs waiting to be stolen (sometimes up to 10 or 12) but it seems like that is largely a result of threads not waking as fast as one might like.

In any case, got to run for this morning, maybe I'll hook the CSV up to gnuplot and try to get a nice figure that visualizes what's going on. I imagine that might help in spotting any other anti-patterns.

nikomatsakis · 2019-09-22T11:14:45Z

OK, a few updates:

I realized that my "Separate JOBS and SLEEPY counters" implementation was pretty bogus and prone to deadlock -- I've replaced it with a new, cleaner impl that (so far) seems to work fine. In the process (and crucially) it combines all the counters (i.e., also those tracking the number of idle/sleepy threads). I'm still testing this but it seems to be working now.

Finally, though, while I still have a few more things to try, there is a distinct possibility I won't be able to recover the perf on quick_sort_par_bench. I still think we should land the new scheduler, but I haven't had a chance to discuss with @cuviper. For one thing, I think it'd be great to have the work on master and maybe other people will be able to play around with it.

In terms of next steps:

I still have one more experiment I want to try. 😉
I need to update the RFC to the latest status.
I need to start breaking up my branch into steps that can be meaningfully reviewed. Right now it starts out very clean but then quickly goes into a series of back and forth experiments. It also contains some independent strands of work (e.g., building the logging infrastructure). I think the thing to do is probably to "squash" all the experimental commits to the final, current protocol, as well as landing the logging infra separately.

wagnerf42 · 2019-09-26T09:12:39Z

hi, sorry if I'm asking dumb questions here. I don't really know where to start with this.
I have a rather simple algorithm for amortizing overheads and I'm not sure why it would not work.
here is the idea:
each thread has a counter c counting the numer_of_consecutive_unsucessful steal attemps.
everytime you steal, if you have success the counter is reset to 0
if you fail you add 1 to the counter and wait a little bit before your next attempt.
how and how much time depends on your counter value.
if your value is small you waste time looping for an approximate duration of d^c. (d is a constant).
if your value is larger you sleep for the same approximate duration.

the good points are:

you are initially very reactive
the number of useless steal requests decreases exponentially fast
the time wasted waiting (sleeping of looping) is amortized with respect to the previous failed attemps: for example if d = 2 and you waited 10s then of these 10s five were actually useful so no more than half the waiting time is wasted.

we used to use in a slightly different context so i'm not sure it would work here but I don't see why it would not.

nikomatsakis · 2019-10-02T00:03:07Z

@wagnerf42 certainly worth a try!

lnicola · 2019-10-09T12:37:09Z

I haven't looked at the code, but this might serve as inspiration: https://github.com/dotnet/coreclr/blob/master/src/System.Private.CoreLib/shared/System/Threading/ThreadPool.cs.

sagar-solana · 2019-10-14T16:40:53Z

@nikomatsakis do you have a target release in which you plan to improve this? Just wondering as there hasn't been much movement since the initial diagnosis.

tdaede · 2019-11-05T00:56:33Z

The latch-target-thread branch produces an enormous speedup for rav1e's threading, on my 2990wx using 56 tile threads.

Before: INFO rav1e::stats > encoded 200 frames, 6.404 fps, 1052.40 Kb/s

After: INFO rav1e::stats > encoded 200 frames, 9.243 fps, 1052.40 Kb/s

nikomatsakis · 2019-11-05T10:46:48Z

Update:

Sorry for the radio silence. I got overwhelmed for a while. But I've come back to this work in the last week or so. I've got some good news, though not perfect.

The biggest concern when @cuviper and I talked last was the fact that certain benchmarks -- notably our parallel sort routine, but also the quick_sort_par_bench -- slowed down by about 50% on latch-target-thread. Basically everything else I've measured is either unaffected or does better.

I spent some time investigating what is causing this slowdown and what we can do about it. Along the way, I did find one bug in the handling of idle threads. Fixing that reduces the slowdown on quick_sort_par_bench to 25%. Unfortunately, par_sort_big remains noticeably affected (2.3m ns/iter vs 1.5m ns/iter). I've actually just pushed that bug fix, so @tdaede you may want to re-run those benchmarks, I'd be curious to see the result.

I also backported the "event logging" framework that I added on the latch-target-thread branch to master. This allows me to generate the same sort of event traces on both branches, which I can then compare. These traces give me a complete picture of how many threads are sleeping, notified, idle, etc, as well as how many jobs are pending in each thread's queue. Generating these traces (experimentally) doesn't affect the total benchmark time, so I believe they are representative. I also wrote a tool (rayon-plot) that will analyze the traces and generate SVGs from them, which has allowed me to view them. Unfortunately, the SVGs are kind of huge...

I spent the last few days closely studying quick_sort_par_bench. I tried and test various hypotheses. My conclusion thus far is that the slowdown results simply from the fact that there are fewer idle threads hanging around, waiting to steal tasks. As a result, the latency to steal tasks is slightly increased. The latency on master is 2.18 events. The latency on latch-target-thread now is 3.32 events, and the latency before my fix was 5.15. So you can see that my fix, which removed about 50% of the slowdown, also removed about 50% of this latency.

I'm not sure how much we can do about this slowdown. It feels somewhat inherent. Any changes that lead to more idle tasks hanging around will also increase latency. One thing I tinkered with was modified the routines to try and make idle tasks spend a larger percentage of their time searching for stealing tasks, but I didn't have any success with that yet.

I've not yet done a detailed look at the par_sort_big text, that is probably next.

nikomatsakis · 2019-11-05T10:48:20Z

I'm pulling this into a separate comment.

Looking for help

My time is pretty limited -- is anybody interested in collaborating on this work? I'm enjoying it, but I'm also thinking it would go faster if somebody else wanted to help a bit. If so, ping me (on Discord, Zulip, or even gitter, although I don't notice pings there quite as well).

Next steps

I would like to land this event logging mechanism on master which I backported, and move rayon-plot into the rayon org (or maybe even the rayon repo). It is very useful. But I thought that before doing so, it would make sense to compare against @wagnerf42's RFC (added logs rfc #4) and see how much they overlap. I also thought an RFC describing how it works would make sense, and not be too much work.
I would like to experiment with @wagnerf42's suggestion of having threads independently sleep for increasingly long amounts of time. It's certainly appealingly simple. Having the event logging mechanism would help with this.
I need to rebase and squash my branch, which is ridiculously long, and then see if I can identify things to start pulling onto master.

wagnerf42 · 2019-11-05T11:11:00Z

well, I can re-run the sort benchmark and take a look at the logging part that's for sure.
for the sleeping part I'm not too sure I really understand current code. I'd need to dive deeper.

wagnerf42 · 2019-11-05T12:29:18Z

well, the latch branch is slightly faster on my desktop (5,37 vs 5,3).
however one of the run segfaulted and I did not manage to reproduce it :
error: process didn't exit successfully: /home/wagnerf/code/latch/rayon/target/release/deps/rayon_demo-a001faa26da46eda par_sort_big --bench (signal: 11, SIGSEGV: invalid memory reference).
I'll try to run it again with more cores to stress it a bit more.

nikomatsakis · 2019-11-05T20:25:22Z

@wagnerf42

well, the latch branch is slightly faster on my desktop (5,37 vs 5,3).

Huh, interesting! How many cores etc is your desktop?

I've not tested with different values of RAYON_RS_CPUS here.

however one of the run segfaulted and I did not manage to reproduce it :

uh oh.

for the sleeping part I'm not too sure I really understand current code. I'd need to dive deeper.

I am reminded that I have to update this RFC, which is woefully out of date in some particulars.

wagnerf42 · 2019-11-18T09:08:59Z

hi niko,
I can confirm the segfault. I got it also on another machine. Sadly it was in a middle of a script I don't know which branch. also, your rayon-plot repository is empty. maybe you forgot to push ?

tdaede · 2019-11-25T21:57:35Z

I didn't save the parameters I used for the previous test, but here's a newer one:
Release rayon: 19.818 fps
c4a49e34106d9a19c38e98aa5d242cd0bda67005: 29.976 fps
bc73d198699ebec7180a2840682db3a2e82e7b90: 29.820 fps

On this CPU I would consider the difference between the last two to be noise.

746: new scheduler from RFC 5 r=cuviper a=nikomatsakis Implementation of the scheduler described in rayon-rs/rfcs#5 -- modulo the fact that the RFC is mildly out of date. There is a [walkthrough video available](https://youtu.be/HvmQsE5M4cY). To Do List: * [x] Fix the cargo lock * [x] Address use of `AtomicU64` * [x] Document the handling of rollover and wakeups and convince ourselves it's sound * [ ] Adopt and document the [proposed scheme for the job event counter](#746 (comment)) * [ ] Review RFC and list out the places where it differs from the branch Co-authored-by: Niko Matsakis <niko@alum.mit.edu> Co-authored-by: Josh Stone <cuviper@gmail.com>

cuviper · 2020-08-24T19:18:42Z

The implementation has merged in rayon#746 -- I guess we should fix any inconsistencies here and merge as well...

793: Release rayon 1.4.0 / rayon-core 1.8.0 r=cuviper a=cuviper - Implemented a new thread scheduler, [RFC 5], which uses targeted wakeups for new work and for notifications of completed stolen work, reducing wasteful CPU usage in idle threads. - Implemented `IntoParallelIterator for Range<char>` and `RangeInclusive<char>` with the same iteration semantics as Rust 1.45. - Relaxed the lifetime requirements of the initial `scope` closure. [RFC 5]: rayon-rs/rfcs#5 Co-authored-by: Josh Stone <cuviper@gmail.com>

Boscop · 2023-05-02T16:48:58Z

I'm wondering, what's the current status of this? :)

cuviper · 2023-05-02T16:51:04Z

Ah, this shipped in rayon-rs/rayon#793.

cuviper · 2023-05-02T16:58:07Z

The last commit was WIP with an incomplete section, but I don't really expect to revisit and complete that at this point, so I just removed that. I'll merge it as-is.

first draft

2e4633b

nikomatsakis mentioned this pull request Sep 8, 2019

Rayon uses a lot of CPU when there's not a lot of work to do rayon-rs/rayon#642

Closed

nikomatsakis mentioned this pull request Sep 10, 2019

Lazily create threads in ThreadPool rayon-rs/rayon#576

Closed

describe current behavior

386e79a

restructure text, settle some alternatives, describe go sched

e04e43a

sagar-solana mentioned this pull request Sep 16, 2019

Request for a new Release rust-rse/reed-solomon-erasure#53

Closed

cuviper mentioned this pull request Sep 23, 2019

refactoring: have latches tickle registries directly rayon-rs/rayon#691

Closed

lu-zero mentioned this pull request Nov 5, 2019

Improve the multithread support xiph/rav1e#1831

Open

nikomatsakis mentioned this pull request Nov 12, 2019

added logs rfc #4

Open

nikomatsakis mentioned this pull request Apr 8, 2020

new scheduler from RFC 5 rayon-rs/rayon#746

Merged

5 tasks

wip

98675c6

cuviper mentioned this pull request Aug 24, 2020

Release rayon 1.4.0 / rayon-core 1.8.0 rayon-rs/rayon#793

Merged

ElliotB256 mentioned this pull request Feb 9, 2021

Profiling for red-mot TeamAtomECS/AtomECS#9

Closed

cuviper added 2 commits May 2, 2023 09:55

Remove an incomplete section header

2d312a3

Set the RFC5 filename

c890e16

cuviper merged commit e309421 into rayon-rs:master May 2, 2023

Boscop mentioned this pull request May 2, 2023

Multithreaded sequential programs spend most their active time waiting HigherOrderCO/HVM#184

Closed

QuarticCat mentioned this pull request Sep 2, 2024

A long-running job could make the pool degenerate judofyr/spice#10

Open

improve the sleeping thread algorithm #5

improve the sleeping thread algorithm #5

Conversation

nikomatsakis commented Sep 8, 2019

ishitatsuyuki commented Sep 9, 2019

nikomatsakis commented Sep 9, 2019

nikomatsakis commented Sep 10, 2019 • edited Loading

lnicola commented Sep 10, 2019

nikomatsakis commented Sep 10, 2019

lnicola commented Sep 10, 2019 • edited Loading

nikomatsakis commented Sep 10, 2019 • edited Loading

jrmuizel commented Sep 11, 2019

nikomatsakis commented Sep 12, 2019

nikomatsakis commented Sep 12, 2019 • edited Loading

cuviper commented Sep 12, 2019

nikomatsakis commented Sep 12, 2019

nikomatsakis commented Sep 13, 2019 • edited Loading

nikomatsakis commented Sep 13, 2019 • edited Loading

nikomatsakis commented Sep 13, 2019

nikomatsakis commented Sep 13, 2019 • edited Loading

nikomatsakis commented Sep 15, 2019 • edited Loading

nikomatsakis commented Sep 18, 2019

nikomatsakis commented Sep 22, 2019 • edited Loading

wagnerf42 commented Sep 26, 2019

nikomatsakis commented Oct 2, 2019

lnicola commented Oct 9, 2019

sagar-solana commented Oct 14, 2019 • edited Loading

tdaede commented Nov 5, 2019

nikomatsakis commented Nov 5, 2019

nikomatsakis commented Nov 5, 2019

Looking for help

Next steps

wagnerf42 commented Nov 5, 2019

wagnerf42 commented Nov 5, 2019

nikomatsakis commented Nov 5, 2019 • edited Loading

wagnerf42 commented Nov 18, 2019

tdaede commented Nov 25, 2019

cuviper commented Aug 24, 2020

Boscop commented May 2, 2023

cuviper commented May 2, 2023

cuviper commented May 2, 2023

nikomatsakis commented Sep 10, 2019 •

edited

Loading

lnicola commented Sep 10, 2019 •

edited

Loading

nikomatsakis commented Sep 10, 2019 •

edited

Loading

nikomatsakis commented Sep 12, 2019 •

edited

Loading

nikomatsakis commented Sep 13, 2019 •

edited

Loading

nikomatsakis commented Sep 13, 2019 •

edited

Loading

nikomatsakis commented Sep 13, 2019 •

edited

Loading

nikomatsakis commented Sep 15, 2019 •

edited

Loading

nikomatsakis commented Sep 22, 2019 •

edited

Loading

sagar-solana commented Oct 14, 2019 •

edited

Loading

nikomatsakis commented Nov 5, 2019 •

edited

Loading