Skip to content

Commit

Permalink
Merge pull request #1157 from atouchet/faq
Browse files Browse the repository at this point in the history
Re-wrap the FAQ to 80 character width
  • Loading branch information
cuviper committed Apr 8, 2024
2 parents 5623fcf + f56b38b commit 8ccfda3
Showing 1 changed file with 99 additions and 112 deletions.
211 changes: 99 additions & 112 deletions FAQ.md
@@ -1,51 +1,47 @@
# Rayon FAQ

This file is for general questions that don't fit into the README or
crate docs.
This file is for general questions that don't fit into the README or crate docs.

## How many threads will Rayon spawn?

By default, Rayon uses the same number of threads as the number of
CPUs available. Note that on systems with hyperthreading enabled this
equals the number of logical cores and not the physical ones.
By default, Rayon uses the same number of threads as the number of CPUs
available. Note that on systems with hyperthreading enabled this equals the
number of logical cores and not the physical ones.

If you want to alter the number of threads spawned, you can set the
environmental variable `RAYON_NUM_THREADS` to the desired number of
threads or use the
environmental variable `RAYON_NUM_THREADS` to the desired number of threads or
use the
[`ThreadPoolBuilder::build_global` function](https://docs.rs/rayon/*/rayon/struct.ThreadPoolBuilder.html#method.build_global)
method.

## How does Rayon balance work between threads?

Behind the scenes, Rayon uses a technique called **work stealing** to
try and dynamically ascertain how much parallelism is available and
exploit it. The idea is very simple: we always have a pool of worker
threads available, waiting for some work to do. When you call `join`
the first time, we shift over into that pool of threads. But if you
call `join(a, b)` from a worker thread W, then W will place `b` into
its work queue, advertising that this is work that other worker
threads might help out with. W will then start executing `a`.

While W is busy with `a`, other threads might come along and take `b`
from its queue. That is called *stealing* `b`. Once `a` is done, W
checks whether `b` was stolen by another thread and, if not, executes
`b` itself. If W runs out of jobs in its own queue, it will look
through the other threads' queues and try to steal work from them.

This technique is not new. It was first introduced by the
[Cilk project][cilk], done at MIT in the late nineties. The name Rayon
is an homage to that work.
Behind the scenes, Rayon uses a technique called **work stealing** to try and
dynamically ascertain how much parallelism is available and exploit it. The idea
is very simple: we always have a pool of worker threads available, waiting for
some work to do. When you call `join` the first time, we shift over into that
pool of threads. But if you call `join(a, b)` from a worker thread W, then W
will place `b` into its work queue, advertising that this is work that other
worker threads might help out with. W will then start executing `a`.

While W is busy with `a`, other threads might come along and take `b` from its
queue. That is called *stealing* `b`. Once `a` is done, W checks whether `b` was
stolen by another thread and, if not, executes `b` itself. If W runs out of jobs
in its own queue, it will look through the other threads' queues and try to
steal work from them.

This technique is not new. It was first introduced by the [Cilk project][cilk],
done at MIT in the late nineties. The name Rayon is an homage to that work.

[cilk]: http://supertech.csail.mit.edu/cilk/

## What should I do if I use `Rc`, `Cell`, `RefCell` or other non-Send-and-Sync types?

There are a number of non-threadsafe types in the Rust standard library,
and if your code is using them, you will not be able to combine it
with Rayon. Similarly, even if you don't have such types, but you try
to have multiple closures mutating the same state, you will get
compilation errors; for example, this function won't work, because
both closures access `slice`:
There are a number of non-threadsafe types in the Rust standard library, and if
your code is using them, you will not be able to combine it with Rayon.
Similarly, even if you don't have such types, but you try to have multiple
closures mutating the same state, you will get compilation errors; for example,
this function won't work, because both closures access `slice`:

```rust
/// Increment all values in slice.
Expand All @@ -54,9 +50,9 @@ fn increment_all(slice: &mut [i32]) {
}
```

The correct way to resolve such errors will depend on the case. Some
cases are easy: for example, uses of [`Rc`] can typically be replaced
with [`Arc`], which is basically equivalent, but thread-safe.
The correct way to resolve such errors will depend on the case. Some cases are
easy: for example, uses of [`Rc`] can typically be replaced with [`Arc`], which
is basically equivalent, but thread-safe.

Code that uses `Cell` or `RefCell`, however, can be somewhat more complicated.
If you can refactor your code to avoid those types, that is often the best way
Expand All @@ -66,34 +62,33 @@ equivalents:
- `Cell` -- replacement: `AtomicUsize`, `AtomicBool`, etc
- `RefCell` -- replacement: `RwLock`, or perhaps `Mutex`

However, you have to be wary! The parallel versions of these types
have different atomicity guarantees. For example, with a `Cell`, you
can increment a counter like so:
However, you have to be wary! The parallel versions of these types have
different atomicity guarantees. For example, with a `Cell`, you can increment a
counter like so:

```rust
let value = counter.get();
counter.set(value + 1);
```

But when you use the equivalent `AtomicUsize` methods, you are
actually introducing a potential race condition (not a data race,
technically, but it can be an awfully fine distinction):
But when you use the equivalent `AtomicUsize` methods, you are actually
introducing a potential race condition (not a data race, technically, but it can
be an awfully fine distinction):

```rust
let value = tscounter.load(Ordering::SeqCst);
tscounter.store(value + 1, Ordering::SeqCst);
```

You can already see that the `AtomicUsize` API is a bit more complex,
as it requires you to specify an
You can already see that the `AtomicUsize` API is a bit more complex, as it
requires you to specify an
[ordering](https://doc.rust-lang.org/std/sync/atomic/enum.Ordering.html). (I
won't go into the details on ordering here, but suffice to say that if
you don't know what an ordering is, and probably even if you do, you
should use `Ordering::SeqCst`.) The danger in this parallel version of
the counter is that other threads might be running at the same time
and they could cause our counter to get out of sync. For example, if
we have two threads, then they might both execute the "load" before
either has a chance to execute the "store":
won't go into the details on ordering here, but suffice to say that if you don't
know what an ordering is, and probably even if you do, you should use
`Ordering::SeqCst`.) The danger in this parallel version of the counter is that
other threads might be running at the same time and they could cause our counter
to get out of sync. For example, if we have two threads, then they might both
execute the "load" before either has a chance to execute the "store":

```
Thread 1 Thread 2
Expand All @@ -104,26 +99,23 @@ tscounter.store(value+1); tscounter.store(value+1);
// tscounter = X+1 // tscounter = X+1
```

Now even though we've had two increments, we'll only increase the
counter by one! Even though we've got no data race, this is still
probably not the result we wanted. The problem here is that the `Cell`
API doesn't make clear the scope of a "transaction" -- that is, the
set of reads/writes that should occur atomically. In this case, we
probably wanted the get/set to occur together.

In fact, when using the `Atomic` types, you very rarely want a plain
`load` or plain `store`. You probably want the more complex
operations. A counter, for example, would use `fetch_add` to
atomically load and increment the value in one step. Compare-and-swap
is another popular building block.

A similar problem can arise when converting `RefCell` to `RwLock`, but
it is somewhat less likely, because the `RefCell` API does in fact
have a notion of a transaction: the scope of the handle returned by
`borrow` or `borrow_mut`. So if you convert each call to `borrow` to
`read` (and `borrow_mut` to `write`), things will mostly work fine in
a parallel setting, but there can still be changes in behavior.
Consider using a `handle: RefCell<Vec<i32>>` like:
Now even though we've had two increments, we'll only increase the counter by
one! Even though we've got no data race, this is still probably not the result
we wanted. The problem here is that the `Cell` API doesn't make clear the scope
of a "transaction" -- that is, the set of reads/writes that should occur
atomically. In this case, we probably wanted the get/set to occur together.

In fact, when using the `Atomic` types, you very rarely want a plain `load` or
plain `store`. You probably want the more complex operations. A counter, for
example, would use `fetch_add` to atomically load and increment the value in one
step. Compare-and-swap is another popular building block.

A similar problem can arise when converting `RefCell` to `RwLock`, but it is
somewhat less likely, because the `RefCell` API does in fact have a notion of a
transaction: the scope of the handle returned by `borrow` or `borrow_mut`. So if
you convert each call to `borrow` to `read` (and `borrow_mut` to `write`),
things will mostly work fine in a parallel setting, but there can still be
changes in behavior. Consider using a `handle: RefCell<Vec<i32>>` like:

```rust
let len = handle.borrow().len();
Expand All @@ -133,13 +125,12 @@ for i in 0 .. len {
}
```

In sequential code, we know that this loop is safe. But if we convert
this to parallel code with an `RwLock`, we do not: this is because
another thread could come along and do
`handle.write().unwrap().pop()`, and thus change the length of the
vector. In fact, even in *sequential* code, using very small borrow
sections like this is an anti-pattern: you ought to be enclosing the
entire transaction together, like so:
In sequential code, we know that this loop is safe. But if we convert this to
parallel code with an `RwLock`, we do not: this is because another thread could
come along and do `handle.write().unwrap().pop()`, and thus change the length of
the vector. In fact, even in *sequential* code, using very small borrow sections
like this is an anti-pattern: you ought to be enclosing the entire transaction
together, like so:

```rust
let vec = handle.borrow();
Expand All @@ -159,11 +150,10 @@ for data in vec {
}
```

There are several reasons to prefer one borrow over many. The most
obvious is that it is more efficient, since each borrow has to perform
some safety checks. But it's also more reliable: suppose we modified
the loop above to not just print things out, but also call into a
helper function:
There are several reasons to prefer one borrow over many. The most obvious is
that it is more efficient, since each borrow has to perform some safety checks.
But it's also more reliable: suppose we modified the loop above to not just
print things out, but also call into a helper function:

```rust
let vec = handle.borrow();
Expand All @@ -172,45 +162,42 @@ for data in vec {
}
```

And now suppose, independently, this helper fn evolved and had to pop
something off of the vector:
And now suppose, independently, this helper fn evolved and had to pop something
off of the vector:

```rust
fn helper(...) {
handle.borrow_mut().pop();
}
```

Under the old model, where we did lots of small borrows, this would
yield precisely the same error that we saw in parallel land using an
`RwLock`: the length would be out of sync and our indexing would fail
(note that in neither case would there be an actual *data race* and
hence there would never be undefined behavior). But now that we use a
single borrow, we'll see a borrow error instead, which is much easier
to diagnose, since it occurs at the point of the `borrow_mut`, rather
than downstream. Similarly, if we move to an `RwLock`, we'll find that
the code either deadlocks (if the write is on the same thread as the
read) or, if the write is on another thread, works just fine. Both of
these are preferable to random failures in my experience.
Under the old model, where we did lots of small borrows, this would yield
precisely the same error that we saw in parallel land using an `RwLock`: the
length would be out of sync and our indexing would fail (note that in neither
case would there be an actual *data race* and hence there would never be
undefined behavior). But now that we use a single borrow, we'll see a borrow
error instead, which is much easier to diagnose, since it occurs at the point of
the `borrow_mut`, rather than downstream. Similarly, if we move to an `RwLock`,
we'll find that the code either deadlocks (if the write is on the same thread as
the read) or, if the write is on another thread, works just fine. Both of these
are preferable to random failures in my experience.

## But wait, isn't Rust supposed to free me from this kind of thinking?

You might think that Rust is supposed to mean that you don't have to
think about atomicity at all. In fact, if you avoid interior
mutability (`Cell` and `RefCell` in a sequential setting, or
`AtomicUsize`, `RwLock`, `Mutex`, et al. in parallel code), then this
is true: the type system will basically guarantee that you don't have
to think about atomicity at all. But often there are times when you
WANT threads to interleave in the ways I showed above.

Consider for example when you are conducting a search in parallel, say
to find the shortest route. To avoid fruitless search, you might want
to keep a cell with the shortest route you've found thus far. This
way, when you are searching down some path that's already longer than
this shortest route, you can just stop and avoid wasted effort. In
sequential land, you might model this "best result" as a shared value
like `Rc<Cell<usize>>` (here the `usize` represents the length of best
path found so far); in parallel land, you'd use a `Arc<AtomicUsize>`.
You might think that Rust is supposed to mean that you don't have to think about
atomicity at all. In fact, if you avoid interior mutability (`Cell` and
`RefCell` in a sequential setting, or `AtomicUsize`, `RwLock`, `Mutex`, et al.
in parallel code), then this is true: the type system will basically guarantee
that you don't have to think about atomicity at all. But often there are times
when you WANT threads to interleave in the ways I showed above.

Consider for example when you are conducting a search in parallel, say to find
the shortest route. To avoid fruitless search, you might want to keep a cell
with the shortest route you've found thus far. This way, when you are searching
down some path that's already longer than this shortest route, you can just stop
and avoid wasted effort. In sequential land, you might model this "best result"
as a shared value like `Rc<Cell<usize>>` (here the `usize` represents the length
of best path found so far); in parallel land, you'd use a `Arc<AtomicUsize>`.

```rust
fn search(path: &Path, cost_so_far: usize, best_cost: &AtomicUsize) {
Expand All @@ -222,5 +209,5 @@ fn search(path: &Path, cost_so_far: usize, best_cost: &AtomicUsize) {
}
```

Now in this case, we really WANT to see results from other threads
interjected into our execution!
Now in this case, we really WANT to see results from other threads interjected
into our execution!

0 comments on commit 8ccfda3

Please sign in to comment.