Lowering Multithreaded Executor's Overhead #8304

james7132 · 2023-04-05T00:52:43Z

james7132
Apr 5, 2023
Collaborator

Just wanted a place to publicly jot down an idea on reducing the overhead of the multithreaded system executor, probably none of which are currently actionable, and it's unclear if it's possible at all, hence why this isn't an issue.

The current multithreaded executor's flow is dictated by two lock-free queues/channels right now: the task executor's global injection queue and the system completion channel. The executor starts systems by spawning tasks, and awaits their completion the completion channel, which the tasks push onto upon finishing running. This last part of sending a completion message over an async channel carries the overhead of potentially multiple atomic fences and potentially the delay of waking up an OS thread. This currently manifests as multiple 10-70+us segments where we're just waiting on the OS to wake up. This can quickly add up when there are many systems completing in sequence.

The rough idea I'm trying to propose is to opportunistically cut out that overhead by moving the multithreaded executor from it's own task to running synchronously in the system task itself after the system has completed. This should avoid the OS thread wakeup and channel overheads entirely.

However, this potentially raises a few issues with multiple systems completing at the same time and then all attempting to run the executor in their own tasks. This naturally calls for synchornization primitives like Mutex, but that likely puts us back at square one in terms of OS and synchronization overhead. This is potentially addressable by using a single AtomicBool as a lock-like mechanism:

Attempt to "lock" the bool by setting it to true via a single AtomicBool::compare_and_swap_weak
- If it succeeds, proceed to run the multithreaded executor, and reset it to false once done.
- If it fails, enqueue the system index onto the completion queue, rely on the running executor to pick it up, or followup runs of the executor.

If done correctly, there should be at least one active system or executor at any given time, and we may opportunistically only need to deal with the atomic contention of one CAS per system instead of potentially multiple when pushing to the queue.

hymm · 2023-04-05T02:43:30Z

hymm
Apr 5, 2023
Collaborator

This could be interesting to try. I did have a very old branch with the Mutex approach. https://github.com/hymm/bevy/blob/system-future/crates/bevy_ecs/src/schedule/executor_parallel.rs. It ended up being 5-10% slower, so I ended up not pursuing it much further.

This is orthogonal to the approach outlined above. But what I want to try for reducing overhead is to prespawn the system tasks and reuse them by sending the reference to the world through a channel every frame. Should be a nice speedup, but might be tricky to make it 100% safe.

0 replies

alice-i-cecile · 2024-03-01T16:42:24Z

alice-i-cecile
Mar 1, 2024
Maintainer

I've been tinkering with speeding up the multi-threaded executor by keeping its own work-stealing queues of runnable systems. The goal is to keep running systems or the executor without having to return to the task pool every time.
It seems promising so far!

Quartermeister on Discord

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lowering Multithreaded Executor's Overhead #8304

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Lowering Multithreaded Executor's Overhead #8304

james7132 Apr 5, 2023 Collaborator

Replies: 2 comments

hymm Apr 5, 2023 Collaborator

alice-i-cecile Mar 1, 2024 Maintainer

james7132
Apr 5, 2023
Collaborator

hymm
Apr 5, 2023
Collaborator

alice-i-cecile
Mar 1, 2024
Maintainer