Replies: 2 comments
-
This could be interesting to try. I did have a very old branch with the Mutex approach. https://github.com/hymm/bevy/blob/system-future/crates/bevy_ecs/src/schedule/executor_parallel.rs. It ended up being 5-10% slower, so I ended up not pursuing it much further. This is orthogonal to the approach outlined above. But what I want to try for reducing overhead is to prespawn the system tasks and reuse them by sending the reference to the world through a channel every frame. Should be a nice speedup, but might be tricky to make it 100% safe. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Just wanted a place to publicly jot down an idea on reducing the overhead of the multithreaded system executor, probably none of which are currently actionable, and it's unclear if it's possible at all, hence why this isn't an issue.
The current multithreaded executor's flow is dictated by two lock-free queues/channels right now: the task executor's global injection queue and the system completion channel. The executor starts systems by spawning tasks, and awaits their completion the completion channel, which the tasks push onto upon finishing running. This last part of sending a completion message over an async channel carries the overhead of potentially multiple atomic fences and potentially the delay of waking up an OS thread. This currently manifests as multiple 10-70+us segments where we're just waiting on the OS to wake up. This can quickly add up when there are many systems completing in sequence.
The rough idea I'm trying to propose is to opportunistically cut out that overhead by moving the multithreaded executor from it's own task to running synchronously in the system task itself after the system has completed. This should avoid the OS thread wakeup and channel overheads entirely.
However, this potentially raises a few issues with multiple systems completing at the same time and then all attempting to run the executor in their own tasks. This naturally calls for synchornization primitives like
Mutex
, but that likely puts us back at square one in terms of OS and synchronization overhead. This is potentially addressable by using a singleAtomicBool
as a lock-like mechanism:AtomicBool::compare_and_swap_weak
If done correctly, there should be at least one active system or executor at any given time, and we may opportunistically only need to deal with the atomic contention of one CAS per system instead of potentially multiple when pushing to the queue.
Beta Was this translation helpful? Give feedback.
All reactions