Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression with non-power-of-2 number of threads #968

Open
cfallin opened this issue Aug 17, 2022 · 6 comments
Open

Performance regression with non-power-of-2 number of threads #968

cfallin opened this issue Aug 17, 2022 · 6 comments

Comments

@cfallin
Copy link

cfallin commented Aug 17, 2022

Over in bytecodealliance/wasmtime#4327 we found a ~20% performance regression on a workload when a non-power-of-2 number of threads were specified, on my 12-core system (Ryzen 3900X) and on other systems with 3, 6, ... threads manually specified. On my own system, manually specifying a power-of-two number of threads (e.g., 8) causes the regression to disappear.

@alexcrichton produced a diff that isolates the issue in that PR; it appears that it may have something to do with scopes. I am not familiar enough with the design of rayon or terminology here to describe in more detail what these changes mean.

Does this issue seem familiar at all? Could there be some dependence on a power-of-two number of threads for, e.g., even work distribution?

@cuviper
Copy link
Member

cuviper commented Aug 18, 2022

Well, I don't think the pool itself would care about that, but it might be a problem with job-splitting on parallel iterators. Basically, they work by even 50-50 splits until there are enough jobs to fill the pool, but that does effectively overshoot up to the next power of two. Plus, when jobs do get stolen, we split further in an attempt of "adaptive" splitting for imbalanced workloads.

So basically, a pool of 12 threads will split just as much as a pool of 16 threads would. And with smaller jobs, they may finish sooner and work-steal more, which adaptively splits more, potentially cascading poorly. I'm guessing about all of this, but maybe that could stack up to your 20% effect, especially if you have things like map_with that are sensitive to the intermediate item group size in a job.

I've known that rayon's job-splitting is not "ideal", but it's trying to strike a general balance. In the past I found that having more control knobs on this ends up slowing everyone down, but there's at least with_min_len to put a limit on splitting. I also have #857 to curtail some of the adaptive effect, if you care to try that out.

@alexcrichton
Copy link

To expand a bit more on our problem, what we're using rayon for is to compile all functions in a wasm module in parallel. We basically build a big list of all the functions and then use rayon's par_iter().map().collect() to execute the compilation. The behavior I'm seeing is that if I wrap the par_iter() of all functions in a closure called calculate I see runtime differences in this code:

        let funcs = if std::env::var("SLOW").is_ok() {
            let (a, _) = rayon::join(|| calculate(), || ());
            a
        } else {
            calculate()
        };

Specifically on an intel cpu (I can't seem to reproduce this on arm64) when RAYON_NUM_THREADS=6 the compilation time for a module is originally ~1.7s and with SLOW=1 it jumps to ~2.2s. This slowdown I've seen is consistent across many runs and is at least reproducible locally for me. If you're interested to dig in I can share some details about reproduction.

Otherwise though I tested out #857 and it unfortunately didn't show any effect here. Using SLOW=1 showed the same slowdown relative to not using it.

Apart from this one manual call to rayon::join the only other usage of rayon is this method which is called once from the calculate closure (not shown here and not in the main branch of wasmtime, just in my local fork hacking on this to investigate this behavior)

@cuviper
Copy link
Member

cuviper commented Aug 18, 2022

That's wacky, especially that it would happen on Intel and not arm64.

The only thing I can see, which has nothing to do with being power-of-two, is that rayon::join will force the entire calculate() into the threadpool, while the bare version can do some work in your current thread. Even if you have no explicit work, collecting Result<Vec<_>, E> does have some final cleanup to get a contiguous Vec on success, or drop partial values on error.

Does the same slowdown happen if you use rayon::scope? No spawn needed for this test, I'm just looking at another way to force it into the pool.

@alexcrichton
Copy link

Unfortunately yeah even with rayon::scope where it just runs calculate within the scope closure and sets a local variable to the result the slowdown persists.

@Walther
Copy link

Walther commented Aug 20, 2022

Sorry in advance for the drive-by comment: wonder if this thread #795 is related or not?

@cuviper
Copy link
Member

cuviper commented Aug 22, 2022

@Walther that issue is specific to the way that par_bridge spins while idle, so I think it's unrelated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants