Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel mode hangs when invoked from Rayon global thread pool #517

Closed
Pr0methean opened this issue Jun 12, 2023 · 25 comments · Fixed by #524
Closed

Parallel mode hangs when invoked from Rayon global thread pool #517

Pr0methean opened this issue Jun 12, 2023 · 25 comments · Fixed by #524

Comments

@Pr0methean
Copy link
Contributor

My app needs to generate and optimize multiple images in parallel, so it uses the Rayon global thread pool. However, the time these images take varies unpredictably, so utilization at the end would be higher if I could also use Oxipng's parallel mode. Unfortunately, Oxipng deadlocks when I invoke oxipng from within a Rayon task in the global thread pool. Here's the app where I'm experiencing this: https://app.circleci.com/pipelines/github/Pr0methean/OcHd-RustBuild?branch=oxipng-parallel

@Pr0methean
Copy link
Contributor Author

I've tried wrapping the optimize_from_memory call in ThreadPool::install with an explicitly-created thread pool, but that leads to starvation of the global pool and out-of-memory errors. A workaround would be if oxipng made the interface of RowFilter public, in which case I could create a separate task for each filter algorithm that should be tried.

@andrews05
Copy link
Collaborator

Is this on master or v8? Could you try some older versions perhaps?
Do you know the exact point that it deadlocks?

@Pr0methean
Copy link
Contributor Author

Master.

@andrews05
Copy link
Collaborator

andrews05 commented Jun 13, 2023

Maybe this is relevant? rayon-rs/rayon#592
Oxipng explicitly avoids using mutexes in this way, but perhaps your code does?

Out of interest, are you using the new raw API?

@Pr0methean
Copy link
Contributor Author

That bug is unlikely to be relevant, since the PNG-writing task that invokes Oxipng calls into_inner() on its Mutex before running.

I'm not using the raw API but will look into it.

@andrews05
Copy link
Collaborator

As I'm not exactly an expert in Rayon, I'm afraid there's not much else I can do to help, sorry. Perhaps @kornelski may have some ideas?

@wait-what
Copy link

wait-what commented Jun 18, 2023

Very simple to reproduce, using this code as an example. My CPU has 12 threads. This works perfectly well with 11, but as soon as I uncomment 12, none of them finish, there is no CPU activity and the program just hangs. On an 8 thread CPU, it works with 7, but hangs with 8, etc.

use rayon::prelude::*;
use oxipng::{optimize_from_memory, Options, Deflaters};

fn main() {
    let buffer = include_bytes!("sample.png").as_slice();

    let options = Options {
        deflate: Deflaters::Libdeflater { compression: 12 },
        ..Default::default()
    };

    vec![
        1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
        // 12,
    ].par_iter().for_each(|x| {
        println!("Optimizing {x}");
        optimize_from_memory(&buffer, &options).unwrap();
        println!("Done {x}");
    });
}

@andrews05
Copy link
Collaborator

@wait-what Thanks for minimal test case, that's quite helpful!
It seems the problem is with the evaluator - trials are being scheduled in the background and then later it waits for them to finish but there are no threads executing them so it deadlocks.
I'm not really sure what the correct way to resolve this is, but here's one possible solution: andrews05@059e22b
Would you and @Pr0methean mind checking this out to see if it works for you?

@wait-what
Copy link

wait-what commented Jun 19, 2023

Using oxipng = { git = "https://github.com/andrews05/oxipng", branch = "parallel" }. Using the same code, the issue seems to be fixed. Thank you!

Using a while loop like that does seem like it might not be the most optimal solution, but it does seem plenty fast.

@kornelski
Copy link
Contributor

kornelski commented Jun 20, 2023

I think the yield loop should be safe, because it's not busy-waiting, it's stealing work to run on the current thread.

I'm not sure if it solves the problem 100%, because I don't know if it's guaranteed that rayon will find a job to execute at the moment of entering the while loop. If some timing issue made the evaluation job enqueue a bit later, I suspect the while loop may terminate, and then still get stuck waiting for the channel.

When get_best_candidate is called in a rayon context, it creates a situation where one rayon task is waiting for another rayon task to complete, and that is the equivalent of the mutex problem that rayon can't solve.

I see a couple of solutions to this:

  1. Have an explicit separate rayon threadpool just for the try_image tasks, so that it's always guaranteed there's a thread available for them.

  2. In try_image have an explicit queue for images left to evaluate. Instead of spawning rayon::task for a specific image do this:

    1. Add an image to a queue (could be Mutex<Vec<Arc<PngImage> & params>>)
    2. Spawn rayon::task that removes an image from the queue and evaluates it

    And then in get_best_candidate, try to take images from the queue too, and eval them on the current thread too. This way if rayon::task never runs, it will never take images off the queue, and get_best_candidate will finish the job. It's about the same as the while rayon::yield loop, but only for image tasks, and guaranteed to empty that particular queue.

@andrews05
Copy link
Collaborator

Thanks for the analysis, @kornelski!

I'm not sure if it solves the problem 100%, because I don't know if it's guaranteed that rayon will find a job to execute at the moment of entering the while loop. If some timing issue made the evaluation job enqueue a bit later, I suspect the while loop may terminate, and then still get stuck waiting for the channel.

Are you sure this is possible though? I would have thought the job would be guaranteed to be queued before rayon::spawn returns. I mean, my personal preference would be to stick with the simplest solution unless we know for sure that it may not always work 😅

@andrews05
Copy link
Collaborator

@Pr0methean Have you had a chance to try out the branch above?

@TPS
Copy link

TPS commented Jun 22, 2023

Is this work useful toward #169/#275, perchance?

@andrews05
Copy link
Collaborator

@TPS Not really, sorry. Although since #169 was opened we've added new filter types, so it will use up to 10 threads now at -o 6.

@Pr0methean
Copy link
Contributor Author

@andrews05 It works; thanks!

@andrews05
Copy link
Collaborator

Great, I've opened a PR for it.

@andrews05
Copy link
Collaborator

@Pr0methean Was just thinking about the stack overflow you mentioned, are you able to get a stack trace for that?

@Pr0methean
Copy link
Contributor Author

Pr0methean commented Jun 25, 2023

Yes.
stack_trace.txt
The problem occurs because yield_local is effectively recursive. It could probably be solved by having one thread not yield.

@Pr0methean
Copy link
Contributor Author

Opened rayon-rs/rayon#1064.

@andrews05
Copy link
Collaborator

Oh I see, so yield_local is picking up a task from your own program (the next image to operate on), rather than an evaluation task, is that right?

@Pr0methean
Copy link
Contributor Author

Pr0methean commented Jun 26, 2023

That's right. I've tried having the image task itself call yield_local() before starting, but this is slower and gives lower CPU utilization than without the Oxipng parallelization. I've also tried wrapping the Oxipng call in install() with a second thread pool, but then I get deadlock again.

@andrews05
Copy link
Collaborator

I guess kornelski's solution would have worked better then. Whoops 😅
It is a lot more complex though, so hopefully rayon accepts your patch.

@Pr0methean
Copy link
Contributor Author

Update: it looks like this only affects jobs running in a FIFO scope, but it's unfortunate that I have to choose between my two methods of increasing utilization (parallelizing oxipng vs starting the longest jobs first).

@andrews05
Copy link
Collaborator

I believe I have a simple solution. Can you try this out? andrews05@1e46122

@Pr0methean
Copy link
Contributor Author

It may take me a day or two to test this, because I'm also testing zopfli-rs/zopfli#22 and (in preparation for making it a PR) Pr0methean/zopfli@b2d0142.

@laurmaedje laurmaedje mentioned this issue Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants