Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU heavy request prevents Rocket from accepting other connections #2239

Open
oersted opened this issue Jun 20, 2022 · 9 comments
Open

CPU heavy request prevents Rocket from accepting other connections #2239

oersted opened this issue Jun 20, 2022 · 9 comments
Labels
bug Deviation from the specification or expected behavior upstream An unresolvable issue: an upstream dependency bug

Comments

@oersted
Copy link

oersted commented Jun 20, 2022

I have an endpoint that can execute a long CPU heavy computation that does sync I/O and uses some sync locks (RwLock). I understand I should be using spawn_blocking for this, and indeed that works fine.

But I'd like to understand why not using spawn_blocking has such a dramatic effect in this case, to the point that no new HTTP connections can be accepted while the sync computation is running. The computation would of course hog one of the Tokio core threads, but I would expect the other threads to work normally. In a machine with 40 cores, why would this sync computation stall the whole server?

@oersted oersted added the triage A bug report being investigated label Jun 20, 2022
@the10thWiz
Copy link
Collaborator

My best guess is that both the route and main tasks are getting assigned to the same thread, and as such block each other. Although Rocket spawns a task to handle each request, Tokio is free to simply schedule it on the same thread, and due to synchronization and context switching overhead, Tokio will do this whenever it makes sense to do so. It's also possible that Tokio polls the task at least once during spawning, which would trigger this behavior. I'm also not particularly familiar with Tokio's internals, but my understanding is that each thread has a local pool of tasks to run, and long running tasks, especially if they are waiting for IO, are placed into a shared pool to enable other threads to poll them when they can make progress. If both the main server task and your handler are in the local pool, you need to yield for Tokio to even be able to transfer one of the tasks to another thread, but even then there is no guarantee that Tokio will actually transfer one of them.

Overall, there are a variety of things that could be going wrong here.

@SergioBenitez
Copy link
Member

My best guess is that both the route and main tasks are getting assigned to the same thread, and as such block each other. Although Rocket spawns a task to handle each request, Tokio is free to simply schedule it on the same thread, and due to synchronization and context switching overhead, Tokio will do this whenever it makes sense to do so.

This shouldn't happen as we use the multithreaded scheduler. Even if it does schedule two tasks on one worker, a free worker should steal the task, assuming a free worker exists.

It's also possible that Tokio polls the task at least once during spawning, which would trigger this behavior.

That would be okay, but a free worker should still come by, notice a stalled task, steal it, and execute it.

I'm also not particularly familiar with Tokio's internals, but my understanding is that each thread has a local pool of tasks to run, and long running tasks, especially if they are waiting for IO, are placed into a shared pool to enable other threads to poll them when they can make progress.

This is correct. Tokio will internally use an I/O pool for certain tasks, and you can add to ~that pool via spawn_blocking. None of those tasks implicitly block any other task.

If both the main server task and your handler are in the local pool, you need to yield for Tokio to even be able to transfer one of the tasks to another thread, but even then there is no guarantee that Tokio will actually transfer one of them.

This is what work stealing is for. No task should need to yield. As soon as a new task is scheduled, workers are notified, and if they're free they'll steal.

My guess about what's happening here is that you were contending on the same lock in the original code across multiple tasks, so blocking in one task caused blocking in all tasks. When you switched to using spawn_blocking, perhaps your locking became more coarse grained. If you have a copy for the source of both versions, I'd love to take a look.

@SergioBenitez
Copy link
Member

To confirm, I wrote the following:

use rocket::*;

#[get("/block")]
fn block() {
    let mutex = std::sync::Mutex::new(0u32);
    std::mem::forget(mutex.lock().expect("unlocked"));
    std::mem::forget(mutex.lock().expect("deadlock"));
}

#[get("/")]
fn index() { }

#[launch]
fn rocket() -> Rocket<Build> {
    rocket::build().mount("/", routes![block, index])
}

Here's a session running this server:

>  curl http://127.0.0.1:8000/block
# ... never returns ...

# in a new shell
> curl http://127.0.0.1:8000/
> # returned immediately

@oersted Were you trying to observe behavior through a web browser? If so, see #2228.

@oersted
Copy link
Author

oersted commented Jun 21, 2022

Thank you @SergioBenitez, yes I was using a web browser. Although parallel requests work fine now using spawn_blocking, if request serialization would be the root cause this wouldn't be the case right?

The locking has indeed changed somewhat and is held for a smaller section of the computation, however, I'm using RwLocks and all requests involved acquire them in read-only mode, so none of them should block.

I did consider providing a reproducible code sample in my first post, but it is rather tricky to untangle so that it's generic and brief, and I don't leak any confidential info.

It works fine with spawn_blocking, so this is not really a bug. And your toy example demonstrates that the base case works as expected.

I was mainly looking to correct my mental model of how the concurrency of Rocket works based on this anomaly. In that respect, I'm more or less satisfied. If no one is able to come up with another constructive hypothesis feel free to close this issue.

@samanpa
Copy link

samanpa commented Jun 21, 2022

FTR I am seeing a similar issue with rc2. We have a web server that acts as a cargo registry. When a new crate is published it calls cargo metadata which then contacts the web server again. The server does not responds to the cargo metadata request (strace shows that it doesn't even accept the new connection). In fact it is no longer able to process any incoming requess.

Using spawn_blocking helps but does not completely fix the issue. There are also no locks in the code. (I have also not been successful in making a simple reproduction of the behaviour.

You can find the pstack of the deadlocked application here. It is waiting for the server to respond to cargo metadata but the tcp connection is not being accepted. You can also see at the bottom of the trace that the hyper starts everything. So I am currently unsure if the problem is in hyper or rocket.

0  0x00007ff389a9accd in poll () from /lib64/libc.so.6
#1  0x000055dd64354c60 in {closure#0} () at library/std/src/sys/unix/pipe.rs:96
#2  cvt_r<i32, std::sys::unix::pipe::read2::{closure_env#0}> () at library/std/src/sys/unix/mod.rs:211
#3  read2 () at library/std/src/sys/unix/pipe.rs:96
#4  std::process::Child::wait_with_output::h2fceb171f08f13cb () at library/std/src/process.rs:1927
#5  0x000055dd6435477d in {closure#0} () at library/std/src/process.rs:913
#6  and_then<std::process::Child, std::io::error::Error, std::process::Output, std::process::{impl#22}::output::{closure_env#0}> () at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/result.rs:1332
#7  std::process::Command::output::h01d2a524572c78c4 () at library/std/src/process.rs:910
#8  0x000055dd63761e66 in cargo_metadata::MetadataCommand::exec::h0a3e0adb42e83416 (self=0x7ff389991870) at /tmp/.cargo/registry/src/crates.dev.bloomberg.com-44059b74dffe3dca/cargo_metadata-0.11.4/src/lib.rs:587
#9  0x000055dd636d22fb in reg_index::metadata::metadata_reg::hb7883e7d20059554 (index_url=..., manifest_path=..., crate_path=..., package_args=...) at /tmp/.cargo/registry/src/crates.dev.bloomberg.com-44059b74dffe3dca/reg-index-0.4.1/src/metadata.rs:76
#10 0x000055dd62f630c5 in reg_index::add::add_reg::hc618fca982aae3ca (index_path=0x7ff389993910, index_url=..., manifest_path=..., crate_path=..., upload=..., package_args=...) at /tmp/.cargo/registry/src/crates.dev.bloomberg.com-44059b74dffe3dca/reg-index-0.4.1/src/add.rs:90
#11 0x000055dd62f60a8f in reg_index::add::add_from_crate::h86045826c932bfde (index_path=0x7ff389993910, index_url=..., crate_path=0x7ff3899931a8, upload=...) at /tmp/.cargo/registry/src/crates.dev.bloomberg.com-44059b74dffe3dca/reg-index-0.4.1/src/add.rs:233
#12 0x000055dd62f67410 in _$LT$crates_registry..index..publish..CargoPublishTransact$u20$as$u20$crates_registry..index..Transact$GT$::run::h6ad7aa2c444f94d2 (self=..., index=...) at src/index/publish.rs:231
#13 0x000055dd62f3ab51 in crates_registry::index::RegistryIndex::run::ha089dc7c6048791c (self=..., trans=...) at src/index/mod.rs:69
#14 0x000055dd62f78f3a in crates_registry::api::v1::new::publish::_$u7b$$u7b$closure$u7d$$u7d$::hea209b4732171f51 () at src/api/v1/new.rs:93
#15 0x000055dd62e870c9 in _$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::hbce15a4359aa8dac (self=..., cx=0x7ff3899a4488) at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/future/mod.rs:91
#16 0x000055dd62f7bde7 in crates_registry::api::v1::new::publish::into_info::monomorphized_function::_$u7b$$u7b$closure$u7d$$u7d$::hf4f7b34c9a95ad98 () at src/api/v1/new.rs:34
#17 0x000055dd62e85c0a in _$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h1188c6f39cdcf0bc (self=..., cx=0x7ff3899a4488) at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/future/mod.rs:91
#18 0x000055dd63e321af in _$LT$core..pin..Pin$LT$P$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h6be16fe87046f2ed (self=..., cx=0x7ff3899a4488) at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/future/future.rs:124
#19 0x000055dd63e7fdaf in _$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..future..future..Future$GT$::poll::haafa47609e9b1d60 (self=..., cx=0x7ff3899a4488) at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/panic/unwind_safe.rs:289
#20 0x000055dd63df007e in _$LT$futures_util..future..future..catch_unwind..CatchUnwind$LT$Fut$GT$$u20$as$u20$core..future..future..Future$GT$::poll::_$u7b$$u7b$closure$u7d$$u7d$::hc41e057b0a8af397 () at /tmp/.cargo/registry/src/crates.dev.bloomberg.com-44059b74dffe3dca/futures-util-0.3.21/src/future/future/catch_unwind.rs:36
#21 0x000055dd63e8287c in _$LT$core..panic..unwind_safe..AssertUnwindSafe$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$$LP$$RP$$GT$$GT$::call_once::h84e4960d547a0c89 (self=..., _args=0) at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/panic/unwind_safe.rs:271
#22 0x000055dd63cf58ab in std::panicking::try::do_call::h1aa54d11e5d912db (data=0x7ff389998ab0 "\270E") at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/panicking.rs:492
#23 0x000055dd63d0ba8b in __rust_try () at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/slice/mod.rs:620
#24 0x000055dd63cf49bf in std::panicking::try::h085f1de032c77361 (f=...) at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/panicking.rs:456
#25 0x000055dd63e664ab in std::panic::catch_unwind::hdaf1202dc27e5f4f (f=...) at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/std/src/panic.rs:137
#26 0x000055dd63deffb4 in _$LT$futures_util..future..future..catch_unwind..CatchUnwind$LT$Fut$GT$$u20$as$u20$core..future..future..Future$GT$::poll::hf5fea4711e5a5be2 (self=..., cx=0x7ff3899a4488) at /tmp/.cargo/registry/src/crates.dev.bloomberg.com-44059b74dffe3dca/futures-util-0.3.21/src/future/future/catch_unwind.rs:36
#27 0x000055dd63de6785 in rocket::server::handle::_$u7b$$u7b$closure$u7d$$u7d$::h2309afedd2f41c7d () at /tmp/.cargo/registry/src/crates.dev.bloomberg.com-44059b74dffe3dca/rocket-0.5.0-rc.2/src/server.rs:54
#28 0x000055dd63e01f5b in _$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::he0b232d8b5fbc5a7 (self=..., cx=0x7ff3899a4488) at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/future/mod.rs:91
#29 0x000055dd63ded682 in rocket::server::_$LT$impl$u20$rocket..rocket..Rocket$LT$rocket..phase..Orbit$GT$$GT$::route::_$u7b$$u7b$closure$u7d$$u7d$::h60467193a5325b1d () at /tmp/.cargo/registry/src/crates.dev.bloomberg.com-44059b74dffe3dca/rocket-0.5.0-rc.2/src/server.rs:281
#30 0x000055dd63e01e4b in _$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::hd5bb1d65bc61e24e (self=..., cx=0x7ff3899a4488) at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/future/mod.rs:91
#31 0x000055dd63deba95 in rocket::server::_$LT$impl$u20$rocket..rocket..Rocket$LT$rocket..phase..Orbit$GT$$GT$::route_and_process::_$u7b$$u7b$closure$u7d$$u7d$::h389690c80822dce0 () at /tmp/.cargo/registry/src/crates.dev.bloomberg.com-44059b74dffe3dca/rocket-0.5.0-rc.2/src/server.rs:236
#32 0x000055dd63e009a9 in _$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h11fe9c6477640cdc (self=..., cx=0x7ff3899a4488) at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/future/mod.rs:91
#33 0x000055dd63deac92 in rocket::server::_$LT$impl$u20$rocket..rocket..Rocket$LT$rocket..phase..Orbit$GT$$GT$::dispatch::_$u7b$$u7b$closure$u7d$$u7d$::h6b2662bca57eb976 () at /tmp/.cargo/registry/src/crates.dev.bloomberg.com-44059b74dffe3dca/rocket-0.5.0-rc.2/src/server.rs:210
#34 0x000055dd63e00789 in _$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h100f58bcd1324382 (self=..., cx=0x7ff3899a4488) at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/future/mod.rs:91
#35 0x000055dd62c73354 in rocket::server::hyper_service_fn::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::ha63a59bb8e80b34c () at /tmp/.cargo/registry/src/crates.dev.bloomberg.com-44059b74dffe3dca/rocket-0.5.0-rc.2/src/server.rs:82
#36 0x000055dd62cab0eb in _$LT$core..future..from_generator..GenFuture$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h6a280cd89ed3aecc (self=..., cx=0x7ff3899a4488) at /rustc/fe5b13d681f25ee6474be29d748c65adcd91f69e/library/core/src/future/mod.rs:91

This was my attempt at a simple reproduction which does not show the behaviour

use rocket::{get, routes};                                                      

#[get("/hello")]                                                                
async fn hello() -> String {                                                    
    let stdout = std::process::Command::new("curl")                             
        .args(["http://localhost:8000/world"])                                  
        .output()                                                               
        .expect("failed to execute process")                                    
        .stdout;                                                                
    String::from_utf8(stdout).unwrap()                                          
}                                                                               
 
#[get("/world")]                                                                
fn world() -> &'static str {                                                    
    "Hello, world!"                                                             
}                                                                               
   
#[tokio::main]                                                                  
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let rocket = rocket::build().mount("/", routes![hello, world]);
    let _ = rocket.launch().await?;

    Ok(())
}

@SergioBenitez
Copy link
Member

@samanpa You don't need any sort of locking to block. Executing a Command from std blocks until the command exits, so your route will block a worker thread until the command exits. Given that your command calls right back into the server, you'll need at least one more worker thread to be able to handle the new incoming request. Otherwise, you'll have deadlocked your server. In others, your code as written would allow floor(n/2) requests to that route to execute in parallel, where n is the number of worker threads. If you only have one worker thread, you can execute 0 of them. I imagine this is exactly what's happening on your deployed server, and that your local server has more than 1 worker thread.

But you can avoid all of this by not blocking while the command is executing. Instead of using the Command from the standard library, you should use the async variant in tokio: https://docs.rs/tokio/latest/tokio/process/struct.Command.html. See https://rocket.rs/v0.5-rc/guide/upgrading-from-0.4/#blocking-io for more.

@samanpa
Copy link

samanpa commented Jun 21, 2022

@samanpa You don't need any sort of locking to block. Executing a Command from std blocks until the command exits, so your route will block a worker thread until the command exits. Given that your command calls right back into the server, you'll need at least one more worker thread to be able to handle the new incoming request. Otherwise, you'll have deadlocked your server. In others, your code as written would allow floor(n/2) requests to that route to execute in parallel, where n is the number of worker threads. If you only have one worker thread, you can execute 0 of them. I imagine this is exactly what's happening on your deployed server, and that your local server has more than 1 worker thread.

True, although this happens even when there are 4 workers. I would expect another worker to handle the new incoming request.

Using spawn blocking helps but want to be able to understand what is going on.

e.g. the example above works with ROCKET_WORKERS=1 set.

@SergioBenitez
Copy link
Member

SergioBenitez commented Jun 21, 2022

You can find the pstack of the deadlocked application here.

This trace shows that the thread is waiting for the command to exit, which is to be expected. An strace of the program and all of its threads would be more useful. Then we can see what the rest of the server is waiting on.

True, although this happens even when there are 4 workers. I would expect another worker to handle the new incoming request.

This is surprising behavior; are you certain that there are workers available? Or are they potentially also stuck on the same issue?

Using spawn blocking helps but want to be able to understand what is going on.

How does it "help" in this case? The program advances, always? Or sometimes? Have you tried using the async variant of Command as I previously indicated?

e.g. the example above works with ROCKET_WORKERS=1 set.

Are you saying the example exactly as written works, or the example modified to use spawn_blocking() works?

@SergioBenitez
Copy link
Member

SergioBenitez commented Jun 21, 2022

It looks like this could be is a bug in tokio: tokio-rs/tokio#4730 (comment).

@SergioBenitez SergioBenitez added bug Deviation from the specification or expected behavior upstream An unresolvable issue: an upstream dependency bug and removed triage A bug report being investigated labels Mar 27, 2023
@SergioBenitez SergioBenitez mentioned this issue Feb 18, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Deviation from the specification or expected behavior upstream An unresolvable issue: an upstream dependency bug
Projects
None yet
Development

No branches or pull requests

4 participants