Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clone may "miss" ongoing operations #5759

Closed
djmitche opened this issue Jun 3, 2023 · 11 comments · Fixed by #5803
Closed

Clone may "miss" ongoing operations #5759

djmitche opened this issue Jun 3, 2023 · 11 comments · Fixed by #5803
Assignees
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-fs Module: tokio/fs

Comments

@djmitche
Copy link
Contributor

djmitche commented Jun 3, 2023

Version

│   │   └── tokio v1.26.0
│   │       └── tokio-macros v1.8.2 (proc-macro)
│   │   ├── tokio v1.26.0 (*)
│   │   ├── tokio-util v0.7.7
│   │   │   ├── tokio v1.26.0 (*)
│   │   ├── tokio v1.26.0 (*)
│   │   ├── tokio v1.26.0 (*)
│   │   └── tokio-native-tls v0.3.0
│   │       └── tokio v1.26.0 (*)
│   ├── tokio v1.26.0 (*)
│   ├── tokio-native-tls v0.3.0 (*)
│   ├── tokio-util v0.7.7 (*)
└── tokio v1.26.0 (*)
│   └── tokio v1.26.0 (*)
├── tokio v1.26.0 (*)
└── tokio-util v0.7.7 (*)
│   ├── tokio v1.26.0 (*)
│   └── tokio-util v0.7.7 (*)
└── tokio v1.26.0 (*)

Platform
Linux

Description
The following will fail intermittently (a simplification of taskcluster/taskcluster#6280). It fails particularly under high CPU load.

use std::error::Error;
use tempfile::tempfile;
use tokio::io::{copy, File};

const DATA: &[u8] = b"HELLO/WORLD";

#[tokio::test]
async fn t() -> Result<(), Box<dyn Error>> {
    let mut file: File = tempfile()?.into();
    file.write_all(DATA).await?;
    let mut file2 = file.try_clone().await?;
    let mut buf = Cursor::new(Vec::new());
    copy(&mut file2, &mut buf).await?;
    assert_eq!(writer.into_inner(), DATA);
}

What's happening is that the write_all is starting a write operation in another thread, and returning Poll::Ready. The try_clone, however, misses that there's an operation ongoing and creates a new File with no such operation. So there's a race between that write completing in another thread and the read performed by copy.

I would expect the try_clone to wait until any operations that have already been described as complete actually are complete. I can see that being hard to implement, though. Perhaps the documentation could be updated with something like

NOTE
Ongoing operations on the source File may not be complete when it is cloned. New operations on the cloned File may "race" with these ongoing operations.

This is similar to #4796. I couldn't find how that issue was fixed (no commits around June 30, 2022 seemed relevant), but the current documentation doesn't describe this situation (at least not in a way that's clear to me even after having found the problem).

@djmitche djmitche added A-tokio Area: The main tokio crate C-bug Category: This is a bug. labels Jun 3, 2023
@Darksonn Darksonn added the M-fs Module: tokio/fs label Jun 3, 2023
@Darksonn
Copy link
Contributor

Darksonn commented Jun 3, 2023

Seems reasonable enough to me.

@djmitche
Copy link
Contributor Author

@Darksonn which part seems reasonable? The bug, the fix, or the documentation? If the third, I can make a PR :)

@Darksonn
Copy link
Contributor

Waiting until pending operations have finished sounds good to me.

@djmitche
Copy link
Contributor Author

I'm less confident I can accomplish that, but I will give it a shot :)

@Darksonn
Copy link
Contributor

You can just call flush inside the try_clone method.

@djmitche
Copy link
Contributor Author

I got a start in https://github.com/tokio-rs/tokio/compare/master...djmitche:tokio:issue5759?expand=1. I think the fix is straightforward, although it requires try_clone to take &mut self. I also haven't managed to replicate the issue with a test yet.

Do you mind assigning this issue to me so it shows in my dashboard?

@Darksonn
Copy link
Contributor

Ah, we don't want it to take &mut self. You'll want complete_inflight instead of flush.

@Darksonn
Copy link
Contributor

At least on Linux, you can probably test this using a pipe. That will let you create a read operation that blocks until you write something to the pipe.

@djmitche
Copy link
Contributor Author

Ah, complete_inflight makes a lot more sense - thanks!

As for testing: the bug occurs when an operation is incomplete on the original File, because its spawn_blocking execution has not completed. And after the clone, the fact that this operation has not completed is visible on the clone.

To demonstrate the error in a test, I'd need to make the operation block. I could use a pipe to do so, by calling File::write with enough data that the pipe buffer fills. This pipe buffer size is unspecified, though, so that tactic feels a little risky. I'm not sure what happens in dup2(2) when there's a write(2) call outstanding in another thread -- that seems unspecified too! And finally, I'm not sure how the cloned File would behave if that write(2) hadn't completed before the dup2(2). Also, if I managed to write a test case that reliably lost the race by blocking the write operation indefinitely, then with the complete_inflight call in place it would block the test indefinitely.

I think I've convinced myself that there's no way to test reliably -- the best I can do is a test that, without the fix, fails intermittently. So I'll work on that.

@Darksonn
Copy link
Contributor

Ah, complete_inflight makes a lot more sense - thanks!

Flushing just calls complete_inflight, so that's why I initially suggested flushing.

djmitche added a commit to djmitche/tokio that referenced this issue Jun 18, 2023
If there is an ongoing operation on a file, wait for that to complete
before performing the clone in `File::try_clone`. This avoids a race
between the ongoing operation and any subsequent operations performed on
the clone.

Fixes: tokio-rs#5759
@djmitche
Copy link
Contributor Author

@Darksonn might you have a chance to take a look at #5803, or perhaps suggest another person I could ask?

Darksonn pushed a commit that referenced this issue Jun 27, 2023
If there is an ongoing operation on a file, wait for that to complete
before performing the clone in `File::try_clone`. This avoids a race
between the ongoing operation and any subsequent operations performed on
the clone.

Fixes: #5759
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-bug Category: This is a bug. M-fs Module: tokio/fs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants