New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunk FindMissingBlobsRequest appropriately #20708
base: main
Are you sure you want to change the base?
Conversation
I haven't forgotten about this @tgolsson. I just haven't made time to test this. I'll attempt to get to this no later than end of next week (4/14/2024). |
@tgolsson it worked! |
694efe3
to
b7d8edc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
let instance_name = ""; | ||
// NOTE[TSolberg]: | ||
// This test is a bit of a hack, but it's the best way I could think of to ensure that the size of the | ||
// FindMissingBlobsRequest is roughly what we expect. The only delta would be the encoding of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sentence is truncated: maybe?
// FindMissingBlobsRequest is roughly what we expect. The only delta would be the encoding of the | |
// FindMissingBlobsRequest is roughly what we expect. The only delta would be the encoding of the instance name |
instance_name: self.instance_name.as_ref().cloned().unwrap_or_default(), | ||
blob_digests: digests.into_iter().map(|d| d.into()).collect::<Vec<_>>(), | ||
}; | ||
let blob_digests = digests.into_iter().map(|d| d.into()).collect::<Vec<_>>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One option here would be collecting into a set instead, to deduplicate any repetition. For instance empty files, or a node_modules folder with the same package installed a lot of times in different places (if that's a thing). Thoughts?
Now that I write it out, maybe best for that to be a separate change, so never mind, but floating it here to explain my thinking.
}; | ||
let blob_digests = digests.into_iter().map(|d| d.into()).collect::<Vec<_>>(); | ||
|
||
let max_digests_per_request: usize = (4 * 1024 * 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this 4MiB be pulled out into constant with a comment that explains where it comes from?
@@ -68,6 +68,10 @@ Setting [the `orphan_files_behaviour = "ignore"` option](https://www.pantsbuild. | |||
|
|||
### Plugin API changes | |||
|
|||
## Bugfixes | |||
|
|||
- Remote cache: `FindMissingBlobsRequest` will now make multiple request if the number of files is large. (#20708) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to #20492 (comment):
- we've historically put this sort of change into a dedicated "remote caching/execution" section: https://github.com/pantsbuild/pants/blob/main/docs/notes/2.20.x.md#remote-cachingexecution
- we might want to make the PR reference a manual link
- self | ||
.instance_name | ||
.as_ref() | ||
.cloned() | ||
.unwrap_or_default() | ||
.len()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the instance name encoded into the gRPC message as exactly its UTF-8 bytes? I'd personally be expecting an extra one or two bytes for either a varint-encoded length or a null terminator (the first I imagine). If we get unlucky with configuration, a request could exceed the 4MiB limit by 1 byte and that'd be very annoying! (e.g. a 44 byte instance_name
would make for a 4MiB request (59918 * 70 + 44
) and thus any extra content would be too large)
I guess we could either:
- make this exactly precise (and a loop over different
instance_names
to the test) - just add semi-arbitrary slop factor:
- self | |
.instance_name | |
.as_ref() | |
.cloned() | |
.unwrap_or_default() | |
.len()) | |
- self | |
.instance_name | |
.as_ref() | |
.cloned() | |
.unwrap_or_default() | |
.len() - 10) |
provider | ||
.list_missing_digests(&mut test_data.into_iter()) | ||
.await, | ||
Ok(HashSet::new()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the test. I think there's a few plausible mistakes that we could introduce in future that won't be caught:
- only sending each unique digest once, thus invalidating this test (i.e. if we do the
HashSet
idea, thelist_missing_digests
will be working with just one digest and thus only send one request) - skipping the second+ request (either not sending them at all, or not including the results in the final aggregation)
- Not sure about this one: if we break this in another way and start sending >4MiB requests, does the
StubCAS
reject the requests (i.e. does this test fail without the fix)?
I think we can address each of them:
- check
assert_eq!(cas.request_count(RequestType::CASFindMissingBlobs), 2)
so that the test starts failing if that core assumption changes - add two digests that is missing, within each "chunk" (e.g. one at the start and one at the end)
- add request size enforcement to
StubCAS
if not already there (this might cause other tests to fail, so we might want it to be optional until we resolve them)
}; | ||
|
||
msg | ||
}); | ||
|
||
workunit_store::increment_counter_if_in_workunit(Metric::RemoteStoreExistsAttempts, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm... not sure whether these should be counting list_missing_digests
calls (i.e. remain here) or network requests (i.e. move within the map(|request| { ... })
closure). 🤔
@@ -21,6 +21,7 @@ tokio-util = { workspace = true, features = ["io"] } | |||
tonic = { workspace = true } | |||
uuid = { workspace = true, features = ["v4"] } | |||
workunit_store = { path = "../../workunit_store" } | |||
prost = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this should technically be in dev-dependencies
, since it's only used by tests... but I suspect it doesn't matter given it's already in the non-dev dependency tree. Only worth moving if you're making other changes.
|
||
let client = self.cas_client.as_ref().clone(); | ||
let requests = blob_digests.chunks(max_digests_per_request).map(|digests| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This core of the fix is very nice, good choice!
No description provided.