Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk FindMissingBlobsRequest appropriately #20708

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

tgolsson
Copy link
Contributor

No description provided.

@jasonwbarnett
Copy link
Contributor

I haven't forgotten about this @tgolsson. I just haven't made time to test this. I'll attempt to get to this no later than end of next week (4/14/2024).

@jasonwbarnett
Copy link
Contributor

@tgolsson it worked!

@tgolsson tgolsson added category:bugfix Bug fixes for released features remote labels May 13, 2024
@tgolsson tgolsson force-pushed the ts/max-message-size-find-missing-blobs branch from 694efe3 to b7d8edc Compare May 13, 2024 16:07
@tgolsson tgolsson marked this pull request as ready for review May 13, 2024 16:17
@tgolsson tgolsson requested a review from huonw May 13, 2024 16:17
Copy link
Contributor

@huonw huonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

let instance_name = "";
// NOTE[TSolberg]:
// This test is a bit of a hack, but it's the best way I could think of to ensure that the size of the
// FindMissingBlobsRequest is roughly what we expect. The only delta would be the encoding of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sentence is truncated: maybe?

Suggested change
// FindMissingBlobsRequest is roughly what we expect. The only delta would be the encoding of the
// FindMissingBlobsRequest is roughly what we expect. The only delta would be the encoding of the instance name

instance_name: self.instance_name.as_ref().cloned().unwrap_or_default(),
blob_digests: digests.into_iter().map(|d| d.into()).collect::<Vec<_>>(),
};
let blob_digests = digests.into_iter().map(|d| d.into()).collect::<Vec<_>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option here would be collecting into a set instead, to deduplicate any repetition. For instance empty files, or a node_modules folder with the same package installed a lot of times in different places (if that's a thing). Thoughts?

Now that I write it out, maybe best for that to be a separate change, so never mind, but floating it here to explain my thinking.

};
let blob_digests = digests.into_iter().map(|d| d.into()).collect::<Vec<_>>();

let max_digests_per_request: usize = (4 * 1024 * 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this 4MiB be pulled out into constant with a comment that explains where it comes from?

@@ -68,6 +68,10 @@ Setting [the `orphan_files_behaviour = "ignore"` option](https://www.pantsbuild.

### Plugin API changes

## Bugfixes

- Remote cache: `FindMissingBlobsRequest` will now make multiple request if the number of files is large. (#20708)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to #20492 (comment):

Comment on lines +401 to +406
- self
.instance_name
.as_ref()
.cloned()
.unwrap_or_default()
.len())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the instance name encoded into the gRPC message as exactly its UTF-8 bytes? I'd personally be expecting an extra one or two bytes for either a varint-encoded length or a null terminator (the first I imagine). If we get unlucky with configuration, a request could exceed the 4MiB limit by 1 byte and that'd be very annoying! (e.g. a 44 byte instance_name would make for a 4MiB request (59918 * 70 + 44) and thus any extra content would be too large)

I guess we could either:

  • make this exactly precise (and a loop over different instance_names to the test)
  • just add semi-arbitrary slop factor:
Suggested change
- self
.instance_name
.as_ref()
.cloned()
.unwrap_or_default()
.len())
- self
.instance_name
.as_ref()
.cloned()
.unwrap_or_default()
.len() - 10)

provider
.list_missing_digests(&mut test_data.into_iter())
.await,
Ok(HashSet::new())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test. I think there's a few plausible mistakes that we could introduce in future that won't be caught:

  1. only sending each unique digest once, thus invalidating this test (i.e. if we do the HashSet idea, the list_missing_digests will be working with just one digest and thus only send one request)
  2. skipping the second+ request (either not sending them at all, or not including the results in the final aggregation)
  3. Not sure about this one: if we break this in another way and start sending >4MiB requests, does the StubCAS reject the requests (i.e. does this test fail without the fix)?

I think we can address each of them:

  1. check assert_eq!(cas.request_count(RequestType::CASFindMissingBlobs), 2) so that the test starts failing if that core assumption changes
  2. add two digests that is missing, within each "chunk" (e.g. one at the start and one at the end)
  3. add request size enforcement to StubCAS if not already there (this might cause other tests to fail, so we might want it to be optional until we resolve them)

};

msg
});

workunit_store::increment_counter_if_in_workunit(Metric::RemoteStoreExistsAttempts, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm... not sure whether these should be counting list_missing_digests calls (i.e. remain here) or network requests (i.e. move within the map(|request| { ... }) closure). 🤔

@@ -21,6 +21,7 @@ tokio-util = { workspace = true, features = ["io"] }
tonic = { workspace = true }
uuid = { workspace = true, features = ["v4"] }
workunit_store = { path = "../../workunit_store" }
prost = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this should technically be in dev-dependencies, since it's only used by tests... but I suspect it doesn't matter given it's already in the non-dev dependency tree. Only worth moving if you're making other changes.


let client = self.cas_client.as_ref().clone();
let requests = blob_digests.chunks(max_digests_per_request).map(|digests| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This core of the fix is very nice, good choice!

@huonw
Copy link
Contributor

huonw commented May 13, 2024

Oh, also, does this fix #20674? Could you update the PR description like "fixes #20674" to have the appropriate auto-close behaviour. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category:bugfix Bug fixes for released features remote
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants