New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement bulk_delete_request for Azure #5681
base: master
Are you sure you want to change the base?
Conversation
Sorry for the delay, I'll try to get to this next week, it's a spicy one 😅 Azure's APIs are truly something else 🙃 |
@tustvold Don't tell me! It's sad to see the amount of gymnastics their official clients have to do, like encoding and decoding HTTP manually just because someone thought it would be a good idea to send quasi HTTP requests within a multipart request... 🫠 |
} | ||
} | ||
|
||
fn serialize_part_request( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn serialize_part_request( | |
/// https://docs.oasis-open.org/odata/odata/v4.0/errata02/os/complete/part1-protocol/odata-v4.0-errata02-os-part1-protocol-complete.html#_Toc406398359 | |
fn serialize_part_request( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've gone through this, and whilst I don't see anything obviously incorrect, I am quite nervous about it. It is implementing a fairly complex specification, https://www.ietf.org/rfc/rfc2046, and I worry there may be incorrect assumptions, implementation oversights, etc... If this were being used for a read-only request, that's one thing, but the potential impact of a malformed bulk delete is high.
I don't know if there are mature Rust implementations of multipart mime messages, but if there are, one idea might be to add this as an optional dependency, potentially gated on its own feature (e.g. azure_batch), and use that to implement this functionality. That would delegate the maintenance, correctness burden onto a crate focused on implementing that particular specification correctly
|
||
let credential = self.get_credential().await?; | ||
|
||
let boundary = format!("batch_{}", uuid::Uuid::new_v4()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let boundary = format!("batch_{}", uuid::Uuid::new_v4()); | |
// https://www.ietf.org/rfc/rfc2046 | |
let boundary = format!("batch_{}", uuid::Uuid::new_v4()); |
|
||
// Encode part headers | ||
let mut part_headers = HeaderMap::new(); | ||
part_headers.insert(CONTENT_TYPE, "application/http".parse().unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can use HeaderValue::from_static here
@@ -240,6 +268,157 @@ impl<'a> PutRequest<'a> { | |||
} | |||
} | |||
|
|||
#[inline] | |||
fn extend(dst: &mut Vec<u8>, data: &[u8]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed?
} | ||
|
||
// Write header names as title case. The header name is assumed to be ASCII. | ||
fn title_case(dst: &mut Vec<u8>, name: &[u8]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW this may not be necessary as headers are case insensitive
|
||
fn write_headers(headers: &HeaderMap, dst: &mut Vec<u8>) { | ||
for (name, value) in headers { | ||
if name == "content-id" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this special case needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Azure returns 400 if we don't use Content-ID
}; | ||
|
||
// Parse part headers and retrieve part id | ||
let mut headers = [httparse::EMPTY_HEADER; 4]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be hard-coding an assumption that the response only contains 4 headers, is this something that can be relied upon?
}; | ||
|
||
// Parse part response headers | ||
let mut headers = [httparse::EMPTY_HEADER; 10]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment to above
) | ||
.header(CONTENT_LENGTH, HeaderValue::from(body_bytes.len())) | ||
.body(body_bytes) | ||
.with_azure_authorization(&credential, &self.config.account) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary?
My opinion is that having this feature as a part of object store (with a caveat in the doc and gated behind a non default feature flag) would be a good idea. My rationale is that while I agree with the technical concerns, unless there is some alternate strategy for implementing this feature today this seems like the best way to get experience / testing for this feature |
|
||
let boundary = format!("batch_{}", uuid::Uuid::new_v4()); | ||
|
||
let mut body_bytes = Vec::with_capacity(paths.len() * 256); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 256 enough for each path? This looks quite small as there are headers, which likely always trigger re-allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding dedicated tests for the multipart parse and generate logic, similar to: https://github.com/apache/opendal/blob/be6f359d15b965147d2c0ecc4de93a95863167aa/core/src/raw/http_util/multipart.rs#L748
|
||
fn write_headers(headers: &HeaderMap, dst: &mut Vec<u8>) { | ||
for (name, value) in headers { | ||
if name == "content-id" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Azure returns 400 if we don't use Content-ID
// Encode part headers | ||
let mut part_headers = HeaderMap::new(); | ||
part_headers.insert(CONTENT_TYPE, "application/http".parse().unwrap()); | ||
part_headers.insert("Content-Transfer-Encoding", "binary".parse().unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same to previous. It's better to leave a comment to explain why we need to use Content-ID
instead of content-id
here.
Which issue does this PR close?
Closes #5680.