Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement bulk_delete_request for Azure #5681

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

andrebsguedes
Copy link
Contributor

Which issue does this PR close?

Closes #5680.

@github-actions github-actions bot added the object-store Object Store Interface label Apr 22, 2024
@tustvold
Copy link
Contributor

Sorry for the delay, I'll try to get to this next week, it's a spicy one 😅

Azure's APIs are truly something else 🙃

@andrebsguedes
Copy link
Contributor Author

@tustvold Don't tell me! It's sad to see the amount of gymnastics their official clients have to do, like encoding and decoding HTTP manually just because someone thought it would be a good idea to send quasi HTTP requests within a multipart request... 🫠

}
}

fn serialize_part_request(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn serialize_part_request(
/// https://docs.oasis-open.org/odata/odata/v4.0/errata02/os/complete/part1-protocol/odata-v4.0-errata02-os-part1-protocol-complete.html#_Toc406398359
fn serialize_part_request(

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've gone through this, and whilst I don't see anything obviously incorrect, I am quite nervous about it. It is implementing a fairly complex specification, https://www.ietf.org/rfc/rfc2046, and I worry there may be incorrect assumptions, implementation oversights, etc... If this were being used for a read-only request, that's one thing, but the potential impact of a malformed bulk delete is high.

I don't know if there are mature Rust implementations of multipart mime messages, but if there are, one idea might be to add this as an optional dependency, potentially gated on its own feature (e.g. azure_batch), and use that to implement this functionality. That would delegate the maintenance, correctness burden onto a crate focused on implementing that particular specification correctly


let credential = self.get_credential().await?;

let boundary = format!("batch_{}", uuid::Uuid::new_v4());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let boundary = format!("batch_{}", uuid::Uuid::new_v4());
// https://www.ietf.org/rfc/rfc2046
let boundary = format!("batch_{}", uuid::Uuid::new_v4());


// Encode part headers
let mut part_headers = HeaderMap::new();
part_headers.insert(CONTENT_TYPE, "application/http".parse().unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use HeaderValue::from_static here

@@ -240,6 +268,157 @@ impl<'a> PutRequest<'a> {
}
}

#[inline]
fn extend(dst: &mut Vec<u8>, data: &[u8]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

}

// Write header names as title case. The header name is assumed to be ASCII.
fn title_case(dst: &mut Vec<u8>, name: &[u8]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this may not be necessary as headers are case insensitive


fn write_headers(headers: &HeaderMap, dst: &mut Vec<u8>) {
for (name, value) in headers {
if name == "content-id" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this special case needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Azure returns 400 if we don't use Content-ID

};

// Parse part headers and retrieve part id
let mut headers = [httparse::EMPTY_HEADER; 4];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be hard-coding an assumption that the response only contains 4 headers, is this something that can be relied upon?

};

// Parse part response headers
let mut headers = [httparse::EMPTY_HEADER; 10];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment to above

)
.header(CONTENT_LENGTH, HeaderValue::from(body_bytes.len()))
.body(body_bytes)
.with_azure_authorization(&credential, &self.config.account)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary?

@alamb
Copy link
Contributor

alamb commented May 20, 2024

So I've gone through this, and whilst I don't see anything obviously incorrect, I am quite nervous about it. It is implementing a fairly complex specification, https://www.ietf.org/rfc/rfc2046, and I worry there may be incorrect assumptions, implementation oversights, etc... If this were being used for a read-only request, that's one thing, but the potential impact of a malformed bulk delete is high.

I don't know if there are mature Rust implementations of multipart mime messages, but if there are, one idea might be to add this as an optional dependency, potentially gated on its own feature (e.g. azure_batch), and use that to implement this functionality. That would delegate the maintenance, correctness burden onto a crate focused on implementing that particular specification correctly

My opinion is that having this feature as a part of object store (with a caveat in the doc and gated behind a non default feature flag) would be a good idea.

My rationale is that while I agree with the technical concerns, unless there is some alternate strategy for implementing this feature today this seems like the best way to get experience / testing for this feature


let boundary = format!("batch_{}", uuid::Uuid::new_v4());

let mut body_bytes = Vec::with_capacity(paths.len() * 256);
Copy link

@konjac konjac May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 256 enough for each path? This looks quite small as there are headers, which likely always trigger re-allocation.

Copy link
Member

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest adding dedicated tests for the multipart parse and generate logic, similar to: https://github.com/apache/opendal/blob/be6f359d15b965147d2c0ecc4de93a95863167aa/core/src/raw/http_util/multipart.rs#L748


fn write_headers(headers: &HeaderMap, dst: &mut Vec<u8>) {
for (name, value) in headers {
if name == "content-id" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Azure returns 400 if we don't use Content-ID

// Encode part headers
let mut part_headers = HeaderMap::new();
part_headers.insert(CONTENT_TYPE, "application/http".parse().unwrap());
part_headers.insert("Content-Transfer-Encoding", "binary".parse().unwrap());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same to previous. It's better to leave a comment to explain why we need to use Content-ID instead of content-id here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
object-store Object Store Interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement bulk_delete_request for Azure
5 participants