Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix S3 storage driver to support R2 Multipart upload #3940

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tpoxa
Copy link

@tpoxa tpoxa commented Jun 12, 2023

Add extra call of flushPart to avoid losing pendingBytes when multipartcombinesmallpart is false

Fixes: #3873

Signed-off-by: Maksym Trofimenko <maksym@container-registry.com>
@milosgajdos milosgajdos requested a review from squizzi June 12, 2023 21:21
@Vad1mo
Copy link

Vad1mo commented Jun 14, 2023

I verified the patch, and it works as expected:

Here is the example config for R2

storage:
  s3:
    region: auto
    bucket: a-bucket-name-as-set-in-R2
    regionendpoint: https://126dc445e8778171899b2bd6XXXXXXX.r2.cloudflarestorage.com
    multipartcombinesmallpart: false

@milosgajdos
Copy link
Member

This introduces a recursion...which makes me think....I haven't properly dived into the code, but the fact this never cropped up even in 2.8 branch/release (

func (w *writer) flushPart() error {
if len(w.readyPart) == 0 && len(w.pendingPart) == 0 {
// nothing to write
return nil
}
if len(w.pendingPart) < int(w.driver.ChunkSize) {
// closing with a small pending part
// combine ready and pending to avoid writing a small part
w.readyPart = append(w.readyPart, w.pendingPart...)
w.pendingPart = nil
}
partNumber := aws.Int64(int64(len(w.parts) + 1))
resp, err := w.driver.S3.UploadPart(&s3.UploadPartInput{
Bucket: aws.String(w.driver.Bucket),
Key: aws.String(w.key),
PartNumber: partNumber,
UploadId: aws.String(w.uploadID),
Body: bytes.NewReader(w.readyPart),
})
if err != nil {
return err
}
w.parts = append(w.parts, &s3.Part{
ETag: resp.ETag,
PartNumber: partNumber,
Size: aws.Int64(int64(len(w.readyPart))),
})
w.readyPart = w.pendingPart
w.pendingPart = nil
return nil
}
) makes this a bit peculiar...given this introduces recursion....could Writer be broken somehow? I think I'd prefer fixing that 🤔

@tpoxa
Copy link
Author

tpoxa commented Jun 14, 2023

It seems that the driver was never working fine with option multipartcombinesmallpart:false during multi-part upload. Pending bytes are never flushed at the end, so the entire push fails with an offset mismatch error.

It looks like a recursion, but flushPart doesn't run if ready and pending bytes are empty.

if len(w.readyPart) == 0 && len(w.pendingPart) == 0 { 

It does the trick.
But yes, I agree. It looks a bit off.
I am keen to consider other approaches...

Thanks

@milosgajdos
Copy link
Member

@tpoxa would it be possible for you to write a test that causes S3 storage driver test to fail? We can take it from there I reckon 🤔

@milosgajdos
Copy link
Member

milosgajdos commented Jul 17, 2023

Circling back to this quickly. Is there any reason why you are settings this param to false?

multipartcombinesmallpart: false

@Vad1mo
Copy link

Vad1mo commented Jul 17, 2023

R2 expects same size parts. The exception is in the last part.

@milosgajdos
Copy link
Member

milosgajdos commented Oct 2, 2023

We've recently merged a rework of S3 push path #4066

I wonder if that fixes the issue this PR attempts to address.

@pditommaso
Copy link

I'm just experiencing the multi-part upload issue with R2 using distribution v2.8.3, therefore this PR looks still valid. Any plan to get it merged?

Vad1mo added a commit to container-registry/harbor that referenced this pull request Nov 20, 2023
@milosgajdos
Copy link
Member

@pditommaso do you experience that issue when using distribution/distribution:edge image? 2.8.x image is way behind the main

@pditommaso
Copy link

We tested this branch, but it looks like still something weird is happening:

time="2023-11-17T14:47:12.495776284Z" level=error msg="response completed with error" err.code="blob unknown" err.detail="sha256:e4c58958181a5925816faa528ce959e487632f4cfd192f8132f71b32df2744b4" err.message="blob unknown to registry" go.version=go1.20.10 http.request.host="localhost:5000" http.request.id=3f5d56e8-b528-4739-b1b2-84a33f75fecf http.request.method=HEAD http.request.remoteaddr="172.17.0.1:58870" http.request.uri="/v2/library/ubuntu/blobs/sha256:e4c58958181a5925816faa528ce959e487632f4cfd192f8132f71b32df2744b4" http.request.useragent="docker/24.0.2 go/go1.20.4 git-commit/659604f kernel/5.4.0-1025-aws os/linux arch/amd64 UpstreamClient(Docker-Client/24.0.2 \\(linux\\))" http.response.contenttype=application/json http.response.duration=170.45707ms http.response.status=404 http.response.written=157 instance.id=e7579883-1e4a-4730-a1cd-488f9b08b1d4 service=registry vars.digest="sha256:e4c58958181a5925816faa528ce959e487632f4cfd192f8132f71b32df2744b4" vars.name=library/ubuntu version=v2.7.0-2216-g4b4747a8

and Docker continue to retry it

9a5e3a1e8fb7: Pushed 
b5ec8a11a02d: Retrying in 7 seconds 
c6c4c25cabf9: Pushed 
6dfab66a44d5: Pushed 
ab18cb8eb197: Retrying in 1 second 

@milosgajdos
Copy link
Member

Hmm, interesting. Something still seems broken when multipartcombinesmallpart is set to false.

I am still unconvinced by this patch, though, maybe @corhere has better ideas

@corhere
Copy link
Collaborator

corhere commented Dec 15, 2023

I'm also unconvinced by the patch. If the problem is that pending bytes aren't flushed when committing/closing the writer, why not just flush twice when committing or closing the writer? Why double up all flushes?

Also, this branch needs a rebase.

(@milosgajdos I totally called it.)

@milosgajdos
Copy link
Member

I'm also unconvinced by the patch. If the problem is that pending bytes aren't flushed when committing/closing the writer, why not just flush twice when committing or closing the writer? Why double up all flushes?

(@milosgajdos I totally called it.)

I did think about that comment at the time I was "cleaning up" that code, of course.

My understanding of the pending buffer has always been as follows:

  1. We only ever flush parts when the client explicitly asks for it via Commit or Close.
  2. Unless the client has explicitly asked to Commit / Close and while it keeps writing we are maintaining the "overflow" data (the bytes that haven't fitted into the ready buffer) in the pending buffer but only up to a chunkSize when we finally give up and do a forced flush. I suspect the reason is so we don't eventually run out of memory by creating year another buffer if client isn't committing: in other words, pending is distribution giving the client a bit of leeway? 🤷‍♂️

If we were flushing as soon as a writer buffer fills up, the client "loses control" over the flushes i.e. Commit essentially becomes almost obsolete, I mean other than completing the multipart upload? Note, that this is my understanding and I might be missing something here - I'm always happy to learn to understand this better.

Now flushing that pending buffer when *combinesmallpart is set to false seems "broken" because we only do it when *combinesmallpart is set to true (plus when some other conditions are met such as the pending data < chunkSize).

I think I might be missing something here as I always do - as I said I'd like to understand this better - but based on the OP reporter

R2 expects same size parts. The exception is in the last part.

I feel like the potential fix should be maybe done in the Commit / Close 🤔

@corhere
Copy link
Collaborator

corhere commented Dec 17, 2023

  1. We only ever flush parts when the client explicitly asks for it via Commit or Close.

We are flushing as soon as both buffers fill up.

// we filled up pending buffer, flush
if w.pending.Len() == w.pending.Cap() {
if err := w.flush(); err != nil {
return n, err
}
}

If we were flushing as soon as a writer buffer fills up, the client "loses control" over the flushes i.e. Commit essentially becomes almost obsolete, I mean other than completing the multipart upload?

Yes. Buffer management is an implementation detail of the driver. It's why flush is not exported.

I just did some digital archaeology to discover the reason why there are two buffers, only to find what I was looking for...in the doc comment of type writer which still exists in HEAD! 🤦

// writer attempts to upload parts to S3 in a buffered fashion where the last
// part is at least as large as the chunksize, so the multipart upload could be
// cleanly resumed in the future. This is violated if Close is called after less
// than a full chunk is written.

Well, that is not entirely unreasonable. The minimum upload part size is 5 MiB on Amazon S3. If only 1 MiB is buffered when Close is called, the driver can't just discard the buffer. (The current bug notwithstanding.) And if it uploads the buffer as an upload part, it would not be resumable. The driver currently solves this problem by making sure it can never get into this situation: by only flushing half the buffer in Write it always has enough data in reserve to create a large enough part on Close.

Unfortunately this means multipartcombinesmallpart: false (added fairly recently, in #3556) cannot easily be fixed. The buffer contents need to go somewhere on Close. Uploading a part smaller than 5 MiB when flushing on Close does not immediately blow up, but it will prevent the write from being resumed. Therefore calling flush more times is not the correct fix. The buffered data would have to be stashed in a separate temporary object until the upload is resumed. As that is going to be a pretty large chunk of work to implement, I think the best course of action would be to remove the multipartcombinesmallpart option for v3.0.0 and document that we are incompatible with Cloudflare R2 at this time.

Copy link
Collaborator

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this change does manage to make multipartcombinesmallpart: false less broken, resuming uploads is broken when that storage option is set.

@milosgajdos
Copy link
Member

We are flushing as soon as both buffers fill up.

That's what I said in point 2

...and while it keeps writing we are maintaining the "overflow" data (the bytes that haven't fitted into the ready buffer) in the pending buffer but only up to the chunkSize when we finally give up and do a forced flush...

Anyways, thanks for the git archaeology. Come to think of it, I vaguely remember someone bringing this (resumable uploads and how distirbution handles them) up.

I agree, we need to remove that option and mention in the docs those types of uploads are not supported at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multipart upload issues with Cloudflare R2
5 participants