Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem uploading large files to S3 through Python client #3318

Open
mgrauer opened this issue Dec 4, 2020 · 9 comments
Open

Problem uploading large files to S3 through Python client #3318

mgrauer opened this issue Dec 4, 2020 · 9 comments
Labels

Comments

@mgrauer
Copy link
Contributor

mgrauer commented Dec 4, 2020

When I upload large files to S3 through the Girder Python client, the files as stored in S3 get truncated at 67108864000 bytes, no matter how much bigger the full file is. Girder still reports the full file size.

I was following logs of my EC2 based Girder instance as I pushed up a 70G upload, there weren't any issues reported in the Girder info or error logs, or nginx logs.

@mgrauer mgrauer added the bug label Dec 4, 2020
@satra
Copy link

satra commented Dec 4, 2020

it would be also useful to know if we changed CHUNK_LEN in our instance of girder server and the client, whether that would have any unknown effects.

also some multipart upload limits info: https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html

@mgrauer
Copy link
Contributor Author

mgrauer commented Dec 4, 2020

@satra I'm not aware that changed the CHUNK_LEN on the Girder server, do you know that to be true?

When I ran my test, I used a fresh pip installed girder-client, so I had not changed the CHUNK_LEN in that test. I.e. I'm not sure the CHUNK_LEN change is the issue.

@zachmullen
Copy link
Member

The number gives it away... It's 2^26 * 1000.

2^26 is the chunk size we send in a part (64MB). I think there may be a part count limit of 1000.

@zachmullen
Copy link
Member

(A lot of AWS REST APIs tend to have count limits of 1000, I haven't confirmed that's true for multipart finalization.)

@satra
Copy link

satra commented Dec 4, 2020

I'm not aware that changed the CHUNK_LEN on the Girder server, do you know that to be true?

i was asking if we can change it to address the limits on our side. that seems to be a variable that's currently hard coded separately in the girder server and client.

@zachmullen
Copy link
Member

According to this they actually support 10k parts on upload. I think the issue here is that when we do the part list request, we are only getting back 1k records (default / max page size) and not making all required page requests.

@zachmullen
Copy link
Member

Indeed, here is the problematic code.

@mgrauer
Copy link
Contributor Author

mgrauer commented Dec 4, 2020

From the boto docs for list_parts:

This request returns a maximum of 1,000 uploaded parts. The default number of parts returned is 1,000 parts. You can restrict the number of parts returned by specifying the max-parts request parameter. If your multipart upload consists of more than 1,000 parts, the response returns an IsTruncated field with the value of true, and a NextPartNumberMarker element. In subsequent ListParts requests you can include the part-number-marker query string parameter and set its value to the NextPartNumberMarker field value from the previous response.

So the problematic block might be fixed by something like the following

parts_page = self.client.list_parts(
                    Bucket=self.assetstore['bucket'], Key=file['s3Key'],
                    UploadId=upload['s3']['uploadId'])
parts_to_finalize = [{
                    'ETag': part['ETag'],
                    'PartNumber': part['PartNumber']
                     } for part in parts_page.get('Parts', [])]
while parts_page.get('IsTruncated', False):
    nextPartNumber = parts_page['NextPartNumberMarker']
    parts_page = self.client.list_parts(
                Bucket=self.assetstore['bucket'],
                Key=file['s3Key'],
                UploadId=upload['s3']['uploadId'],
                PartNumberMarker=nextPartNumber
            )
    parts_to_finalize.extend([{
                    'ETag': part['ETag'],
                    'PartNumber': part['PartNumber']
                } for part in parts_page.get('Parts', [])])
self.client.complete_multipart_upload(
    Bucket=self.assetstore['bucket'],
    Key=file['s3Key'],
    UploadId=upload['s3']['uploadId'],
    MultipartUpload={'Parts': parts_to_finalize})

@yarikoptic
Copy link

CHUNK_LEN

Observation: I assume that we are using _proxiedUploadChunk -- that one does not use CHUNK_LEN and just takes the size of the chunk, so I would assume that change of chunk size in girder client would have desired effect. But initUpload makes decision on either chunked or not based on CHUNK_LEN. Sounds a bit off to me, or client's chunk size has nothing to do with server's s3 backend chunk size?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants