Problem uploading large files to S3 through Python client #3318

mgrauer · 2020-12-04T15:58:37Z

When I upload large files to S3 through the Girder Python client, the files as stored in S3 get truncated at 67108864000 bytes, no matter how much bigger the full file is. Girder still reports the full file size.

I was following logs of my EC2 based Girder instance as I pushed up a 70G upload, there weren't any issues reported in the Girder info or error logs, or nginx logs.

satra · 2020-12-04T18:40:29Z

it would be also useful to know if we changed CHUNK_LEN in our instance of girder server and the client, whether that would have any unknown effects.

also some multipart upload limits info: https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html

mgrauer · 2020-12-04T18:50:34Z

@satra I'm not aware that changed the CHUNK_LEN on the Girder server, do you know that to be true?

When I ran my test, I used a fresh pip installed girder-client, so I had not changed the CHUNK_LEN in that test. I.e. I'm not sure the CHUNK_LEN change is the issue.

zachmullen · 2020-12-04T18:58:52Z

The number gives it away... It's 2^26 * 1000.

2^26 is the chunk size we send in a part (64MB). I think there may be a part count limit of 1000.

zachmullen · 2020-12-04T18:59:41Z

(A lot of AWS REST APIs tend to have count limits of 1000, I haven't confirmed that's true for multipart finalization.)

satra · 2020-12-04T19:03:40Z

I'm not aware that changed the CHUNK_LEN on the Girder server, do you know that to be true?

i was asking if we can change it to address the limits on our side. that seems to be a variable that's currently hard coded separately in the girder server and client.

zachmullen · 2020-12-04T19:08:34Z

According to this they actually support 10k parts on upload. I think the issue here is that when we do the part list request, we are only getting back 1k records (default / max page size) and not making all required page requests.

zachmullen · 2020-12-04T19:09:39Z

Indeed, here is the problematic code.

mgrauer · 2020-12-04T20:47:00Z

From the boto docs for list_parts:

This request returns a maximum of 1,000 uploaded parts. The default number of parts returned is 1,000 parts. You can restrict the number of parts returned by specifying the max-parts request parameter. If your multipart upload consists of more than 1,000 parts, the response returns an IsTruncated field with the value of true, and a NextPartNumberMarker element. In subsequent ListParts requests you can include the part-number-marker query string parameter and set its value to the NextPartNumberMarker field value from the previous response.

So the problematic block might be fixed by something like the following

parts_page = self.client.list_parts(
                    Bucket=self.assetstore['bucket'], Key=file['s3Key'],
                    UploadId=upload['s3']['uploadId'])
parts_to_finalize = [{
                    'ETag': part['ETag'],
                    'PartNumber': part['PartNumber']
                     } for part in parts_page.get('Parts', [])]
while parts_page.get('IsTruncated', False):
    nextPartNumber = parts_page['NextPartNumberMarker']
    parts_page = self.client.list_parts(
                Bucket=self.assetstore['bucket'],
                Key=file['s3Key'],
                UploadId=upload['s3']['uploadId'],
                PartNumberMarker=nextPartNumber
            )
    parts_to_finalize.extend([{
                    'ETag': part['ETag'],
                    'PartNumber': part['PartNumber']
                } for part in parts_page.get('Parts', [])])
self.client.complete_multipart_upload(
    Bucket=self.assetstore['bucket'],
    Key=file['s3Key'],
    UploadId=upload['s3']['uploadId'],
    MultipartUpload={'Parts': parts_to_finalize})

yarikoptic · 2020-12-04T20:48:02Z

CHUNK_LEN

Observation: I assume that we are using _proxiedUploadChunk -- that one does not use CHUNK_LEN and just takes the size of the chunk, so I would assume that change of chunk size in girder client would have desired effect. But initUpload makes decision on either chunked or not based on CHUNK_LEN. Sounds a bit off to me, or client's chunk size has nothing to do with server's s3 backend chunk size?

mgrauer added the bug label Dec 4, 2020

This was referenced Dec 7, 2020

Compare uploaded file size against what download headers report dandi/dandi-cli#306

Merged

enable upload of files larger than current limit of 65G dandi/dandi-cli#309

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem uploading large files to S3 through Python client #3318

Problem uploading large files to S3 through Python client #3318

mgrauer commented Dec 4, 2020

satra commented Dec 4, 2020

mgrauer commented Dec 4, 2020

zachmullen commented Dec 4, 2020

zachmullen commented Dec 4, 2020

satra commented Dec 4, 2020

zachmullen commented Dec 4, 2020

zachmullen commented Dec 4, 2020

mgrauer commented Dec 4, 2020

yarikoptic commented Dec 4, 2020

Problem uploading large files to S3 through Python client #3318

Problem uploading large files to S3 through Python client #3318

Comments

mgrauer commented Dec 4, 2020

satra commented Dec 4, 2020

mgrauer commented Dec 4, 2020

zachmullen commented Dec 4, 2020

zachmullen commented Dec 4, 2020

satra commented Dec 4, 2020

zachmullen commented Dec 4, 2020

zachmullen commented Dec 4, 2020

mgrauer commented Dec 4, 2020

yarikoptic commented Dec 4, 2020