-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem uploading large files to S3 through Python client #3318
Comments
it would be also useful to know if we changed also some multipart upload limits info: https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html |
@satra I'm not aware that changed the When I ran my test, I used a fresh |
The number gives it away... It's 2^26 * 1000. 2^26 is the chunk size we send in a part (64MB). I think there may be a part count limit of 1000. |
(A lot of AWS REST APIs tend to have count limits of 1000, I haven't confirmed that's true for multipart finalization.) |
i was asking if we can change it to address the limits on our side. that seems to be a variable that's currently hard coded separately in the girder server and client. |
According to this they actually support 10k parts on upload. I think the issue here is that when we do the part list request, we are only getting back 1k records (default / max page size) and not making all required page requests. |
Indeed, here is the problematic code. |
From the boto docs for
So the problematic block might be fixed by something like the following parts_page = self.client.list_parts(
Bucket=self.assetstore['bucket'], Key=file['s3Key'],
UploadId=upload['s3']['uploadId'])
parts_to_finalize = [{
'ETag': part['ETag'],
'PartNumber': part['PartNumber']
} for part in parts_page.get('Parts', [])]
while parts_page.get('IsTruncated', False):
nextPartNumber = parts_page['NextPartNumberMarker']
parts_page = self.client.list_parts(
Bucket=self.assetstore['bucket'],
Key=file['s3Key'],
UploadId=upload['s3']['uploadId'],
PartNumberMarker=nextPartNumber
)
parts_to_finalize.extend([{
'ETag': part['ETag'],
'PartNumber': part['PartNumber']
} for part in parts_page.get('Parts', [])])
self.client.complete_multipart_upload(
Bucket=self.assetstore['bucket'],
Key=file['s3Key'],
UploadId=upload['s3']['uploadId'],
MultipartUpload={'Parts': parts_to_finalize}) |
Observation: I assume that we are using _proxiedUploadChunk -- that one does not use |
When I upload large files to S3 through the Girder Python client, the files as stored in S3 get truncated at 67108864000 bytes, no matter how much bigger the full file is. Girder still reports the full file size.
I was following logs of my EC2 based Girder instance as I pushed up a 70G upload, there weren't any issues reported in the Girder info or error logs, or nginx logs.
The text was updated successfully, but these errors were encountered: