Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage - Truncated upload due to length-calculation prior to utf-8 encoding #469

Open
jack1323 opened this issue May 18, 2022 · 8 comments

Comments

@jack1323
Copy link

Encoding utf-8 can increase the calculated length of an object. The length is calculated prior to encoding. If special characters are in the string, the length will be incorrect and the result is a truncated upload.

in upload(), this:

        stream = self._preprocess_data(file_data)
        content_length = self._get_stream_len(stream)

happens before this in _upload_multipart():

       raw_body: AnyStr = stream.read()
       if isinstance(raw_body, str):
            bytes_body: bytes = raw_body.encode('utf-8')
@TheKevJames
Copy link
Member

Ahh, awesome find! I would be more than happy to accept a PR which solves this.

@AdeelK93
Copy link

I'm having this issue as well, what would a fix to this look like?

@AdeelK93
Copy link

the fix, if anyone is wondering, is to encode as bytes

@TheKevJames
Copy link
Member

@AdeelK93 do you happen to have sample code for how you've done this? I would love to get this patched.

@AdeelK93
Copy link

sure, this is basically the fix in my code. for a utf8 instance of my_string, i do the following

async with aiohttp.ClientSession() as session:
    client = Storage(session=session)
    await client.upload(bucket, blob, my_string.encode())

@AdeelK93
Copy link

or in other words, dont ever upload a string, only upload bytes

@TheKevJames
Copy link
Member

Huh, ok, then I'm a bit confused: we do indeed calculate the content length before encoding... but for multipart uploads we actually recompute that length after we do the bytes encode.

Does anyone have a sample file which I could test with to understand this behaviour? I've just tried a couple str uploads which special characters and all seems to be working just fine.

@AdeelK93
Copy link

try an emoji for the simplest example, or for something more robust, try this test file.

for the second file, you'll notice that the original file is 27kb, but the file that gets uploaded to GCS is 22kb.

you're not going to get an error. you're going to get an incomplete upload.

then try again with .encode() - the full 27kb file will be uploaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants