Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow http read performance on aws lambda #709

Open
3 tasks done
stev-0 opened this issue Jul 20, 2022 · 5 comments
Open
3 tasks done

slow http read performance on aws lambda #709

stev-0 opened this issue Jul 20, 2022 · 5 comments
Labels
help wanted We can't figure this out, if you can, then please help! performance

Comments

@stev-0
Copy link
Contributor

stev-0 commented Jul 20, 2022

Problem description

I am trying to stream data from a website using http directly into s3 using smart_open using an AWS Lambda function. Testing has shown that the http read with smart_open is much slower than the same function using requests directly, by about an order of magnitude, so the examples reflect that for simplicity of reproduction.

Tests on a local machine do not show the same discrepancy.

I may well be doing this wrong as I couldn't find an example of how to do this, but happy to contribute one if someone can put me right.

Steps/code to reproduce the problem

Fast version ~ 5 seconds

import logging
import requests
from smart_open import open as s_open

CHUNK_SIZE = 100 * 1024**2
MP_UPLOAD_SIZE = 50 * 1024**2

with s_open( "/tmp/100.bin", 'wb') as fout:
    with requests.get("https://speed.hetzner.de/100MB.bin", stream=True) as r:
         r.raise_for_status()
         for chunk in r.iter_content(chunk_size=CHUNK_SIZE):
             fout.write(chunk)

slow version ~ 170 seconds

import logging
from smart_open import open as s_open

CHUNK_SIZE = 100 * 1024**2
MP_UPLOAD_SIZE = 50 * 1024**2

with s_open( "/tmp/100.bin", 'wb') as fout:
    with s_open('https://speed.hetzner.de/100MB.bin' , 'rb') as fin: 
        chunk = b'0'
        while len(chunk) > 0:
            chunk = fin.read(CHUNK_SIZE)
            fout.write(chunk)

Versions

Linux-4.14.255-276-224.499.amzn2.x86_64-x86_64-with-glibc2.26
Python 3.9.13 (main, Jun 10 2022, 16:49:31) [GCC 7.3.1 20180712 (Red Hat 7.3.1-15)]
smart_open 6.0.0

Checklist

Before you create the issue, please make sure you have:

  • Described the problem clearly
  • Provided a minimal reproducible example, including any required data
  • Provided the version numbers of the relevant software
@mpenkov mpenkov added performance help wanted We can't figure this out, if you can, then please help! labels Jul 29, 2022
@mpenkov
Copy link
Collaborator

mpenkov commented Jul 29, 2022

The fast version does not run correctly because of a missing requests import. But, even after fixing that, I still could not reproduce the problem.

$ cat gitignore/slow.py 
import logging
from smart_open import open as s_open

CHUNK_SIZE = 100 * 1024**2
MP_UPLOAD_SIZE = 50 * 1024**2

with s_open( "/tmp/100.bin", 'wb') as fout:
    with s_open('https://speed.hetzner.de/100MB.bin' , 'rb') as fin: 
        chunk = b'0'
        while len(chunk) > 0:
            chunk = fin.read(CHUNK_SIZE)
            fout.write(chunk)
$ cat gitignore/fast.py 
import logging
import requests
from smart_open import open as s_open

CHUNK_SIZE = 100 * 1024**2
MP_UPLOAD_SIZE = 50 * 1024**2

with s_open( "/tmp/100.bin", 'wb') as fout:
    with requests.get("https://speed.hetzner.de/100MB.bin", stream=True) as r:
         r.raise_for_status()
         for chunk in r.iter_content(chunk_size=CHUNK_SIZE):
             fout.write(chunk)

$ time python gitignore/slow.py 

real    1m43.211s
user    0m14.324s
sys     0m18.733s
$ time python gitignore/fast.py 

real    2m4.085s
user    0m0.987s
sys     0m0.763s

@stev-0
Copy link
Contributor Author

stev-0 commented Jul 29, 2022

Thanks for pointing out the typo, will fix in my original post.

Tested locally on 3.9 to rule that out:

python3.9 fast.py 2.20s user 2.02s system 6% cpu 1:03.77 total
python3.9 slow.py 8.83s user 5.79s system 18% cpu 1:18.89 total

And double checked again on Lambda:

fast

REPORT RequestId:xxxxDuration: 3457.29 ms Billed Duration: 3458 ms Memory Size: 350 MB Max Memory Used: 258 MB Init Duration: 407.47 ms

slow

REPORT RequestId: xxx Duration: 121736.98 ms Billed Duration: 121737 ms Memory Size: 350 MB Max Memory Used: 259 MB Init Duration: 438.44 ms

As you can see the differential is negligible locally, huge on lambda. Not sure what is going on there. Possibly something to do with memory availability?

@mpenkov
Copy link
Collaborator

mpenkov commented Jul 30, 2022

I don't have much experience with Lambda, so it's difficult for me to comment.

It's odd that the slow version is still more than twice as slow locally, though... Are you able to investigate why there is such a difference? There is a small chance that this difference is what's causing the huge slowdown on Lambda.

The way I would approach this is:

  • Make the slow version behave identical to the fast one locally (possibly by modifying smart_open)
  • Re-run the slow version on Lambda and test the duration

@ShailChoksi
Copy link

ShailChoksi commented Mar 14, 2023

I am able to reproduce this with s3 and https request as well.

with python requests library to read https stream - <1s to upload 70 MB file
with smart_open it takes 750s to upload 70MB file

requests:

with requests.get(uri, stream=True) as r:
        r.raise_for_status()
        with sm_open(f"s3://{bucket_name}/{file_path}/{file_name}", "wb", transport_params=transport_params) as fout:
            for chunk in r.iter_content(chunk_size=CHUNK_SIZE):
                fout.write(chunk) 

~ 10MB/s - I believe this is because I have the chunk size as 10MB

sm_open:

with sm_open(uri, "rb") as fin:
        with sm_open(f"s3://{bucket_name}/{file_path}/{file_name}", "wb", transport_params=transport_params) as fout:
            for line in fin:
                fout.write(line) 

~ 0.093 MB/s - I could try chunking like above but I wouldn't expect a slow down of this order of magnitude.

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 15, 2023

Are you able to profile the code to work out where the time-consuming part is? It seems that downloading is slow, because you're using smart_open for the upload in both cases. If so, then we can probably eliminate the upload component altogether, and look for the problem in the download component.

Also, ensure compression isn't causing the slow-down. By default, smart_open uses the file extension to transparently handle compression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted We can't figure this out, if you can, then please help! performance
Projects
None yet
Development

No branches or pull requests

3 participants