slow http read performance on aws lambda #709

stev-0 · 2022-07-20T11:39:09Z

Problem description

I am trying to stream data from a website using http directly into s3 using smart_open using an AWS Lambda function. Testing has shown that the http read with smart_open is much slower than the same function using requests directly, by about an order of magnitude, so the examples reflect that for simplicity of reproduction.

Tests on a local machine do not show the same discrepancy.

I may well be doing this wrong as I couldn't find an example of how to do this, but happy to contribute one if someone can put me right.

Steps/code to reproduce the problem

Fast version ~ 5 seconds

import logging
import requests
from smart_open import open as s_open

CHUNK_SIZE = 100 * 1024**2
MP_UPLOAD_SIZE = 50 * 1024**2

with s_open( "/tmp/100.bin", 'wb') as fout:
    with requests.get("https://speed.hetzner.de/100MB.bin", stream=True) as r:
         r.raise_for_status()
         for chunk in r.iter_content(chunk_size=CHUNK_SIZE):
             fout.write(chunk)

slow version ~ 170 seconds

import logging
from smart_open import open as s_open

CHUNK_SIZE = 100 * 1024**2
MP_UPLOAD_SIZE = 50 * 1024**2

with s_open( "/tmp/100.bin", 'wb') as fout:
    with s_open('https://speed.hetzner.de/100MB.bin' , 'rb') as fin: 
        chunk = b'0'
        while len(chunk) > 0:
            chunk = fin.read(CHUNK_SIZE)
            fout.write(chunk)

Versions

Linux-4.14.255-276-224.499.amzn2.x86_64-x86_64-with-glibc2.26
Python 3.9.13 (main, Jun 10 2022, 16:49:31) [GCC 7.3.1 20180712 (Red Hat 7.3.1-15)]
smart_open 6.0.0

Checklist

Before you create the issue, please make sure you have:

Described the problem clearly
Provided a minimal reproducible example, including any required data
Provided the version numbers of the relevant software

The text was updated successfully, but these errors were encountered:

mpenkov · 2022-07-29T06:35:23Z

The fast version does not run correctly because of a missing requests import. But, even after fixing that, I still could not reproduce the problem.

$ cat gitignore/slow.py 
import logging
from smart_open import open as s_open

CHUNK_SIZE = 100 * 1024**2
MP_UPLOAD_SIZE = 50 * 1024**2

with s_open( "/tmp/100.bin", 'wb') as fout:
    with s_open('https://speed.hetzner.de/100MB.bin' , 'rb') as fin: 
        chunk = b'0'
        while len(chunk) > 0:
            chunk = fin.read(CHUNK_SIZE)
            fout.write(chunk)
$ cat gitignore/fast.py 
import logging
import requests
from smart_open import open as s_open

CHUNK_SIZE = 100 * 1024**2
MP_UPLOAD_SIZE = 50 * 1024**2

with s_open( "/tmp/100.bin", 'wb') as fout:
    with requests.get("https://speed.hetzner.de/100MB.bin", stream=True) as r:
         r.raise_for_status()
         for chunk in r.iter_content(chunk_size=CHUNK_SIZE):
             fout.write(chunk)

$ time python gitignore/slow.py 

real    1m43.211s
user    0m14.324s
sys     0m18.733s
$ time python gitignore/fast.py 

real    2m4.085s
user    0m0.987s
sys     0m0.763s

stev-0 · 2022-07-29T15:30:45Z

Thanks for pointing out the typo, will fix in my original post.

Tested locally on 3.9 to rule that out:

python3.9 fast.py 2.20s user 2.02s system 6% cpu 1:03.77 total
python3.9 slow.py 8.83s user 5.79s system 18% cpu 1:18.89 total

And double checked again on Lambda:

fast

REPORT RequestId:xxxxDuration: 3457.29 ms Billed Duration: 3458 ms Memory Size: 350 MB Max Memory Used: 258 MB Init Duration: 407.47 ms

slow

REPORT RequestId: xxx Duration: 121736.98 ms Billed Duration: 121737 ms Memory Size: 350 MB Max Memory Used: 259 MB Init Duration: 438.44 ms

As you can see the differential is negligible locally, huge on lambda. Not sure what is going on there. Possibly something to do with memory availability?

mpenkov · 2022-07-30T12:30:12Z

I don't have much experience with Lambda, so it's difficult for me to comment.

It's odd that the slow version is still more than twice as slow locally, though... Are you able to investigate why there is such a difference? There is a small chance that this difference is what's causing the huge slowdown on Lambda.

The way I would approach this is:

Make the slow version behave identical to the fast one locally (possibly by modifying smart_open)
Re-run the slow version on Lambda and test the duration

ShailChoksi · 2023-03-14T16:03:22Z

I am able to reproduce this with s3 and https request as well.

with python requests library to read https stream - <1s to upload 70 MB file
with smart_open it takes 750s to upload 70MB file

requests:

with requests.get(uri, stream=True) as r:
        r.raise_for_status()
        with sm_open(f"s3://{bucket_name}/{file_path}/{file_name}", "wb", transport_params=transport_params) as fout:
            for chunk in r.iter_content(chunk_size=CHUNK_SIZE):
                fout.write(chunk)

~ 10MB/s - I believe this is because I have the chunk size as 10MB

sm_open:

with sm_open(uri, "rb") as fin:
        with sm_open(f"s3://{bucket_name}/{file_path}/{file_name}", "wb", transport_params=transport_params) as fout:
            for line in fin:
                fout.write(line)

~ 0.093 MB/s - I could try chunking like above but I wouldn't expect a slow down of this order of magnitude.

mpenkov · 2023-03-15T02:54:01Z

Are you able to profile the code to work out where the time-consuming part is? It seems that downloading is slow, because you're using smart_open for the upload in both cases. If so, then we can probably eliminate the upload component altogether, and look for the problem in the download component.

Also, ensure compression isn't causing the slow-down. By default, smart_open uses the file extension to transparently handle compression.

mpenkov added performance help wanted We can't figure this out, if you can, then please help! labels Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow http read performance on aws lambda #709

slow http read performance on aws lambda #709

stev-0 commented Jul 20, 2022 •

edited

mpenkov commented Jul 29, 2022

stev-0 commented Jul 29, 2022

mpenkov commented Jul 30, 2022

ShailChoksi commented Mar 14, 2023 •

edited

mpenkov commented Mar 15, 2023

slow http read performance on aws lambda #709

slow http read performance on aws lambda #709

Comments

stev-0 commented Jul 20, 2022 • edited

Problem description

Steps/code to reproduce the problem

Versions

Checklist

mpenkov commented Jul 29, 2022

stev-0 commented Jul 29, 2022

mpenkov commented Jul 30, 2022

ShailChoksi commented Mar 14, 2023 • edited

mpenkov commented Mar 15, 2023

stev-0 commented Jul 20, 2022 •

edited

ShailChoksi commented Mar 14, 2023 •

edited