Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content-Range header for multiple part request #2248

Open
narugo1992 opened this issue Apr 24, 2024 · 4 comments
Open

Content-Range header for multiple part request #2248

narugo1992 opened this issue Apr 24, 2024 · 4 comments

Comments

@narugo1992
Copy link

narugo1992 commented Apr 24, 2024

I'm developing a library to download files for tar archives on huggingface repository:

this is based on the Range header in http request, so download tar archives with Range: bytes=xxx-yyy will only download the specific file instead of the full archive file.

In some cases, we need to download many files from different tar archives, and many of them are from the same archive. So im considering using Range: bytes=xxx-yyy,zzz-ttt to download all of them with only one http request. This can greatly improve the performance of batch downloading, and can also reduce the pressure to the huggingface cdn.

But in my test, when using multiple part ranges, the Content-Range header seems gone in response.

from pprint import pprint

import requests

resp = requests.get(
    'https://huggingface.co/datasets/deepghs/yande_full/resolve/main/images/0008.tar',
    headers={
        'Range': 'bytes=0-99,1200-1369'
    }
)
print(resp)
pprint(dict(resp.headers))
print(len(resp.content))

The output is like this, no Content-Range found. The length of content seems okay, but i dont know what are the ranges of each part.

<Response [206]>
{'Accept-Ranges': 'bytes',
 'Connection': 'keep-alive',
 'Content-Disposition': "attachment; filename*=UTF-8''0008.tar; "
                        'filename="0008.tar";',
 'Content-Length': '570',
 'Content-Type': 'multipart/byteranges; '
                 'boundary=CloudFront:8C171B3C6DAD1DF1040C2DA33E27D04D',
 'Date': 'Wed, 24 Apr 2024 14:57:57 GMT',
 'ETag': '"820b63e3250678f8217c157c8b557712-135"',
 'Last-Modified': 'Sat, 20 Apr 2024 15:06:38 GMT',
 'Server': 'AmazonS3',
 'Vary': 'Origin',
 'Via': '1.1 db3cc869e0dda88ce4fa37dee230e06e.cloudfront.net (CloudFront)',
 'X-Amz-Cf-Id': 'VToeCDfStyG6NtjMCRVWdUqbHvojrQN8a29nE-tgh0zbMNF_80DMEg==',
 'X-Amz-Cf-Pop': 'TXL50-P6',
 'X-Cache': 'RefreshHit from cloudfront',
 'x-amz-server-side-encryption': 'AES256',
 'x-amz-storage-class': 'INTELLIGENT_TIERING'}
570

This header information is really important. So can it be added? or is there an alternative solution to download multiple parts at one time, and save each parts to different files?

@julien-c
Copy link
Member

Cool idea to implement a lazy tar parser on top of HF Hub!! What's the context/goals there?

Re. support for multiple ranges in a single Range request, I think I remember @Kakulukian took a look at this at some point (was this you @Kakulukian?)

@narugo1992
Copy link
Author

narugo1992 commented Apr 25, 2024

@julien-c

In essence, the idea (detailed in the hfutils.index module) is to create an index for tar files, including offsets, sizes, and file hashes of all files within the tar. This enables downloading specific files and verifying integrity using Range: bytes=xxx-yyy during retrieval.

Our requirement is to swiftly retrieve a set of specific files from datasets on huggingface. These datasets typically comprise numerous (e.g., 1k) tar archives, each containing numerous image files. The archive in which an image resides depends on the image's id modulo 1000. Notably, one such dataset is nyanko7/danbooru2023, containing roughly 8 million images spread across 2k+ archive files.

In our practical application, we often begin by querying images based on metadata like tags, obtaining a list of required image ids (often over 1k, sometimes exceeding 100k), then fetching all images based on these ids to make a dataset. For this purpose, we're developing a library called cheesechaser. Though still a work in progress, it already supports the aforementioned danbooru2023 dataset. Based on our current tests, downloading 10k specified images (with consecutive ids spread across 1000 archive files) totaling approximately 18gb, using 12 threads took about 17 minutes, involving roughly 10k download requests. This performance is satisfactory, significantly faster than downloading and decompressing approximately 9tb of complete tar archives, with minimal local disk usage.

However, we've identified areas for improvement in performance. Primarily, due to the large volume of download requests and relatively small file sizes, most time is spent establishing connections rather than downloading. Additionally, as the number of downloaded files increases, excessive requests strain huggingface's cdn resources. Therefore, supporting multi-part range requests could significantly boost performance and alleviate pressure on the cdn service by enabling simultaneous downloads of multiple files within the same archive.

Furthermore, after raising this issue and attempting to use multi-part range, we encountered some more problems:

  • Current response times are far below expectations.
    • When requesting multiple ranges, especially 3-4 widely spaced ranges, response times are excessively long, even for requests with a total content size of only a few hundred bytes.
  • Response formats are unclear, making stable decoding difficult.
    • Response headers lack consistent Content-Range.
    • While I attempted to read the response body, it appears to have a certain format internally. However, the format varies across different runtime environments for the same request, sometimes returning the entire archive file.

@Kakulukian
Copy link
Member

Kakulukian commented Apr 25, 2024

When you request multiple ranges, the response will be in a multipart/byteranges content type, including a boundary. Each subsequent range corresponds to a specific block separated by this boundary with content-range header (https://www.rfc-editor.org/rfc/rfc7233#page-21)

For example for your request:

GET https://huggingface.co/datasets/deepghs/yande_full/resolve/main/images/0008.tar HTTP/1.1
Range: bytes=0-99,1200-1369 

Response:

HTTP/1.1 206 Partial Content
Content-Length: 570
Content-Range: multipart/byteranges; boundary=CloudFront:725CE26A0B74DDB74002A7B61F84A558

--CloudFront:725CE26A0B74DDB74002A7B61F84A558
Content-Type: application/x-tar
Content-Range: bytes 0-99/2146662400

././@PaxHeader
--CloudFront:725CE26A0B74DDB74002A7B61F84A558
Content-Type: application/x-tar
Content-Range: bytes 1200-1369/2146662400

ustar00runnerdocker00000000000000
--CloudFront:725CE26A0B74DDB74002A7B61F84A558--

@narugo1992
Copy link
Author

While I attempted to read the response body, it appears to have a certain format internally. However, the format varies across different runtime environments for the same request, sometimes returning the entire archive file.

I just reproduce this

Reproduce code

import time
from pprint import pprint

import requests

# ranges to get
ranges = [
    (0, 99),
    (1200, 1369),
    (2000, 2209),
    (2146660100, 2146660200),
]

# get ranges with standalone requests
datas = []
for i, (x, y) in enumerate(ranges):
    start_time = time.time()
    resp = requests.get(
        'https://huggingface.co/datasets/deepghs/yande_full/resolve/main/images/0008.tar',
        headers={
            'Range': f'bytes={x}-{y}'
        },
    )
    print(f'Range {i}, response: {resp!r}, length: {len(resp.content)}, time cost: {time.time() - start_time:.3f}s')
    datas.append(bytes(resp.content))
    assert resp.status_code == 206, f'Should be 206, but {resp.status_code} found!'

# get all the data with one request
start_time = time.time()
resp = requests.get(
    'https://huggingface.co/datasets/deepghs/yande_full/resolve/main/images/0008.tar',
    headers={
        'Range': f'bytes={",".join(map(lambda ix: f"{ix[0]}-{ix[1]}", ranges))}'
    },
)
print(f'Multipart response: {resp!r}')
print(f'Time cost: {time.time() - start_time:.3f}s')
print('Headers:')
pprint(dict(resp.headers))
print(f'Content length: {len(resp.content)}')
assert resp.status_code == 206, f'Should be 206, but {resp.status_code} found!'

full_bytes = resp.content

start_pos = 0
current_i = 0
while True:
    try:
        next_sep = full_bytes.index(b'\r\n\r\n', start_pos)
    except ValueError:
        break

    lines = list(filter(bool, full_bytes[start_pos: next_sep].decode().splitlines(keepends=False)))
    pairs = [line.split(':', maxsplit=1) for line in lines]
    headers = {
        key.strip(): value.strip()
        for key, value in pairs
    }
    start_bytes, end_bytes = headers['Content-Range'].split(' ')[-1].split('/')[0].split('-', maxsplit=1)
    start_bytes, end_bytes = int(start_bytes), int(end_bytes)
    length = end_bytes - start_bytes + 1
    current_data = full_bytes[next_sep + 4: next_sep + 4 + length]
    start_pos = next_sep + 4 + length

    print(f'Multipart, range {current_i}, headers: {headers!r}, byte-ranges: {(start_bytes, end_bytes)}')
    assert current_data == datas[current_i], f'Range {current_i} not match!'
    print(f'Range {current_i} matched!')
    current_i += 1

if current_i < len(datas):
    print(f'Range {list(range(current_i, len(datas)))} not matched!')
else:
    print('Match success!')

On my local machine

When i run this on my local environment the result is (the time cost of multipart request is really slow, but the result is correct, status code is 206 as expected)

Range 0, response: <Response [206]>, length: 100, time cost: 2.709s               
Range 1, response: <Response [206]>, length: 170, time cost: 2.133s
Range 2, response: <Response [206]>, length: 210, time cost: 2.101s
Range 3, response: <Response [206]>, length: 101, time cost: 2.365s
Multipart response: <Response [206]>
Time cost: 23.916s                                                                                   
Headers:                                                                                             
{'Accept-Ranges': 'bytes',
 'Connection': 'keep-alive',                                                                         
 'Content-Disposition': "attachment; filename*=UTF-8''0008.tar; "
                        'filename="0008.tar";',
 'Content-Length': '1147',                                                                           
 'Content-Type': 'multipart/byteranges; '
                 'boundary=CloudFront:E5D729C94A500F62E0C8D8AF02F938EF',
 'Date': 'Thu, 25 Apr 2024 13:33:23 GMT',
 'ETag': '"820b63e3250678f8217c157c8b557712-135"',
 'Last-Modified': 'Sat, 20 Apr 2024 15:06:38 GMT', 
 'Server': 'AmazonS3',   
 'Vary': 'Origin',  
 'Via': '1.1 c1ff362c1118e059b545627964cd2e64.cloudfront.net (CloudFront)',
 'X-Amz-Cf-Id': 'I3Zj3t7Yn0ndSDNb7q9F3-_2700VGin-UGIZK-Ik9dkZmfkY5Um8Jw==',
 'X-Amz-Cf-Pop': 'SFO53-P1',
 'X-Cache': 'Miss from cloudfront',
 'x-amz-server-side-encryption': 'AES256',
 'x-amz-storage-class': 'INTELLIGENT_TIERING'}
Content length: 1147
Multipart, range 0, headers: {'--CloudFront': 'E5D729C94A500F62E0C8D8AF02F938EF', 'Content-Type': 'application/x-tar', 'Content-Range': 'bytes 0-99/2146662400'}, byte-ranges: (0, 99)
Range 0 matched!
Multipart, range 1, headers: {'--CloudFront': 'E5D729C94A500F62E0C8D8AF02F938EF', 'Content-Type': 'application/x-tar', 'Content-Range': 'bytes 1200-1369/2146662400'}, byte-ranges: (1200, 1369)
Range 1 matched!
Multipart, range 2, headers: {'--CloudFront': 'E5D729C94A500F62E0C8D8AF02F938EF', 'Content-Type': 'application/x-tar', 'Content-Range': 'bytes 2000-2209/2146662400'}, byte-ranges: (2000, 2209)
Range 2 matched!
Multipart, range 3, headers: {'--CloudFront': 'E5D729C94A500F62E0C8D8AF02F938EF', 'Content-Type': 'application/x-tar', 'Content-Range': 'bytes 2146660100-2146660200/2146662400'}, byte-ranges: (2146660100
, 2146660200) 
Range 3 matched!                  
Match success!

my local env

  • huggingface_hub version: 0.22.2
  • Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Running in iPython ?: No
  • Running in notebook ?: No
  • Running in Google Colab ?: No
  • Token path ?: /data/.hf/token
  • Has saved token ?: True
  • Who am I ?: narugo
  • Configured git credential helpers:
  • FastAI: N/A
  • Tensorflow: N/A
  • Torch: N/A
  • Jinja2: 3.1.3
  • Graphviz: N/A
  • keras: N/A
  • Pydot: N/A
  • Pillow: N/A
  • hf_transfer: N/A
  • gradio: N/A
  • tensorboard: N/A
  • numpy: 1.24.4
  • pydantic: N/A
  • aiohttp: N/A
  • ENDPOINT: https://huggingface.co
  • HF_HUB_CACHE: /data/.hf/hub
  • HF_ASSETS_CACHE: /data/.hf/assets
  • HF_TOKEN_PATH: /data/.hf/token
  • HF_HUB_OFFLINE: False
  • HF_HUB_DISABLE_TELEMETRY: False
  • HF_HUB_DISABLE_PROGRESS_BARS: None
  • HF_HUB_DISABLE_SYMLINKS_WARNING: False
  • HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
  • HF_HUB_DISABLE_IMPLICIT_TOKEN: False
  • HF_HUB_ENABLE_HF_TRANSFER: False
  • HF_HUB_ETAG_TIMEOUT: 10
  • HF_HUB_DOWNLOAD_TIMEOUT: 10

On huggingface space

When i run this code on huggingface space (i deployed a jupyterlab in hfspace), the output is (failed, the entire file is returned)

Range 0, response: <Response [206]>, length: 100, time cost: 0.291s
Range 1, response: <Response [206]>, length: 170, time cost: 0.190s
Range 2, response: <Response [206]>, length: 210, time cost: 0.119s
Range 3, response: <Response [206]>, length: 101, time cost: 0.281s
Multipart response: <Response [200]>
Time cost: 23.513s
Headers:
{'Accept-Ranges': 'bytes',
 'Content-Disposition': "attachment; filename*=UTF-8''0008.tar; "
                        'filename="0008.tar";',
 'Content-Length': '2146662400',
 'Content-Type': 'application/x-tar',
 'Date': 'Thu, 25 Apr 2024 13:33:19 GMT',
 'ETag': '"820b63e3250678f8217c157c8b557712-135"',
 'Last-Modified': 'Sat, 20 Apr 2024 15:06:38 GMT',
 'Server': 'AmazonS3',
 'x-amz-id-2': 'yGuW1BP+wVzZ6c6FgVvrvuBw2vkHDuqskpgpGHFW2t5y9sDFGRNGMi/29Ywf1t3t06aL3ma6MME=',
 'x-amz-request-id': 'HBC2WBTYWNDSXDP7',
 'x-amz-server-side-encryption': 'AES256',
 'x-amz-storage-class': 'INTELLIGENT_TIERING'}
Content length: 2146662400
Traceback (most recent call last):
  File "test_main.py", line 41, in <module>
    assert resp.status_code == 206, f'Should be 206, but {resp.status_code} found!'
AssertionError: Should be 206, but 200 found!

the env

  • huggingface_hub version: 0.22.2
  • Platform: Linux-5.10.205-195.807.amzn2.x86_64-x86_64-with-glibc2.2.5
  • Python version: 3.8.1
  • Running in iPython ?: No
  • Running in notebook ?: No
  • Running in Google Colab ?: No
  • Token path ?: /home/user/.cache/huggingface/token
  • Has saved token ?: True
  • Who am I ?: narugo
  • Configured git credential helpers:
  • FastAI: N/A
  • Tensorflow: N/A
  • Torch: 2.0.1
  • Jinja2: 3.1.3
  • Graphviz: N/A
  • keras: N/A
  • Pydot: N/A
  • Pillow: 10.3.0
  • hf_transfer: N/A
  • gradio: N/A
  • tensorboard: N/A
  • numpy: 1.24.4
  • pydantic: N/A
  • aiohttp: N/A
  • ENDPOINT: https://huggingface.co
  • HF_HUB_CACHE: /home/user/.cache/huggingface/hub
  • HF_ASSETS_CACHE: /home/user/.cache/huggingface/assets
  • HF_TOKEN_PATH: /home/user/.cache/huggingface/token
  • HF_HUB_OFFLINE: False
  • HF_HUB_DISABLE_TELEMETRY: False
  • HF_HUB_DISABLE_PROGRESS_BARS: None
  • HF_HUB_DISABLE_SYMLINKS_WARNING: False
  • HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
  • HF_HUB_DISABLE_IMPLICIT_TOKEN: False
  • HF_HUB_ENABLE_HF_TRANSFER: False
  • HF_HUB_ETAG_TIMEOUT: 10
  • HF_HUB_DOWNLOAD_TIMEOUT: 10

So, 2 problems:

  • too slow when requesting multipart byteranges
  • it will return entire file instead of byteranges in some cases, but i have no idea what triggers this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants