Timeout when downloading dataset metadata with 8 torchrun workers #2272

samsja · 2024-05-06T13:59:33Z

Describe the bug

hey, I am experiencing time out when downloading a dataset. I would like to be able to increase this time out, either having a longer default or via env variable.

Reproduction

I am using the following dataset load_dataset("allenai/c4", "en", streaming=True) in streaming mode and get the error below.

This only happened when suing torchrun with 8 workers, using 2 workers is working. My guess is that the worker fight for bandwith leading to the time out when there are too many workers.

I actually "fix" the issue locally by patching the time out in this line:

huggingface_hub/src/huggingface_hub/hf_api.py

Line 2306 in 5ff2d15

r = get_session().get(path, headers=headers, timeout=timeout, params=params)

I would like to increase this timeout in a more secure way.

Thanks in advance 🙏

Logs

File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2491, in repo_info
    return method(
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2363, in dataset_info
    r = get_session().get(path, headers=headers, timeout=timeout, params=params)
  File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 602, in get
    return self.request("GET", url, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 66, in send
    return super().send(request, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/requests/adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: (ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co

System info

- huggingface_hub version: 0.23.0
- Platform: Linux-6.5.0-15-generic-x86_64-with-glibc2.35
- Python version: 3.10.14
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /root/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers:
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.2.2
- Jinja2: 3.1.3
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 10.2.0
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 1.26.4
- pydantic: 1.10.15
- aiohttp: 3.9.5
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /root/.cache/huggingface/hub
- HF_ASSETS_CACHE: /root/.cache/huggingface/assets
- HF_TOKEN_PATH: /root/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10

The text was updated successfully, but these errors were encountered:

Wauplin · 2024-05-22T08:06:18Z

Hi @samsja, thanks for reporting and sorry for the delay. This timeout value is actually hard-coded to 100s in the datasets library (see here). Do you think your workers are blocked in a way that requires more than 100s to complete the HTTP call?

cc @lhoestq who maintains datasets

samsja · 2024-05-22T09:36:21Z

I manage to solve my problem by using HF_HUB_ETAG_TIMEOUT=500 as an env variable.

Do you think your workers are blocked in a way that requires more than 100s to complete the HTTP call?

I guess yes since increasing the timeout allow my run to start.

Feel free to close the issue now that I have a working solution

Wauplin · 2024-05-22T10:57:17Z

Thanks for sharing your solution @samsja! I'll close this issue then :)

samsja added the bug Something isn't working label May 6, 2024

Wauplin closed this as completed May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout when downloading dataset metadata with 8 torchrun workers #2272

Timeout when downloading dataset metadata with 8 torchrun workers #2272

samsja commented May 6, 2024

Wauplin commented May 22, 2024

samsja commented May 22, 2024

Wauplin commented May 22, 2024

Timeout when downloading dataset metadata with 8 torchrun workers #2272

Timeout when downloading dataset metadata with 8 torchrun workers #2272

Comments

samsja commented May 6, 2024

Describe the bug

Reproduction

Logs

System info

Wauplin commented May 22, 2024

samsja commented May 22, 2024

Wauplin commented May 22, 2024