Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout when downloading dataset metadata with 8 torchrun workers #2272

Closed
samsja opened this issue May 6, 2024 · 3 comments
Closed

Timeout when downloading dataset metadata with 8 torchrun workers #2272

samsja opened this issue May 6, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@samsja
Copy link

samsja commented May 6, 2024

Describe the bug

hey, I am experiencing time out when downloading a dataset. I would like to be able to increase this time out, either having a longer default or via env variable.

Reproduction

I am using the following dataset load_dataset("allenai/c4", "en", streaming=True) in streaming mode and get the error below.

This only happened when suing torchrun with 8 workers, using 2 workers is working. My guess is that the worker fight for bandwith leading to the time out when there are too many workers.

I actually "fix" the issue locally by patching the time out in this line:

r = get_session().get(path, headers=headers, timeout=timeout, params=params)

I would like to increase this timeout in a more secure way.

Thanks in advance 🙏

Logs

File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2491, in repo_info
    return method(
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 2363, in dataset_info
    r = get_session().get(path, headers=headers, timeout=timeout, params=params)
  File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 602, in get
    return self.request("GET", url, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 66, in send
    return super().send(request, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/requests/adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: (ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co

System info

- huggingface_hub version: 0.23.0
- Platform: Linux-6.5.0-15-generic-x86_64-with-glibc2.35
- Python version: 3.10.14
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /root/.cache/huggingface/token
- Has saved token ?: False
- Configured git credential helpers:
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.2.2
- Jinja2: 3.1.3
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 10.2.0
- hf_transfer: N/A
- gradio: N/A
- tensorboard: N/A
- numpy: 1.26.4
- pydantic: 1.10.15
- aiohttp: 3.9.5
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /root/.cache/huggingface/hub
- HF_ASSETS_CACHE: /root/.cache/huggingface/assets
- HF_TOKEN_PATH: /root/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
@samsja samsja added the bug Something isn't working label May 6, 2024
@Wauplin
Copy link
Contributor

Wauplin commented May 22, 2024

Hi @samsja, thanks for reporting and sorry for the delay. This timeout value is actually hard-coded to 100s in the datasets library (see here). Do you think your workers are blocked in a way that requires more than 100s to complete the HTTP call?

cc @lhoestq who maintains datasets

@samsja
Copy link
Author

samsja commented May 22, 2024

I manage to solve my problem by using HF_HUB_ETAG_TIMEOUT=500 as an env variable.

Do you think your workers are blocked in a way that requires more than 100s to complete the HTTP call?

I guess yes since increasing the timeout allow my run to start.

Feel free to close the issue now that I have a working solution

@Wauplin
Copy link
Contributor

Wauplin commented May 22, 2024

Thanks for sharing your solution @samsja! I'll close this issue then :)

@Wauplin Wauplin closed this as completed May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants