Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

huggingface-cli scan-cache doesn't capture cached datasets #2218

Open
sealad886 opened this issue Apr 11, 2024 · 2 comments
Open

huggingface-cli scan-cache doesn't capture cached datasets #2218

sealad886 opened this issue Apr 11, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@sealad886
Copy link
Contributor

Describe the bug

The cached location of datasets is variant depending on how you download them from Huggingface:

  1. Download using the CLI:
> huggingface-cli download 'wikimedia/wikisource' --repo-type dataset

In this case, the default location (I'll use MacOS since that's what I have, but I'm assuming some level of overall consistency here) is: $HOME/.cache/huggingface/hub/. In the above example, the directory created is datasets--wikimedia--wikisource such that:

datasets--wikimedia--wikisource
|--blobs
    --<blobs>
|--refs
    --<?> #only one file in mine anyway
|--snapshots
    |--<snapshot hash>
        --<symlinked content to blobs>
  1. Download using Huggingface datasets:
>>> from datasets import load_dataset
>>> ds = load_dataset('wikimedia/wikisource')

In this case, the default location is no longer controlled by the environment variable HF_HUB_CACHE. The naming convention is also slightly different. The default location is: $HOME/.cache/huggingface/datasets and the data structure is:

datasets
|--downloads
    --<shared blobs location>
|--wikimedia___wikisource     # note the 3 underscores
    --<symlinked content to downloads folder>

Using huggingface-cli scan-cache a user is unable to access the (actually useful) second cache location. I say "actually useful" because to date I haven't yet been able to figure out how to easily get a dataset cached with the CLI to be used in any models in code.

Other issues that may or may not need separate tickets

  1. Datasets will be downloaded twice if both methods are used.
  2. Datasets used by one download method are inaccessible (using standard tools and defaults) to the other method.
  3. You can't delete cached datasets in the second method using huggingface-cli delete-cache.

Reproduction

Well...use the code and examples above.

Logs

No response

System info

- huggingface_hub version: 0.22.2
- Platform: macOS-14.4.1-arm64-arm-64bit
- Python version: 3.12.2
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Token path ?: /Users/andrew/.cache/huggingface/token
- Has saved token ?: True
- Who am I ?: sealad886
- Configured git credential helpers: osxkeychain
- FastAI: N/A
- Tensorflow: N/A
- Torch: 2.2.2
- Jinja2: 3.1.3
- Graphviz: N/A
- keras: N/A
- Pydot: N/A
- Pillow: 10.3.0
- hf_transfer: 0.1.6
- gradio: 4.21.0
- tensorboard: N/A
- numpy: 1.26.4
- pydantic: 2.6.4
- aiohttp: 3.9.3
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /Users/andrew/.cache/huggingface/hub
- HF_ASSETS_CACHE: /Users/andrew/.cache/huggingface/assets
- HF_TOKEN_PATH: /Users/andrew/.cache/huggingface/token
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_ENABLE_HF_TRANSFER: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
@sealad886 sealad886 added the bug Something isn't working label Apr 11, 2024
@sealad886
Copy link
Contributor Author

sealad886 commented Apr 11, 2024

The offending code can be found here, where the default cache location is sourced from environment variable HF_HUB_CACHE:
https://github.com/huggingface/huggingface_hub/blame/ebba9ef2c338149783978b489ec142ab122af42a/src/huggingface_hub/utils/_cache_manager.py#L500

I say 'offending code', but that's just the original commit of that code. It was how it was designed at the time, I suppose, but I imagine it was decided later to have a shared blob download location to allow for datasets that had shared files? I'm guessing...

@Wauplin
Copy link
Contributor

Wauplin commented Apr 11, 2024

Thanks for pointing that out @sealad886!

The datasets library is indeed managing its own cache and therefore not using the huggingface_hub cache. This problem has already been reported in our ecosystem but fixing it is not as straightforward as it seems -namely because datasets works with other providers as well. I will keep this issue open as long as the datasets <> huggingface_hub integration is not consistent. Stay tuned 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants