You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this case, the default location (I'll use MacOS since that's what I have, but I'm assuming some level of overall consistency here) is: $HOME/.cache/huggingface/hub/. In the above example, the directory created is datasets--wikimedia--wikisource such that:
datasets--wikimedia--wikisource
|--blobs
--<blobs>
|--refs
--<?> #only one file in mine anyway
|--snapshots
|--<snapshot hash>
--<symlinked content to blobs>
In this case, the default location is no longer controlled by the environment variable HF_HUB_CACHE. The naming convention is also slightly different. The default location is: $HOME/.cache/huggingface/datasets and the data structure is:
datasets
|--downloads
--<shared blobs location>
|--wikimedia___wikisource # note the 3 underscores
--<symlinked content to downloads folder>
Using huggingface-cli scan-cache a user is unable to access the (actually useful) second cache location. I say "actually useful" because to date I haven't yet been able to figure out how to easily get a dataset cached with the CLI to be used in any models in code.
Other issues that may or may not need separate tickets
Datasets will be downloaded twice if both methods are used.
Datasets used by one download method are inaccessible (using standard tools and defaults) to the other method.
You can't delete cached datasets in the second method using huggingface-cli delete-cache.
I say 'offending code', but that's just the original commit of that code. It was how it was designed at the time, I suppose, but I imagine it was decided later to have a shared blob download location to allow for datasets that had shared files? I'm guessing...
The datasets library is indeed managing its own cache and therefore not using the huggingface_hub cache. This problem has already been reported in our ecosystem but fixing it is not as straightforward as it seems -namely because datasets works with other providers as well. I will keep this issue open as long as the datasets <> huggingface_hub integration is not consistent. Stay tuned 😉
Describe the bug
The cached location of datasets is variant depending on how you download them from Huggingface:
In this case, the default location (I'll use MacOS since that's what I have, but I'm assuming some level of overall consistency here) is:
$HOME/.cache/huggingface/hub/
. In the above example, the directory created isdatasets--wikimedia--wikisource
such that:In this case, the default location is no longer controlled by the environment variable HF_HUB_CACHE. The naming convention is also slightly different. The default location is:
$HOME/.cache/huggingface/datasets
and the data structure is:Using
huggingface-cli scan-cache
a user is unable to access the (actually useful) second cache location. I say "actually useful" because to date I haven't yet been able to figure out how to easily get a dataset cached with the CLI to be used in any models in code.Other issues that may or may not need separate tickets
huggingface-cli delete-cache
.Reproduction
Well...use the code and examples above.
Logs
No response
System info
The text was updated successfully, but these errors were encountered: