-
Notifications
You must be signed in to change notification settings - Fork 608
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
webdataset cannot stop cycling at end of epoch #5441
Comments
Hi @CoinCheung, Thank you for reaching out. |
Thanks for replying !! I have one more question. If there are more than one tar files given to |
You can find more details in this answer. |
Hi @JanuszL , Just to make sure I have got your point. Does this mean that, different tar files are loaded sequentially, but within each tar file the samples are shuffled? |
Samples are shuffled inside an internal buffer that is sequentially filed. When DALI reads one tar it moves to the next one, so samples from different tars can land inside one batch but the bigger the distance between samples in tars the less likely it is. |
Thanks for telling me this!!! I have a suggestion, maybe at the beginning of each epoch we can shuffle the order of the tar files. After the order of tar files are shuffled, we carry out the aforementioned sequential-and-random-buffer loading operation. This would add more randomness to the batches. If this feature is reasonable, please consider adding it in future versions. |
Thank you for your suggestion. |
I am closing this since my question is answered. I am sorry that I am not able to contribute now. Thanks again for your help to the community !!! |
Hi @CoinCheung,
The reader first fills its internal buffer of |
Hi @JanuszL , I got the reason. My platform has 1T memory, but the dataset size is 10T, and I assigned |
Yes, I missed the usage of |
Version
1.31.0
Describe the bug.
I used dataset of about 2w samples, and the iteration of data should stop at iteration of 700. However, the dataloader would continue feed dataset batches after than, and the training will not stop.
Minimum reproducible example
The text was updated successfully, but these errors were encountered: