Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does DataLoader(shuffle=True) really shuffle DBpedia dataset correctly? #2197

Open
fujidaiti opened this issue Aug 4, 2023 · 0 comments
Open

Comments

@fujidaiti
Copy link

fujidaiti commented Aug 4, 2023

According to the docs, DBpedia dataset has 14 classes (labels) and 40000 texts for each class. Hence, if I create batches using DataLoader(shuffle=True) as follows:

import torchtext.datasets as d
from torch.utils.data.dataloader import DataLoader

train = DataLoader(
    d.DBpedia(split="train", root=".cache"),
    batch_size=10000,
    shuffle=True,
)

the labels should be uniformly distributed in each batch. But in practice, it seems that only a few labels are in each batch.

for labels, texts in train:
    print(len(set(labels.tolist())))

The output of the above code is:

1
1
1
2
2
2
2
3
3
3
3
4
4
3
3
.
.
.

How can I fix this? Or is my implementation wrong?

P.S.
Interactive code is available on GoogleColab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant