Does DataLoader(shuffle=True) really shuffle DBpedia dataset correctly? #2197

fujidaiti · 2023-08-04T10:34:52Z

According to the docs, DBpedia dataset has 14 classes (labels) and 40000 texts for each class. Hence, if I create batches using DataLoader(shuffle=True) as follows:

import torchtext.datasets as d
from torch.utils.data.dataloader import DataLoader

train = DataLoader(
    d.DBpedia(split="train", root=".cache"),
    batch_size=10000,
    shuffle=True,
)

the labels should be uniformly distributed in each batch. But in practice, it seems that only a few labels are in each batch.

for labels, texts in train:
    print(len(set(labels.tolist())))

The output of the above code is:

How can I fix this? Or is my implementation wrong?

P.S.
Interactive code is available on GoogleColab

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does DataLoader(shuffle=True) really shuffle DBpedia dataset correctly? #2197

Does DataLoader(shuffle=True) really shuffle DBpedia dataset correctly? #2197

fujidaiti commented Aug 4, 2023 •

edited

Does DataLoader(shuffle=True) really shuffle DBpedia dataset correctly? #2197

Does DataLoader(shuffle=True) really shuffle DBpedia dataset correctly? #2197

Comments

fujidaiti commented Aug 4, 2023 • edited

fujidaiti commented Aug 4, 2023 •

edited