Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Builder creates duplicate query-document pairs & model predictions are odd #140

Open
littlewine opened this issue Apr 20, 2020 · 1 comment
Labels
bug Something isn't working

Comments

@littlewine
Copy link

I have the following issue, which is really odd and affects the evaluation of the neural models.
I build my data using the auto preparer and I came to realize, that when I try to make predictions on the test set, some document-query pairs are duplicated.
I am not sure why this is happening, my first guess would be in order to fill up the missing examples until the batch size, but this does not seem to be the case.

Here's most of my code:

model, prpr, dsb, dlb = preparer.prepare(model_class,
                                             train_pack
                                             )

    train_prepr = prpr.transform(train_pack)
    valid_prepr = prpr.transform(valid_pack)
    test_prepr = prpr.transform(test_pack)

    mz.dataloader.dataset_builder.DatasetBuilder()
    train_dataset = dsb.build(train_prepr)
    valid_dataset = dsb.build(valid_prepr)
    test_dataset = dsb.build(test_prepr)

    train_dl = dlb.build(train_dataset, stage='train')
    valid_dl = dlb.build(valid_dataset, stage='dev')
    test_dl = dlb.build(test_dataset, stage='test')

# training the model etc....

    test_preds = pd.DataFrame(trainer.predict(test_dl), columns=['pred'])
    test_preds['id_left'] = test_dl.id_left
    test_preds['id_right'] = test_dl._dataset[:][0]['id_right']
    test_preds['length_right'] = test_dl._dataset[:][0]['length_right']

Now, it seems that the duplicates are created through the dataset builder, but I don't understand why.

    test_dataset._data_pack.frame().duplicated(['id_left', 'id_right']).sum() 
>> 297
    test_pack.frame().duplicated(['id_left', 'id_right']).sum() 
>>0
    test_prepr.frame().duplicated(['id_left', 'id_right']).sum()
>> 0

Even more odd, is the fact that those predictions have different scores for the same document-query pairs. And those are not even always close to each other - so this can't be some rounding error or so. This is very weird, how is it possible that without re-training the model, I can get so much different predictions for the same query-document pairs in inference time???


    print(test_preds[test_preds.duplicated(['id_right', 'id_left'],
                                           keep=False)].sort_values(['id_left', 'id_right'])
          )

>>
           pred  id_left                   id_right  length_right
466  -10.889746   33-1-1  47-07395           896
499   -9.492123   33-1-1  47-07395           896
677   -6.880966   33-1-1  47-07395           896
496  -10.781660   33-1-1  98-33779           535
678   -7.954109   33-1-1  98-33779           535
1044 -11.102488   33-1-1 98-33779           535
508   -6.497414   33-1-1  95-23333           244
1326  -7.466503   33-1-1  95-23333           244

In this replicated example the model used was KNRM, but I think this happens in other models too.

@littlewine littlewine added the bug Something isn't working label Apr 20, 2020
@faneshion
Copy link
Member

Hi, @littlewine , there are indeed three kinds of datapack, i.e., point-wise, pair-wise, and list-wise. In fact, for training, we can choose either one according to the loss function. While in testing, we should not organize the datapack into pair-wise since it will add duplicate instances to fill the batch size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants