-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix seq2seq collator padding #30556
Fix seq2seq collator padding #30556
Conversation
further added tests for the seq2seq data collator in the style of the `data_collator_for_token_classification` (pt, tf, np)
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code changes LGTM, and I appreciate the extensive testing! cc @amyeroberts for core maintainer review.
The point you made about side-effects on the input is interesting, though - I agree that it's unlikely to be critical (because people rarely reuse the pre-collator inputs after calling the collator), but it's still suboptimal. Maybe we could consider having our collators make copies of the inputs instead so they don't overwrite them, though this would definitely be a task for a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing and for all of these tests - beautiful!
@Rocketknight1 @vasqu Very much aligned on this - side-effects should definitely be avoided, and it's a bigger piece of work for a follow-up PR. |
@amyeroberts @Rocketknight1 I'm not sure which collators affect which inputs via side-effects so I'd suggest writing a mutability test which assures this wanted behaviour. Afaik, it currently only affects the labels when using the seq2seq collator but maybe I'll catch some others. Got some time now, will be looking into it in a bit. |
@vasqu Awesome ❤️ looking forward to seeing the PR! |
What does this PR do?
DataCollatorForSeq2Seq
currently only supports thelongest
and silently theno_padding
strategies. This PR adds some checks to ensure the padding strategy is respected if valid (e.g.max_length
without a given max length defaults back tolongest
).Furthermore, added tests for the collator in the style of the
DataCollatorForTokenClassification
for all kinds of supported datatypes (pt, tf, np).Something to note: The seq2seq collator causes side effects on the features' labels that are passed. I don't think it's a big deal but thought I'd clarify why I produce features twice within a single test.
Fixes #30521
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@Rocketknight1 @amyeroberts