Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UDOP] Add special tokens to tokenizer #29594

Merged
merged 10 commits into from
Apr 19, 2024

Conversation

NielsRogge
Copy link
Contributor

What does this PR do?

This PR makes sure the 1201 additional special tokens can be encoded/decoded properly.

Fixes #29591

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's do better testing:
paragraph<loc_58>. Hey for example with a special case

@NielsRogge
Copy link
Contributor Author

Thanks, I've added tests. Should we add spaces_between_special_tokens as argument to UdopTokenizer, to make it decode the same as the fast tokenizer?

@ArthurZucker
Copy link
Collaborator

You can add it and default it to False in the decode maybe? Cuz I don't think decode takes self. spaces_between_special_tokens into account no?

@NielsRogge
Copy link
Contributor Author

Yeah indeed, shouldn't the _decode method here take into account self. spaces_between_special_tokens in case it exists?

@NielsRogge NielsRogge requested a review from ArthurZucker April 17, 2024 06:32
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, breaking but fixing

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Apr 17, 2024

I don't think it's a breaking change, since adding spaces_between_special_tokens as an attribute doesn't do anything. The _decode method doesn't use it (whereas I think it should).

@ArthurZucker
Copy link
Collaborator

If it does not do anything why are we adding it?

@NielsRogge
Copy link
Contributor Author

Haha yes I'll remove it, and I'll remove it for Gemma too

@NielsRogge NielsRogge merged commit ecfe9be into huggingface:main Apr 19, 2024
19 checks passed
ydshieh pushed a commit that referenced this pull request Apr 23, 2024
* Add special tokens

* Add special tokens

* Use fmt

* Uncomment code

* Add test

* Remove scripts

* Address comments

* Improve tests

* Address comment

* Remove flag
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Special tokens of UDOP aren't encoded/decoded correctly
3 participants