Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multicharacter Token Support #193

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

pe-trik
Copy link

@pe-trik pe-trik commented Oct 13, 2021

The CTC decoder might perform worse when using multicharacter tokens (e.g., BPEs). This issue is mentioned in #173.

The reason is that the implementation sets is_character_based in Scorer to false because the tokens in LM's dictionary have more than one character. When is_character_based is false the Scorer creates an FST based on malformed transitions (add_word_to_dictionary() breaks the tokens/words in LM's dictionary to characters and not to tokens).

This pull request adds an option is_token_based that indicates that the vocabulary consists of custom (multicharacter) tokens.

@pe-trik
Copy link
Author

pe-trik commented Oct 15, 2021

Hi @SeanNaren , could you please review the PR? Thanks, Peter.

@TehGreatCat
Copy link

@SeanNaren please merge this branch, this is a really needed feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants