Multicharacter Token Support #193

pe-trik · 2021-10-13T13:01:13Z

The CTC decoder might perform worse when using multicharacter tokens (e.g., BPEs). This issue is mentioned in #173.

The reason is that the implementation sets is_character_based in Scorer to false because the tokens in LM's dictionary have more than one character. When is_character_based is false the Scorer creates an FST based on malformed transitions (add_word_to_dictionary() breaks the tokens/words in LM's dictionary to characters and not to tokens).

This pull request adds an option is_token_based that indicates that the vocabulary consists of custom (multicharacter) tokens.

pe-trik · 2021-10-15T11:39:58Z

Hi @SeanNaren , could you please review the PR? Thanks, Peter.

TehGreatCat · 2023-11-09T16:34:55Z

@SeanNaren please merge this branch, this is a really needed feature

Token Support

dd3289b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multicharacter Token Support #193

Multicharacter Token Support #193

pe-trik commented Oct 13, 2021

pe-trik commented Oct 15, 2021

TehGreatCat commented Nov 9, 2023

Multicharacter Token Support #193

Are you sure you want to change the base?

Multicharacter Token Support #193

Conversation

pe-trik commented Oct 13, 2021

pe-trik commented Oct 15, 2021

TehGreatCat commented Nov 9, 2023