Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer.json compability with ai.djl.huggingface tokenizers broken for sentencepiece based models #31086

Closed
jobergum opened this issue Apr 30, 2024 · 5 comments
Assignees
Milestone

Comments

@jobergum
Copy link
Member

At some point, the transformers library tokenizer save_pretrained tokenizer.json changed for sentencepiece based tokenizer models, which is incompatible with the djl tokenizer implementation we depend on. We use 0.27.0 which is the latest version.

Importing a tokenizer.json file from a recent Transformer version will prevent the embedder from starting because the djl tokenizer doesn't understand the format.

For example this config, pointing to a tokenizer file exported from intfloat/multilingual-e5-small) using either optimum-cli or our export tooling into:

 <component id="e5" type="hugging-face-embedder">
            <transformer-model path="multilingual-e5-small-quantized/model_quantized.onnx"/>
            <tokenizer-model path="multilingual-e5-small-quantized/tokenizer.json"/>
        </component>

Will fail with the following

[2024-04-30 06:53:25.891] WARNING container        Container.com.yahoo.container.di.Container	
	Failed to set up first component graph due to error when constructing one of the components
	exception=
	com.yahoo.container.di.componentgraph.core.ComponentNode$ComponentConstructorException: Error constructing 'e5' of type 'ai.vespa.embedding.huggingface.HuggingFaceEmbedder': null
	Caused by: java.lang.RuntimeException: data did not match any variant of untagged enum PreTokenizerWrapper at line 73 column 3
	at ai.djl.huggingface.tokenizers.jni.TokenizersLibrary.createTokenizerFromString(Native Method)
	at com.yahoo.language.huggingface.HuggingFaceTokenizer.lambda$getModelInfo$6(HuggingFaceTokenizer.java:124)
	at com.yahoo.language.huggingface.HuggingFaceTokenizer.withContextClassloader(HuggingFaceTokenizer.java:149)
	at com.yahoo.language.huggingface.HuggingFaceTokenizer.getModelInfo(HuggingFaceTokenizer.java:121)
@jobergum
Copy link
Member Author

The older tokenizer file saved on tokenizer.json on the model hub imports fine.

diff tokenizer-works.json tokenizer-fails.json 
71c71,72
<     "add_prefix_space": true
---
>     "prepend_scheme": "always",
>     "split": true
157c158,159
<     "add_prefix_space": true
---
>     "prepend_scheme": "always",
>     "split": true
1000171c1000173,1000174
<     ]
---
>     ],
>     "byte_fallback": false

There is one new introduced parameter byte_fallback in the file that fails to import, which could match up with

data did not match any variant of untagged enum PreTokenizerWrapper

@baldersheim
Copy link
Member

Have they fixed it on HEAD, if so I guess 0.28 will be out soon.

@jobergum
Copy link
Member Author

Created deepjavalibrary/djl#3141

@jobergum
Copy link
Member Author

Workaround to patch the tokenizer.json file to remove the key vespa-engine/sample-apps#1421

@geirst geirst added this to the soon milestone Apr 30, 2024
@baldersheim
Copy link
Member

deepjavalibrary upgraded to 0.28 in #31216
That should resolve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants