Tokenizer.json compability with ai.djl.huggingface tokenizers broken for sentencepiece based models #31086

jobergum · 2024-04-30T07:06:33Z

At some point, the transformers library tokenizer save_pretrained tokenizer.json changed for sentencepiece based tokenizer models, which is incompatible with the djl tokenizer implementation we depend on. We use 0.27.0 which is the latest version.

Importing a tokenizer.json file from a recent Transformer version will prevent the embedder from starting because the djl tokenizer doesn't understand the format.

For example this config, pointing to a tokenizer file exported from intfloat/multilingual-e5-small) using either optimum-cli or our export tooling into:

 <component id="e5" type="hugging-face-embedder">
            <transformer-model path="multilingual-e5-small-quantized/model_quantized.onnx"/>
            <tokenizer-model path="multilingual-e5-small-quantized/tokenizer.json"/>
        </component>

Will fail with the following

[2024-04-30 06:53:25.891] WARNING container        Container.com.yahoo.container.di.Container	
	Failed to set up first component graph due to error when constructing one of the components
	exception=
	com.yahoo.container.di.componentgraph.core.ComponentNode$ComponentConstructorException: Error constructing 'e5' of type 'ai.vespa.embedding.huggingface.HuggingFaceEmbedder': null
	Caused by: java.lang.RuntimeException: data did not match any variant of untagged enum PreTokenizerWrapper at line 73 column 3
	at ai.djl.huggingface.tokenizers.jni.TokenizersLibrary.createTokenizerFromString(Native Method)
	at com.yahoo.language.huggingface.HuggingFaceTokenizer.lambda$getModelInfo$6(HuggingFaceTokenizer.java:124)
	at com.yahoo.language.huggingface.HuggingFaceTokenizer.withContextClassloader(HuggingFaceTokenizer.java:149)
	at com.yahoo.language.huggingface.HuggingFaceTokenizer.getModelInfo(HuggingFaceTokenizer.java:121)

The text was updated successfully, but these errors were encountered:

jobergum · 2024-04-30T07:22:17Z

The older tokenizer file saved on tokenizer.json on the model hub imports fine.

diff tokenizer-works.json tokenizer-fails.json 
71c71,72
<     "add_prefix_space": true
---
>     "prepend_scheme": "always",
>     "split": true
157c158,159
<     "add_prefix_space": true
---
>     "prepend_scheme": "always",
>     "split": true
1000171c1000173,1000174
<     ]
---
>     ],
>     "byte_fallback": false

There is one new introduced parameter byte_fallback in the file that fails to import, which could match up with

data did not match any variant of untagged enum PreTokenizerWrapper

baldersheim · 2024-04-30T07:24:00Z

Have they fixed it on HEAD, if so I guess 0.28 will be out soon.

jobergum · 2024-04-30T08:23:17Z

Created deepjavalibrary/djl#3141

jobergum · 2024-04-30T09:48:52Z

Workaround to patch the tokenizer.json file to remove the key vespa-engine/sample-apps#1421

baldersheim · 2024-05-16T08:57:30Z

deepjavalibrary upgraded to 0.28 in #31216
That should resolve this issue.

jobergum mentioned this issue Apr 30, 2024

Add ability to patch tokenizer.json file and include support for quantization vespa-engine/sample-apps#1421

Merged

geirst assigned baldersheim Apr 30, 2024

geirst added this to the soon milestone Apr 30, 2024

baldersheim closed this as completed May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer.json compability with ai.djl.huggingface tokenizers broken for sentencepiece based models #31086

Tokenizer.json compability with ai.djl.huggingface tokenizers broken for sentencepiece based models #31086

jobergum commented Apr 30, 2024

jobergum commented Apr 30, 2024

baldersheim commented Apr 30, 2024

jobergum commented Apr 30, 2024

jobergum commented Apr 30, 2024

baldersheim commented May 16, 2024

Tokenizer.json compability with ai.djl.huggingface tokenizers broken for sentencepiece based models #31086

Tokenizer.json compability with ai.djl.huggingface tokenizers broken for sentencepiece based models #31086

Comments

jobergum commented Apr 30, 2024

jobergum commented Apr 30, 2024

baldersheim commented Apr 30, 2024

jobergum commented Apr 30, 2024

jobergum commented Apr 30, 2024

baldersheim commented May 16, 2024