You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At some point, the transformers library tokenizer save_pretrained tokenizer.json changed for sentencepiece based tokenizer models, which is incompatible with the djl tokenizer implementation we depend on. We use 0.27.0 which is the latest version.
Importing a tokenizer.json file from a recent Transformer version will prevent the embedder from starting because the djl tokenizer doesn't understand the format.
For example this config, pointing to a tokenizer file exported from intfloat/multilingual-e5-small) using either optimum-cli or our export tooling into:
[2024-04-30 06:53:25.891] WARNING container Container.com.yahoo.container.di.Container
Failed to set up first component graph due to error when constructing one of the components
exception=
com.yahoo.container.di.componentgraph.core.ComponentNode$ComponentConstructorException: Error constructing 'e5' of type 'ai.vespa.embedding.huggingface.HuggingFaceEmbedder': null
Caused by: java.lang.RuntimeException: data did not match any variant of untagged enum PreTokenizerWrapper at line 73 column 3
at ai.djl.huggingface.tokenizers.jni.TokenizersLibrary.createTokenizerFromString(Native Method)
at com.yahoo.language.huggingface.HuggingFaceTokenizer.lambda$getModelInfo$6(HuggingFaceTokenizer.java:124)
at com.yahoo.language.huggingface.HuggingFaceTokenizer.withContextClassloader(HuggingFaceTokenizer.java:149)
at com.yahoo.language.huggingface.HuggingFaceTokenizer.getModelInfo(HuggingFaceTokenizer.java:121)
The text was updated successfully, but these errors were encountered:
At some point, the transformers library tokenizer save_pretrained tokenizer.json changed for sentencepiece based tokenizer models, which is incompatible with the djl tokenizer implementation we depend on. We use 0.27.0 which is the latest version.
Importing a tokenizer.json file from a recent Transformer version will prevent the embedder from starting because the djl tokenizer doesn't understand the format.
For example this config, pointing to a tokenizer file exported from intfloat/multilingual-e5-small) using either
optimum-cli
or our export tooling into:Will fail with the following
The text was updated successfully, but these errors were encountered: