-
Notifications
You must be signed in to change notification settings - Fork 242
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different similarity results when using text-embedding-3-small or text-embedding-3-large models #372
Comments
I think that's expected behavior. Bigger and newer models capture more details and understand content better. Something that might seem relevant to ada2 might be less relevant to the other models. The opposite can happen too. In general when switching models, it's recommended to "fine tune" also thresholds, prompts and other "semantic" settings. Similar scenarios present with text generation when moving from GPT 3.5 to 4, and to other models. It's similar to changing an image/sound/video compression algorithm at the core of a game, noticing different quality,performance and artifacts, with the need to revisit settings and requirements. |
Thank you @dluc for the answer. Now I'm experimenting some weird situations in which a question that had a similarity of 0.79 with Now I'm trying to increment |
That's a pretty big difference, are the chunks the same? |
Yes, chunks are the same. The text is relevant as
|
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Context / Scenario
For the same document and question, when using
text-embedding-3-small
ortext-embedding-3-large
models, similarity returns results with lower relevance than when usingtext-embedding-ada-002
model.What happened?
I'm using the code available at https://github.com/marcominerva/KernelMemoryService with
SimpleVectorDb
. I have imported the file Taggia.pdf, that is the PDF of the Italian Wikipedia page about the town of Taggia, Italy. Then, I have searched for "Quante persone vivono a Taggia?" (in English it is "How many people do live in Taggia?"),If I use the
text-embedding-ada-002
, model digging into the source code ofSimpleVectorDb
,kernel-memory/service/Core/MemoryStorage/DevTools/SimpleVectorDb.cs
Lines 115 to 121 in d127063
I obtain this:
However, if I use
text-embedding-3-small
(I have of course deleted the previous memories and re-imported the document), with the same question I get:So, if I have these models I need to change the
minRelevance
parameter I use for my query. Withtext-embedding-ada-002
, I use a value of 0.75, while with newer models it seems that anything grater than 0.5 is good. Do you agree?NOTE: I get similar results also with Qdrant.
Importance
edge case
Platform, Language, Versions
Kernel Memory v0.35.240318.1
Relevant log output
No response
The text was updated successfully, but these errors were encountered: