Different similarity results when using text-embedding-3-small or text-embedding-3-large models #372

marcominerva · 2024-03-19T10:38:13Z

Context / Scenario

For the same document and question, when using text-embedding-3-small or text-embedding-3-large models, similarity returns results with lower relevance than when using text-embedding-ada-002 model.

What happened?

I'm using the code available at https://github.com/marcominerva/KernelMemoryService with SimpleVectorDb. I have imported the file Taggia.pdf, that is the PDF of the Italian Wikipedia page about the town of Taggia, Italy. Then, I have searched for "Quante persone vivono a Taggia?" (in English it is "How many people do live in Taggia?"),

If I use the text-embedding-ada-002, model digging into the source code of SimpleVectorDb,

kernel-memory/service/Core/MemoryStorage/DevTools/SimpleVectorDb.cs

Lines 115 to 121 in d127063

    
           var similarity = new Dictionary<string, double>(); 
        
           Embedding textEmbedding = await this._embeddingGenerator.GenerateEmbeddingAsync 
        
               (text, cancellationToken).ConfigureAwait(false); 
        
           foreach (var record in records) 
        
           { 
        
               similarity[record.Value.Id] = textEmbedding.CosineSimilarity(record.Value.Vector); 
        
           }

I obtain this:

However, if I use text-embedding-3-small (I have of course deleted the previous memories and re-imported the document), with the same question I get:

So, if I have these models I need to change the minRelevance parameter I use for my query. With text-embedding-ada-002, I use a value of 0.75, while with newer models it seems that anything grater than 0.5 is good. Do you agree?

NOTE: I get similar results also with Qdrant.

Importance

edge case

Platform, Language, Versions

Kernel Memory v0.35.240318.1

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

dluc · 2024-03-19T15:48:23Z

I think that's expected behavior. Bigger and newer models capture more details and understand content better. Something that might seem relevant to ada2 might be less relevant to the other models. The opposite can happen too. In general when switching models, it's recommended to "fine tune" also thresholds, prompts and other "semantic" settings. Similar scenarios present with text generation when moving from GPT 3.5 to 4, and to other models. It's similar to changing an image/sound/video compression algorithm at the core of a game, noticing different quality,performance and artifacts, with the need to revisit settings and requirements.
We briefly called out this topic last year at //build and the need of a new generation of dev tools to measure AI behavior, it's still early days, with some options for prompts fine tuning. I haven't seen anything for embeddings yet though.

marcominerva · 2024-03-19T16:01:05Z

Thank you @dluc for the answer. Now I'm experimenting some weird situations in which a question that had a similarity of 0.79 with text-embedding-ada-002, now has only 0.33 with text-embedding-3-small and 0.27 with text-embedding-3-large, so it is very difficult to set a valid threshold.

Now I'm trying to increment MaxMatchesCount in the SearchClientConfig and MaxTokens used by Text Generation.

dluc · 2024-03-19T16:18:26Z

Thank you @dluc for the answer. Now I'm experimenting some weird situations in which a question that had a similarity of 0.79 with text-embedding-ada-002, now has only 0.33 with text-embedding-3-small and 0.27 with text-embedding-3-large, so it is very difficult to set a valid threshold.

Now I'm trying to increment MaxMatchesCount in the SearchClientConfig and MaxTokens used by Text Generation.

That's a pretty big difference, are the chunks the same?
Looking at the content, which model do you think is "right"? e.g. is the text actually relevant like ada002 says, or not so much as 3-small says?

marcominerva · 2024-03-19T16:32:28Z

Yes, chunks are the same. The text is relevant as text-embedding-ada-002 says. For example, among the others I have a chunk (about 1000 tokens) that contains something like "and near the town there is Vivaldi Palace, built in 1458", and I ask for "When was the Vivaldi Palace built?":

text-embedding-ada-002 tells me that the chuk have a similarity of 0.79 with my question
text-embedding-3-small returns a similarity of 0.33
text-embedding-3-large returns the lowest 0.27

marcominerva added bug Something isn't working triage labels Mar 19, 2024

marcominerva changed the title ~~[Bug] Different similarity results when using text-embedding-3-small or text-embedding-3-large models~~ Different similarity results when using text-embedding-3-small or text-embedding-3-large models Mar 19, 2024

dluc added question Further information is requested and removed bug Something isn't working triage labels Mar 19, 2024

microsoft locked and limited conversation to collaborators Jun 4, 2024

dluc converted this issue into discussion #542 Jun 4, 2024

dluc added discussion and removed question Further information is requested labels Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Different similarity results when using text-embedding-3-small or text-embedding-3-large models #372

Different similarity results when using text-embedding-3-small or text-embedding-3-large models #372

marcominerva commented Mar 19, 2024 •

edited

dluc commented Mar 19, 2024 •

edited

marcominerva commented Mar 19, 2024 •

edited

dluc commented Mar 19, 2024

marcominerva commented Mar 19, 2024 •

edited

This issue was moved to a discussion.

This issue was moved to a discussion.

Different similarity results when using text-embedding-3-small or text-embedding-3-large models #372

Different similarity results when using text-embedding-3-small or text-embedding-3-large models #372

Comments

marcominerva commented Mar 19, 2024 • edited

Context / Scenario

What happened?

Importance

Platform, Language, Versions

Relevant log output

dluc commented Mar 19, 2024 • edited

marcominerva commented Mar 19, 2024 • edited

dluc commented Mar 19, 2024

marcominerva commented Mar 19, 2024 • edited

This issue was moved to a discussion.

marcominerva commented Mar 19, 2024 •

edited

dluc commented Mar 19, 2024 •

edited

marcominerva commented Mar 19, 2024 •

edited

marcominerva commented Mar 19, 2024 •

edited