Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different similarity results when using text-embedding-3-small or text-embedding-3-large models #372

Closed
marcominerva opened this issue Mar 19, 2024 · 4 comments

Comments

@marcominerva
Copy link
Contributor

marcominerva commented Mar 19, 2024

Context / Scenario

For the same document and question, when using text-embedding-3-small or text-embedding-3-large models, similarity returns results with lower relevance than when using text-embedding-ada-002 model.

What happened?

I'm using the code available at https://github.com/marcominerva/KernelMemoryService with SimpleVectorDb. I have imported the file Taggia.pdf, that is the PDF of the Italian Wikipedia page about the town of Taggia, Italy. Then, I have searched for "Quante persone vivono a Taggia?" (in English it is "How many people do live in Taggia?"),

If I use the text-embedding-ada-002, model digging into the source code of SimpleVectorDb,

var similarity = new Dictionary<string, double>();
Embedding textEmbedding = await this._embeddingGenerator.GenerateEmbeddingAsync
(text, cancellationToken).ConfigureAwait(false);
foreach (var record in records)
{
similarity[record.Value.Id] = textEmbedding.CosineSimilarity(record.Value.Vector);
}

I obtain this:

image

However, if I use text-embedding-3-small (I have of course deleted the previous memories and re-imported the document), with the same question I get:

image

So, if I have these models I need to change the minRelevance parameter I use for my query. With text-embedding-ada-002, I use a value of 0.75, while with newer models it seems that anything grater than 0.5 is good. Do you agree?

NOTE: I get similar results also with Qdrant.

Importance

edge case

Platform, Language, Versions

Kernel Memory v0.35.240318.1

Relevant log output

No response

@marcominerva marcominerva added bug Something isn't working triage labels Mar 19, 2024
@marcominerva marcominerva changed the title [Bug] Different similarity results when using text-embedding-3-small or text-embedding-3-large models Different similarity results when using text-embedding-3-small or text-embedding-3-large models Mar 19, 2024
@dluc
Copy link
Collaborator

dluc commented Mar 19, 2024

I think that's expected behavior. Bigger and newer models capture more details and understand content better. Something that might seem relevant to ada2 might be less relevant to the other models. The opposite can happen too. In general when switching models, it's recommended to "fine tune" also thresholds, prompts and other "semantic" settings. Similar scenarios present with text generation when moving from GPT 3.5 to 4, and to other models. It's similar to changing an image/sound/video compression algorithm at the core of a game, noticing different quality,performance and artifacts, with the need to revisit settings and requirements.
We briefly called out this topic last year at //build and the need of a new generation of dev tools to measure AI behavior, it's still early days, with some options for prompts fine tuning. I haven't seen anything for embeddings yet though.

@dluc dluc added question Further information is requested and removed bug Something isn't working triage labels Mar 19, 2024
@marcominerva
Copy link
Contributor Author

marcominerva commented Mar 19, 2024

Thank you @dluc for the answer. Now I'm experimenting some weird situations in which a question that had a similarity of 0.79 with text-embedding-ada-002, now has only 0.33 with text-embedding-3-small and 0.27 with text-embedding-3-large, so it is very difficult to set a valid threshold.

Now I'm trying to increment MaxMatchesCount in the SearchClientConfig and MaxTokens used by Text Generation.

@dluc
Copy link
Collaborator

dluc commented Mar 19, 2024

Thank you @dluc for the answer. Now I'm experimenting some weird situations in which a question that had a similarity of 0.79 with text-embedding-ada-002, now has only 0.33 with text-embedding-3-small and 0.27 with text-embedding-3-large, so it is very difficult to set a valid threshold.

Now I'm trying to increment MaxMatchesCount in the SearchClientConfig and MaxTokens used by Text Generation.

That's a pretty big difference, are the chunks the same?
Looking at the content, which model do you think is "right"? e.g. is the text actually relevant like ada002 says, or not so much as 3-small says?

@marcominerva
Copy link
Contributor Author

marcominerva commented Mar 19, 2024

Yes, chunks are the same. The text is relevant as text-embedding-ada-002 says. For example, among the others I have a chunk (about 1000 tokens) that contains something like "and near the town there is Vivaldi Palace, built in 1458", and I ask for "When was the Vivaldi Palace built?":

  • text-embedding-ada-002 tells me that the chuk have a similarity of 0.79 with my question
  • text-embedding-3-small returns a similarity of 0.33
  • text-embedding-3-large returns the lowest 0.27

@microsoft microsoft locked and limited conversation to collaborators Jun 4, 2024
@dluc dluc converted this issue into discussion #542 Jun 4, 2024
@dluc dluc added discussion and removed question Further information is requested labels Jun 4, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

2 participants