Hugging Transformer Based Embeddings and Cosine Similarity Search Example

This example demonstrates how to transform text into embeddings via. Hugging Face sentence transform library. Once the text is represented as embeddings cosine similarity search can determine which embeddings are most similar to a search query

What are Text Embeddings

Embeddings are numerical representations of text data that capture the semantic meaning and relationships between words, sentences, or documents. The numerical representation (vectors in a high-dimensional space) of the text data (sentences, paragraphs, documents ...) are constructed in a way that words or texts with similar meanings/contexts are represented closer to each other in the embedding space

There are several methods to generate text embeddings

Word Embeddings represent of individual words as vectors. Popular techniques include Word2Vec, GloVe (Global Vectors for Word Representation), and FastText
Sentence and Document Embeddings aim to represent the meaning of entire sentences or documents. Sentence/document embeddings are generated using methods like averaging the word embeddings, using recurrent neural networks (RNNs) or convolutional neural networks (CNNs) or more recently, using transformer based models
Transformer Based Embeddings models, such as BERT, GPT can create contextual embeddings. These embeddings consider the surrounding context of each word in a sentence and can result in richer, more nuanced representations

The example in this repository uses a transformer based approach for converting text to embeddings. The example uses the all-MiniLM-L6-v2 provided by hugging face.

What is Cosine Similarity Search

Cosine similarity measures the cosine of the angle between two vectors. This indicates how similar their directions are regardless of their magnitudes. Explained simply cosine similarity can quantify how similar two vectors are.

In search applications if we represent our search query as vector(s) and our search-able text as vector(s). we can determine which search-able text is most similar to our search query.

Example in Python

Install sentence_transformers library

pip install sentence_transformers or in a Juypter notebook run !pip install sentence_transformers

To confirm that the installation was successful import the BM250kapi item from the rank_bm25 library run

from sentence_transformers import SentenceTransformer, util

Create corpus of documents

In this example we have a corpus of 4 documents

corpus = [
    "does this work with xbox?",
    "Does the M70 work with Android phones?",
    "does this work with iphone?",
    "Can this work with an xbox "
]

Download transformer model and encode documents

model = SentenceTransformer('all-MiniLM-L6-v2')

embeddings = model.encode(corpus)

Encode a search query

search_query = "M70 Android"
search_query_encode = model.encode(search_query)

Compute cosine similarity between search query and all sentencase

cos_sim = util.cos_sim(search_query_encode, embeddings)

Return the most similar document to the search query

similarity_scores = []
for i in range(list(cos_sim.size())[1]):
	similarity_scores.append(cos_sim[0][i].item())

print(corpus[similarity_scores.index(max(similarity_scores))])

---- result below ----

Does the M70 work with Android phones?

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
README.md		README.md
cosine_similarity_search_example.py		cosine_similarity_search_example.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

cosine_similarity_search_example.py

cosine_similarity_search_example.py

Repository files navigation

Hugging Transformer Based Embeddings and Cosine Similarity Search Example

What are Text Embeddings

What is Cosine Similarity Search

Example in Python

Install sentence_transformers library

Create corpus of documents

Download transformer model and encode documents

Encode a search query

Compute cosine similarity between search query and all sentencase

Return the most similar document to the search query

About

Releases

Packages

Languages

ev2900/Cosine_Similarity_Search_Example

Folders and files

Latest commit

History

README.md

README.md

cosine_similarity_search_example.py

cosine_similarity_search_example.py

Repository files navigation

Hugging Transformer Based Embeddings and Cosine Similarity Search Example

What are Text Embeddings

What is Cosine Similarity Search

Example in Python

Install sentence_transformers library

Create corpus of documents

Download transformer model and encode documents

Encode a search query

Compute cosine similarity between search query and all sentencase

Return the most similar document to the search query

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages