SemanticFinder - frontend-only live semantic search with transformers.js

Try the demo or read the introduction blog post.

Semantic search right in your browser! Calculates the embeddings and cosine similarity client-side without server-side inferencing, using transformers.js and a quantized version of sentence-transformers/all-MiniLM-L6-v2.

Installation

Clone the repository and install dependencies with

npm install

Then run with

npm run start

If you want to build instead, run

npm run build

Afterwards, you'll find the index.html, main.css and bundle.js in dist.

Speed

Tested on the entire book of Moby Dick with 660.000 characters ~13.000 lines or ~111.000 words. Initial embedding generation takes 1-2 mins on my old i7-8550U CPU with 1000 characters as segment size. Following queries take only 20-30 seconds! If you want to query larger text instead or keep an entire library of books indexed use a proper vector database instead.

Features

You can customize everything!

Input text & search term(s)
Segment length (the bigger the faster, the smaller the slower)
Highlight colors (currently hard-coded)
Number of highlights are based on the threshold value. The lower, the more results.
Live updates
Easy integration of other ML-models thanks to transformers.js
Data privacy-friendly - your input text data is not sent to a server, it stays in your browser!

Usage ideas

Basic search through anything, like your personal notes (my initial motivation by the way, a huge notes.txt file I couldn't handle anymore)
Remember peom analysis in school? Often you look for possible Leitmotifs or recurring categories like food in Hänsel & Gretel

Future ideas

One could package everything nicely and use it e.g. instead of JavaScript search engines such as Lunr.js (also being used in mkdocs-material).
Integration in mkdocs (mkdocs-material) experimental:
- when building the docs, slice all .md-files in chunks (length defined in mkdocs.yaml). Should be fairly large (>800 characters) for lower response time. It's also possible to build n indices with first a coarse index (mabye per document/ .md-file if the used model supports the length) and then a rfined one for the document chunks
- build the index by calculating the embeddings for all docs/chunks
- when a user queries the docs, a switch can toggle (fast) full-text standard search (atm with lunr.js) or experimental semantic search
- if the latter is being toggled, the client loads the model (all-MiniLM-L6-v2 has ~30mb)
- like in SemanticFinder, the embedding is created client-side and the cosine similarity calculated
- the high-scored results are returned just like with lunr.js so the user shouldn't even notice a differenc ein the UI
Electron- or browser-based apps could be augmented with semantic search, e.g. VS Code, Atom or mobile apps.
Integration in personal wikis such as Obsidian, tiddlywiki etc. would save you the tedious tagging/keywords/categorisation work or could at least improve your structure further
Search your own browser history (thanks @Snapdeus)
Integration in chat apps
Allow PDF-uploads (conversion from PDF to text)
Integrate with Speech-to-Text whisper model from transformers.js to allow audio uploads.
Thanks to CodeMirror one could even use syntax highlighting for programming languages such as Python, JavaScript etc.

Logic

Transformers.js is doing all the heavy lifting of tokenizing the input and running the model. Without it, this demo would have been impossible.

Input

Text, as much as your browser can handle! The demo uses a part of "Hänsel & Gretel" but it can handle hundreds of PDF pages
A search term or phrase
The number of characters the text should be segmented in
A similarity threshold value. Results with lower similarity score won't be displayed.

Output

Three highlighted string segments, the darker the higher the similarity score.

Pipeline

All scripts are loaded. The model is loaded once from HuggingFace, after cached in the browser.
A user inputs some text and a search term or phrase.
Depending on the approximate length to consider (unit=characters), the text is split into segments. Words themselves are never split, that's why it's approximative.
The search term embedding is created.
For each segment of the text, the embedding is created.
Meanwhile, the cosine similarity is calculated between every segment embedding and the search term embedding. It's written to a dictionary with the segment as key and the score as value.
For every iteration, the progress bar and the highlighted sections are updated in real-time depending on the highest scores in the array.
The embeddings are cached in the dictionary so that subsequent queries are quite fast. The calculation of the cosine similarity is fairly speedy in comparison to the embedding generation.
Only if the user changes the segment length, the embeddings must be recalculated.

Pre-index files for rapid search

For larger documents that many people might be interested in it's worth considering pre-indexing.

You can run the indexing (= embedding calculation) once and override the inputTextsEmbeddings variable to the source code. Then, either copy the original text exactly as you pasted it before in the textarea input field or use a function to restore the original text from the inputTextsEmbeddings variable. Pay attention that you use the same segment length (700 chars in the IPCC examples).

You can also host the embeddings somewhere else, e.g. on GitHub and load it on runtime. See the example source code in the IPCC exapmles:

𝗟𝗼𝗻𝗴𝗲𝗿 𝗥𝗲𝗽𝗼𝗿𝘁: https://geo.rocks/semanticfinder/ipcc/
𝗦𝘂𝗺𝗺𝗮𝗿𝘆 𝗳𝗼𝗿 𝗣𝗼𝗹𝗶𝗰𝘆𝗺𝗮𝗸𝗲𝗿𝘀: https://geo.rocks/semanticfinder/ipcc-summary/

Function example loading an external embeddings file, overriding the inputTextsEmbeddings variable, setting the textarea value and updating CodeMirror:

var inputTextsEmbeddings;
var url = "https://raw.githubusercontent.com/do-me/SemanticFinder-IPCC/main/ipcc-embeddings.json";
$.ajax({
    url: url,
    dataType: 'json',
    error: function(jqXHR, textStatus, errorThrown) {
        console.log('failed to load embeddings');
    },
    success:function(results) { 
        inputTextsEmbeddings = results
        const textarea = document.getElementById('input-text');
        textarea.value = Object.keys(inputTextsEmbeddings).join(' ')
        editor = CodeMirror.fromTextArea(document.getElementById('input-text'), {
                lineNumbers: true,
                mode: 'text/plain',
                matchBrackets: true,
                lineWrapping: true,
            });
    }
});

Collaboration

PRs welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
.github/workflows		.github/workflows
.husky		.husky
dist		dist
src		src
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SemanticFinder.gif		SemanticFinder.gif
index.html		index.html
jsconfig.json		jsconfig.json
package-lock.json		package-lock.json
package.json		package.json
webpack.config.js		webpack.config.js

License

lizozom/SemanticFinder

Folders and files

Latest commit

History

Repository files navigation

SemanticFinder - frontend-only live semantic search with transformers.js

Try the demo or read the introduction blog post.

Installation

Speed

Features

Usage ideas

Future ideas

Logic

Pre-index files for rapid search

Collaboration

To Dos (no priorization)

About

Resources

License

Stars

Watchers

Forks

Languages