simstring

A Python implementation of the SimString, a simple and efficient algorithm for approximate string matching.

Docs are here

Features

With this library, you can extract strings/texts which has certain similarity from large amount of strings/texts. It will help you when you develop applications related to language processing.

This library supports variety of similarity functions such as Cossine similarity, Jaccard similarity, and supports Word N-gram and Character N-gram as features. You can also implement your own feature extractor easily.

SimString has the following features:

Fast algorithm for approximate string retrieval.
100% exact retrieval. Although some algorithms allow misses (false positives) for faster query response, SimString is guaranteed to achieve 100% correct retrieval with fast query response.
Unicode support.
Extensibility. You can implement your own feature extractor easily.
no japanese support Please see this paper for more details.

Install

pip install simstring-fast

Usage

from simstring.feature_extractor.character_ngram import CharacterNgramFeatureExtractor
from simstring.measure.cosine import CosineMeasure
from simstring.database.dict import DictDatabase
from simstring.searcher import Searcher

db = DictDatabase(CharacterNgramFeatureExtractor(2))
db.add('foo')
db.add('bar')
db.add('fooo')

searcher = Searcher(db, CosineMeasure())
results = searcher.search('foo', 0.8)
print(results)
# => ['foo', 'fooo']

If you want to use other feature, measure, and database, simply replace these classes. You can replace these classes easily by your own classes if you want.

from simstring.feature_extractor.word_ngram import WordNgramFeatureExtractor
from simstring.measure.jaccard import JaccardMeasure
from simstring.database.dict import DictDatabase
from simstring.searcher import Searcher

db = DictDatabase(WordNgramFeatureExtractor(2))
db.add('You are so cool.')

searcher = Searcher(db, JaccardMeasure())
results = searcher.search('You are cool.', 0.8)
print(results)

Supported String Similarity Measures

Cosine
Dice
Jaccard
Overlap
Left Overlap

Supported database backends

dictionary
diskcache (sqlite)
redis (in development #37)

Name		Name	Last commit message	Last commit date
Latest commit History 174 Commits
.github/workflows		.github/workflows
dev		dev
docs		docs
simstring		simstring
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
env.yml		env.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

dev

dev

docs

docs

simstring

simstring

tests

tests

.gitignore

.gitignore

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

docker-compose.yml

docker-compose.yml

env.yml

env.yml

pyproject.toml

pyproject.toml

Repository files navigation

simstring

Features

Install

Usage

Supported String Similarity Measures

Supported database backends

About

Releases 13

Packages

Languages

License

banking-circle-advanced-analytics/simstring-fast

Folders and files

Latest commit

History

Repository files navigation

simstring

Features

Install

Usage

Supported String Similarity Measures

Supported database backends

About

Resources

License

Stars

Watchers

Forks

Languages