GitHub - mkonicek/nlp: Simple experiments with word embeddings

What is this repo?

This code lets you experiment with pre-trained word embeddings using plain Python 3 with no additional dependencies.

See the blog post Playing with Word Vectors for a detailed explanation.

Usage

This repo only includes a small data file with 1000 words. To get interesting results you'll need to download the pre-trained word vectors from the fastText website.

But don't use the whole 2GB file! The program would use too much memory. Instead, once you download the file take only the top n words, save them to a separate file, and remove the first line. For example:

$ cat data/wiki-news-300d-1M.vec | head -n 60001 | tail -n 60000 > data/vectors.vec

Then you can run:

# Find words related to given word
$ python3 related.py

and:

# Complete analogies
$ python3 analogies.py

Using numpy?

The code doesn't use numpy or any third party dependencies. This is so that anyone can run the code easily using vanilla Python 3.

There is a separate branch that uses numpy for the vector math and achieves about 12x speedup on my laptop.

Type annotations

I'm using the mypy type checker to find bugs as I type in VS Code (using this plugin). However, you don't need to have mypy installed. Python 3 will run the code just fine.

LICENSE

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
.gitignore		.gitignore
README.md		README.md
analogies.py		analogies.py
load.py		load.py
related.py		related.py
vectors.py		vectors.py
word.py		word.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

.gitignore

.gitignore

README.md

README.md

analogies.py

analogies.py

load.py

load.py

related.py

related.py

vectors.py

vectors.py

word.py

word.py

Repository files navigation

What is this repo?

Usage

Using numpy?

Type annotations

LICENSE

About

Releases

Packages

Languages

mkonicek/nlp

Folders and files

Latest commit

History

Repository files navigation

What is this repo?

Usage

Using numpy?

Type annotations

LICENSE

About

Resources

Stars

Watchers

Forks

Languages