Biterm Topic Model - minimal real world usage fork

"Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms). (In constrast, LDA and PLSA are word-document co-occurrence topic models, since they model word-document co-occurrences.)

A biterm consists of two words co-occurring in the same context, for example, in the same short text window. Unlike LDA models the word occurrences, BTM models the biterm occurrences in a corpus. In generation procedure, a biterm is generated by drawn two words independently from a same topic. In other words, the distribution of a biterm b=(wi,wj) is defined as:

P(b) = \sum_k{P(wi|z)*P(wj|z)*P(z)}.

With Gibbs sampling algorithm, we can learn topics by estimate P(w|k) and P(z).

More detail can be referred to the following paper:

Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm Topic Model For Short Text. WWW2013."

Motivation

This fork aims at providing interfaces to the author's original code-base suitable for closer to real-world applications while making minimal modifications.

Usage / added to original project

Building

Run make in the repository's root.

Topic learning

Run script/train.py DOCUMENTS MODEL in the repository's root where DOCUMENTS is a file with one document (consisting of space separated tokens) per line and MODEL is the directory to create the model in.

Training parameters can be set as follows:

--num-topics K or -k K to set the number of topics to learn to K; this will default to K=20.
--alpha ALPHA or -a ALPHA to set the alpha parameter as given by the paper; this will default to ALPHA=K/50.
--beta BETA or -b BETA to set the beta paramter as given by the paper; this will default to BETA=5.
--num-iterations N_IT or -n N_IT to set the number of training iterations; this will default to N_IT=5.
--save-steps SAVE_STEPS or -s SAVE_STEPS to set the number of iterations to save model after; this will default to 500.

After training, the directory MODEL will contain

a file vocab.txt with lines ID TOKEN that encodes the documents' tokens into integer IDs
a file topics.csv with tab-separated topic, prob_topic, top_words where topic is a topic's ID z (\in [0..K-1]), prob_topic is P(z) and top_words is a comma-separated list of at most 10 tokens w with the highest value of P(w|z), i.e. the topic's highest probability tokens
a directory vectors/ that holds the actual model data, i.e. the values for P(z) and P(w|z) needed for topic inferral

Topic Inferral, i.e. `P(z|d)`

This fork provides a python class BTMInferrer in script/infer.py with an interface for fast topic inferral of single documents that can easily be implemented analogously in other programming languages.

Here, an instance i of BTMInferrer can initialized with the model's directory (see section Topic learning). A single document's topic vector can then be inferrered by calling i.infer(document), which will return a list of K values of type float that represents the K-dimensional vector P(z|d).

Notable changes from original repository

existing Makefile was revised for efficiency and to separate build from source
existing scripts were recreated to increase efficiency, adaptability and ease of use
existing C++ code was formatted by LLVM Coding Standards and dynamic inferral (through stdin/out) was added while making minimal changes and retaining all previous functionality
the original project's sample data has been removed to decrease the repository's size (once GitHub prunes expired refs)

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
script		script
src		src
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

script

script

src

src

.gitignore

.gitignore

LICENSE.txt

LICENSE.txt

Makefile

Makefile

README.md

README.md

Repository files navigation

Biterm Topic Model - minimal real world usage fork

Motivation

Usage / added to original project

Building

Topic learning

Topic Inferral, i.e. `P(z|d)`

Notable changes from original repository

About

Releases

Packages

Languages

License

jannis-baum/biterm-topic-model

Folders and files

Latest commit

History

Repository files navigation

Biterm Topic Model - minimal real world usage fork

Motivation

Usage / added to original project

Building

Topic learning

Topic Inferral, i.e. P(z|d)

Notable changes from original repository

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Topic Inferral, i.e. `P(z|d)`