Skip to content

jannis-baum/biterm-topic-model

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Biterm Topic Model - minimal real world usage fork

From the original repository:

"Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms). (In constrast, LDA and PLSA are word-document co-occurrence topic models, since they model word-document co-occurrences.)

A biterm consists of two words co-occurring in the same context, for example, in the same short text window. Unlike LDA models the word occurrences, BTM models the biterm occurrences in a corpus. In generation procedure, a biterm is generated by drawn two words independently from a same topic. In other words, the distribution of a biterm b=(wi,wj) is defined as:

P(b) = \sum_k{P(wi|z)*P(wj|z)*P(z)}.

With Gibbs sampling algorithm, we can learn topics by estimate P(w|k) and P(z).

More detail can be referred to the following paper:

Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm Topic Model For Short Text. WWW2013."

Motivation

This fork aims at providing interfaces to the author's original code-base suitable for closer to real-world applications while making minimal modifications.

Usage / added to original project

Building

Run make in the repository's root.

Topic learning

Run script/train.py DOCUMENTS MODEL in the repository's root where DOCUMENTS is a file with one document (consisting of space separated tokens) per line and MODEL is the directory to create the model in.

Training parameters can be set as follows:

  • --num-topics K or -k K to set the number of topics to learn to K; this will default to K=20.
  • --alpha ALPHA or -a ALPHA to set the alpha parameter as given by the paper; this will default to ALPHA=K/50.
  • --beta BETA or -b BETA to set the beta paramter as given by the paper; this will default to BETA=5.
  • --num-iterations N_IT or -n N_IT to set the number of training iterations; this will default to N_IT=5.
  • --save-steps SAVE_STEPS or -s SAVE_STEPS to set the number of iterations to save model after; this will default to 500.

After training, the directory MODEL will contain

  • a file vocab.txt with lines ID TOKEN that encodes the documents' tokens into integer IDs
  • a file topics.csv with tab-separated topic, prob_topic, top_words where topic is a topic's ID z (\in [0..K-1]), prob_topic is P(z) and top_words is a comma-separated list of at most 10 tokens w with the highest value of P(w|z), i.e. the topic's highest probability tokens
  • a directory vectors/ that holds the actual model data, i.e. the values for P(z) and P(w|z) needed for topic inferral

Topic Inferral, i.e. P(z|d)

This fork provides a python class BTMInferrer in script/infer.py with an interface for fast topic inferral of single documents that can easily be implemented analogously in other programming languages.

Here, an instance i of BTMInferrer can initialized with the model's directory (see section Topic learning). A single document's topic vector can then be inferrered by calling i.infer(document), which will return a list of K values of type float that represents the K-dimensional vector P(z|d).

Notable changes from original repository

  • existing Makefile was revised for efficiency and to separate build from source
  • existing scripts were recreated to increase efficiency, adaptability and ease of use
  • existing C++ code was formatted by LLVM Coding Standards and dynamic inferral (through stdin/out) was added while making minimal changes and retaining all previous functionality
  • the original project's sample data has been removed to decrease the repository's size (once GitHub prunes expired refs)

About

Fork of original code for Biterm Topic Model to provide closer to real-world use interfaces

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 81.3%
  • Python 17.7%
  • Makefile 1.0%