Twittersa

BC CSCI339 (Natural Language Processing) Final Project

A Twitter sentiment analysis web app written with Flask.

Setup and Deployment

Note: deployment to Heroku depends on scikit-learn and the numpy/scipy stack, which is tricky to run with Heroku. We depend on @thenovices custom Heroku/scipy buildpack which can be set with heroku config:set BUILDPACK_URL=https://github.com/thenovices/heroku-buildpack-scipy

This is a Heroku app with gunicorn as the web server, but the standalone app can be run on localhost with python twittersa.py or foreman start and requires Flask, Tweepy, and Scikit-Learn + dependencies, which can be installed with

pip install -r requirements.txt

Twittersa requires application-level authentication from a registered Twitter application, and thus requires valid Consumer Key and Consumer Secret API keys from http://dev.twitter.com/apps.

These keys must be set as environment variables (export CONSUMER_KEY and CONSUMER_SECRET), or set them in .env and run with foreman or Heroku.

Classifiers

sentiment/classifiers.py includes a command-line script to facilitate the testing of various Naive Bayes classifiers with different data sets and feature extraction techniques.

Usage should be pretty self explanatory by accessing help:

python sentiment/classifiers.py --help

Examples

# MultinomialNB, tested with random sampling 5 times, unigrams + bigrams,
# and TF-IDF weighted transformation. Accuracy is printed as an average
# of the 5 samples.
python sentiment/classifiers.py -N 5 -n 2 25000 --tfidf -c multinomial

# BernoulliNB with unigrams and bigrams, 0 variance threshold removal,
# serialization of the classifier, and an interactive REPL for
# classification after training on the 25000 tweet data set
python sentiment/classifiers.py -n 2 25000 -vpr

Currently the global variables present in the script prefixed with PROD_ will be automatically selected in Twittersa to serve as the classifier backing the web application.

Testing

python tests.py

Util

Contains helper scripts for related tasks. NB: These are intended to be ran from the repository home directory, e.g. python util/noslang_parser.py. I should probably make this a module?

noslang_parser.py
- Parses and serializes abbrevations from noslang dictionary
semeval/
- Contains Tweet corpora from the SemEval 2013 (?) classification task
- To download, use tweet_download.py
tweet_download.py
- Downloads Tweets in the SemEval .tsv files by scraping URLs.
pickle_corpus.py
- Grabs training .csv files specified in corpora/, parses them, removes everything but sentiment and text, and serializes them in lib/.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
corpora		corpora
lib		lib
nltk_data/tokenizers		nltk_data/tokenizers
sentiment		sentiment
static		static
templates		templates
util		util
writeup		writeup
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
requirements.txt		requirements.txt
tests.py		tests.py
twittersa.py		twittersa.py

License

jayelm/twittersa

Folders and files

Latest commit

History

Repository files navigation

Twittersa

Setup and Deployment

Classifiers

Examples

Testing

Util

About

Topics

Resources

License

Stars

Watchers

Forks

Languages