Skip to content

VaasuDevanS/Natural-Language-Processing-Assignments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Natural-Language-Processing-Assignments

University of New Brunswick Fall-2018 CS6765: Natural Language Processing

This Repository contains the python code for the Fall Term Assignments.
No usage of numpy/nltk in any of the code and developed using Python2.7 (built-in modules)
sklearn is used only in Assignment3 for Logistic Regression

Getting started

No Python-file Usage
1 tokenize.py
count.py
python tokenize.py FILE > FILE.tokens
python count.py FILE.tokens > FILE.freqs
2 lm.py
perplexity.py
python lm.py MODEL TRAIN_FILE TEST_FILE > OUTPUT
python perplexity.py OUTPUT
3 classify.py
score.py
python classify.py METHOD TRAIN_DOCS TRAIN_CLASSES TEST_DOCS > PREDICTED_CLASSES
python score.py PREDICTED_CLASSES TRUE_CLASSES
4 tag.py
accuracy.py
python tag.py TRAIN_FILE TEST_FILE METHOD > SYSTEM_OUTPUT
python accuracy.py TRUE_TAGS SYSTEM_OUTPUT
5 chatbot.py python chatbot.py METHOD

Arguments

No Arguments File-Location (in Individual Assignment folder)
1 FILE Data/tweets-en.txt.gz
2 MODEL
TRAIN_FILE
TEST_FILE
1 or 2 or interp
Data/reuters-train.txt
Data/reuters-dev.txt
3 METHOD
TRAIN_DOCS
TRAIN_CLASSES
TEST_FILE
TRUE_CLASSES
baseline or lr or lexicon or nb or nbbin
Data/train.docs.txt
Data/train.classes.txt
Data/dev.docs.txt
Data/dev.classes.txt
4 TRAIN_FILE
TEST_FILE
METHOD
TRUE_TAGS
Data/train.en.txt
Data/dev.en.words.txt
baseline or hmm
Data/dev.en.tags.txt
5 METHOD overlap
w2v
both

Assignment 2: - MODEL

  • 1 represents Unigram (with Add-1 smoothing)
  • 2 represents Bigram (with Add-k smoothing)
  • 3 represents Interpolated (both Unigram and Bigram)

Assignment 3: - METHOD

  • baseline represents Most-Frequent-Class-Baseline
  • lr represents Logistic Regression (used from skimage)
  • lexicon represents Sentiment Lexicon containing + and - words
  • nb represents Naive Bayes Model (with add-k smoothing)
  • nbbin represents Binarized Naive Bayes

Assignment 4: - METHOD

  • baseline represents Most-Frequent-Tag-Baseline
  • 2 represents Hidden Markov Model (Bigram with add-k smoothing) and Viterbi Algorithm

Assignment 5: - METHOD

  • overlap represents Chatbot responses based on the word overlap
  • w2v represents Response with highest Cosine value (from pre-trained vectors from fastText)
  • both represents both responses from overlap and w2v with their Cosine values