Toxic Comment Classification

This project focuses on establishing a classification model to identify and classify the toxic comments on Wikipedia. The project was done by Qimo Li, Chen He and Kun Qiu for the course "Introduction to Natural Language Processing in Python" at Brandeis University.

This project is based on the Kaggle Toxic Comment Classification Challenge. The dataset is shown as train.csv. For more information about the challenge: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

This project contains the following files:

The dataclean.py script is used to remove punctuation, numbers, and abbreviation.
The tokenizer.py script creates a Tokenizer class that tokenizes raw text and sentences.
The pos_tagger.py script creates a pos tagger using the Random Forest model.
The lemmatizer.py script creates a WordNet-based lemmatizer.
The vector.py script is used to vectorize the comments based on the given vocabulary, calculate term frequency(tf) value, and return ndarrays.
The vocabulary.py script creates a vocabulary from the dataset with stop words removed. The Vocabulary class contains method for computing the inverse document frequency (idf) value for each word in the vocabulary.
The visualization.ipynb script creates several basic data exploratory visualizations and creates interactive HTML visualization using scattertext library.
The visualization folder contains the HTML visualization created by visualization.ipynb. The resulting screenshots of the HTML visualization are in the visualization_results_screenshots folder.

The main method can be executed in the following file:

The main.py script is the core of the project. It loads the data set, cleans and parses the comments, vectorizes the text, encodes tf-idf values, splits the data for cross validation and builds an ensemble classification model consists of six logistic regression models. The main.py script also contains methods for splitting training and testing data, mapping tag sets, and evaluating the performance.

This project requires the following libraries:

re
nltk
pandas
numpy
os
joblib
sklearn
scattertext
IPython
spacy

For motivations, implementation details and results, please check out the project report.pdf file.

How to run our code:

$ python main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

visualization

visualization

visualization_results_screenshots

visualization_results_screenshots

.gitignore

.gitignore

README.md

README.md

dataclean.py

dataclean.py

lemmatizer.py

lemmatizer.py

main.py

main.py

pos_tagger.py

pos_tagger.py

report.pdf

report.pdf

tokenizer.py

tokenizer.py

train.csv

train.csv

vector.py

vector.py

visualization.ipynb

visualization.ipynb

vocabulary.py

vocabulary.py

Repository files navigation

Toxic Comment Classification

How to run our code:

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
visualization		visualization
visualization_results_screenshots		visualization_results_screenshots
.gitignore		.gitignore
README.md		README.md
dataclean.py		dataclean.py
lemmatizer.py		lemmatizer.py
main.py		main.py
pos_tagger.py		pos_tagger.py
report.pdf		report.pdf
tokenizer.py		tokenizer.py
train.csv		train.csv
vector.py		vector.py
visualization.ipynb		visualization.ipynb
vocabulary.py		vocabulary.py

Kimonokimo/NLP-comment-project

Folders and files

Latest commit

History

Repository files navigation

Toxic Comment Classification

How to run our code:

About

Topics

Resources

Stars

Watchers

Forks

Languages