Readme

This is the repository for the code and data to replicate experiments in the paper "Hate Speech Detection is Not as Easy as You May Think: A Closer Look at Model Validation" accepted as a full paper in SIGIR 2019.

In the paper we focus on the experimental methodology used by two state-of-the-art methods presented by Badjatiya et al. [2], and by Agrawal and Awekar [3]. We analyze the methodology implemented in these works and how they can be generalized to other datasets. The results of our research evidence methodological problems, and a relevant dataset bias. Consequently, performance claims of the current state-of-the-art have become overestimated. The issues that we have encountered are mostly due to the overfitting of data and sampling issues. We make an analysis of the implications for current research and reconducte experiments to give a more accurate idea of the current state-of-the art methods.

Below you find specific instructions on how to download and preprocess the data tha we used as well as how to run the experiments to reproduce out results.

Datasets

In this work we use three different datasets. We used the trainning English dataset of the SemEval 2019 Task5 [5]. Please download the dataset and unzip at Data/ from here .

The second dataset is the one costructed by Wassem and Hovy [1], they made the tweets ID and labels public and we recovered the information using the Twitter API.

A third dataset was constructed by us, using part of the Waseem and Hovy dataset, and part of the dataset described in Davidson et al. [4]. To get the second and third datasets ready for use you have to run the following command:

$python Download_Data.py   -ct 'consumer_token' -cs 'consumer_secret' -at 'access_token' -ats 'access_token_secret'

Where consumer_token, consumer_secret, access_token and access_token_secret are the corresponding credentials to use the Twitter API.

Word vectors

For the experiments we have also need word embeddings for initialization. Please download and unzip the following vectors at folder Vectors/.

Glove

Sentiment Specific word embeddings (SSWE)

Requirements

Keras
Theano
Gensim
xgboost
NLTK
Sklearn
Numpy
Keras
Tflearn
Tensorflow

Instructions to run

After running the previous command and download an unzip the SemEval dataset, you will be able to run the experiments running the following command:

$python ModelX_Experiments/Experiment_X.py

For example, to run the Experiment 1 in our paper with the models of Badjatiya et al. you have to run:

$python Model1_Experiments/Experiment_1.py

To run the Experiment 1 with the models of Agrawal and Awekar you have to run:

$python Model2_Experiments/Experiment_1.py

References

[1] Z. Waseem, D. Hovy Hateful Symbols or Hateful People? Predictive Features for Hate Speech on Detection on Twitter

[2] P. Badjatiya, S. Gupta, M. Gupta, V. Varma Deep learning for hate speech detection in tweets

[3] S. Agrawal, A. Awekar Deep learning for detecting cyberbullying across multiple social media platforms

[4] T. Davidson, D. Warmsley, W. Macy, I.Weber Automated Hate Speech Detection and the Problem of Offensive Language

[5] V. Valerio, C. Bosco, V. Patti, I.Weber, M. Sanguinetti, E. Fersini, D.Nozza, F.Rangel, P. Rosso Shared Task on Multilingual Detection of Hate

Citation

To cite our work please use the following:

@inproceedings{APP19,
  title={Hate Speech Detection is Not as Easy as You May Think: A Closer Look at Model Validation},
  author={Ayme Arango and Jorge P\'erez and Barbara Poblete},
  booktitle={To appear in SIGIR 2019},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data

Data

Model1_Experiments

Model1_Experiments

Model2_Experiments

Model2_Experiments

Vectors

Vectors

Download_Data.py

Download_Data.py

Readme.md

Readme.md

Repository files navigation

Readme

Datasets

Word vectors

Requirements

Instructions to run

References

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
Data		Data
Model1_Experiments		Model1_Experiments
Model2_Experiments		Model2_Experiments
Vectors		Vectors
Download_Data.py		Download_Data.py
Readme.md		Readme.md

prateekchaudhry/User_distribution_experiments

Folders and files

Latest commit

History

Repository files navigation

Readme

Datasets

Word vectors

Requirements

Instructions to run

References

Citation

About

Resources

Stars

Watchers

Forks

Languages