An Empirical study on Pre-trained Embeddings and Language Models for Bot Detection.

Motivation

While word embeddings are learnt from large corpora, their use in neural models to solve specific tasks is limited to the input layer. So in practice a task-specific neural model is built almost from scratch because most of the model parameters are initialized randomly, and hence, these paremeters need to be optimized for the task at hand, requiring large sets of data to produce a high performance model.

Recent advances in neural language models (BERT or OPEN AI GPT) have shown evidence that task specific architectures are not longer necessary and transfering some internal representations (attention blocks) along with shallow feed forward networks is enough.

In (Garcia et al.,2019) we presented an experimental study on the use of word embeddings as input of CNN architectures and Bi-LSTM to tackle the bot detection task and compare these results with fine-tuning pretrained language models.

Evaluation results, presented in the figure below, show that fine-tuning language models yields overall better results than training specific neural architectures that are fed with mixture of: i) pre-trained contextualized word embeddings (ELMo), ii) pre-trained context-indepedent word embeddings learnt from Common Crawl(FastText), Twitter (GloVe), and urban dictionary (word2vec), plus embeddings optimized by the neural network in the learning process.

Dataset generation

Please find the required code to generate the dataset in dataset generation. Dataset generation can be a little bit time consuming, if you want to get the datasets, please send us an email agarcia@expertsystem.com, cberrio@expertsystem.com or jmgomez@expertsystem.com.

CNN and BiLSTM experiments

Since there were many experiments in the paper, only will be uploaded the code of those with the best results. Please feel free to ask for the other experiments.

References

Garcia-Silva, Andres, et al. "An Empirical Study on Pre-trained Embeddings and Language Models for Bot Detection." Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019). 2019.

To cite this paper use the following BibTex entry:

@inproceedings{garcia-silva-etal-2019-empirical,
    title = "An Empirical Study on Pre-trained Embeddings and Language Models for Bot Detection",
    author = "Garcia-Silva, Andres  and
      Berrio, Cristian  and
      G{\'o}mez-P{\'e}rez, Jos{\'e} Manuel",
    booktitle = "Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-4317",
    doi = "10.18653/v1/W19-4317",
    pages = "148--155",
}

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
Bi-LSTM notebooks		Bi-LSTM notebooks
CNN-based notebooks		CNN-based notebooks
ELMO-embeddings		ELMO-embeddings
Pretrained Language Models and Finetuning		Pretrained Language Models and Finetuning
dataset generation		dataset generation
README.md		README.md
fmeasurelanguagemodelsbot.png		fmeasurelanguagemodelsbot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bi-LSTM notebooks

Bi-LSTM notebooks

CNN-based notebooks

CNN-based notebooks

ELMO-embeddings

ELMO-embeddings

Pretrained Language Models and Finetuning

Pretrained Language Models and Finetuning

dataset generation

dataset generation

README.md

README.md

fmeasurelanguagemodelsbot.png

fmeasurelanguagemodelsbot.png

Repository files navigation

An Empirical study on Pre-trained Embeddings and Language Models for Bot Detection.

Motivation

Dataset generation

CNN and BiLSTM experiments

References

About

Releases

Packages

Contributors 2

Languages

expertailab/An-Empirical-study-on-Pre-trained-Embeddings-and-Language-Models-for-Bot-Detection

Folders and files

Latest commit

History

Repository files navigation

An Empirical study on Pre-trained Embeddings and Language Models for Bot Detection.

Motivation

Dataset generation

CNN and BiLSTM experiments

References

About

Topics

Resources

Stars

Watchers

Forks

Languages