Arabic News Article Classification

Based on: Building TALAA, a Free General and Categorized Arabic Corpus

University of Science and Technology Houari Boumediene, Algiers, Algeria

Corpus

"The TALAA corpus is a voluminous general Arabic corpus, built from daily Arabic newspaper websites. The corpus is a collection of more than 14 million words with 15,891,729 tokens contained in 57,827 different articles." [1]

Description of the TALAA corpus [1] :

Features	Corpora
Nb. of articles	57.827
Nb. of categories	8
Nb. of words	14.068.407
Nb. of types	582.531
Nb. of tokens	15.891.729

The corpus is distributed on 8 categories [1] :

Category	Nb. of articles
Culture	5322
Economic	8768
Politics	9620
Religion	4526
Society	9744
Sports	9103
World	6344
Other	4400

Pre-processing

The following data pre-processing steps have been performed:

0.Example:

أمرت السلطات القطرية الأسواق و المراكز التجارية في البلاد برفع و إزالة السلع الواردة من السعودية و البحرين و الإمارات و مصر في الذكرى الأولى لإعلان هذه الدول الحصار عليها.

1.Tokenization

Each collected article was segmented into tokens, using NLTK.

[ أمرت, السلطات, القطرية, الأسواق, و, المراكز, التجارية, في, البلاد, ب, رفع, و, إزالة, السلع, الواردة, من, السعودية, و, البحرين, و, الإمارات, و, مصر, في, الذكرى, الأولى, ل, إعلان, هذه, الدول, الحصار, عليها, . ]

2.Removing stopwords

Tokenized text was cleaned from stopwords. There's a complete and reviewed list here, It contains 750 stop words.

[ أمرت, السلطات, القطرية, الأسواق, المراكز, التجارية, البلاد, رفع, إزالة, السلع, الواردة, السعودية, البحرين, الإمارات, مصر, الذكرى, الأول, إعلان, الدول, الحصار ]

3.Stemming

Each word was stemmed using Farasa Arabic text processing toolkit.

[ أمر, سلطة, قطر, سوق, مركز, تجاري, بلد, رفع, إزالة, سلعة, وارد, سعودية, بحرين, إمارات, مصر, ذكرى, أول, إعلان, دولة, حصار ]

Dataset

Categories = {الجزائر : Algeria, الثقافة : entertainment, الدين : religion, المجتمع : society, الرياضة : sport, العالم : world}

Machine Learning Models

Many Machine Learning algorithms has been experimented:

Algorithm	Precision	Recall	F-mesure
Decision Tree	0.82	0.84	0.83
SVM (SGD)	0.94	0.94	0.94
Naive Bayes	0.89	0.87	0.88

Evaluation (Confusion matrix)

Confusion matrix using the best model SVM with Stochastic Gradient Descent:

TODO

Contributing

Credits

Team mate: Fawzi TOUATI
Initial idea and mentor: Pr. Ahmed GUESSOUM
Mentor: Dr. Riadh BELKEBIR

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
Articles		Articles
Models		Models
.gitignore		.gitignore
NewsAticleClassification.ipynb		NewsAticleClassification.ipynb
README.md		README.md
helper.py		helper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Articles

Articles

Models

Models

.gitignore

.gitignore

NewsAticleClassification.ipynb

NewsAticleClassification.ipynb

README.md

README.md

helper.py

helper.py

Repository files navigation

Arabic News Article Classification

Based on: Building TALAA, a Free General and Categorized Arabic Corpus

University of Science and Technology Houari Boumediene, Algiers, Algeria

Corpus

"The TALAA corpus is a voluminous general Arabic corpus, built from daily Arabic newspaper websites. The corpus is a collection of more than 14 million words with 15,891,729 tokens contained in 57,827 different articles." [1]

Description of the TALAA corpus [1] :

The corpus is distributed on 8 categories [1] :

Pre-processing

The following data pre-processing steps have been performed:

0.Example:

1.Tokenization

2.Removing stopwords

3.Stemming

Dataset

Machine Learning Models

Evaluation (Confusion matrix)

TODO

Contributing

Credits

About

Releases

Packages

Languages

saidziani/Arabic-News-Article-Classification

Folders and files

Latest commit

History

Repository files navigation

Arabic News Article Classification

Based on: Building TALAA, a Free General and Categorized Arabic Corpus

University of Science and Technology Houari Boumediene, Algiers, Algeria

Corpus

"The TALAA corpus is a voluminous general Arabic corpus, built from daily Arabic newspaper websites. The corpus is a collection of more than 14 million words with 15,891,729 tokens contained in 57,827 different articles." [1]

Description of the TALAA corpus [1] :

The corpus is distributed on 8 categories [1] :

Pre-processing

The following data pre-processing steps have been performed:

0.Example:

1.Tokenization

2.Removing stopwords

3.Stemming

Dataset

Machine Learning Models

Evaluation (Confusion matrix)

TODO

Contributing

Credits

About

Topics

Resources

Stars

Watchers

Forks

Languages