Skip to content

Automatic categorization of documents, consists in assigning a category to a text based on the information it contains. We'll follow different approach of Supervised Machine Learning.

Notifications You must be signed in to change notification settings

saidziani/Arabic-News-Article-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Arabic News Article Classification

University of Science and Technology Houari Boumediene, Algiers, Algeria


Corpus

"The TALAA corpus is a voluminous general Arabic corpus, built from daily Arabic newspaper websites. The corpus is a collection of more than 14 million words with 15,891,729 tokens contained in 57,827 different articles." [1]

Description of the TALAA corpus [1] :

Features Corpora
Nb. of articles 57.827
Nb. of categories 8
Nb. of words 14.068.407
Nb. of types 582.531
Nb. of tokens 15.891.729

The corpus is distributed on 8 categories [1] :

Category Nb. of articles
Culture 5322
Economic 8768
Politics 9620
Religion 4526
Society 9744
Sports 9103
World 6344
Other 4400

Pre-processing

The following data pre-processing steps have been performed:

0.Example:

أمرت السلطات القطرية الأسواق و المراكز التجارية في البلاد برفع و إزالة السلع الواردة من السعودية و البحرين و الإمارات و مصر في الذكرى الأولى لإعلان هذه الدول الحصار عليها.

1.Tokenization

Each collected article was segmented into tokens, using NLTK.

[ أمرت, السلطات, القطرية, الأسواق, و, المراكز, التجارية, في, البلاد, ب, رفع, و, إزالة, السلع, الواردة, من, السعودية, و, البحرين, و, الإمارات, و, مصر, في, الذكرى, الأولى, ل, إعلان, هذه, الدول, الحصار, عليها, . ]

2.Removing stopwords

Tokenized text was cleaned from stopwords. There's a complete and reviewed list here, It contains 750 stop words.

[ أمرت, السلطات, القطرية, الأسواق, المراكز, التجارية, البلاد, رفع, إزالة, السلع, الواردة, السعودية, البحرين, الإمارات, مصر, الذكرى, الأول, إعلان, الدول, الحصار ]

3.Stemming

Each word was stemmed using Farasa Arabic text processing toolkit.

[ أمر, سلطة, قطر, سوق, مركز, تجاري, بلد, رفع, إزالة, سلعة, وارد, سعودية, بحرين, إمارات, مصر, ذكرى, أول, إعلان, دولة, حصار ]


Dataset

Categories = {الجزائر : Algeria, الثقافة : entertainment, الدين : religion, المجتمع : society, الرياضة : sport, العالم : world}

TALAA Categories

Machine Learning Models

Many Machine Learning algorithms has been experimented:

Algorithm Precision Recall F-mesure
Decision Tree 0.82 0.84 0.83
SVM (SGD) 0.94 0.94 0.94
Naive Bayes 0.89 0.87 0.88

Evaluation (Confusion matrix)

Confusion matrix using the best model SVM with Stochastic Gradient Descent:

Confusion matrix

TODO


Contributing


Credits

About

Automatic categorization of documents, consists in assigning a category to a text based on the information it contains. We'll follow different approach of Supervised Machine Learning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published