Skip to content
This repository has been archived by the owner on Sep 30, 2019. It is now read-only.

fpaupier/spam_classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Spam classifier

In this project I implement a simple spam classifier in Matlab/Octave.

Implementing a spam classifier using Support Vector Machines

SVM enable detection of complex decision boundary more effectively than classic logistic regression most of the times.

Steps

  1. Normalize the email by extracting the kernel of each word ex (hosting, host, hosted,.. reduced to 'host')
  2. This kernel word corresponds to an entry in a vocabulary file where each cell word is linked with an id
  3. For each email, a feature vector composed of all the words present in the vocabulary file is built. If the word is present in the email the i-th row of the vector is 1, 0 otherwise.
  4. A training set of vectors is fed to the svm
  5. Test performance on a training set

Run with a sample email

Simply run the spam_classifier.m script and the output will be displayed in the console.

Try with your email

To classify one of your email simply copy and paste its text content into a file (let's say my_email.txt) under the code\samples directory. Then, modify the spam_classifier.m line 70 to update it with your filename:

filename = 'samples/my_email.txt';

Then, run the spam_classifier.m script.

Further readings

Check honey pot project who try to gather as much as spam emails as possible to build a better vocabulary file or other type of feature.

Note

This project was part of Andrew Ng's Mooc on machine learning which I strongly recommend. This project is no longer updated.