Skip to content

lorenanicole/python-naive-bayes-spam-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Basic Naive Bayes Classifier in Python

This approach makes use of pre-labeled data provided by the Kaggle Classroom spam detection challenge.

naive-bayes Python 2 Classifier

Python project code in naive-bayes is written with Python 2.7.

For setup create a virtualenv with the requirements:

virtualenv nbenv
source nbenv/bin/activate
pip install -r pathway/to/naive-baves/requirements.txt

To run the Naive Bayes classifier:

cd naive-bayes
python spam_detector.py

Python 3 Jupyter Notebook

The Python 2.7 project has been ported to Python 3 and can be run in the Jupyter notebook.

First you will want to create a Python3 virtualenv:

pyenv-3.5 python3env  # Update 3.5 with your version of Python 3
source python3env/bin/activate  # Name your env whatever you like!
pip3 install -r requirements.txt 

Then start the notebook!

jupyter notebook

Notes on Python Naive Bayes Implementation

You can have the detector either train and evaluate itself against the training data (using 90% of the pre-labeled data as training and 10% to label) with:

detector.train_and_evaluate()

Or you can train against the entire labeled data set (2500 emails) and classify on the unlabeled data (1827 emails).

detector.train()
detector.classify(1827)  # Number of emails to classify

Ham has a label of 1 while Spam has a label of 0.

How Naive Bayes Implemented

This solution makes use of Python's 2.7 Decimal module, which is used for floating point arithmetic. (Prevents floating point underflow!)

Inside the NaiveBayes#train method each document has common stop words removed using [NLTK](install http://www.nltk.org/install.html). The words have not yet been stemmed as this is a forthcoming feature.

Only the corpus of words are used as selectors to determine if an email is spam or ham.

To prevent words with 0 frequency from miscontruing the results, Laplace smoothing is applied to increment each 0 frequency word to 1.

About

Python 2 and Python 3 naive bayes spam classifier trained with nltk.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published