Build a Spam Filter

In this lesson, we will try to build a spam filter using the Enron email dataset.

Pre-Requisites:

Introduction to Natural Language Processing with NTLK

Intro to NTLK, Part 2

Build a sentiment analysis program

Analysing the Enron Email Corpus

So this is the first guided practice session I’m trying. The aim is, I’ll give you hints on how to complete the lessons, same as I give in practice sessions. But I’ll also give you the solution in the next video.

You can only look at the solution videos if you want, but I recommend trying to solve the problem yourself first. If you have completed all the pre-requisites, the challenges should be easy.

The repo is here. Enron Spam Practice.ipynb is the file you will be working in, while Enron Spam Solution.ipynb contains the solutions.

Challenge 1: Print all the directories and files

The first video introduces the Enron Spam dataset. Get it from here: http://www.aueb.gr/users/ion/data/enron-spam/

Extract all into a folder called Enron Spam. You will need 7zip if you are on Windows. Spend some time studying how the emails are laid out. All the spam emails are in a folder called spam, while non-spam are in ham.

The first challenge is simply to print all the directories, sub directories and files in the folder. This should be quite simple, as it’s the first thing we did in the Enron example.

Solution 1

The solution is quite simple.

Challenge 2 Print only files in the Ham and Spam folder

The video for this has been mixed with the last solutions video (from 3:55 onwards).

The challenge is: Instead of printing all files and folders, only print the files when we are in the ham or spam folder.

I give you a hint. The os.path.split() can  be used to find out which directory you are in. Like so:

The third example above shows you how to detect you are in the ham folder.

Solution 2

We take the code we had before and modify it, so that we are checking if we are in the ham or spam folder.

The key part is this:

This is where we are checking if we are in the right folder.

Why does it matter?

Because when we start reading the files, we want to make sure we are only reading them from the spam and ham folders.

Challenge 3: Read all the files in the Ham and Spam folders

Now that you can print the files in the spam and ham folders, it’s time to go ahead and read all the files in those folders.

I’ve given you some code to start off:

There is a ham_list and spam_list, and you have to store the content of the files in these lists.

I give you a warning: The spam files will throw you a Unicode error when you try to read them. You can either Google for the solution, or look at mine. A hint is: It’s a unicode problem (duh) which will only be seen on Python 3, not 2.

But I still recommend you give the challenge a try. Feel free to comment out the spam part of the code, as the two parts are similar anyway.

Solution 3

Let’s look at the code in parts. First, the code to read the ham files:

It’s the same as before, except this part is new:

We loop over all the files, open each, read the text and append it to the ham_list. And we do the same for the spam list.

Run the code, and you will see the Unicode error. Why is that?

By putting in some extra debug, I found a few of the files throwing errors. One is the 2248.2004-09-23.GP.spam.txt in the Enron1/spam folder. Open this in a text editor:

enronspam1

You will note the file has special characters in it. This means it’s not a pure text (which is not that surprising, as it is spam).

Python 2 would allow you to get away with this. Python 3 won’t.

The way around this is to specify the encoding. Here’s a page with some details: http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html

The link explains the unicode error you get. It also tells you you have to specify the encoding. There are several options, but the simplest one is latin-1. This is closest to what Python 2 did.

The latin-1 encoding means Python will try its best to decode the text without throwing an error. The downside is, the data might be corrupted. However, this is spam, so we expect bad data. Also, we’ll have millions of words in the spam list, and even if 1-2 are corrupt, we can live with that.

So the way to stop Python throwing an error is to change this line to add an encoding:

The code now works.

Challenge 4 Prepare our data for the Naive Bayes filter

We will be using the Naive Bayes for our spam filter. If you have done the Nltk lessons, you know it expects the input in a particular format. That’s the goal of this challenge. I give you a few hints:

So the first thing is to write a create_word_features() function. The second is to use this (with word tokenize):

Solution 4

Since this stuff has been covered in the Ntlk tutorial, I will zip through it a bit.

First, the create_word_features()  function:

We are just creating a dictionary that returns True for each word. I’m not removing stop words for this example.

Next, we use our function. I’m going to be working with the code from the previous example. This is what we did last time:

We were reading the data and appending it to ham_list. This will now change to:

So we break the text into words with:

And then we call the function we wrote:

We are appending a “ham” at the end. This is to tell the machine learning algorithm that this text is of type ham. The actual word doesn’t matter, as long as it’s consistent.

We will do the same for the spam list, so the final code looks like:

This looks fairly complex, but since we built it in parts, it should be easy to follow.

Challenge 5 Create the Test/Train data, call the Naive Bayes Filter

We are reaching the final steps. The next challenge is to create test / train samples. I give you some code:

We combine our spam and ham list, and then shuffle it, so that it is randomised.

The first thing you need to do is create test/train splits:

And then call the Naive Bayes filter, and find the accuracy:

Solution 5

We start by calculating what 70% of the data will be.

I’m converting it to int() so I can get a whole number.

We then divide our samples into test and training sets:

We now create the Naive Bayes filter with the training data, and test with test data:

We are getting an accuracy of 98%, which is good.

We will also look at the most interesting features:

So the word Enron is more likely to appear in ham emails than spam, while sex is more likely to appear in spam. php also appears a lot in spam, but that maybe a 90s thing (as do corel and macromedia, both expensive software).

Challenge 6: Identify messages as spam or ham

 

We have 3 messages we want to classify as spam or ham:

The first is clearly spam. The 2nd is also spam, but doesn’t use spam like language. The 3rd is my own email (and should not be classified as spam!)

At the top, I give you a hint about what to do. There are 3 steps: 1) Break into words 2) Create word features 3) Call the classify() function.

Solution 6

There isn’t a lot to explain in this code:

And there you go. A simple spam filter.

Some final comments: This spam filter was built for spam in the 90s, and the type of spam messages has grown. If you wanted to use this today, you would add a few modern spam messages to the training data, and retrain.

I hope you also appreciated how complex looking code becomes easy if you build it in small parts.