Chapter 4 Spam filter

4.1 Naive Bayes: Spam or Ham?

Nobody likes spam emails. How can Bayes help? In this chapter, we’ll keep expanding our data science knowledge with a practical example. A simple yet effective way of using Bayesian probability to create a spam filter from scratch will be introduced. The filter will examine emails and classify them as either spam or ham (the word for non-spam emails) based on their content.

What we will be implementing here is a supervised learning model, in other words, a classification model that has been trained on previously classified data. Think of it like a machine to which you can give some input, like an email, and will give you some label to that input, like spam or ham. This machine has a lot of tiny knobs, and based on their particular configuration it will output some label for each input. Supervised learning involves iteratively finding the right configuration of these knobs by letting the machine make a guess with some pre-classified data, checking if the guess matches the true label, and if not, tune the knobs in some controlled way. The way our machine will make predictions is based on the underlying mathematical model. For a spam filter, a naive Bayes approach has proven to be effective, and you will have the opportunity to verify that yourself at the end of the chapter. In a naive Bayes model, Bayes’ theorem is the main tool for classifying, and it is naive because we make very loose assumptions about the data we are analyzing. This will be clearer once we dive into the implementation.

4.2 The Training Data

For the Bayesian spam filter to work correctly, we need to feed it some good training data. In this context, that means having a large enough corpus of emails that have been pre-classified as spam or ham. The emails should be collected from a sufficiently heterogeneous group of people. After all, spam is a somewhat subjective category: one person’s spam may be another person’s ham. The proportion of spam vs. ham in our data should also be somewhat representative of the real proportion of emails we receive.

Fortunately, there are a lot of very good datasets available online. We’ll use the “Email Spam Classification Dataset CSV” from Kaggle, a website where data science enthusiasts and practitioners publish datasets, participate in competitions, and share their knowledge. The dataset’s description included online helps us make sense of its contents:

The .csv file contains 5,172 rows, one row for each email. There are 3,002 columns. The first column indicates Email name. The name has been set with numbers and not recipients’ name to protect privacy. The last column has the labels for prediction: for spam, for not spam. The remaining 3,000 columns are the 3,000 most common words in all the emails, after excluding the non-alphabetical characters/words. For each row, the count of each word(column) in that email(row) is stored in the respective cells.

Let’s take a look at the data. The following code snippet outputs a view of the first and last rows of the dataset.

raw_df = CSV.read("./04_naive_bayes/data/emails.csv", DataFrame)
## 5172×3002 DataFrame
##   Row │ Email No.   the    to     ect    and    for    of     a      you    ho ⋯
##       │ String15    Int64  Int64  Int64  Int64  Int64  Int64  Int64  Int64  In ⋯
## ──────┼─────────────────────────────────────────────────────────────────────────
##     1 │ Email 1         0      0      1      0      0      0      2      0     ⋯
##     2 │ Email 2         8     13     24      6      6      2    102      1
##     3 │ Email 3         0      0      1      0      0      0      8      0
##     4 │ Email 4         0      5     22      0      5      1     51      2
##     5 │ Email 5         7      6     17      1      5      2     57      0     ⋯
##     6 │ Email 6         4      5      1      4      2      3     45      1
##     7 │ Email 7         5      3      1      3      2      1     37      0
##     8 │ Email 8         0      2      2      3      1      2     21      6
##   ⋮   │     ⋮         ⋮      ⋮      ⋮      ⋮      ⋮      ⋮      ⋮      ⋮       ⋱
##  5166 │ Email 5166      1      0      1      0      3      1     12      1     ⋯
##  5167 │ Email 5167      1      0      1      1      0      0      4      0
##  5168 │ Email 5168      2      2      2      3      0      0     32      0
##  5169 │ Email 5169     35     27     11      2      6      5    151      4
##  5170 │ Email 5170      0      0      1      1      0      0     11      0     ⋯
##  5171 │ Email 5171      2      7      1      0      2      1     28      2
##  5172 │ Email 5172     22     24      5      1      6      5    148      8
##                                               2993 columns and 5157 rows omitted

As you can see, the output informs the amount of rows and columns and the type of each column and allows us to see a sample of the data.

4.3 Preprocessing the Data

Before we use the data to train our filter, we need to preprocess it a little bit. First, we should filter out very common words, such as articles and pronouns, which will most likely add noise rather than information to our classification algorithm.

all_words = names(raw_df)[2:end-1]

all_words_text = join(all_words, " ")
document = StringDocument(all_words_text)

prepare!(document, strip_articles)
prepare!(document, strip_pronouns)

vocabulary = split(TextAnalysis.text(document))
clean_words_df = raw_df[!, vocabulary]

data_matrix = Matrix(clean_words_df)'

In the first line, we create a variable all_words to store a list of all the words present in the emails. As our dataset has a column for each word, we do this by storing the names of every column with the names function, except for the first and last column, which are for email id and for the spam or ham label, respectively.

Let’s move on to the second and third lines of the code. We would like to filter out some words that are very common in the English language, such as articles and pronouns, which will most likely add noise rather than information to our classification algorithm. For this we will use two Julia packages that are specially designed for working with texts of any type. These are Languages.jl and TextAnalysis.jl. In the third line, we create a StringDocument, which is a struct provided by TextAnalysis.jl and we use its built-in methods to remove articles and pronouns from the list of words we created before. This is done by calling the prepare function two times, with two different flags: strip_articles and strip_pronouns. What follows is just code to recover our original DataFrame with only the relevant columns, i.e., the words that were not filtered. A clean_words_df DataFrame is created selecting those columns only. Finally, we turn our DataFrame into a matrix with its rows and columns transposed. This is just the convention used by the packages we are working with to make our analysis; each column is one data realization.

Next, we need to divide the data in two: a training set and a testing set. This is standard practice when working with models that learn from data, like the one we’re going to implement. We’re going to train the model on the training set, and then evaluate the model’s accuracy by having it make predictions on the testing set. In Julia, the package MLDataUtils.jl has some nice functionalities for data manipulations like this.

labels = raw_df.Prediction
(x_train, y_train), (x_test, y_test) = splitobs(shuffleobs((data_matrix, labels)), at = 0.7)

The function splitobs splits our dataset into a training set and a testing set, and shuffleobs randomizes the order of the data in the split. We pass a labels array to our split function so it knows how to properly split the dataset. Now we can turn our attention to building the spam filter.

4.4 The Naive Bayes Approach

As we mentioned, what we are facing here is a classification problem, and we will code from scratch and use a supervised learning algorithm to find a solution with the help of Bayes’ theorem. We’re going to use a naive Bayes classifier to create our spam filter. We’re going to use a classifier to create our spam filter. This method is going to treat each email just as a collection of words, with no regard for the order in which they appear. This means we won’t take into account semantic considerations like the particular relationship between words and their context.

Our strategy will be to estimate a probability of an incoming email being ham or spam and make a decision based on that. Our general approach can be summarized as:

\(P(spam|email) \propto P(email|spam)P(spam)\)

\(P(ham|email) \propto P(email|ham)P(ham)\)

Notice we use the \(\propto\) sign, meaning proportional to, instead of the = sign because the denominator from Bayes’s theorem is missing. In this case, we won’t need to calculate it, as it’s the same for both probabilities and all we’re going to care about is a comparison of these two probabilities.

In this naive approach, where semantics aren’t taken into account and each email is just a collection of words, the conditional probability \(P(email|spam)\) means the probability that a given email can be generated with the collection of words that appear in the spam category of our data. Let’s take a quick example. Imagine for a moment that our training set of emails consists just of these three emails, all labeled as spam:

  • Email 1: ‘Are you interested in buying my product?’
  • Email 2: ‘Congratulations! You’ve won $1000!’
  • Email 3: ‘Check out this product!’

Also imagine we receive a new, unclassified email and we want to discover \(P(email|spam)\). The new email looks like this:

  • New email: ‘Apply and win all these products!’

The new email contains the words win and product, which are rather common in our example’s training data. We would therefore expect \(P(email|spam)\), the probability of the new email being generated by the words encountered in the training spam email set, to be relatively high.

(The word \emph{win} appears in the form \emph{won} in the training set, but that’s OK. The standard linguistic technique of \emph{lemmatization} groups together any related forms of a word and treats them as the same word.)

Mathematically, the way to calculate \(P(email|spam)\) is to take each word in our target email, calculate the probability of it appearing in spam emails based on our training set, and multiply those probabilties together.

\(P(email|spam) = \prod_{i=1}^{n}P(word_i|spam)\)

We use a similar calculation to determine \(P(email|ham)\), the probability of the new email being generated by the words encountered in the training ham email set:

\(P(email|ham) = \prod_{i=1}^{n}P(word_i|ham)\)

The multiplication of each of the probabilities associated with a particular word here stems from the naive assumption that all the words in the email are statistically independent. In reality, this assumption isn’t necessarily true. In fact, it’s most likely false. Words in a language are never independent from one another, but this simple assumption seems to be enough for the level of complexity our problem requires.

The probability of a given word \(word_i\) being in a given category is calculated like so:

\[P(word_i|spam) = \frac{N_{word_i|spam} + \alpha}{N_{spam} + \alpha N_{vocabulary}}\] \[P(word_i|ham) = \frac{N_{word_i|ham} + \alpha}{N_{ham} + \alpha N_{vocabulary}}\]

These formulas tell us exactly what we have to calculate from our data. We need the numbers \(N_{word_i|spam}\) and \(N_{word_i|ham}\) for each word, meaning the number of times that \(word_i\) is used in the spam and ham categories, respectively. \(N_{spam}\) and \(N_{ham}\) are the total number of words used in the spam and ham categories (including all word repetitions), and \(N_{vocabulary}\) is the total number of unique words in the dataset. The variable \(\alpha\) is a smoothing parameter that prevents the probability of a given word being in a given category from going down to zero. If a given word hasn’t appeared in the spam category in our training dataset, for example, we don’t want to assign it zero probability of appearing in new spam emails.

As all of this information will be specific to our dataset, a clever way to aggregate it is to use a Julia struct, with attributes for the pieces of data we’ll need to access over and over during the prediction process. Here’s the implementation:

mutable struct BayesSpamFilter
    words_count_ham::Dict{String, Int64}
    words_count_spam::Dict{String, Int64}
    N_ham::Int64
    N_spam::Int64
    vocabulary::Array{String}
    BayesSpamFilter() = new()
end

The relevant attributes of the struct are words_count_ham and words_count_spam, two dictionaries containing the frequency of appearance of each word in the ham and spam datasets; N_ham and N_spam, the total number of words appearing in each category; and vocabulary, an array of all the unique words in the dataset.

The line BayesSpamFilter() = new() is the constructor of this struct. Because the constructor is empty, all the attributes will be undefined when we instantiate the filter. We’ll have to define some functions to fill these variables with values that are relevant to our particular problem. First, here’s a function word_count that counts the occurrences of each word in the ham and spam categories.

Now we are going to define some functions that will be important for our filter implementation.

function words_count(word_data, vocabulary, labels, spam=0)
    count_dict = Dict{String,Int64}()
    n_emails = size(word_data)[2]
    for (i, word) in enumerate(vocabulary)
        count_dict[word] = sum([word_data[i, j] for j in 1:n_emails if labels[j] == spam])
    end
    return count_dict
end

The function word_count counts the occurrences of each word in the ham and spam categories. One of its parameters is word_data, which we defined before and is a matrix where each column is an email and each row is a word.

Next, we’ll define a fit! function for our spam filter struct. Notice we’re using the bang (!) convention here to indicate a function that modifies its arguments in-place (in this case, the spam filter struct itself). This function fits our model to the data, a typical procedure in data science and machine learning areas.

function fit!(model::BayesSpamFilter, x_train, y_train, voc)
    model.vocabulary = voc
    model.words_count_ham = words_count(x_train, model.vocabulary, y_train, 0)
    model.words_count_spam = words_count(x_train, model.vocabulary, y_train, 1)
    model.N_ham = sum(values(model.words_count_ham))
    model.N_spam = sum(values(model.words_count_spam))
    return
end
## fit! (generic function with 1 method)

What we mean by fitting the model to the data is mainly filling all the undefined parameters in our struct with values informed by the training data. To do this, we use the words_count function we defined earlier. Notice that we’re only fitting the model to the training portion of the data, since we’re reserving the testing portion to evaluate the model’s accuracy.

4.5 Training the Model

Now it’s time to instantiate our spam filter and fit the model to the training data. With the struct and helper functions we’ve defined, the process is quite straightforward.

spam_filter = BayesSpamFilter()
fit!(spam_filter, x_train, y_train, vocabulary)

We create an instance of our BayesSpamFilter struct and pass it to our fit! function along with the data. Notice that we’re only passing in the training portion of the dataset, since we want to reserve the testing portion to evaluate the model’s accuracy later.

4.6 Making Predictions

Now that we have our model, we can use it to make some spam vs. ham predictions and assess its performance. We’ll define a few more functions to help with this process. First, we need a function implementing the TAL formula that we discussed earlier.

function word_spam_probability(word, words_count_ham, words_count_spam, N_ham, N_spam, n_vocabulary, α)
    ham_prob = (words_count_ham[word] + α) / (N_ham + α * (n_vocabulary))
    spam_prob = (words_count_spam[word] + α) / (N_spam + α * (n_vocabulary))
    return ham_prob, spam_prob
end
## word_spam_probability (generic function with 1 method)

This function calculates \(P(word_i|spam)\) and \(P(word_i|ham)\) for a given word. We’ll call it for each word of an incoming email within another function, spam_predict, to calculate the probability of that email being spam or ham.

function spam_predict(email, model::BayesSpamFilter, α, tol=100)
    ngrams_email = ngrams(StringDocument(email))
    email_words = keys(ngrams_email)
    n_vocabulary = length(model.vocabulary)
    ham_prior = model.N_ham / (model.N_ham + model.N_spam)
    spam_prior = model.N_spam / (model.N_ham + model.N_spam)

    if length(email_words) > tol
        word_freq = values(ngrams_email)
        sort_idx = sortperm(collect(word_freq), rev=true)
        email_words = collect(email_words)[sort_idx][1:tol]
    end

    email_ham_probability = BigFloat(1)
    email_spam_probability = BigFloat(1)

    for word in intersect(email_words, model.vocabulary)
        word_ham_prob, word_spam_prob = word_spam_probability(word, model.words_count_ham, model.words_count_spam, model.N_ham, model.N_spam, n_vocabulary, α)
        email_ham_probability *= word_ham_prob
        email_spam_probability *= word_spam_prob
    end
    return ham_prior * email_ham_probability, spam_prior * email_spam_probability
end

This function takes as input a new email that we want to classify as spam or ham, our fitted model, an \(α\) value (which we’ve already discussed), and a tolerance value tol. The latter sets the maximum number of unique words in an email that we’ll look at. We saw that the calculations for \(P(email|spam)\) and \(P(email|ham)\) require the multiplication of each \(P(word_i|spam)\) and \(P(word_i|ham)\) term. When emails consist of a large number of words, this multiplication may lead to very small probabilities, up to the point that the computer interprets those probabilities as zero. This isn’t desirable; we need values of \(P(email|spam)\) and \(P(email|ham)\) that are larger than zero in order to multiply them by \(P(spam)\) and \(P(ham)\), respectively, and compare these values to make a prediction. To avoid probabilities of zero, we’ll only consider up to the tol most frequently used words in the email.

Finally, we arrive to the point of actually testing our model. We create another function to manage the process. This function classifies each email into Ham (represented by the number 0) or Spam (represented by the number 1)

function get_predictions(x_test, y_test, model::BayesSpamFilter, α, tol=200)
    N = length(y_test)
    predictions = Array{Int64,1}(undef, N)
    for i in 1:N
        email = string([repeat(string(word, " "), N) for (word, N) in zip(model.vocabulary, x_test[:, i])]...)
        pham, pspam = spam_predict(email, model, α, tol)
        pred = argmax([pham, pspam]) - 1
        predictions[i] = pred
    end

    predictions
end

This function takes in the testing portion of the data and our trained model. We call our spam_predict function for each email in the testing data and use the maximum (argmax) of the two returned probability values to predict (pred) if the email is spam or ham. We return the predictions as an array of values, which will contain zeros for ham emails, and ones for spam emails. Here we call the function to make predictions about the test data:

predictions = get_predictions(x_test, y_test, spam_filter, 1)

Let’s take a look at the predicted classifications of just the first five emails in the test data.

predictions[1:5]
## 5-element Vector{Int64}:
##  0
##  0
##  1
##  0
##  1

Of the first five emails, one (the third) was classified as spam, and the rest were classified as ham.

4.7 Evaluating the Accuracy

Looking at the predictions themselves is pretty meaningless; what we really want to know is the model’s accuracy. We’ll define another function to calculate this.

function spam_filter_accuracy(predictions, actual)
    N = length(predictions)
    correct = sum(predictions .== actual)
    accuracy = correct / N
    accuracy
end

This function compares the predicted classifications with the actual classifications of the test data, counts the number of correct predictions, and divides this number by the total number of test emails, giving us an accuracy measurement. Here we call the function:

spam_filter_accuracy(predictions, y_test)
## 0.9510309278350515

The output indicates our model is about 95 percent accurate. It appears our model is performing very well! Such a high accuracy rate is quite astonishing for a model so naive and simple. In fact, it may be a little too good to be true, because we have to take into account one more thing. Our model classifies emails into spam or ham, but the amount of ham emails in our data set is considerably higher than the spam ones. Let’s see the percentages:

sum(raw_df[!, :Prediction])/length(raw_df[!, :Prediction])
## 0.2900232018561485

To calculate the proportion of spam to ham emails, we sum over the Prediction column of the dataset remembering it only consists of 0s and 1s, and then we divide by the total amount of emails.This type of classification problem, where there’s an unequal distribution of classes in the dataset, is called imbalanced. With unbalanced data, a better way to see how the model is performing is to construct a confusion matrix, an \(N \times N\) matrix, where \(N\) is the number of target classes (in our case, 2, for spam and ham). The matrix compares the actual values for each class with those predicted by the model. Here’s a function that builds a confusion matrix for our spam filter:

function spam_filter_confusion_matrix(y_test, predictions)
    # 2x2 matrix is instantiated with zeros
    confusion_matrix = zeros((2, 2))

    confusion_matrix[1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in 1:length(y_test))
    confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in 1:length(y_test))
    confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in 1:length(y_test))
    confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in 1:length(y_test))

    # Now we convert the confusion matrix into a DataFrame 
    confusion_df = DataFrame(prediction=String[], ham_mail=Int64[], spam_mail=Int64[])
    confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Ham", ham_mail=confusion_matrix[1, 1], spam_mail=confusion_matrix[1, 2]))
    confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Spam", ham_mail=confusion_matrix[2, 1], spam_mail=confusion_matrix[2, 2]))

    return confusion_df
end

Now let’s call our function to build the confusion matrix for our model.

confusion_matrix = spam_filter_confusion_matrix(y_test[:], predictions)
## 2×3 DataFrame
##  Row │ prediction            ham_mail  spam_mail
##      │ String                Float64   Float64
## ─────┼───────────────────────────────────────────
##    1 │ Model predicted Ham     1056.0       33.0
##    2 │ Model predicted Spam      43.0      420.0

Row 1 of the confusion matrix shows us all the times our model classified emails to be ham; 1,056 of those classifications were correct and 36 were incorrect. Similarly, the spam_mail column shows us the classifications for all the spam emails; 36 were misidentified as ham, and 427 were correctly identified as spam.

Now that we have the confusion matrix, we can calculate the accuracy of the model segmented by category.

ham_accuracy = confusion_matrix[1, :ham_mail] / (confusion_matrix[1, :ham_mail] + confusion_matrix[2, :ham_mail])
## 0.9608735213830755
spam_accuracy = confusion_matrix[2, :spam_mail] / (confusion_matrix[1, :spam_mail] + confusion_matrix[2, :spam_mail])
## 0.9271523178807947

With these values now we have a more fine-grained measure of the accuracy of our model. Now we know that our spam filter doesn’t have the same degree of accuracy for spam and for ham emails. As a consequence of the imbalance in our data, ham emails will be classified as such more accurately than spam emails. Still, with both percentages above 90, the accuracy is pretty good for a model so simple and naive. Models like these can be used like a baseline for creating more complex ones on top of them.

4.8 Summary

In this chapter, we’ve used a naive Bayes approach to build a simple email spam filter. We walked through the whole process of training, testing, and evaluating a learning model. First, we obtained a dataset of emails already classified as spam or ham and preprocessed the data. Then we considered the theoretical framework for our naive analysis. Using Bayes’s theorem on the data available, we assigned a probability of belonging to a spam or ham email to each word of the email dataset. The probability of a new email being classified as spam is therefore the product of the probabilities of each of its constituent words. We defined a Julia struct for the spam filter object and created functions to fit the spam filter object to the data. Finally, we made predictions on new data and evaluated our model’s performance by calculating the accuracy and making a confusion matrix.

4.9 Appendix - A little more about alpha

As we have seen, to calculate the probability of the email being a spam email, we should use

\(P(email|spam)=∏ni=1P(wordi|spam)=P(word0|spam)P(word1|spam)...P(wordnp|spam)\)

where P(wordnp|spam) stands for the probability of the word that is not presen t in our dataset. What probability should be assigned to this word? One way to handle this could be to simply ignore that term in the multiplication. In other words, assigning P(wordnp|spam)=1. Without thinking about it too much, we can conclude that this doesn’t make any sense, since that would mean that the probability to find that word in a spam (or ham, too) email would be equal to 1. A more logically consistent approach would be to assign 0 probability to that word. But there is a problem: with \(P(wordnp\|spam)=0\)

we can quickly see that

\(P(word0|spam)P(word1|spam)...P(wordnp|spam)=0\)

This is the motivation for introducing the smoothing parameter α

into our equation. In a real-word scenario, we should expect that words not present in our training set will appear, and altough it makes sense that they don’t have a high probability, it can’t be 0. When such a word appears, the probability assigned for it will be simply

\(P(wordnp|spam)=Nwordnp|spam+αNspam+αNvocabulary=0+αNspam+αNvocabulary=αNspam+αNvocabulary\)

In summary, α is just a smoothing parameter, so that the probability of finding a word that is not in our dataset, doesn’t go down to 0. Since we want to keep the probability for these words low enough, it makes sense to use α=1