Chapter 4 Spam filter
4.1 Naive Bayes: Spam or Ham?
Nobody likes spam emails. How can Bayes help? In this chapter, we’ll keep expanding our data science knowledge with a practical example. A simple yet effective way of using Bayesian probability to create a spam filter from scratch will be introduced. The filter will examine emails and classify them as either spam or ham (the word for non-spam emails) based on their content.
What we will be implementing here is a supervised learning model, in other words, a classification model that has been trained on previously classified data. Think of it like a machine to which you can give some input, like an email, and will give you some label to that input, like spam or ham. This machine has a lot of tiny knobs, and based on their particular configuration it will output some label for each input. Supervised learning involves iteratively finding the right configuration of these knobs by letting the machine make a guess with some pre-classified data, checking if the guess matches the true label, and if not, tune the knobs in some controlled way. The way our machine will make predictions is based on the underlying mathematical model. For a spam filter, a naive Bayes approach has proven to be effective, and you will have the opportunity to verify that yourself at the end of the chapter. In a naive Bayes model, Bayes’ theorem is the main tool for classifying, and it is naive because we make very loose assumptions about the data we are analyzing. This will be clearer once we dive into the implementation.
4.2 The Training Data
For the Bayesian spam filter to work correctly, we need to feed it some good training data. In this context, that means having a large enough corpus of emails that have been pre-classified as spam or ham. The emails should be collected from a sufficiently heterogeneous group of people. After all, spam is a somewhat subjective category: one person’s spam may be another person’s ham. The proportion of spam vs. ham in our data should also be somewhat representative of the real proportion of emails we receive.
Fortunately, there are a lot of very good datasets available online. We’ll use the “Email Spam Classification Dataset CSV” from Kaggle, a website where data science enthusiasts and practitioners publish datasets, participate in competitions, and share their knowledge. The dataset’s description included online helps us make sense of its contents:
The .csv file contains 5,172 rows, one row for each email. There are 3,002 columns. The first column indicates Email name. The name has been set with numbers and not recipients’ name to protect privacy. The last column has the labels for prediction: for spam, for not spam. The remaining 3,000 columns are the 3,000 most common words in all the emails, after excluding the non-alphabetical characters/words. For each row, the count of each word(column) in that email(row) is stored in the respective cells.
Let’s take a look at the data. The following code snippet outputs a view of the first and last rows of the dataset.
= CSV.read("./04_naive_bayes/data/emails.csv", DataFrame) raw_df
## 5172×3002 DataFrame
## Row │ Email No. the to ect and for of a you ho ⋯
## │ String15 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 In ⋯
## ──────┼─────────────────────────────────────────────────────────────────────────
## 1 │ Email 1 0 0 1 0 0 0 2 0 ⋯
## 2 │ Email 2 8 13 24 6 6 2 102 1
## 3 │ Email 3 0 0 1 0 0 0 8 0
## 4 │ Email 4 0 5 22 0 5 1 51 2
## 5 │ Email 5 7 6 17 1 5 2 57 0 ⋯
## 6 │ Email 6 4 5 1 4 2 3 45 1
## 7 │ Email 7 5 3 1 3 2 1 37 0
## 8 │ Email 8 0 2 2 3 1 2 21 6
## ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋱
## 5166 │ Email 5166 1 0 1 0 3 1 12 1 ⋯
## 5167 │ Email 5167 1 0 1 1 0 0 4 0
## 5168 │ Email 5168 2 2 2 3 0 0 32 0
## 5169 │ Email 5169 35 27 11 2 6 5 151 4
## 5170 │ Email 5170 0 0 1 1 0 0 11 0 ⋯
## 5171 │ Email 5171 2 7 1 0 2 1 28 2
## 5172 │ Email 5172 22 24 5 1 6 5 148 8
## 2993 columns and 5157 rows omitted
As you can see, the output informs the amount of rows and columns and the type of each column and allows us to see a sample of the data.
4.3 Preprocessing the Data
Before we use the data to train our filter, we need to preprocess it a little bit. First, we should filter out very common words, such as articles and pronouns, which will most likely add noise rather than information to our classification algorithm.
= names(raw_df)[2:end-1]
all_words
= join(all_words, " ")
all_words_text = StringDocument(all_words_text)
document
prepare!(document, strip_articles)
prepare!(document, strip_pronouns)
= split(TextAnalysis.text(document))
vocabulary = raw_df[!, vocabulary]
clean_words_df
= Matrix(clean_words_df)' data_matrix
In the first line, we create a variable all_words to store a list of
all the words present in the emails. As our dataset has a column for
each word, we do this by storing the names of every column with the
names
function, except for the first and last column, which are for
email id and for the spam or ham label, respectively.
Let’s move on to the second and third lines of the code. We would like
to filter out some words that are very common in the English language,
such as articles and pronouns, which will most likely add noise rather
than information to our classification algorithm. For this we will use
two Julia packages that are specially designed for working with texts of
any type. These are Languages.jl and TextAnalysis.jl. In the third line,
we create a StringDocument, which is a struct provided by
TextAnalysis.jl and we use its built-in methods to remove articles and
pronouns from the list of words we created before. This is done by
calling the prepare function two times, with two different flags:
strip_articles
and strip_pronouns
. What follows is just code to
recover our original DataFrame with only the relevant columns, i.e., the
words that were not filtered. A clean_words_df
DataFrame is created
selecting those columns only. Finally, we turn our DataFrame into a
matrix with its rows and columns transposed. This is just the convention
used by the packages we are working with to make our analysis; each
column is one data realization.
Next, we need to divide the data in two: a training set and a testing set. This is standard practice when working with models that learn from data, like the one we’re going to implement. We’re going to train the model on the training set, and then evaluate the model’s accuracy by having it make predictions on the testing set. In Julia, the package MLDataUtils.jl has some nice functionalities for data manipulations like this.
= raw_df.Prediction
labels = splitobs(shuffleobs((data_matrix, labels)), at = 0.7) (x_train, y_train), (x_test, y_test)
The function splitobs
splits our dataset into a training set and a
testing set, and shuffleobs
randomizes the order of the data in the
split. We pass a labels
array to our split function so it knows how to
properly split the dataset. Now we can turn our attention to building
the spam filter.
4.4 The Naive Bayes Approach
As we mentioned, what we are facing here is a classification problem, and we will code from scratch and use a supervised learning algorithm to find a solution with the help of Bayes’ theorem. We’re going to use a naive Bayes classifier to create our spam filter. We’re going to use a classifier to create our spam filter. This method is going to treat each email just as a collection of words, with no regard for the order in which they appear. This means we won’t take into account semantic considerations like the particular relationship between words and their context.
Our strategy will be to estimate a probability of an incoming email being ham or spam and make a decision based on that. Our general approach can be summarized as:
\(P(spam|email) \propto P(email|spam)P(spam)\)
\(P(ham|email) \propto P(email|ham)P(ham)\)
Notice we use the \(\propto\) sign, meaning proportional to, instead of the = sign because the denominator from Bayes’s theorem is missing. In this case, we won’t need to calculate it, as it’s the same for both probabilities and all we’re going to care about is a comparison of these two probabilities.
In this naive approach, where semantics aren’t taken into account and each email is just a collection of words, the conditional probability \(P(email|spam)\) means the probability that a given email can be generated with the collection of words that appear in the spam category of our data. Let’s take a quick example. Imagine for a moment that our training set of emails consists just of these three emails, all labeled as spam:
- Email 1: ‘Are you interested in buying my product?’
- Email 2: ‘Congratulations! You’ve won $1000!’
- Email 3: ‘Check out this product!’
Also imagine we receive a new, unclassified email and we want to discover \(P(email|spam)\). The new email looks like this:
- New email: ‘Apply and win all these products!’
The new email contains the words win and product, which are rather common in our example’s training data. We would therefore expect \(P(email|spam)\), the probability of the new email being generated by the words encountered in the training spam email set, to be relatively high.
(The word \emph{win} appears in the form \emph{won} in the training set, but that’s OK. The standard linguistic technique of \emph{lemmatization} groups together any related forms of a word and treats them as the same word.)
Mathematically, the way to calculate \(P(email|spam)\) is to take each word in our target email, calculate the probability of it appearing in spam emails based on our training set, and multiply those probabilties together.
\(P(email|spam) = \prod_{i=1}^{n}P(word_i|spam)\)
We use a similar calculation to determine \(P(email|ham)\), the probability of the new email being generated by the words encountered in the training ham email set:
\(P(email|ham) = \prod_{i=1}^{n}P(word_i|ham)\)
The multiplication of each of the probabilities associated with a particular word here stems from the naive assumption that all the words in the email are statistically independent. In reality, this assumption isn’t necessarily true. In fact, it’s most likely false. Words in a language are never independent from one another, but this simple assumption seems to be enough for the level of complexity our problem requires.
The probability of a given word \(word_i\) being in a given category is calculated like so:
\[P(word_i|spam) = \frac{N_{word_i|spam} + \alpha}{N_{spam} + \alpha N_{vocabulary}}\] \[P(word_i|ham) = \frac{N_{word_i|ham} + \alpha}{N_{ham} + \alpha N_{vocabulary}}\]
These formulas tell us exactly what we have to calculate from our data. We need the numbers \(N_{word_i|spam}\) and \(N_{word_i|ham}\) for each word, meaning the number of times that \(word_i\) is used in the spam and ham categories, respectively. \(N_{spam}\) and \(N_{ham}\) are the total number of words used in the spam and ham categories (including all word repetitions), and \(N_{vocabulary}\) is the total number of unique words in the dataset. The variable \(\alpha\) is a smoothing parameter that prevents the probability of a given word being in a given category from going down to zero. If a given word hasn’t appeared in the spam category in our training dataset, for example, we don’t want to assign it zero probability of appearing in new spam emails.
As all of this information will be specific to our dataset, a clever way to aggregate it is to use a Julia struct, with attributes for the pieces of data we’ll need to access over and over during the prediction process. Here’s the implementation:
mutable struct BayesSpamFilter
::Dict{String, Int64}
words_count_ham::Dict{String, Int64}
words_count_spam::Int64
N_ham::Int64
N_spam::Array{String}
vocabularyBayesSpamFilter() = new()
end
The relevant attributes of the struct are words_count_ham
and
words_count_spam
, two dictionaries containing the frequency of
appearance of each word in the ham and spam datasets; N_ham
and
N_spam
, the total number of words appearing in each category; and
vocabulary
, an array of all the unique words in the dataset.
The line BayesSpamFilter() = new()
is the constructor of this struct.
Because the constructor is empty, all the attributes will be undefined
when we instantiate the filter. We’ll have to define some functions to
fill these variables with values that are relevant to our particular
problem. First, here’s a function word_count
that counts the
occurrences of each word in the ham and spam categories.
Now we are going to define some functions that will be important for our filter implementation.
function words_count(word_data, vocabulary, labels, spam=0)
= Dict{String,Int64}()
count_dict = size(word_data)[2]
n_emails for (i, word) in enumerate(vocabulary)
= sum([word_data[i, j] for j in 1:n_emails if labels[j] == spam])
count_dict[word] end
return count_dict
end
The function word_count counts the occurrences of each word in the ham and spam categories. One of its parameters is word_data, which we defined before and is a matrix where each column is an email and each row is a word.
Next, we’ll define a fit!
function for our spam filter struct. Notice
we’re using the bang (!
) convention here to indicate a function that
modifies its arguments in-place (in this case, the spam filter struct
itself). This function fits our model to the data, a typical procedure
in data science and machine learning areas.
function fit!(model::BayesSpamFilter, x_train, y_train, voc)
= voc
model.vocabulary = words_count(x_train, model.vocabulary, y_train, 0)
model.words_count_ham = words_count(x_train, model.vocabulary, y_train, 1)
model.words_count_spam = sum(values(model.words_count_ham))
model.N_ham = sum(values(model.words_count_spam))
model.N_spam return
end
## fit! (generic function with 1 method)
What we mean by fitting the model to the data is mainly filling all the
undefined parameters in our struct with values informed by the training
data. To do this, we use the words_count
function we defined earlier.
Notice that we’re only fitting the model to the training portion of the
data, since we’re reserving the testing portion to evaluate the model’s
accuracy.
4.5 Training the Model
Now it’s time to instantiate our spam filter and fit the model to the training data. With the struct and helper functions we’ve defined, the process is quite straightforward.
= BayesSpamFilter()
spam_filter fit!(spam_filter, x_train, y_train, vocabulary)
We create an instance of our BayesSpamFilter
struct and pass it to our
fit!
function along with the data. Notice that we’re only passing in
the training portion of the dataset, since we want to reserve the
testing portion to evaluate the model’s accuracy later.
4.6 Making Predictions
Now that we have our model, we can use it to make some spam vs. ham predictions and assess its performance. We’ll define a few more functions to help with this process. First, we need a function implementing the TAL formula that we discussed earlier.
function word_spam_probability(word, words_count_ham, words_count_spam, N_ham, N_spam, n_vocabulary, α)
= (words_count_ham[word] + α) / (N_ham + α * (n_vocabulary))
ham_prob = (words_count_spam[word] + α) / (N_spam + α * (n_vocabulary))
spam_prob return ham_prob, spam_prob
end
## word_spam_probability (generic function with 1 method)
This function calculates \(P(word_i|spam)\) and \(P(word_i|ham)\) for a
given word. We’ll call it for each word of an incoming email within
another function, spam_predict
, to calculate the probability of that
email being spam or ham.
function spam_predict(email, model::BayesSpamFilter, α, tol=100)
= ngrams(StringDocument(email))
ngrams_email = keys(ngrams_email)
email_words = length(model.vocabulary)
n_vocabulary = model.N_ham / (model.N_ham + model.N_spam)
ham_prior = model.N_spam / (model.N_ham + model.N_spam)
spam_prior
if length(email_words) > tol
= values(ngrams_email)
word_freq = sortperm(collect(word_freq), rev=true)
sort_idx = collect(email_words)[sort_idx][1:tol]
email_words end
= BigFloat(1)
email_ham_probability = BigFloat(1)
email_spam_probability
for word in intersect(email_words, model.vocabulary)
= word_spam_probability(word, model.words_count_ham, model.words_count_spam, model.N_ham, model.N_spam, n_vocabulary, α)
word_ham_prob, word_spam_prob *= word_ham_prob
email_ham_probability *= word_spam_prob
email_spam_probability end
return ham_prior * email_ham_probability, spam_prior * email_spam_probability
end
This function takes as input a new email that we want to classify as
spam or ham, our fitted model, an \(α\) value (which we’ve already
discussed), and a tolerance value tol
. The latter sets the maximum
number of unique words in an email that we’ll look at. We saw that the
calculations for \(P(email|spam)\) and \(P(email|ham)\) require the
multiplication of each \(P(word_i|spam)\) and \(P(word_i|ham)\) term. When
emails consist of a large number of words, this multiplication may lead
to very small probabilities, up to the point that the computer
interprets those probabilities as zero. This isn’t desirable; we need
values of \(P(email|spam)\) and \(P(email|ham)\) that are larger than zero
in order to multiply them by \(P(spam)\) and \(P(ham)\), respectively, and
compare these values to make a prediction. To avoid probabilities of
zero, we’ll only consider up to the tol
most frequently used words in
the email.
Finally, we arrive to the point of actually testing our model. We create another function to manage the process. This function classifies each email into Ham (represented by the number 0) or Spam (represented by the number 1)
function get_predictions(x_test, y_test, model::BayesSpamFilter, α, tol=200)
= length(y_test)
N = Array{Int64,1}(undef, N)
predictions for i in 1:N
= string([repeat(string(word, " "), N) for (word, N) in zip(model.vocabulary, x_test[:, i])]...)
email = spam_predict(email, model, α, tol)
pham, pspam = argmax([pham, pspam]) - 1
pred = pred
predictions[i] end
predictionsend
This function takes in the testing portion of the data and our trained
model. We call our spam_predict
function for each email in the testing
data and use the maximum (argmax
) of the two returned probability
values to predict (pred
) if the email is spam or ham. We return the
predictions as an array of values, which will contain zeros for ham
emails, and ones for spam emails. Here we call the function to make
predictions about the test data:
= get_predictions(x_test, y_test, spam_filter, 1) predictions
Let’s take a look at the predicted classifications of just the first five emails in the test data.
1:5] predictions[
## 5-element Vector{Int64}:
## 0
## 0
## 1
## 0
## 1
Of the first five emails, one (the third) was classified as spam, and the rest were classified as ham.
4.7 Evaluating the Accuracy
Looking at the predictions themselves is pretty meaningless; what we really want to know is the model’s accuracy. We’ll define another function to calculate this.
function spam_filter_accuracy(predictions, actual)
= length(predictions)
N = sum(predictions .== actual)
correct = correct / N
accuracy
accuracyend
This function compares the predicted classifications with the actual classifications of the test data, counts the number of correct predictions, and divides this number by the total number of test emails, giving us an accuracy measurement. Here we call the function:
spam_filter_accuracy(predictions, y_test)
## 0.9510309278350515
The output indicates our model is about 95 percent accurate. It appears our model is performing very well! Such a high accuracy rate is quite astonishing for a model so naive and simple. In fact, it may be a little too good to be true, because we have to take into account one more thing. Our model classifies emails into spam or ham, but the amount of ham emails in our data set is considerably higher than the spam ones. Let’s see the percentages:
sum(raw_df[!, :Prediction])/length(raw_df[!, :Prediction])
## 0.2900232018561485
To calculate the proportion of spam to ham emails, we sum over the
Prediction
column of the dataset remembering it only consists of 0s
and 1s, and then we divide by the total amount of emails.This type of
classification problem, where there’s an unequal distribution of classes
in the dataset, is called imbalanced. With unbalanced data, a better
way to see how the model is performing is to construct a confusion
matrix, an \(N \times N\) matrix, where \(N\) is the number of target
classes (in our case, 2, for spam and ham). The matrix compares the
actual values for each class with those predicted by the model. Here’s a
function that builds a confusion matrix for our spam filter:
function spam_filter_confusion_matrix(y_test, predictions)
# 2x2 matrix is instantiated with zeros
= zeros((2, 2))
confusion_matrix
1, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 0) for i in 1:length(y_test))
confusion_matrix[1, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 0) for i in 1:length(y_test))
confusion_matrix[2, 1] = sum(isequal(y_test[i], 0) & isequal(predictions[i], 1) for i in 1:length(y_test))
confusion_matrix[2, 2] = sum(isequal(y_test[i], 1) & isequal(predictions[i], 1) for i in 1:length(y_test))
confusion_matrix[
# Now we convert the confusion matrix into a DataFrame
= DataFrame(prediction=String[], ham_mail=Int64[], spam_mail=Int64[])
confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Ham", ham_mail=confusion_matrix[1, 1], spam_mail=confusion_matrix[1, 2]))
confusion_df = vcat(confusion_df, DataFrame(prediction="Model predicted Spam", ham_mail=confusion_matrix[2, 1], spam_mail=confusion_matrix[2, 2]))
confusion_df
return confusion_df
end
Now let’s call our function to build the confusion matrix for our model.
= spam_filter_confusion_matrix(y_test[:], predictions) confusion_matrix
## 2×3 DataFrame
## Row │ prediction ham_mail spam_mail
## │ String Float64 Float64
## ─────┼───────────────────────────────────────────
## 1 │ Model predicted Ham 1056.0 33.0
## 2 │ Model predicted Spam 43.0 420.0
Row 1 of the confusion matrix shows us all the times our model
classified emails to be ham; 1,056 of those classifications were correct
and 36 were incorrect. Similarly, the spam_mail
column shows us the
classifications for all the spam emails; 36 were misidentified as ham,
and 427 were correctly identified as spam.
Now that we have the confusion matrix, we can calculate the accuracy of the model segmented by category.
= confusion_matrix[1, :ham_mail] / (confusion_matrix[1, :ham_mail] + confusion_matrix[2, :ham_mail]) ham_accuracy
## 0.9608735213830755
= confusion_matrix[2, :spam_mail] / (confusion_matrix[1, :spam_mail] + confusion_matrix[2, :spam_mail]) spam_accuracy
## 0.9271523178807947
With these values now we have a more fine-grained measure of the accuracy of our model. Now we know that our spam filter doesn’t have the same degree of accuracy for spam and for ham emails. As a consequence of the imbalance in our data, ham emails will be classified as such more accurately than spam emails. Still, with both percentages above 90, the accuracy is pretty good for a model so simple and naive. Models like these can be used like a baseline for creating more complex ones on top of them.
4.8 Summary
In this chapter, we’ve used a naive Bayes approach to build a simple
email spam filter. We walked through the whole process of training,
testing, and evaluating a learning model. First, we obtained a dataset
of emails already classified as spam or ham and preprocessed the data.
Then we considered the theoretical framework for our naive analysis.
Using Bayes’s theorem on the data available, we assigned a probability
of belonging to a spam or ham email to each word of the email dataset.
The probability of a new email being classified as spam is therefore the
product of the probabilities of each of its constituent words. We
defined a Julia struct
for the spam filter object and created
functions to fit the spam filter object to the data. Finally, we made
predictions on new data and evaluated our model’s performance by
calculating the accuracy and making a confusion matrix.
4.9 Appendix - A little more about alpha
As we have seen, to calculate the probability of the email being a spam email, we should use
\(P(email|spam)=∏ni=1P(wordi|spam)=P(word0|spam)P(word1|spam)...P(wordnp|spam)\)
where P(wordnp|spam) stands for the probability of the word that is not presen t in our dataset. What probability should be assigned to this word? One way to handle this could be to simply ignore that term in the multiplication. In other words, assigning P(wordnp|spam)=1. Without thinking about it too much, we can conclude that this doesn’t make any sense, since that would mean that the probability to find that word in a spam (or ham, too) email would be equal to 1. A more logically consistent approach would be to assign 0 probability to that word. But there is a problem: with \(P(wordnp\|spam)=0\)
we can quickly see that
\(P(word0|spam)P(word1|spam)...P(wordnp|spam)=0\)
This is the motivation for introducing the smoothing parameter α
into our equation. In a real-word scenario, we should expect that words not present in our training set will appear, and altough it makes sense that they don’t have a high probability, it can’t be 0. When such a word appears, the probability assigned for it will be simply
\(P(wordnp|spam)=Nwordnp|spam+αNspam+αNvocabulary=0+αNspam+αNvocabulary=αNspam+αNvocabulary\)
In summary, α is just a smoothing parameter, so that the probability of finding a word that is not in our dataset, doesn’t go down to 0. Since we want to keep the probability for these words low enough, it makes sense to use α=1