Introduction
In general we will get a lot of text messages to our mobile. So here I am gonna build a machine learning model which will classify our text messages whether it’s spam or not.
Mapping to machine learning problem
So generally you will have input as a text message and your end goal is to classify if it is spam or not. So it’s kind of a binary classification problem. So we need to build a classifier which takes input text and classifies it.
Data Set
I got a dataset from Kaggle. SMS Spam Collection Dataset. The dataset consists of around 5k messages labelled as spam or ham. It consists of two columns one column contains all the SMS text and another one contains a label i.e. labeled as spam or ham.
Exploratory Data Analysis (EDA)
From the above plot we can clearly see that the data set which we have is an imbalanced dataset here are around 747 spam messages only.
- From the above graph we can see that the majority of the length of sentences lie between the range of 0 and 200 so we can consider any sentence above 210 as an outlier .
Conclusions after EDA
- Dataset is highly imbalanced.
- The sentence with maximum and minimum length is 925 and 2.
- We can consider any sentences which have length above 210 as outliers.
- Majority of the sentences have length within 2 and 210.
Feature engineering
- As we are dealing with text data we can consider length of sentence as a new feature apart from text
Preprocessing the text data
- As part of preprocessing first we will convert the text into lower case , remove punctuation and stop words, and perform stemming.
- And we can normalize the length of sentence columns as they vary from [2,210] by normalizing they will be in between [0,1].
- We use TF-IDF vectorizer to convert text into numbers (as we cannot give raw text to the model).
After performing tf idf vectorization on the corpus the output obtained will be a sparse matrix.
Next we will combine tf idf vector and normalized length using hstack.
- We will use a label encoder to convert labels into numbers.
- From EDA we came to know that it’s a highly imbalanced dataset so we perform SMOTE on the data what we have.
What is Synthetic Minority Oversampling Technique (SMOTE)?
This is a type of data augmentation for the minority class. Instead of adding duplicates to the same dataset we will create synthetic data points.
Machine learning models
- Naïve Bayes First I tried naïve Bayes model and using grid search cv to find best parameters for the model. After training the model I got an train and test f1-score of 96.17% and 94.97%
- Ensemble techniques Here we used majority voting classifiers by stacking KNN, random forest and Logistic regression.
- First I used grid search cv to find best hyper parameters for KNN, random forest and Logistic regression .
- Later I trained these models of the available best hyper parameters and over the top I went for a Voting classifier to get the best output .
. Finally I got a train and a test f1 score about 99% and 96%.
Deep Learning Model.
- Here I used CNN model for text classification.
- Instead of going to for tf idf vectorizer I had constructed an embedding matrix so that the entire matrix will be trained along with the neural network.
- Steps to create embedding matrix
- First convert text to tokens with the help of tokenizer and then apply padding to text so you can make the entire text of the same length. Here I applied a padding of length 250 i.e if your text is of length 215 and rest of the tokens are padded with 0’s till length 250.
- After converting the text tokens we need to construct an embedding matrix so I took help of glove.6B.100d.txt
- Vectors to construct an embedding matrix of dimension (250X100) for each token.
- And weights in this matrix will be trained along the neural network we construct.
- Final architecture of the model which is going to be trained.
.Finally I got a train and test f1 score of 99% and 96.02%.
Using BERT model
- Instead of building a model from scratch we can use state of the art models like BERT and use it to perform classification tasks. Getting details into architecture of BERT makes this blog lengthy so you can find details of it in A Visual Guide to Using BERT for the First Time.
- Generally Bert takes 3 inputs i.e. tokens, masked tokens and segmentation tokens.
- And these are passed to BERT on and finally we will get a feature vector on the top of it we will build our own classifier.
- Here X_train_pooled_output and X_test_pooled_output are the feature vectors which we will be passing to our own classifier.
- The final architecture of the model can be seen below.
- Here we are not fine tuning or training the entire BERT model instead we are passing our data to BERT model and we will get the feature vector. And on the top of it we build a classifier.
- Finally I got a train and a test f1 score of 88.23% and 86.27%.
- Since the availability of data is less, going for BERT is not a great idea.
Conclusions
As you can see, the Majority Vote classifier and Deep learning model give almost the same results. We can host Majority Vote classifiers in the cloud as computation time ( inference ) of deep learning models will be costly.
Further Works
- We can build the front end and write an API for the model prediction and integrate with the front end.
- We can use docker containers and host the model in the cloud.
Source code
Lets connect in LinkedIn