SENTENCE CORRECTION USING RNN

Chinuteja
7 min readJul 16, 2021

--

Introduction

Most of us use social media platforms to communicate with people or express ourselves. Generally we use informal English to communicate with friends. The ML and DL models are trained on traditional languages i.e. ENGLISH for NLP-tasks like sentiment analysis, classification of text, next sentence prediction etc. As I said most of them use informal English i.e. short sentences, abbreviations in their texts which might not be very helpful for NLP tasks.

Table of contents

  1. Business Problem
  2. Mapping to deep learning
  3. Why to go for seq2seq
  4. Prerequisite
  5. Basic Introduction to Seq2Seq
  6. Dataset Overview
  7. EDA
  8. Modelling
  9. Hosting

1)Business Problem

Here we build a model that takes a corrupted text (shorten text like SMS text) as input and outputs the text in Standard English.

basic seq2seq overview model

You can see here we gave corrupted text i.e. Hw r u? As input to the model and the model converted it into standard English text i.e. How are you?

2) Mapping to deep learning problem

In the above image you can see that we are giving text as input and we are also getting text as output i.e. we are converting corrupted text to a standard English one. It’s kind of a machine translation problem i.e. translating the SMS text to standard English one. So we are going for a deep learning approach which performs better than the traditional approach.

machine translation example

3) Why to go for seq2seq

In general when we want to translate from one language to another language we listen to the entire sentence and then we translate to the other language. So here context and sequence of words play an important role. That’s the reason we go for seq2seq as they use RNN.

4) Prerequisites

Before going further, basic knowledge on RNN’s, LSTM, Encoder, Decoder is must or else it would be like watching Korean movies with French subtitles.

confused state of mind if you read this blog without any prior knowledge on RNNS,seq2seq

5) Basic Introduction to seq2seq

The following picture gives a high level view of how seq2seq works. Let’s walk through it.

top level view of seq2seq

we have to first compile a “vocabulary” list containing all the words we want our model to be able to use or read. The model inputs will have to be tensors containing the IDs of the words in the sequence. We need to add two symbols like <SOS> i.e. start of sentence and <EOS> i.e. end of sentence to the input feeded to the tenors. Here we will ignore the output of the encoder at every time step and take the output of the encoder at last time step. Later we will feed this context vector (vector obtained at last time step) as one of the inputs to decoder and we also give <SOS> as input to decoder at time step t = 0. And the output obtained here is given as input to the decoder at time step t=1. And it goes on until the model predicts <EOS>. This is the overall rough view of how an seq2seq model works in this blog. I am not going to take a deep dive into it as the blog becomes more lengthy and we need to stick to the core concept of case study. I hope you understand.

6) Dataset overview

Based on our case study the dataset is taken from the website

https://www.comp.nus.edu.sg/~nlp/corpora.html

This dataset consists of SMS text, standardized text along with Chinese text. The data set looks like the following.

overview of dataset

As you can see, the first sentence consists of corrupted text, second one of standard text and third one of some chinese text.We only need corrupted text and standard text so I extracted both and made a dataframe. The data frame consists of 2000 rows and two columns i.e corrupted text and standardized text.

7) EDA

Let’s look into EDA. I mostly used bar plots so that it is easy for us to understand the data.

length of corrupted text

The above plot gives us information about the top 10 lengths of SMS text. As you can see the maximum length of the SMS text is 221. The longest sentence in the SMS text i.e. in our dataset is of 221.This includes punctuations too.

The least length of corrupted text is of 2 followed by 3 and 4.

You can see the distribution of length of corrupted text: the average length is between 50–55 and the 99th percentile is 159.0. Most probably we can consider the sentences which have length greater than 160 as outliers.

The above plot represents the top 10 lengths of standard text as you can see the maximum length is 281.

The minimum length of standard text is 3.

The distribution of normal text is shown in the above plot. The average length will be around 55–65. The 99th percentile is 190.0 any sentence of the standard text above 190 can be considered as outlier.

E is the most common letter used in both SMS and standard text which is in our corpus followed by A and N

normal text and SMS text word cloud.

The above word cloud shows the most common words that appear in SMS text and normal text.

8) Modelling

Here I used three models

  1. Baseline model i.e. char-char encoding without attention mechanism
  2. Word-word embedding using fast text with attention mechanism
  3. Model with data augmentation + fast text embedding with attention mechanism

Let’s discuss briefly about above models

  1. Baseline model i.e. char-char encoding without attention mechanism: Here i build a simple baseline model with char-char one hot encoding.
architecture of baseline model

The results are not up to the mark for the baseline model so we moved to attention mechanism.

2.Word-word embedding using fast text with attention mechanism: Here we removed the outliers i.e. SMS text having length > 170 and normal text having length >225. We used the fasttext.ai model for word level embedding. Here we padded each sentence in the SMS text with zeros and length maximum so that we can provide batch wise operation. We also used attention mechanism for this model. And we split the data into the ratio of 99:1 as we have very little data. I also used bi-directional LSTM for better results.

encoder decoder with attention mechanism

3.Model with data augmentation + fast text embedding with attention mechanism: As I had mentioned that we had only 2000 sentences which is very low to train a seq2seq model so I went for NLP augmentation to improve the performance of the model. Here I used synonym and random augmentation for normal text. After augmentation I have a total of 6000 sentences which is pretty good to train the seq2seq model. So finally my model with NLP augmentation + fasttext.ai embedding + bi-directional LSTM gave a better performance.

9) Hosting

I used model 3 i.e. model with data augmentation +fast text embedding with attention mechanism as final model for hosting. I hosted on a local server with flask.

References:

https://cs224d.stanford.edu/reports/Lewis.pdf

https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

https://towardsdatascience.com/day-1-2-attention-seq2seq-models-65df3f49e263

Source code

https://github.com/chinuteja/sentence-correction-using-RNN

LinkedIn

for any queries regarding case study you can contact me on LinkedIn

--

--

Chinuteja
Chinuteja

Written by Chinuteja

Working as data scientist.

No responses yet