Intro

I decided to write this tutorial after doing some research online, only to be surprised by the amount of false information about NLP using keras and tensorflow, especially when it comes to the embedding layer. In this tutorial, i will explain exactly what the embedding layer is both from a technical perspeective and a machine learning perspective, as well as how to remove the embedding layer -should you need to- after training your model.

I will start with a simple exampel and walk you through a model that does ... yeah sentiment analysis, note that the focus here is not how to do sentiment analysis, but keras embedding layer.

Let's start with the necessary imports.

In [0]:
import numpy as np
import pandas as pd 
import keras
from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from IPython.core.display import HTML, display
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional, \
   Flatten, InputLayer
from keras import regularizers
from keras.preprocessing import sequence

from sklearn.model_selection import train_test_split
from keras.utils.np_utils import to_categorical
import re
We start of by loading our data, in this example, we will be using keras' imdb dataset. The documentation for this dataset (and keras in general) is pretty clear.

in short, we use keras to load a prepared dataset that contains data segments from imdb consisting of both negative and positive sentiments, the only parameter we will use in this example is num words: "Top most frequent words to consider. Any less frequent word will appear as oov_char value in the sequence data"

In [0]:
from keras.datasets import imdb
num_words = 1000
maxlen = 400
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=num_words)
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)

print(X_train.shape)
(25000, 400)

Variable length sequences

when you have to work with the text directly yourself, keras preprocessing library is your friend, it contains word tokenizer which can be used to split texts into words, as well as sequence.pad_sequences which is used to make all your textual data of the same length.

What are those numbers?

Here's the thing, computer don't really understand textual data, least of all, neural networks, (keep in mind, when you use a String data type, it's actually represented as an array of bytes). And for this reason, we assign a unique integer ID for each uniqe word, what this index represents is pretty interesting, more on that in the following sections.

Timeout - A word for the wise: data sanitization

In this example, fortunately, the data is pre-sanitized, however, in real life, textual data is full of .... well ... surprises ... A good engineer always codes against surprises and unexpected cases. Keep in mind that machine learning models can run for weeks, having your code throw an exception after a few days can be a huge waste of time and resources. some basic cleaning i would recommend is:

  • If your code expects ASCII, make sure to convert all your text to ASCII
  • Stopword reomaval can be beneficial depending on your data, use your judgment.
  • removing punctuations, unless you have a very good reason not to, is almost always a necessity

Back to keras imdb data, the training data comes with a word index, the word index is simply a hashmap of all the word in the training set, the values mapped to each word is the aforementioned unique ID.

To get a better graps of the review data, we can use the word index to convert those numbers to words, the following piece of code does that

In [0]:
word_index = imdb.get_word_index()

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def sequence_to_text(seq, reverse_index):
      return ' '.join([reverse_index.get(i, 'N/A') for i in seq])

sequence_to_text(X_train[0], reverse_word_index)
Out[0]:
"the as you with out themselves powerful and and their becomes and had and of lot from anyone to have after out atmosphere never more room and it so heart shows to years of every never going and help moments or of every and and movie except her was several of enough more with is now and film as you of and and unfortunately of you than him that with out themselves her get for was and of you movie sometimes movie that with scary but and to story wonderful that in seeing in character to of and and with heart had and they of here that with her serious to have does when from why what have and they is you that isn't one will very to as itself with other and in of seen over and for anyone of and br and to whether from than out themselves history he name half some br of and and was two most of mean for 1 any an and she he should is thought and but of script you not while history he heart to real at and but when from one bit then have two of script their with her and most that with wasn't to with and acting watch an for with and film want an"

As expected, when we convert the numbers back into their textual representation, we can see the original review, keep in mind that this is solely for debug, as far as your nueral network is concerned, you will be passing indeces (technically, the embedding is what gets passed to the neural network, more on that later).

Now we get to the core of this tutorial, the word embedding, I assume that the reader has an idea what word embedding is as a concept, if not, feel free to check this wonderful article by stanford university.

That being said, let's us load the word embeddings. For this post i will be using Stanford's Glove pretreained embeddings of dimension 50. The structure of the file is very simple, it consists of the <word> followed by the weights of that word, all seperated by a white space, we can create a word embedding dictionaty by simply splitting each line, using the first split as a key and the rest as the values (word embeddings). Finally, as recommended by a very talented friend of mine (Andre Cianflone), it's advisable to create a random weight for all the unknow words. Unknow words may not have an embedding for a variety of reasons: misspelling, the word is not in English, etc ...

In [0]:
embeddings_index = {}
embedding_dim = 50
f = open('sample_data/glove.6B.50d.txt', encoding='utf8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

# Create a weight for all unknow words 
embeddings_index['<unk>'] = np.random.uniform(low=-0.0001, high=0.0001, size=embedding_dim)

And now we create the embedding matrix, the purpose of the embedding matrix is to map word indexes to embeddings weights.To create the matrix, we loop through our word index, and assign the particular weight from the embedding index to the matrix. We should notice that In which case we assign it a random vector

In [0]:
embedding_matrix = np.zeros((len(word_index) + 1, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector
    else:
        embedding_matrix[i] = embeddings_index.get('<unk>')

    i += 1
In [0]:
print(embedding_matrix.shape)
print(len(word_index) + 1)
print(embedding_layer.input_dim)
(88585, 50)
88585
88585

Finally, we create an Embedding layer from the embedding matrix, the Embedding layer acts as a look-up table for the neural network in order to convert a word index to a vector

In [0]:
embedding_layer = Embedding(len(word_index) + 1,
                            50,
                            weights=[embedding_matrix],
                            input_length=X_train.shape[1],
                            trainable=False)

If you're a curious reader, you're probably wondering why we're creating all those variables to use word embedding? Can't we just map words to vectors directly without having an Embedding layer?

Well, the answer is that you actually can, but as all things, it's a bit more complicated than that... If we skip the embedding, then our X_train vector would be a list of vectors each of size EMBEDDING_DIMENTION (50 in this case, but it could be as high as 300). Which means that the size of our X_train vector would be EMBEDDING_DIMENTION times higher than if we used an Embedding layer.

This feature becomes particularly important when using GPU for training, in which case, rather than feeding the model with a big fat array of vectors, we feed a list of integers and the model can look-up whichever vectors it needs while training without plotting the memory with the full sized vectors.

We're all set to create the model, here we go

In [0]:
model = Sequential()
model.add(embedding_layer)

model.add(Bidirectional(LSTM(units=10, return_sequences=True,
                dropout=0.5, kernel_regularizer=regularizers.l2(0.001),
                activity_regularizer=regularizers.l2(0.001))))

model.add(Bidirectional(LSTM(units=10,
                dropout=0.5, kernel_regularizer=regularizers.l2(0.001),
                activity_regularizer=regularizers.l2(0.001))))

model.add(Dense(1,activation='sigmoid'))

model.compile(loss = 'binary_crossentropy', optimizer='adam',metrics = ['accuracy'])
model.summary()

print(X_train.shape,y_train.shape)
print(X_test.shape,y_test.shape)
batch_size = 32
model.fit(X_train, y_train, epochs = 20, batch_size=batch_size, verbose = 2)
Model: "sequential_19"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_14 (Embedding)     (None, 400, 50)           4429250   
_________________________________________________________________
bidirectional_37 (Bidirectio (None, 400, 20)           4880      
_________________________________________________________________
bidirectional_38 (Bidirectio (None, 20)                2480      
_________________________________________________________________
dense_18 (Dense)             (None, 1)                 21        
=================================================================
Total params: 4,436,631
Trainable params: 7,381
Non-trainable params: 4,429,250
_________________________________________________________________
(25000, 400) (25000,)
(25000, 400) (25000,)
Epoch 1/20
 - 530s - loss: 1.6622 - acc: 0.4990
Epoch 2/20
 - 520s - loss: 0.7674 - acc: 0.5008
Epoch 3/20
 - 518s - loss: 0.7307 - acc: 0.4920
Epoch 4/20
 - 522s - loss: 0.7139 - acc: 0.4914
Epoch 5/20
 - 518s - loss: 0.7042 - acc: 0.4976
Epoch 6/20
 - 518s - loss: 0.6989 - acc: 0.4946
Epoch 7/20
 - 520s - loss: 0.6960 - acc: 0.5002
Epoch 8/20
 - 524s - loss: 0.6946 - acc: 0.5005
Epoch 9/20
 - 523s - loss: 0.6939 - acc: 0.4962
Epoch 10/20
 - 525s - loss: 0.6936 - acc: 0.4960
Epoch 11/20
 - 522s - loss: 0.6934 - acc: 0.4947
Epoch 12/20
 - 521s - loss: 0.6933 - acc: 0.5007
Epoch 13/20
 - 525s - loss: 0.6932 - acc: 0.4953
Epoch 14/20
 - 528s - loss: 0.6932 - acc: 0.5006
Epoch 15/20
 - 528s - loss: 0.6932 - acc: 0.5001
Epoch 16/20
 - 517s - loss: 0.6932 - acc: 0.4986
Epoch 17/20
 - 517s - loss: 0.6932 - acc: 0.4979
Epoch 18/20
 - 520s - loss: 0.6932 - acc: 0.4907
Epoch 19/20
 - 520s - loss: 0.6932 - acc: 0.4939
Epoch 20/20
 - 522s - loss: 0.6932 - acc: 0.4979
Out[0]:
<keras.callbacks.History at 0x7f0a0bc16fd0>

Of course, when you train a model for actual usage, you should aim for a higher accuracy and you definitely should have test set. I will skip that since this post is about word embedding not how to train your dragon... I mean model.

And now to the fun part, we get to actually use the model to predict the sentiment of a tweet. I engineered those two examples carefuly to make a point, the word "waggish" is more or less a synonym for "fun", but it does not appear in our training data, and thus, it does not have a value in the tokenizer's word_index.

In [0]:
tweets = np.array([
"Me reading my family comments about how great the was real fun, i enjoyed the debate",
"Me reading my family comments about how great the was waggish, i enjoyed the debate"
])

Let's take a look at what the tokenized sentences looks like

In [0]:
tokenizer.texts_to_sequences(tweets)

If you pay a close looks at the arrays above, you'll notice that one entry is missing, which is the word "waggish", kera's texts_to_sequences fails silently when it sees an unknown word and it omits it from the resulting array.

On that note, let us try predicting the sentiment of the above tweets anyway, note that whatever padding/cleaning/pre-processing done to the training data needs to be repeated to the prediction data, this applies to any machine learning model not just this one.

In [0]:
X_predict = tokenizer.texts_to_sequences(tweets)
X_predict = pad_sequences(X_predict, maxlen=max_sent_len)

The data is now ready to be passed to the predict function, the result will be a probability of either positive or negative.

In [0]:
preds = model.predict(X_predict)
print(preds)

Well, the model did guess the correct sentiment for both sentences, but it seems less confident about the second one which should not be the case since both senetnces are pretty much identical. The reason for that, as stated earlier is that the model actually sees the second tweet as "Me reading my family comments about how great the was <nothing>, i enjoyed the debate"

Luckily, there is a way to introduce new words to the model, which is the main reason why we used word embeddings in the first place. The logic behind this is simple, if we place the word "fun" with "waggish", we should ideally get a similar vector of embeddings for each words, the problem we have is that the word "waggish" was not a part of the vocabulary we trained the model to use. However, we could simply replicate the trained model, replacing the embedding layer with an input layer that directly takes the word vectors as an input. Note that it does slightly affect the performance of the model especially if we are running our predictions on a huge dataset, I would recommned you try to fix your vocab beforehand if you can. If not, the code below shows how you can use your trained model (or technically a copy of it) to predict sentences that contains unseen words, the code is very simple, the only trick is knowing the shape of your input model, you can follow that in the comments.

In [0]:
newModel = Sequential()
#Add an input layer of dimention: max_sentence_length, embedding_dimention
#this means that our input would consist of vectors of length max_sent_len, each entry is of size embedding_dimention 
#(i.e: a list of word vectors, each of size embedding_dimention)
newModel.add(InputLayer(input_shape=(max_sent_len, 50), dtype='float32'))

#and then, we simply copy the layers over from the old model
newModel.add(model.layers[1])
newModel.add(model.layers[2])
newModel.add(model.layers[3])

newModel.summary()

With the embedding layer gone, we have to basically do what it did manually, it's very straigh forward, we start by tokenizing the sentence

In [0]:
tweets_tok = [keras.preprocessing.text.text_to_word_sequence(x) for x in tweets]
print(tweets_tok)

The next part is where the magic happens, we need to loop through our tokenized tweets, convert each word to its corresponding vector, we also need to prepend (or append, depending how you configured your tokenizer while training) with a zero vector.

Each entry of the resulting vector should be of shape (max_sent_length, embedding_dim) or (200, 50) in our case, each one of those 200 vectors represents a word vector or a zero vector in the case of padding.

New Section

In [0]:
X_predict_newmodel = []

# print(embeddings_index)
# import pdb; pdb.set_trace()
for t in tweets_tok:
    X_sent = None
    for i, word in enumerate(t):
#         print(word)
        if word in embeddings_index:
            if i == 0:
                X_sent = embeddings_index.get(word)
            else:
                X_sent = np.vstack((X_sent, embeddings_index.get(word)))
        else:
            if i == 0:
                X_sent = np.random.uniform(low=-0.0001, high=0.0001, size=50)
            else:
                X_sent = np.vstack( (X_sent, np.random.uniform(low=-0.0001, high=0.0001, size=50)))
    for j in range(0, max_sent_len - X_sent.shape[0]):
        X_sent = np.vstack((np.zeros(50), X_sent))
    X_predict_newmodel.append(X_sent)

X_predict_newmodel = np.array(X_predict_newmodel)    

print(X_predict_newmodel.shape)

And we're all set, let's predict the sentiment for these non-human-readable bunch of vectors

In [0]:
print(newModel.predict(X_predict_newmodel))

Noticed anything interesting? How about two things?

First, the probabilities of the first tweet are identical to our earlier prediction using the embedding layer, in fact, this is a good way to debug your model, this should be the case for any sentence consisting solely of words that the model saw while training (and thus, are part of the word_index). The only exception to that is when a sentence has words that do not have corresponding embeddings and get assigned random values, when debugging try to engineer an example that consists of words that are both seen in training AND have embeddings.

Second, and more importantly, the model is now more confident in the sentiment of the second tweet, it does make sense as we input the full sentence along with the word "waggish". What happens in this case , as ooposed to the previous prediction, where we skipped "waggish", is that the same neurons that get activated when seeing the vector for the word "good" will also be activated for "waggish" resulting in a more acurate prediction.