When working with text data for machine learning problems, we need to convert text data to real-valued data in order to feed it to ML algorithms. The ML algorithms only work on real-valued data. This process of converting text to real-valued data is generally referred to as text encoding. We split the text into tokens and then map the real-valued number to these tokens. The tokens can be characters, words, punctuation marks, etc. There are various ways to encode text data like word frequency, Tf-Idf, one-hot encoding, word embeddings, etc. The approaches which were commonly used earlier were one-hot encoding, word frequency, and Tf-Idf, all of which use one scalar value to represent one token of text. As only one scalar value is used to represent a single token, there is a limit on the information presented by one number. The single number can not capture the exact meaning of tokens/words which can be different in different text contexts. The word embeddings approach that we have explained in this tutorial takes the concept further and uses a real-valued vector to represent a single token. Each token has a list of floats to represent it which can be of any length greater than one. As more numbers are used to represent tokens/words now, it can capture more information and better understand the meaning. The below image shows how word embedding looks.
As a part of this tutorial, we'll explain how we can design neural networks using Keras that uses word embeddings for text classification tasks. We'll start with random word embeddings for tokens and then update them by training the network on our data so that they learn the meaning of tokens/words. We have explained various approaches to using word embeddings as well.
Below, we have listed important sections of tutorial to give an overview of the material covered.
Below, we have imported the necessary Python libraries and printed the version of them that we have used in our tutorial.
import tensorflow
from tensorflow import keras
print("Keras Version : {}".format(keras.__version__))
import torchtext
print("Torchtext Version : {}".format(torchtext.__version__))
In this section, we have prepared the data to be fed directly to the neural network. We first loaded the dataset and then trained a tokenizer with a list of text examples to populate it's vocabulary with tokens (words and punctuations). The vocabulary is a simple mapping from a token to a unique integer index. Each text token is assigned a unique integer index starting from 1. Then, we vectorize each text example using this populated vocabulary. Basically, we tokenize the text example so that we have a list of tokens and then we retrieve an index of each token using vocabulary.
Below, we have loaded the dataset that we'll use for our purpose. We have loaded AG NEWS dataset available from torchtext library. It has text examples for 4 different (["World", "Sports", "Business", "Sci/Tech"]) categories of news. The dataset is already divided into train and test sets.
import numpy as np
train_dataset, test_dataset = torchtext.datasets.AG_NEWS()
X_train_text, Y_train = [], []
for Y, X in train_dataset:
X_train_text.append(X)
Y_train.append(Y)
X_test_text, Y_test = [], []
for Y, X in test_dataset:
X_test_text.append(X)
Y_test.append(Y)
unique_classes = list(set(Y_train))
target_classes = ["World", "Sports", "Business", "Sci/Tech"]
## Subtracted 1 from labels to bring range from 1-4 to 0-3
Y_train, Y_test = np.array(Y_train) - 1, np.array(Y_test) - 1
len(X_train_text), len(X_test_text)
In this section, we have vectorized train and test datasets so that they can be given directly to the neural network. In order to vectorize data, we have used Tokenizer() available from keras.
We first created an instance of Tokenizer and then called fit_on_texts() method on it with our datasets (train and test) to populate the vocabulary with tokens. The method will tokenize each text example and add tokens from it to vocabulary.
Once, we have populated vocabulary, we can translate any text example to a list of indexes by calling texts_to_sequences() method on Tokenizer object. The method takes a list of text examples as input and returns a list of indexes (indexes of tokens) for them.
Below, we have included the vectorization process with a simple example. The tokens for which we have an entry in vocab get replaced with their respective indexes and tokens for which we don't have mapping get replaced with the index of '<unk>'
token.
text = "Hello, How are you? Where are you planning to go?"
tokens = ['hello', ',', 'how', 'are', 'you', '?', 'where',
'are', 'you', 'planning', 'to', 'go', '?']
vocab = {
'hello': 0,
'bye': 1,
'how': 2,
'the': 3,
'welcome': 4,
'are': 5,
'you': 6,
'to': 7,
'<unk>': 8,
}
vector = [0,8,2,4,6,8,8,5,6,8,7,8,8]
The length of texts can be arbitrary but we have decided to keep a maximum of 50 tokens per our text examples. We have enforced these using pad_sequences() function. It keeps the length of each text example to 50 tokens by padding the example that has less than 50 tokens with 0s and truncating the example that has more than 50 tokens.
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train_text+X_test_text)
print("Vocabulary Size : {}".format(len(tokenizer.index_word)))
max_tokens = 50
## Vectorizing data to keep max_tokens words per sample.
X_train_vect = pad_sequences(tokenizer.texts_to_sequences(X_train_text), maxlen=max_tokens, padding="post", truncating="post", value=0)
X_test_vect = pad_sequences(tokenizer.texts_to_sequences(X_test_text), maxlen=max_tokens, padding="post", truncating="post", value=0)
print(X_train_vect[:2])
X_train_vect.shape, X_test_vect.shape
## What is word 444
print(tokenizer.index_word[444])
## How many times it comes in first text document??
print(X_train_text[0]) ## 2 times
Our first approach keeps embeddings of all tokens of text example by laying them next to each other to create a big vector. We have designed a network that has a single embedding layer and 2 dense layers for performing text classification tasks.
In this section, we have defined our text classification network using Sequential API of Keras. The network consists of 3 layers.
The first layer is the embedding layer. We can create an embedding layer using Embedding() constructor available from layers sub-module of keras. The first parameter to the constructor is a list of different tokens (vocabulary size) and the second parameter is embedding length per token. The input_length parameter specifies the number of tokens per text example. The constructor internally creates a weight matrix of shape (vocab_size, embed_len) hence we have a vector of length embed_len for each token of vocabulary. The embedding layer is simply an indexing layer that takes as an input list of indexes of text examples and returns their respective embeddings from the weight matrix. The input to embedding layer is of shape (batch_size, max_tokens) = (batch_size,50) and output is of shape (batch_size, max_tokens, embed_len) = (batch_size,50,25). So basically, the text example gets tokenized to create a list of tokens, then we map those tokens to their respective indexes using vocabulary, and the embedding layer maps the indexes to their respective embedding vectors.
The output of embedding layer is flattened hence the shape is transformed from (batch_size, max_tokens, embed_len) to (batch_size, max_tokens x embed_len) = (batch_size, 1250).
The flattened output is given to a dense layer with 128 output units. The dense layer applies relu activation function to the output.
The output of the first dense layer is then given to the second dense layer of 4 output units. The second dense layer applies softmax activation function to the output hence the output is 4 probabilities in the range 0-1 ( which all sum up to 1) per text example.
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
import gc
embed_len = 25
model = Sequential([
layers.Embedding(len(tokenizer.index_word)+1, embed_len, input_length=max_tokens),
layers.Flatten(),
layers.Dense(128, activation="relu"),
layers.Dense(len(target_classes), activation="softmax")
]
)
model.summary()
In this section, we have compiled a network to use Adam optimizer, cross entropy loss, and accuracy evaluation metric.
model.compile("adam", "sparse_categorical_crossentropy", metrics=["accuracy"])
In this section, we have trained our network for 5 epochs with a batch size of 256 by calling fit() method. We have provided train and test data as well. We can notice from the loss and accuracy getting printed after each epoch that our model is doing quite a good job at the text classification task.
history = model.fit(X_train_vect, Y_train, batch_size=256, epochs=5, validation_data=(X_test_vect, Y_test))
gc.collect()
In this section, we have evaluated the performance of the network by calculating accuracy, classification report (precision, recall, and f1-score per target class) and confusion matrix metrics for test predictions. We can notice from the accuracy that our model is doing quite a good job at predicting target labels for test examples.
We have calculated various ML metrics using functions available from scikit-learn. Please feel free to check the below link if you are interested in learning about various ML metrics available from sklearn in-depth.
Apart from calculating metrics, we have also plotted the confusion matrix using python library scikit-plot. We can notice from the visualization that our model seems to be doing a good job at classifying text documents from categories Sports and World compared to categories Business and Sci/Tech.
The scikit-plot has visualizations for many commonly used ML metrics. Please feel free to check the below link if you want to learn about them as well.
from sklearn.metrics import accuracy_score, classification_report
train_preds = model.predict(X_train_vect)
test_preds = model.predict(X_test_vect)
print("Train Accuracy : {}".format(accuracy_score(Y_train, np.argmax(train_preds, axis=1))))
print("Test Accuracy : {}".format(accuracy_score(Y_test, np.argmax(test_preds, axis=1))))
print("\nClassification Report : ")
print(classification_report(Y_test, np.argmax(test_preds, axis=1), target_names=target_classes))
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in np.argmax(test_preds, axis=1)],
normalize=True,
title="Confusion Matrix",
cmap="Blues",
hide_zeros=True,
figsize=(5,5)
);
plt.xticks(rotation=90);
In this section, we have tried to explain predictions made by our trained network using LIME algorithm. The algorithm is commonly used to explain predictions made by black-box ML models. It let us create a visualization that highlights important tokens of text that contributed to predicting a particular target label. We'll be using the python lime library for our purpose which has an implementation of the algorithm. In order to explain prediction, we need to follow a list of steps.
If you are someone who is new to the LIME concept then we recommend going through the below tutorials in your free time to know about it in detail.
Below, we have first created an instance of LimeTextExplainer() with the names of target classes. This instance will be used to explain the prediction of the network.
from lime import lime_text
explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)
Below, we have first created a function that takes as an input batch of text examples and returns their predicted probabilities. The function tokenizes text data, vectorizes it, and gives it to the network to make predictions. The output probabilities of the network are returned from the function. We'll use this function in the next cell for explanation purposes.
After defining a function, we randomly selected one text example from the test dataset and made predictions on it using our trained model. Our model correctly predicts the target label as 'Sci/Tech' for the selected text example.
import numpy as np
def make_predictions(X_batch_text):
X_batch = pad_sequences(tokenizer.texts_to_sequences(X_batch_text), maxlen=50, padding="post", truncating="post", value=0)
preds = model.predict(X_batch)
return preds
rng = np.random.RandomState(1234)
idx = rng.randint(1, len(X_test_text))
print("Prediction : ", target_classes[model.predict(X_test_vect[idx:idx+1]).argmax(axis=-1)[0]])
print("Actual : ", target_classes[Y_test[idx]])
In the below cell, we have created a visualization explaining the prediction of our network. To do that, we have first called explain_instance() method on LimeTextExplainer object. We have given it a selected text example, our prediction function, and the target label. The method returns an Explanation object which has details about the explanation. We have then called show_in_notebook() method on Explanation object to create visualization.
We can notice from the visualization that words like 'RFID', 'privacy', 'frequency', 'technology', etc are used for predicting the target label as Sci/Tech which makes sense as they are commonly used words in the field.
explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()
Our approach in this section to working with embeddings is a little bit different from our previous approach. In the previous section, we flattened the output of the embedding layer. Instead in this section, we have averaged the embeddings of tokens of one example and given it to the next dense layer. The majority of the code is exactly the same as in our previous section with only a change in the definition of text classification network.
Below, we have defined a network that we'll use for our task in this section. We have defined the layer using Models API of keras this time. The network consists of 3 layers (1 embedding and 2 dense) like our previous approach. The only difference is that the output of the embedding layer is averaged at the tokens index. The input shape of embedding layer is (batch_size, max_tokens) and output shape is (batch_size, max_tokens, embed_len) which we have averaged tokens level hence shape gets transformed to (batch_size, embed_len). The output of shape (batch_size, embed_len) is then given to the dense layer.
from tensorflow.keras.models import Model
from tensorflow.keras import layers
embed_len = 25
inputs = layers.Input(shape=(max_tokens, ))
embed_layer = layers.Embedding(len(tokenizer.index_word)+1, embed_len, input_length=max_tokens)
dense1 = layers.Dense(128, activation="relu")
dense2 = layers.Dense(len(target_classes), activation="softmax")
x = embed_layer(inputs)
x = tensorflow.reduce_mean(x, axis=1) ## Averaged word embeddings of single example
x = dense1(x)
output = dense2(x)
model = Model(inputs=inputs, outputs=output)
model.summary()
Here, we have complied the network to use Adam optimization algorithm, cross entropy loss, and accuracy metric.
model.compile("adam", "sparse_categorical_crossentropy", metrics=["accuracy"])
Below, we have trained our network for 5 epochs with a batch size of 256. We can notice from the loss and accuracy values getting printed after each epoch that our model is doing a good job at the text classification task.
history = model.fit(X_train_vect, Y_train, batch_size=256, epochs=5, validation_data=(X_test_vect, Y_test))
gc.collect()
In this section, we have evaluated the performance of the network by calculating accuracy score, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy score that our model is doing a little better job compared to the previous approach. We have also plotted the confusion matrix for reference purposes.
from sklearn.metrics import accuracy_score, classification_report
train_preds = model.predict(X_train_vect)
test_preds = model.predict(X_test_vect)
print("Train Accuracy : {}".format(accuracy_score(Y_train, np.argmax(train_preds, axis=1))))
print("Test Accuracy : {}".format(accuracy_score(Y_test, np.argmax(test_preds, axis=1))))
print("\nClassification Report : ")
print(classification_report(Y_test, np.argmax(test_preds, axis=1), target_names=target_classes))
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in np.argmax(test_preds, axis=1)],
normalize=True,
title="Confusion Matrix",
cmap="Blues",
hide_zeros=True,
figsize=(5,5)
);
plt.xticks(rotation=90);
In this section, we have tried to explain the prediction of our network on a random text example using LIME algorithm. Our network correctly predicts the target label as Sci/Tech for the selected text example. From the explanation visualization, we can notice that words like 'RFID', 'frequency', technology', 'privacy', 'identification', etc are used to predict target label Sci/Tech.
from lime import lime_text
explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)
rng = np.random.RandomState(1234)
idx = rng.randint(1, len(X_test_text))
print("Prediction : ", target_classes[model.predict(X_test_vect[idx:idx+1]).argmax(axis=-1)[0]])
print("Actual : ", target_classes[Y_test[idx]])
explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()
Our approach in this section is almost the same as our approach in the previous section with one minor change. In the previous section, we had averaged embeddings at the tokens level whereas in this section we have summed embeddings at the tokens level. The majority of the code is the same as in our previous section with one minor change in network definition.
Below, we have defined the network that we'll use for our task in this section. The definition of a network is exactly the same as the previous approach with the only difference that we have summed embeddings using reduce_sum() function. The rest of the code is the same as earlier.
from tensorflow.keras.models import Model
from tensorflow.keras import layers
embed_len = 25
inputs = layers.Input(shape=(max_tokens, ))
embed_layer = layers.Embedding(len(tokenizer.index_word)+1, embed_len, input_length=max_tokens)
dense1 = layers.Dense(128, activation="relu")
dense2 = layers.Dense(len(target_classes), activation="softmax")
x = embed_layer(inputs)
x = tensorflow.reduce_sum(x, axis=1) ## Sum word embeddings of single example
x = dense1(x)
output = dense2(x)
model = Model(inputs=inputs, outputs=output)
model.summary()
Here, we have compiled a network to use Adam optimizer, cross entropy loss, and accuracy metric.
model.compile("adam", "sparse_categorical_crossentropy", metrics=["accuracy"])
Here, we have trained our network for 5 epochs with a batch size of 256. The loss and accuracy values getting printed at the end of each epoch hint that our model is doing a good job at the classification task.
history = model.fit(X_train_vect, Y_train, batch_size=256, epochs=5, validation_data=(X_test_vect, Y_test))
gc.collect()
In this section, we have again evaluated the performance of the network by calculating the accuracy score, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy score that it is almost the same as our first approach. We have also plotted the confusion matrix for reference purposes.
from sklearn.metrics import accuracy_score, classification_report
train_preds = model.predict(X_train_vect)
test_preds = model.predict(X_test_vect)
print("Train Accuracy : {}".format(accuracy_score(Y_train, np.argmax(train_preds, axis=1))))
print("Test Accuracy : {}".format(accuracy_score(Y_test, np.argmax(test_preds, axis=1))))
print("\nClassification Report : ")
print(classification_report(Y_test, np.argmax(test_preds, axis=1), target_names=target_classes))
from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt
skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in np.argmax(test_preds, axis=1)],
normalize=True,
title="Confusion Matrix",
cmap="Blues",
hide_zeros=True,
figsize=(5,5)
);
plt.xticks(rotation=90);
Here, we have again explained the prediction made by our trained network on a random test example using LIME algorithm. The words like 'RFID', 'frequency', 'technology', 'identification', 'threat', 'privacy', etc are used to predict the target label as Sci/Tech which makes sense as they are commonly referred in the field.
from lime import lime_text
explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)
rng = np.random.RandomState(1234)
idx = rng.randint(1, len(X_test_text))
print("Prediction : ", target_classes[model.predict(X_test_vect[idx:idx+1]).argmax(axis=-1)[0]])
print("Actual : ", target_classes[Y_test[idx]])
explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()
Below, we have listed a results summary of all approaches we tried.
Approach | Max Tokens | Embedding Length | Test Accuracy (%) |
---|---|---|---|
Word Embeddings Flattened | 50 | 25 | 91.15 |
Word Embeddings Averaged | 50 | 25 | 91.60 |
Word Embeddings Summed | 50 | 25 | 91.07 |
This ends our small tutorial explaining how we can use word embeddings with keras networks for text classification tasks. Please feel free to let us know your views in the comments section.
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to