Updated On : Apr-22,2022 Time Investment : ~30 mins

Keras: Word Embeddings for Text Classification¶

When working with text data for machine learning problems, we need to convert text data to real-valued data in order to feed it to ML algorithms. The ML algorithms only work on real-valued data. This process of converting text to real-valued data is generally referred to as text encoding. We split the text into tokens and then map the real-valued number to these tokens. The tokens can be characters, words, punctuation marks, etc. There are various ways to encode text data like word frequency, Tf-Idf, one-hot encoding, word embeddings, etc. The approaches which were commonly used earlier were one-hot encoding, word frequency, and Tf-Idf, all of which use one scalar value to represent one token of text. As only one scalar value is used to represent a single token, there is a limit on the information presented by one number. The single number can not capture the exact meaning of tokens/words which can be different in different text contexts. The word embeddings approach that we have explained in this tutorial takes the concept further and uses a real-valued vector to represent a single token. Each token has a list of floats to represent it which can be of any length greater than one. As more numbers are used to represent tokens/words now, it can capture more information and better understand the meaning. The below image shows how word embedding looks.

As a part of this tutorial, we'll explain how we can design neural networks using Keras that uses word embeddings for text classification tasks. We'll start with random word embeddings for tokens and then update them by training the network on our data so that they learn the meaning of tokens/words. We have explained various approaches to using word embeddings as well.

Below, we have listed important sections of tutorial to give an overview of the material covered.

Important Sections Of Tutorial¶

Prepare Data
- 1.1 Load Data
- 1.2 Vectorize Data
Approach 1: Word Embeddings Flattened
- Define Network
- Compile Network
- Train Network
- Evaluate Network Performance
- Explain Network Predictions using LIME Algorithm
Approach 2: Word Embeddings Averaged
Approach 3: Word Embeddings Summed
Results Summary and Further Suggestions

Below, we have imported the necessary Python libraries and printed the version of them that we have used in our tutorial.

import tensorflow
from tensorflow import keras

print("Keras Version : {}".format(keras.__version__))

Keras Version : 2.6.0

import torchtext

print("Torchtext Version : {}".format(torchtext.__version__))

Torchtext Version : 0.10.1

1. Prepare Data ¶

In this section, we have prepared the data to be fed directly to the neural network. We first loaded the dataset and then trained a tokenizer with a list of text examples to populate it's vocabulary with tokens (words and punctuations). The vocabulary is a simple mapping from a token to a unique integer index. Each text token is assigned a unique integer index starting from 1. Then, we vectorize each text example using this populated vocabulary. Basically, we tokenize the text example so that we have a list of tokens and then we retrieve an index of each token using vocabulary.

1.1 Load Data¶

Below, we have loaded the dataset that we'll use for our purpose. We have loaded AG NEWS dataset available from torchtext library. It has text examples for 4 different (["World", "Sports", "Business", "Sci/Tech"]) categories of news. The dataset is already divided into train and test sets.

import numpy as np

train_dataset, test_dataset = torchtext.datasets.AG_NEWS()

X_train_text, Y_train = [], []
for Y, X in train_dataset:
    X_train_text.append(X)
    Y_train.append(Y)

X_test_text, Y_test = [], []
for Y, X in test_dataset:
    X_test_text.append(X)
    Y_test.append(Y)

unique_classes = list(set(Y_train))
target_classes = ["World", "Sports", "Business", "Sci/Tech"]

## Subtracted 1 from labels to bring range from 1-4 to 0-3
Y_train, Y_test = np.array(Y_train) - 1, np.array(Y_test) - 1

len(X_train_text), len(X_test_text)

train.csv: 29.5MB [00:00, 105MB/s]
test.csv: 1.86MB [00:00, 79.6MB/s]

(120000, 7600)

1.2 Vectorize Data¶

In this section, we have vectorized train and test datasets so that they can be given directly to the neural network. In order to vectorize data, we have used Tokenizer() available from keras.

We first created an instance of Tokenizer and then called fit_on_texts() method on it with our datasets (train and test) to populate the vocabulary with tokens. The method will tokenize each text example and add tokens from it to vocabulary.

Once, we have populated vocabulary, we can translate any text example to a list of indexes by calling texts_to_sequences() method on Tokenizer object. The method takes a list of text examples as input and returns a list of indexes (indexes of tokens) for them.

Below, we have included the vectorization process with a simple example. The tokens for which we have an entry in vocab get replaced with their respective indexes and tokens for which we don't have mapping get replaced with the index of '<unk>' token.

text = "Hello, How are you? Where are you planning to go?"

tokens = ['hello', ',', 'how', 'are', 'you', '?', 'where',
            'are', 'you', 'planning', 'to', 'go', '?']

vocab = {
    'hello': 0,
    'bye': 1,
    'how': 2,
    'the': 3,
    'welcome': 4,
    'are': 5,
    'you': 6,
    'to': 7,
    '<unk>': 8,
}

vector = [0,8,2,4,6,8,8,5,6,8,7,8,8]

The length of texts can be arbitrary but we have decided to keep a maximum of 50 tokens per our text examples. We have enforced these using pad_sequences() function. It keeps the length of each text example to 50 tokens by padding the example that has less than 50 tokens with 0s and truncating the example that has more than 50 tokens.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train_text+X_test_text)

print("Vocabulary Size : {}".format(len(tokenizer.index_word)))

max_tokens = 50
## Vectorizing data to keep max_tokens words per sample.
X_train_vect = pad_sequences(tokenizer.texts_to_sequences(X_train_text), maxlen=max_tokens, padding="post", truncating="post", value=0)
X_test_vect  = pad_sequences(tokenizer.texts_to_sequences(X_test_text), maxlen=max_tokens, padding="post", truncating="post", value=0)

print(X_train_vect[:2])

X_train_vect.shape, X_test_vect.shape

Vocabulary Size : 72002
[[  444   440  1697 15012   108    64     1   852    21    21   739  8198
    444  6337 10243  2965     4  5937 26696    40  4014   801   335     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0     0     0     0     0     0     0     0     0     0     0
      0     0]
 [15470  1111   871  1313  4306    21    21   919   809   359 15470    99
    101    22     3  4508     8   504   511 13730     6 15471  1514  2135
      5     1   522   247    22  3938  2289    15  6459     7   209   368
      4     1   129     0     0     0     0     0     0     0     0     0
      0     0]]

((120000, 50), (7600, 50))

## What is word 444

print(tokenizer.index_word[444])

## How many times it comes in first text document??

print(X_train_text[0]) ## 2 times

wall
Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.

Approach 1: Word Embeddings Flattened ¶

Our first approach keeps embeddings of all tokens of text example by laying them next to each other to create a big vector. We have designed a network that has a single embedding layer and 2 dense layers for performing text classification tasks.

Define Network¶

In this section, we have defined our text classification network using Sequential API of Keras. The network consists of 3 layers.

Embedding
Dense(128)
Dense(4)

The first layer is the embedding layer. We can create an embedding layer using Embedding() constructor available from layers sub-module of keras. The first parameter to the constructor is a list of different tokens (vocabulary size) and the second parameter is embedding length per token. The input_length parameter specifies the number of tokens per text example. The constructor internally creates a weight matrix of shape (vocab_size, embed_len) hence we have a vector of length embed_len for each token of vocabulary. The embedding layer is simply an indexing layer that takes as an input list of indexes of text examples and returns their respective embeddings from the weight matrix. The input to embedding layer is of shape (batch_size, max_tokens) = (batch_size,50) and output is of shape (batch_size, max_tokens, embed_len) = (batch_size,50,25). So basically, the text example gets tokenized to create a list of tokens, then we map those tokens to their respective indexes using vocabulary, and the embedding layer maps the indexes to their respective embedding vectors.

The output of embedding layer is flattened hence the shape is transformed from (batch_size, max_tokens, embed_len) to (batch_size, max_tokens x embed_len) = (batch_size, 1250).

The flattened output is given to a dense layer with 128 output units. The dense layer applies relu activation function to the output.

The output of the first dense layer is then given to the second dense layer of 4 output units. The second dense layer applies softmax activation function to the output hence the output is 4 probabilities in the range 0-1 ( which all sum up to 1) per text example.

from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
import gc

embed_len = 25

model = Sequential([
                    layers.Embedding(len(tokenizer.index_word)+1, embed_len, input_length=max_tokens),
                    layers.Flatten(),
                    layers.Dense(128, activation="relu"),
                    layers.Dense(len(target_classes), activation="softmax")
                    ]
                )

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding (Embedding)        (None, 50, 25)            1800075
_________________________________________________________________
flatten (Flatten)            (None, 1250)              0
_________________________________________________________________
dense (Dense)                (None, 128)               160128
_________________________________________________________________
dense_1 (Dense)              (None, 4)                 516
=================================================================
Total params: 1,960,719
Trainable params: 1,960,719
Non-trainable params: 0
_________________________________________________________________

2022-05-02 05:41:47.998656: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.

Compile Network¶

In this section, we have compiled a network to use Adam optimizer, cross entropy loss, and accuracy evaluation metric.

model.compile("adam", "sparse_categorical_crossentropy", metrics=["accuracy"])

Train Network¶

In this section, we have trained our network for 5 epochs with a batch size of 256 by calling fit() method. We have provided train and test data as well. We can notice from the loss and accuracy getting printed after each epoch that our model is doing quite a good job at the text classification task.

history = model.fit(X_train_vect, Y_train, batch_size=256, epochs=5, validation_data=(X_test_vect, Y_test))
gc.collect()

2022-05-02 05:41:48.406892: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)

Epoch 1/5
469/469 [==============================] - 8s 16ms/step - loss: 0.4003 - accuracy: 0.8552 - val_loss: 0.2449 - val_accuracy: 0.9155
Epoch 2/5
469/469 [==============================] - 8s 17ms/step - loss: 0.1457 - accuracy: 0.9510 - val_loss: 0.2450 - val_accuracy: 0.9171
Epoch 3/5
469/469 [==============================] - 8s 16ms/step - loss: 0.0698 - accuracy: 0.9778 - val_loss: 0.2952 - val_accuracy: 0.9099
Epoch 4/5
469/469 [==============================] - 7s 16ms/step - loss: 0.0267 - accuracy: 0.9926 - val_loss: 0.3537 - val_accuracy: 0.9117
Epoch 5/5
469/469 [==============================] - 8s 16ms/step - loss: 0.0145 - accuracy: 0.9966 - val_loss: 0.3801 - val_accuracy: 0.9116

Evaluate Network Performance¶

In this section, we have evaluated the performance of the network by calculating accuracy, classification report (precision, recall, and f1-score per target class) and confusion matrix metrics for test predictions. We can notice from the accuracy that our model is doing quite a good job at predicting target labels for test examples.

We have calculated various ML metrics using functions available from scikit-learn. Please feel free to check the below link if you are interested in learning about various ML metrics available from sklearn in-depth.

Scikit-Learn - Model Evaluation & Scoring Metrics

Apart from calculating metrics, we have also plotted the confusion matrix using python library scikit-plot. We can notice from the visualization that our model seems to be doing a good job at classifying text documents from categories Sports and World compared to categories Business and Sci/Tech.

The scikit-plot has visualizations for many commonly used ML metrics. Please feel free to check the below link if you want to learn about them as well.

Scikit-Plot: Visualizing Machine Learning Algorithm Results & Performance Metrics

from sklearn.metrics import accuracy_score, classification_report

train_preds = model.predict(X_train_vect)
test_preds = model.predict(X_test_vect)

print("Train Accuracy : {}".format(accuracy_score(Y_train, np.argmax(train_preds, axis=1))))
print("Test  Accuracy : {}".format(accuracy_score(Y_test, np.argmax(test_preds, axis=1))))
print("\nClassification Report : ")
print(classification_report(Y_test, np.argmax(test_preds, axis=1), target_names=target_classes))

Train Accuracy : 0.9982833333333333
Test  Accuracy : 0.911578947368421

Classification Report :
              precision    recall  f1-score   support

       World       0.93      0.91      0.92      1900
      Sports       0.96      0.97      0.97      1900
    Business       0.88      0.87      0.88      1900
    Sci/Tech       0.88      0.89      0.88      1900

    accuracy                           0.91      7600
   macro avg       0.91      0.91      0.91      7600
weighted avg       0.91      0.91      0.91      7600

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in np.argmax(test_preds, axis=1)],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Blues",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Network Predictions using LIME Algorithm¶

In this section, we have tried to explain predictions made by our trained network using LIME algorithm. The algorithm is commonly used to explain predictions made by black-box ML models. It let us create a visualization that highlights important tokens of text that contributed to predicting a particular target label. We'll be using the python lime library for our purpose which has an implementation of the algorithm. In order to explain prediction, we need to follow a list of steps.

If you are someone who is new to the LIME concept then we recommend going through the below tutorials in your free time to know about it in detail.

Below, we have first created an instance of LimeTextExplainer() with the names of target classes. This instance will be used to explain the prediction of the network.

from lime import lime_text

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

Below, we have first created a function that takes as an input batch of text examples and returns their predicted probabilities. The function tokenizes text data, vectorizes it, and gives it to the network to make predictions. The output probabilities of the network are returned from the function. We'll use this function in the next cell for explanation purposes.

After defining a function, we randomly selected one text example from the test dataset and made predictions on it using our trained model. Our model correctly predicts the target label as 'Sci/Tech' for the selected text example.

import numpy as np

def make_predictions(X_batch_text):
    X_batch = pad_sequences(tokenizer.texts_to_sequences(X_batch_text), maxlen=50, padding="post", truncating="post", value=0)
    preds = model.predict(X_batch)
    return preds

rng = np.random.RandomState(1234)
idx = rng.randint(1, len(X_test_text))

print("Prediction : ", target_classes[model.predict(X_test_vect[idx:idx+1]).argmax(axis=-1)[0]])
print("Actual :     ", target_classes[Y_test[idx]])

Prediction :  Sci/Tech
Actual :      Sci/Tech

In the below cell, we have created a visualization explaining the prediction of our network. To do that, we have first called explain_instance() method on LimeTextExplainer object. We have given it a selected text example, our prediction function, and the target label. The method returns an Explanation object which has details about the explanation. We have then called show_in_notebook() method on Explanation object to create visualization.

We can notice from the visualization that words like 'RFID', 'privacy', 'frequency', 'technology', etc are used for predicting the target label as Sci/Tech which makes sense as they are commonly used words in the field.

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

Approach 2: Word Embeddings Averaged ¶

Our approach in this section to working with embeddings is a little bit different from our previous approach. In the previous section, we flattened the output of the embedding layer. Instead in this section, we have averaged the embeddings of tokens of one example and given it to the next dense layer. The majority of the code is exactly the same as in our previous section with only a change in the definition of text classification network.

Define Network¶

Below, we have defined a network that we'll use for our task in this section. We have defined the layer using Models API of keras this time. The network consists of 3 layers (1 embedding and 2 dense) like our previous approach. The only difference is that the output of the embedding layer is averaged at the tokens index. The input shape of embedding layer is (batch_size, max_tokens) and output shape is (batch_size, max_tokens, embed_len) which we have averaged tokens level hence shape gets transformed to (batch_size, embed_len). The output of shape (batch_size, embed_len) is then given to the dense layer.

from tensorflow.keras.models import Model
from tensorflow.keras import layers

embed_len = 25

inputs = layers.Input(shape=(max_tokens, ))
embed_layer = layers.Embedding(len(tokenizer.index_word)+1, embed_len, input_length=max_tokens)
dense1 = layers.Dense(128, activation="relu")
dense2 = layers.Dense(len(target_classes), activation="softmax")

x = embed_layer(inputs)
x = tensorflow.reduce_mean(x, axis=1) ## Averaged word embeddings of single example
x = dense1(x)
output = dense2(x)

model = Model(inputs=inputs, outputs=output)

model.summary()

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         [(None, 50)]              0
_________________________________________________________________
embedding_1 (Embedding)      (None, 50, 25)            1800075
_________________________________________________________________
tf.math.reduce_mean (TFOpLam (None, 25)                0
_________________________________________________________________
dense_2 (Dense)              (None, 128)               3328
_________________________________________________________________
dense_3 (Dense)              (None, 4)                 516
=================================================================
Total params: 1,803,919
Trainable params: 1,803,919
Non-trainable params: 0
_________________________________________________________________

Compile Network¶

Here, we have complied the network to use Adam optimization algorithm, cross entropy loss, and accuracy metric.

model.compile("adam", "sparse_categorical_crossentropy", metrics=["accuracy"])

Train Network¶

Below, we have trained our network for 5 epochs with a batch size of 256. We can notice from the loss and accuracy values getting printed after each epoch that our model is doing a good job at the text classification task.

history = model.fit(X_train_vect, Y_train, batch_size=256, epochs=5, validation_data=(X_test_vect, Y_test))
gc.collect()

Epoch 1/5
469/469 [==============================] - 7s 14ms/step - loss: 0.5675 - accuracy: 0.8528 - val_loss: 0.2791 - val_accuracy: 0.9080
Epoch 2/5
469/469 [==============================] - 6s 14ms/step - loss: 0.2157 - accuracy: 0.9296 - val_loss: 0.2439 - val_accuracy: 0.9176
Epoch 3/5
469/469 [==============================] - 6s 14ms/step - loss: 0.1631 - accuracy: 0.9467 - val_loss: 0.2393 - val_accuracy: 0.9195
Epoch 4/5
469/469 [==============================] - 7s 15ms/step - loss: 0.1288 - accuracy: 0.9577 - val_loss: 0.2453 - val_accuracy: 0.9188
Epoch 5/5
469/469 [==============================] - 7s 14ms/step - loss: 0.1032 - accuracy: 0.9665 - val_loss: 0.2620 - val_accuracy: 0.9161

Evaluate Network Performance¶

In this section, we have evaluated the performance of the network by calculating accuracy score, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy score that our model is doing a little better job compared to the previous approach. We have also plotted the confusion matrix for reference purposes.

from sklearn.metrics import accuracy_score, classification_report

train_preds = model.predict(X_train_vect)
test_preds = model.predict(X_test_vect)

print("Train Accuracy : {}".format(accuracy_score(Y_train, np.argmax(train_preds, axis=1))))
print("Test  Accuracy : {}".format(accuracy_score(Y_test, np.argmax(test_preds, axis=1))))
print("\nClassification Report : ")
print(classification_report(Y_test, np.argmax(test_preds, axis=1), target_names=target_classes))

Train Accuracy : 0.976375
Test  Accuracy : 0.9160526315789473

Classification Report :
              precision    recall  f1-score   support

       World       0.94      0.90      0.92      1900
      Sports       0.96      0.98      0.97      1900
    Business       0.88      0.88      0.88      1900
    Sci/Tech       0.88      0.90      0.89      1900

    accuracy                           0.92      7600
   macro avg       0.92      0.92      0.92      7600
weighted avg       0.92      0.92      0.92      7600

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in np.argmax(test_preds, axis=1)],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Blues",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Network Predictions using LIME Algorithm¶

In this section, we have tried to explain the prediction of our network on a random text example using LIME algorithm. Our network correctly predicts the target label as Sci/Tech for the selected text example. From the explanation visualization, we can notice that words like 'RFID', 'frequency', technology', 'privacy', 'identification', etc are used to predict target label Sci/Tech.

from lime import lime_text

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(1234)
idx = rng.randint(1, len(X_test_text))

print("Prediction : ", target_classes[model.predict(X_test_vect[idx:idx+1]).argmax(axis=-1)[0]])
print("Actual :     ", target_classes[Y_test[idx]])

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

Approach 3: Word Embeddings Summed ¶

Our approach in this section is almost the same as our approach in the previous section with one minor change. In the previous section, we had averaged embeddings at the tokens level whereas in this section we have summed embeddings at the tokens level. The majority of the code is the same as in our previous section with one minor change in network definition.

Define Network¶

Below, we have defined the network that we'll use for our task in this section. The definition of a network is exactly the same as the previous approach with the only difference that we have summed embeddings using reduce_sum() function. The rest of the code is the same as earlier.

from tensorflow.keras.models import Model
from tensorflow.keras import layers

embed_len = 25

inputs = layers.Input(shape=(max_tokens, ))
embed_layer = layers.Embedding(len(tokenizer.index_word)+1, embed_len, input_length=max_tokens)
dense1 = layers.Dense(128, activation="relu")
dense2 = layers.Dense(len(target_classes), activation="softmax")

x = embed_layer(inputs)
x = tensorflow.reduce_sum(x, axis=1) ## Sum word embeddings of single example
x = dense1(x)
output = dense2(x)

model = Model(inputs=inputs, outputs=output)

model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_2 (InputLayer)         [(None, 50)]              0
_________________________________________________________________
embedding_2 (Embedding)      (None, 50, 25)            1800075
_________________________________________________________________
tf.math.reduce_sum (TFOpLamb (None, 25)                0
_________________________________________________________________
dense_4 (Dense)              (None, 128)               3328
_________________________________________________________________
dense_5 (Dense)              (None, 4)                 516
=================================================================
Total params: 1,803,919
Trainable params: 1,803,919
Non-trainable params: 0
_________________________________________________________________

Compile Network¶

Here, we have compiled a network to use Adam optimizer, cross entropy loss, and accuracy metric.

model.compile("adam", "sparse_categorical_crossentropy", metrics=["accuracy"])

Train Network¶

Here, we have trained our network for 5 epochs with a batch size of 256. The loss and accuracy values getting printed at the end of each epoch hint that our model is doing a good job at the classification task.

history = model.fit(X_train_vect, Y_train, batch_size=256, epochs=5, validation_data=(X_test_vect, Y_test))
gc.collect()

Epoch 1/5
469/469 [==============================] - 7s 14ms/step - loss: 0.3459 - accuracy: 0.8849 - val_loss: 0.2375 - val_accuracy: 0.9197
Epoch 2/5
469/469 [==============================] - 7s 15ms/step - loss: 0.1543 - accuracy: 0.9472 - val_loss: 0.2421 - val_accuracy: 0.9203
Epoch 3/5
469/469 [==============================] - 6s 14ms/step - loss: 0.1024 - accuracy: 0.9645 - val_loss: 0.2640 - val_accuracy: 0.9186
Epoch 4/5
469/469 [==============================] - 6s 14ms/step - loss: 0.0718 - accuracy: 0.9749 - val_loss: 0.3050 - val_accuracy: 0.9128
Epoch 5/5
469/469 [==============================] - 7s 14ms/step - loss: 0.0519 - accuracy: 0.9822 - val_loss: 0.3561 - val_accuracy: 0.9108

Evaluate Network Performance¶

In this section, we have again evaluated the performance of the network by calculating the accuracy score, classification report and confusion matrix metrics on test predictions. We can notice from the accuracy score that it is almost the same as our first approach. We have also plotted the confusion matrix for reference purposes.

from sklearn.metrics import accuracy_score, classification_report

train_preds = model.predict(X_train_vect)
test_preds = model.predict(X_test_vect)

print("Train Accuracy : {}".format(accuracy_score(Y_train, np.argmax(train_preds, axis=1))))
print("Test  Accuracy : {}".format(accuracy_score(Y_test, np.argmax(test_preds, axis=1))))
print("\nClassification Report : ")
print(classification_report(Y_test, np.argmax(test_preds, axis=1), target_names=target_classes))

Train Accuracy : 0.9897916666666666
Test  Accuracy : 0.9107894736842105

Classification Report :
              precision    recall  f1-score   support

       World       0.90      0.93      0.91      1900
      Sports       0.97      0.97      0.97      1900
    Business       0.90      0.85      0.87      1900
    Sci/Tech       0.88      0.89      0.88      1900

    accuracy                           0.91      7600
   macro avg       0.91      0.91      0.91      7600
weighted avg       0.91      0.91      0.91      7600

from sklearn.metrics import confusion_matrix
import scikitplot as skplt
import matplotlib.pyplot as plt

skplt.metrics.plot_confusion_matrix([target_classes[i] for i in Y_test], [target_classes[i] for i in np.argmax(test_preds, axis=1)],
                                    normalize=True,
                                    title="Confusion Matrix",
                                    cmap="Blues",
                                    hide_zeros=True,
                                    figsize=(5,5)
                                    );
plt.xticks(rotation=90);

Explain Network Predictions using LIME Algorithm¶

Here, we have again explained the prediction made by our trained network on a random test example using LIME algorithm. The words like 'RFID', 'frequency', 'technology', 'identification', 'threat', 'privacy', etc are used to predict the target label as Sci/Tech which makes sense as they are commonly referred in the field.

from lime import lime_text

explainer = lime_text.LimeTextExplainer(class_names=target_classes, verbose=True)

rng = np.random.RandomState(1234)
idx = rng.randint(1, len(X_test_text))

print("Prediction : ", target_classes[model.predict(X_test_vect[idx:idx+1]).argmax(axis=-1)[0]])
print("Actual :     ", target_classes[Y_test[idx]])

explanation = explainer.explain_instance(X_test_text[idx], classifier_fn=make_predictions, labels=Y_test[idx:idx+1])
explanation.show_in_notebook()

5. Results Summary and Further Suggestions ¶

Below, we have listed a results summary of all approaches we tried.

Approach	Max Tokens	Embedding Length	Test Accuracy (%)
Word Embeddings Flattened	50	25	91.15
Word Embeddings Averaged	50	25	91.60
Word Embeddings Summed	50	25	91.07

Further Recommendations¶

Try different token sizes per text example
Try different embedding lengths.
Add more dense layers to the network.
Initialize the network with good random weights.
Train network for more epochs.
Try learning rate schedulers.

This ends our small tutorial explaining how we can use word embeddings with keras networks for text classification tasks. Please feel free to let us know your views in the comments section.

References¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

When going through coding examples, it's quite common to have doubts and errors.

If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.

You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.

Want to Share Your Views? Have Any Suggestions?

If you want to

provide some suggestions on topic
share your views
include some details in tutorial
suggest some new topics on which we should create tutorials/blogs

Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.

keras, word-embeddings, text-classification

Sunny Solanki

Software Developer | Youtuber | Bonsai Enthusiast

Subscribe to Our YouTube Channel

Tutorial Categories

Artificial Intelligence (83)
Data Science (84)
Digital Marketing (8)
Machine Learning (38)
Python (131)

Keras: Word Embeddings for Text Classification¶

Important Sections Of Tutorial¶

1. Prepare Data ¶

1.1 Load Data¶

1.2 Vectorize Data¶

Approach 1: Word Embeddings Flattened ¶

Define Network¶

Compile Network¶

Train Network¶

Evaluate Network Performance¶

Explain Network Predictions using LIME Algorithm¶

Approach 2: Word Embeddings Averaged ¶

Define Network¶

Compile Network¶

Train Network¶

Evaluate Network Performance¶

Explain Network Predictions using LIME Algorithm¶

Approach 3: Word Embeddings Summed ¶

Define Network¶

Compile Network¶

Train Network¶

Evaluate Network Performance¶

Explain Network Predictions using LIME Algorithm¶

5. Results Summary and Further Suggestions ¶

Further Recommendations¶

References¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

Want to Share Your Views? Have Any Suggestions?

Sunny Solanki

Subscribe to Our YouTube Channel

Tutorial Categories

Newsletter Subscription