Text Generation is an NLP task of generating new text using some trained network. The networks used for text generation tasks are referred to as Language Models and the process of training such networks is referred to as Language Modeling. Other tasks related to NLP like translation, text summarization, speech-to-text, conversational systems (chatbots), etc are also language modeling tasks. Nowadays deep neural networks are developed for language modeling. Generally, The Recurrent Neural Networks and their variants (LSTM, GRU, etc) outperforms other neural network architectures for text generation tasks. The reason behind this is that language sentences have structure. The word sequence is important for correct grammar and forming sentences. The RNNs are quite good at capturing and remembering those sequences hence commonly used for language modeling tasks.
As a part of this tutorial, we have explained how we can create Recurrent Neural Networks (RNNs) consisting of LSTM Layers using Python Deep Learning library MXNet for text generation tasks. We have used character-based approach for text generation which means we'll give the network a specified number of characters of a sentence and will make it predict the next character after them. The text data is encoded using bag of words approach. We have used Wikipedia dataset available from torchtext for our task. It has the text of well-curated Wikipedia articles. We have another tutorial on text generation using MXnet that uses character embeddings for encoding text data. Please check the below link to take a look at it.
Please make a NOTE those language models are quite hard to train on CPU hence we recommend using GPU for the task. We have used GPU for training in this tutorial (see below after library imports).
Below, we have listed essential sections of the tutorial to give an overview of the material covered.
Below, we have imported the necessary libraries and printed the versions that we have used in our tutorials.
import mxnet
print("MXNet Version : {}".format(mxnet.__version__))
import gluonnlp
print("GluonNLP Version : {}".format(gluonnlp.__version__))
import torchtext
print("TorchText Version : {}".format(torchtext.__version__))
device = mxnet.gpu() if mxnet.test_utils.list_gpus() else mxnet.cpu()
device
In this section, we are preparing data for our text generation task. As discussed earlier, we'll be using character-based approach for the task. We'll be designing a network that takes 100 characters of data as input and predicts the next character. In order to train the network, we'll organize data following the below steps.
The output of the 5th step will be used to train the network. It's okay if you don't understand the steps exactly, they will become clear as we implement them below.
In this section, we have loaded our Wikipedia dataset from torchtext library. The dataset is already divided into train, test, and validation sets. We'll be using only train dataset for our purpose. It has nearly ~36k text documents each representing a unique Wikipedia article.
from mxnet.gluon.data import ArrayDataset
train_dataset, valid_dataset, test_dataset = torchtext.datasets.WikiText2()
X_train = list(train_dataset)
train_dataset = ArrayDataset(X_train)
len(train_dataset)
In this section, we have populated the vocabulary of unique characters. In order to populate vocabulary, we have used count_tokens() helper functions available from gluonnlp helper Python library from MXnet. We have first created a Counter object available from collections Python library. This object is a kind of dictionary object that maintains characters and their count. After defining the counter object, we are looping through each text example of the dataset calling count_tokens() method with a list of characters and counter object. This method will keep updating the counter object with frequency of characters. After completion of the loop, Counter object will have all characters and their counts in the dataset.
Then, we have created a vocabulary by calling Vocab() constructor available from gluonnlp module with Counter object. This Vocab object holds our vocabulary of characters. We have printed a number of entries present in vocabulary as well as vocabulary contents itself.
from collections import Counter
counter = Counter()
for dataset in [train_dataset, ]:
for X in dataset:
gluonnlp.data.count_tokens(list(X), to_lower=True, counter=counter)
vocab = gluonnlp.Vocab(counter=counter, min_freq=1)
print("Vocabulary Size : {}".format(len(vocab)))
print(vocab.token_to_idx)
In this section, we have organized our dataset for training purposes. We are looping through text examples of train dataset one by one and moving window of size 100 through each example as we had discussed earlier. We have created data features (X_train) and target values (Y_train) that have characters. Then, we have retrieved the index of characters from the vocabulary as well for data features (X_train) and target values (Y_train). Now our dataset consists of character indexes which will be given to the network for training. We have also converted arrays to mxnet ndarrays as required by MXNet networks.
Please make a NOTE that we have used the first few text examples from the dataset for training and not the whole dataset. The reason behind this is that it'll take a lot of time to train the network if we use all examples.
Below, we have tried to explain the process with one simple example.
vocab = {
'h':1,
'e':2,
'l':3,
'o':4,
' ':5,
',':6,
'w',7,
'a':8,
'r':9,
'y':10,
'u':11,
'?':12,
'c':13,
'm':14,
't':15,
'd':16,
'z':17,
'n':18
}
text_example = "Hello, How are you? Welcome to coderzcolumn?"
seq_length = 10
X_train = [
['h','e','l','l','o',',',' ', 'h','o','w'],
[,'e','l','l','o',',',' ', 'h','o','w',' '],
['l','l','o',',',' ', 'h','o','w', ' ', 'a'],
['l','o',',',' ', 'h','o','w',' ', 'a', 'r'],
...
['d','e','r','z','c','o','l', 'u','m','n']
]
Y_train = ['e','l','l','o',',',' ', 'h','o','w',' ',..., '?']
X_train_vectorized = [
[1,2,3,4,5,6,1,4,7],
[2,3,4,5,6,1,4,7,5],
[3,4,5,6,1,4,7,5,1],
...
[16,2,9,17,13,4,3,11,14,18]
]
Y_train_vectorized = [1,2,3,4,5,6,1,4,7,5,1,...., 12]
%%time
import gluonnlp.data.batchify as bf
from mxnet import nd
import numpy as np
train_dataset, valid_dataset, test_dataset = torchtext.datasets.WikiText2()
train_dataset = ArrayDataset(list(train_dataset))
seq_length = 100 ## Network Hyperparameter to tune
X_train, Y_train = [], []
for text in list(train_dataset)[:7500]:
for i in range(0, len(text)-seq_length):
inp_seq = list(text[i:i+seq_length].lower())
out_seq = text[i+seq_length].lower()
X_train.append(vocab(inp_seq)) ## Retrieve character index
Y_train.append(vocab[out_seq]) ## Retrieve character index
X_train, Y_train = nd.array(X_train, dtype=np.float32), nd.array(Y_train)
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], 1) ## Extra dimension is added for LSTM layer
X_train.shape, Y_train.shape
In this section, we have created the dataset and data loader using data features and target values ndarrays we created in the previous step. The data loader will help us loop through training data in batches. We have set shuffle to False to prevent shuffling of examples because character sequence is important. The batch size is set at 1024.
from mxnet.gluon.data import DataLoader, ArrayDataset
vectorized_train_dataset = ArrayDataset(X_train, Y_train)
train_loader = DataLoader(vectorized_train_dataset, batch_size=1024, shuffle=False)
for X, Y in train_loader:
print(X.shape, Y.shape)
break
In this section, we have defined our LSTM network that we'll use for our task. As we'll be predicting one of the vocabulary characters as output, our task will be considered classification task. The network is simple and consists of 3 layers.
The first two layers of our network are LSTM layers. We have created LSTM layers using LSTM() constructor available from rnn sub-module of gluon sub-module of MXnet. We have set the output of the size of lstm layers at 256. By setting num_layers parameter to 2, we have asked the constructor to stack two LSTM layers. The input shape to first LSTM layer will be (batch_size, seq_length, 1) = (batch_size, 100, 1) and output shape will be (batch_size, seq_length, hidden_size) = (batch_size, 100, 256). The output of the first LSTM layer will be given to the second LSTM layer for processing and it'll also produce an output of shape (batch_size, 100, 256). The LSTM layers process a sequence of character indexes internally. For each example, it goes through 100 characters and produces a final output that in some way remembers something about this 100 characters sequence.
The output of the second LSTM layer will be given to the dense layer. The dense layer has the same output units as the length of vocabulary. The output of the dense layer will be of shape (batch_size, vocab_len) and will be a prediction of the network.
After defining the network, we initialized it and performed a forward pass through it for verification purposes. We have also printed the shape of weights/biases of layers of network for information purposes.
We have not discussed LSTM layer in detail over here. If you are someone new to it then we recommend that you go through the below link in your free time where we have explained it. The tutorial uses the LSTM network for text classification task and it'll help you understand LSTM layers a little better.
If you are new to MXNet and want to learn how to create neural networks using it then please check the below link in your free time.
from mxnet.gluon import nn, rnn
from mxnet import gluon
hidden_dim = 256
n_layers = 2
class LSTMTextGenerator(nn.Block):
def __init__(self, **kwargs):
super(LSTMTextGenerator, self).__init__(**kwargs)
self.lstm = rnn.LSTM(hidden_size=hidden_dim, num_layers=n_layers, layout="NTC", input_size=1)
self.dense = nn.Dense(len(vocab))
def forward(self, x):
x = self.lstm(x)
return self.dense(x[:, -1])
model = LSTMTextGenerator()
model
from mxnet import init, initializer
model.initialize(initializer.Xavier(), ctx=mxnet.Context(device))
preds = model(nd.random.randn(10,seq_length,1, ctx=device))
preds.shape
for key,val in model.collect_params().items():
print("{:25s} : {}".format(key, val.shape))
In this section, we are training our network. In order to train the network, we have designed a function that will help us perform the training process. The function takes the trainer object, train data loader, and a number of epochs as input. It then executes a training loop number of epochs time. During each epoch, it loops through training data in batches. For each batch of data, it performs a forward pass to make predictions, calculates loss, calculates gradients, and updates network parameters. The function records the loss of each batch and prints the average loss of all batches at the end of the epoch.
from mxnet import autograd
from tqdm import tqdm
from sklearn.metrics import accuracy_score
def TrainModelInBatches(trainer, train_loader, epochs):
for i in range(1, epochs+1):
losses = [] ## Record loss of each batch
for X_batch, Y_batch in tqdm(train_loader):
with autograd.record():
preds = model(X_batch.as_in_context(device)) ## Forward pass to make predictions
train_loss = loss_func(preds.squeeze(), Y_batch.as_in_context(device)) ## Calculate Loss
train_loss.backward() ## Calculate Gradients
train_loss = train_loss.mean().asscalar()
losses.append(train_loss)
trainer.step(len(X_batch)) ## Update weights
if (i%5)==0:
print("Train CrossEntropyLoss : {:.3f}".format(np.array(losses).mean()))
Below, we are actually training the network using the function designed in the previous cell. We have initialized a number of epochs to 50 and the learning rate to 0.001. Then, we have initialized our model, cross-entropy loss, Adam optimizer, and Trainer object (with network parameters). At last, we have called our training routine with the necessary parameters to perform training. We can notice from the decreasing loss value at the end of each epoch that our model seems to be improving. Next, we'll generate some text using it.
from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
epochs=50
learning_rate = 0.001
model = LSTMTextGenerator()
model.initialize(initializer.Xavier(), ctx=mxnet.Context(device))
loss_func = loss.SoftmaxCrossEntropyLoss()
optimizer = optimizer.Adam(learning_rate=learning_rate)
trainer = gluon.Trainer(model.collect_params(), optimizer)
TrainModelInBatches(trainer, train_loader, epochs)
In this section, we are generating text using our trained model. To start with, we have randomly selected an example from our dataset. We have printed the characters of our example. Then, we have executed a loop of generating new characters. The loop generates 100 new characters. For the first iteration, our selected example is given to the model to make the prediction. The predicted character is added at the end of the sequence and the first character is removed from the sequence to keep the sequence length of 100 as required by our model. This modified sequence with a predicted character added at the end of the sequence becomes an input to the model for the next iteration. This process is repeated for each iteration where we generate a new character, add it at the end of the sequence and remove the existing first character from the sequence. After generating 100 new characters, we have also printed them.
We can notice from the generated text that our model has learned to form words and there are no spelling mistakes. The generated text seems to be in the English language though it does not make much sense. The prediction made by the network seems a little deterministic as it is repeating words. We'll train the network for more epochs to see whether we can further improve results.
import random
random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].asnumpy().astype(int).flatten().tolist()
print("Initial Pattern : {}".format("".join(vocab.to_tokens(pattern))))
generated_text = []
for i in range(100):
X_batch = nd.array(pattern, dtype=np.float32).reshape(1, seq_length, 1) ## Design Batch
preds = model(X_batch.as_in_context(device)) ## Make Prediction
predicted_index = preds.argmax(axis=-1).asnumpy().astype(int)[0] ## Retrieve token index
generated_text.append(predicted_index) ## Add token index to result
pattern.append(predicted_index) ## Add token index to original pattern
pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.
print("Generated Text : {}".format("".join([vocab.idx_to_token[i] for i in generated_text])))
In this section, we have trained the network for another 50 epochs. We have set the learning rate to 0.0003. We can notice from the loss values that our network is improving further because loss is decreasing at every epoch. Next, we'll test it by generating new text.
from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
epochs=50
learning_rate = 0.0003
optimizer = optimizer.Adam(learning_rate=learning_rate)
trainer = gluon.Trainer(model.collect_params(), optimizer)
TrainModelInBatches(trainer, train_loader, epochs)
In this section, we have again generated text using our more trained model. The code for text generation is almost the same as earlier. We have started with the same random example. We can notice from the generated text that it is a little better compared to last time. It is generating more words. There are no spelling mistakes as usual. Though the model still seems deterministic. We can train it for a few more epochs to check whether it helps.
import random
random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].asnumpy().astype(int).flatten().tolist()
print("Initial Pattern : {}".format("".join(vocab.to_tokens(pattern))))
generated_text = []
for i in range(100):
X_batch = nd.array(pattern, dtype=np.float32).reshape(1, seq_length, 1) ## Design Batch
preds = model(X_batch.as_in_context(device)) ## Make Prediction
predicted_index = preds.argmax(axis=-1).asnumpy().astype(int)[0] ## Retrieve token index
generated_text.append(predicted_index) ## Add token index to result
pattern.append(predicted_index) ## Add token index to original pattern
pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.
print("Generated Text : {}".format("".join([vocab.idx_to_token[i] for i in generated_text])))
In this section, we have trained the network for another 50 epochs with a learning rate of 0.0001. Please make a note that we have reduced the learning rate a second time. The loss values indicate that our network has improved further. Next, we'll test it by generating new text.
from mxnet import gluon, initializer
from mxnet.gluon import loss
from mxnet import autograd
from mxnet import optimizer
epochs=50
learning_rate = 0.0001
optimizer = optimizer.Adam(learning_rate=learning_rate)
trainer = gluon.Trainer(model.collect_params(), optimizer)
TrainModelInBatches(trainer, train_loader, epochs)
Here, we have again generated new text using our trained network. Our logic to generate text is the same as earlier and it starts with the same example. We can notice from the generated text that it is quite different compared to earlier. It is generating punctuation marks as well this time. It is correctly spelling words. Next, we have given a few recommendations on how we can improve text generation models further.
import random
random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].asnumpy().astype(int).flatten().tolist()
print("Initial Pattern : {}".format("".join(vocab.to_tokens(pattern))))
generated_text = []
for i in range(100):
X_batch = nd.array(pattern, dtype=np.float32).reshape(1, seq_length, 1) ## Design Batch
preds = model(X_batch.as_in_context(device)) ## Make Prediction
predicted_index = preds.argmax(axis=-1).asnumpy().astype(int)[0] ## Retrieve token index
generated_text.append(predicted_index) ## Add token index to result
pattern.append(predicted_index) ## Add token index to original pattern
pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.
print("Generated Text : {}".format("".join([vocab.idx_to_token[i] for i in generated_text])))
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to