Text Generation also referred to as Natural Language Generation is a kind of Language Modeling problem where we build a model that tries to understand the structure of a text and produce another text. Tasks like machine translation, conversational systems (chatbots), speech-to-text, text summarization, etc at their core try to build language models. Now a day's deep learning models are developed for language modeling tasks. The language model in the case of text generation tries to predict the next token (character/word/n-gram) in text-based on previously seen tokens. In order to predict the next token in sequence, the language model needs to understand the sequence in which tokens are laid out. Deep Learning Recurrent Neural Networks (RNNs) and their variants (LSTM, GRU, etc) are quite good at understanding the sequence of input data hence can be used for language modeling tasks.
As a part of this tutorial, we have explained how we can create Recurrent Neural Networks consisting of LSTM layers using Python deep learning library PyTorch for text generation task. In this tutorial, we have used Character-based approach for text generation tasks where the model takes a specified number of characters as input and predicts the next character in the sequence. In the same way, we can also create networks that take a sequence of words as input and predicts the next word. We have used bag of words approach for encoding text data. We have used the Wikipedia text corpus available from torchtext library (PyTorch NLP tasks helper library) for our purpose. We have another tutorial on text generation using Pytorch which uses character embeddings for encoding text data. Please feel free to check it from the below link.
Please make a NOTE that language models are generally big and take time to train until they can produce some meaningful text. It will be hard to train them on CPU and GPU can help with faster training hence we recommend training language models on GPU.
Below, we have listed important sections of Tutorial to give an overview of the material covered.
Below, we have imported the necessary Python libraries and printed the versions that we have used in our tutorial.
import torch
print("PyTorch Version : {}".format(torch.__version__))
import torchtext
print("TorchText Version : {}".format(torchtext.__version__))
device = "cuda" if torch.cuda.is_available() else "cpu"
device
import gc
In this section, we are preparing our data for training our network. As we said earlier, we are going to use character-based approach for text generation hence we'll feed a few characters to the network and make it predict the next character in the sequence. We have decided to use 100 characters sequence to network and make it predict the next character after them.
We'll be encoding data using bag of words approach. We'll follow the below steps to encode and prepare data.
The data generated after following the above steps will be given to the LSTM network for processing. The network will process a sequence of 100 characters at a time and try to predict the next character. We have explained the steps in more detail below to make them easier to grasp.
In this section, we have simply loaded our Wikipedia dataset. The dataset is already divided into the train, test, and validation sets. We'll use only the train set for our task. The train set has ~36k text examples. Each example represents a Wikipedia article.
train_dataset, valid_dataset, test_dataset = torchtext.datasets.WikiText2()
In this section, we are building a vocabulary of all unique characters present in our dataset. In order to create a vocabulary, we have used build_vocab_from_iterator() function available from 'vocab' sub-module of torchtext library. The function accepts an iterator that returns a list of characters on each call. We have created a small function named build_vocabulary() that works as an iterator. The function takes datasets as input and loops through all datasets and their examples one at a time yielding list of characters. Our text examples have a special token named <unk>
which represents the unknown character and we have done special handling of it to count it as one token instead of breaking it into characters.
After building vocabulary, we have printed vocabulary and the number of characters present in it.
from torchtext.data import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
def build_vocabulary(datasets):
for dataset in datasets:
for text in dataset:
if "<unk>" in text:
texts = text.split("<unk>")
total = list(texts[0].lower())
for t in texts[1:]:
total.extend(["<unk>", ] + list(t.lower()))
yield total
else:
yield list(text.lower())
vocab = build_vocab_from_iterator(build_vocabulary([train_dataset, ]), min_freq=1, specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])
len(vocab)
print(vocab.get_itos())
In this section, we are reorganizing our dataset examples so that they can be used to train our LSTM network. We are simply looping through each text example of our train dataset. For each text example, we are sliding a window of 100 characters. We are taking 100 characters as data features and the next character in the sequence as the target value, then we move the window by 1 character and continue the process until we reach the end of the text. We have also replaced each character with its integer index using our vocabulary. Please make a NOTE that we have not used all examples available from the dataset for the training model as it'll take quite long.
After organizing the dataset, we have converted them to torch tensors. We have also added one extra dimension at the end in order to feed data to the LSTM layer.
Below, we have tried to explain the process with a simple example.
vocab = {
'h':1,
'e':2,
'l':3,
'o':4,
' ':5,
',':6,
'w',7,
'a':8,
'r':9,
'y':10,
'u':11,
'?':12,
'c':13,
'm':14,
't':15,
'd':16,
'z':17,
'n':18
}
text_example = "Hello, How are you? Welcome to coderzcolumn?"
seq_length = 10
X_train = [
['h','e','l','l','o',',',' ', 'h','o','w'],
[,'e','l','l','o',',',' ', 'h','o','w',' '],
['l','l','o',',',' ', 'h','o','w', ' ', 'a'],
['l','o',',',' ', 'h','o','w',' ', 'a', 'r'],
...
['d','e','r','z','c','o','l', 'u','m','n']
]
Y_train = ['e','l','l','o',',',' ', 'h','o','w',' ',..., '?']
X_train_vectorized = [
[1,2,3,4,5,6,1,4,7],
[2,3,4,5,6,1,4,7,5],
[3,4,5,6,1,4,7,5,1],
...
[16,2,9,17,13,4,3,11,14,18]
]
Y_train_vectorized = [1,2,3,4,5,6,1,4,7,5,1,...., 12]
%%time
train_dataset, valid_dataset, test_dataset = torchtext.datasets.WikiText2()
seq_length = 100 ## Network Hyperparameter to tune
X_train, Y_train = [], []
for text in list(train_dataset)[:7500]:
for i in range(0, len(text)-seq_length):
inp_seq = list(text[i:i+seq_length].lower())
out_seq = text[i+seq_length].lower()
X_train.append(vocab(inp_seq))
Y_train.append(vocab[out_seq])
X_train, Y_train = torch.tensor(X_train, dtype=torch.float32), torch.tensor(Y_train)
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], 1) ## Extra dimension is added for LSTM layer
X_train.shape, Y_train.shape
In this section, we have simply wrapped our torch tensors in the dataset and created a data loader from it. The data loader will let us process data in batches during the training process. We have set the batch size of 1024.
from torch.utils.data import DataLoader, TensorDataset
vectorized_train_dataset = TensorDataset(X_train, Y_train)
train_loader = DataLoader(vectorized_train_dataset, batch_size=1024, shuffle=False)
for X, Y in train_loader:
print(X.shape, Y.shape)
break
gc.collect()
In this section, we have defined a neural network that we'll use for our task. Our task will be considered a classification task as our network predicts one of the characters from the vocabulary.
The network that we have defined consists of 2 LSTM layers and one linear layer. The output size of each LSTM layer is set at 256. The usage of two consecutive LSTM layers will help us better capture the sequence of characters found in the data. We have defined LSTM layers using LSTM() constructor where we have provided the value of num_layers parameter as 2 instructing it to stack to LSTM layers. The output of the second LSTM layer is given to Linear layer which has output units the same as the size of the vocabulary.
After defining the network, we initialized it, printed the shape of weights/biases of layers, and performed a forward pass for verification purposes.
If you are someone who is new to PyTorch or don't have a background on LSTM Networks then we recommend that you go through the below links as they will help you with the background. We have not covered the inner workings of LSTM in-depth here as it is already covered there.
from torch import nn
from torch.nn import functional as F
hidden_dim = 256
n_layers=2
class LSTMTextGenerator(nn.Module):
def __init__(self):
super(LSTMTextGenerator, self).__init__()
self.lstm = nn.LSTM(input_size=1, hidden_size=hidden_dim, num_layers=n_layers, batch_first=True)
self.linear = nn.Linear(hidden_dim, len(vocab))
def forward(self, X_batch):
hidden, carry = torch.randn(n_layers, len(X_batch), hidden_dim).to(device), torch.randn(n_layers, len(X_batch), hidden_dim).to(device)
output, (hidden, carry) = self.lstm(X_batch, (hidden, carry))
return self.linear(output[:,-1])
text_generator = LSTMTextGenerator().to(device)
text_generator
for layer in text_generator.children():
print("Layer : {}".format(layer))
print("Parameters : ")
for param in layer.parameters():
print(param.shape)
print()
out = text_generator(torch.randn(1024, seq_length, 1).to(device))
out.shape
Here, we are training our network. To simplify the training process, we have created a helper training function. The function takes the model, loss function, optimizer, train data loader, and a number of epochs as input. It then executes a training loop number of epochs times looping through whole training data in batches each time. For each batch of data, it performs a forward pass to make predictions, calculates loss, calculates gradients, and updates network parameters using gradients. It records the loss value for each batch and prints the average loss value of all batches at the end of each epoch.
from tqdm import tqdm
from sklearn.metrics import accuracy_score
import gc
def TrainModel(model, loss_fn, optimizer, train_loader, epochs=10):
for i in range(1, epochs+1):
losses = []
for X, Y in tqdm(train_loader):
Y_preds = model(X.to(device))
loss = loss_fn(Y_preds, Y.to(device))
losses.append(loss.item())
optimizer.zero_grad()
loss.backward()
optimizer.step()
print("Train Loss : {:.3f}".format(torch.tensor(losses).mean()))
Below, we are actually training our network using the training routine from the previous cell. We have initialized a number of epochs to 25 and the learning rate to 0.001. Then, we have initialized cross entropy loss, our LSTM model, and Adam optimizer. At last, we have called our training routine with the necessary parameters to perform training. We have trained the network for 25 epochs to see what kind of results it produces. We can notice from the loss value getting printed after each epoch that the network seems to be doing a good job at learning the sequence of characters.
%%time
from torch.optim import Adam
epochs = 25
learning_rate = 1e-3
loss_fn = nn.CrossEntropyLoss().to(device)
text_generator = LSTMTextGenerator().to(device)
optimizer = Adam(text_generator.parameters(), lr=learning_rate)
TrainModel(text_generator, loss_fn, optimizer, train_loader, epochs)
In this section, we are trying to generate data using our trained network. We have first retrieved a random text example from our organized train dataset. We have then printed the characters of that example. Then, we have a loop that generates 100 new characters. The logic starts with the initial randomly selected sequence and makes the next character prediction. It then removes the first character from the sequence and adds a newly predicted character at the end. Then, it makes another prediction and the process repeats for 100 characters.
We can notice from the results that our model is not making any spelling errors even though it is predicting one character at a time. The sequence of characters generated does not make much sense but seems like an English language sentence. It is also predicting punctuation marks. The model is a little deterministic and repeats the sequence of characters after some time. This can be avoided by introducing some kind of randomness to the output of the network.
The results look overall good as we have trained the network for just 25 epochs. Next, we'll train the network for more epochs and hopefully, it should improve results further.
import random
random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].numpy().astype(int).flatten().tolist()
print("Initial Pattern : {}".format("".join(vocab.lookup_tokens(pattern))))
generated_text = []
for i in range(100):
X_batch = torch.tensor(pattern, dtype=torch.float32).reshape(1, seq_length, 1) ## Design Batch
preds = text_generator(X_batch.to(device)) ## Make Prediction
predicted_index = preds.argmax(dim=-1).cpu().numpy()[0] ## Retrieve token index
generated_text.append(predicted_index) ## Add token index to result
pattern.append(predicted_index) ## Add token index to original pattern
pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.
print("Generated Text : {}".format("".join(vocab.lookup_tokens(generated_text))))
In this section, we have trained our network for another 50 epochs. We have also reduced the learning rate from 0.001 to 0.0003. We can notice from the loss values getting printed that it is decreasing at every epoch which means that our network is getting good at the text generation task.
epochs = 50
learning_rate = 3e-4
optimizer = Adam(text_generator.parameters(), lr=learning_rate)
TrainModel(text_generator, loss_fn, optimizer, train_loader, epochs)
Here, we have again generated new characters using our more trained network. We have used the same example that we had used earlier. We can notice that results seem to have improved a little bit. The model is not making spelling mistakes and new words are generated for the same example. The network still seems deterministic and produces the same characters again and again. We can train the network further to see whether it helps or not.
import random
random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].numpy().astype(int).flatten().tolist()
print("Initial Pattern : {}".format("".join(vocab.lookup_tokens(pattern))))
generated_text = []
for i in range(100):
X_batch = torch.tensor(pattern, dtype=torch.float32).reshape(1, seq_length, 1) ## Design Batch
preds = text_generator(X_batch.to(device)) ## Make Prediction
predicted_index = preds.argmax(dim=-1).cpu().numpy()[0] ## Retrieve token index
generated_text.append(predicted_index) ## Add token index to result
pattern.append(predicted_index) ## Add token index to original pattern
pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.
print("Generated Text : {}".format("".join(vocab.lookup_tokens(generated_text))))
In this section, we have trained our network for another 50 epochs. We have reduced the learning rate from 0.0003 to 0.0001. We can notice from the loss values at the end of the epoch that the network is improving further.
epochs = 50
learning_rate = 1e-4
optimizer = Adam(text_generator.parameters(), lr=learning_rate)
TrainModel(text_generator, loss_fn, optimizer, train_loader, epochs)
Here, we are again generating text on the same text example using our trained network. We can notice from the results this time that they are a little better compared to earlier. Though they are still deterministic.
import random
random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].numpy().astype(int).flatten().tolist()
print("Initial Pattern : {}".format("".join(vocab.lookup_tokens(pattern))))
generated_text = []
for i in range(100):
X_batch = torch.tensor(pattern, dtype=torch.float32).reshape(1, seq_length, 1) ## Design Batch
preds = text_generator(X_batch.to(device)) ## Make Prediction
predicted_index = preds.argmax(dim=-1).cpu().numpy()[0] ## Retrieve token index
generated_text.append(predicted_index) ## Add token index to result
pattern.append(predicted_index) ## Add token index to original pattern
pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.
print("Generated Text : {}".format("".join(vocab.lookup_tokens(generated_text))))
Below we have suggested a few more things that can be tried to improve network performance further.
This ends our small tutorial explaining how to design LSTM Networks using PyTorch for Text generation tasks. Please feel free to contact us if you questions
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to