Text Generation is an active area of research in Natural Language Processing (NLP) where we build models that generate text just like humans. The models commonly used to generate text are referred to as Language Models and the process is referred to as Language Modeling. Nowadays various deep learning models are getting developed for language modeling tasks like conversational systems (chatbots), text translation, text summarization, etc. The type of deep learning models like Recurrent Neural Networks (RNNs) and their variants are pretty commonly used to create language models for text generation. The RNNs by design are good at remembering sequences in data. It keeps track of previously seen examples of data and uses it to make predictions of the current. This is required in the case of text generation where we want to know previous words/characters in order to generate a new word/character after them.
As a part of this tutorial, we have explained how we can create Recurrent Neural Networks consisting of LSTM layers for text generation tasks using Python deep learning library Keras. We have used character-based model for the text generation task which takes a specified number of characters as input and predicts the next character of the sequence. We'll encode characters of text using bag of words approach where we'll assign a unique integer index to each character. This encoded data will be given to the network for training. For training purposes, we have used wikipedia dataset which has a list of good quality articles from Wikipedia. We have another tutorial on text generation using Keras which uses character embeddings for encoding text data. Please feel free to check it from the below link.
Please make a NOTE that the language models take a lot of time to train and require GPU to train. Training language model on CPU will take a lot of time. We have a trained model on GPU in this tutorial and we recommend using it.
Below, we have listed important sections of Tutorial to give an overview of the material covered.
Below, we have imported the necessary libraries and printed the versions that we have used in our tutorial.
import tensorflow
from tensorflow import keras
print("Keras Version : {}".format(keras.__version__))
import torchtext
print("TorchText Version : {}".format(torchtext.__version__))
import gc
In this section, we are preparing our data to be given to a neural network for training purposes. As we said earlier, we'll be using character-based approach for text generation which means that we'll give a specified number of characters to the network and will train it to predict the next character after those characters. The neural network works on real numbers hence we have used bag of words approach to encoding characters of text. We have followed the below steps to prepare data for the network.
After completing 4 steps, we'll have arrays of integers (X, Y) which we can give to the neural network for training. The steps will become more clear as we go through them below.
In this section, we have simply loaded our Wikipedia dataset that we are going to use for our purpose. The dataset is already divided into the train, validation, and test sets. We'll be using only the train set for our purpose. The training dataset has well-curated ~36k articles.
train_dataset, valid_dataset, test_dataset = torchtext.datasets.WikiText2()
X_train_text = [text for text in train_dataset]
len(X_train_text)
In this section, we have populated the vocabulary of unique characters. In order to populate vocabulary, we have created an instance of Tokenizer() available from processing.text sub-module of keras. We have set char_level to True to inform it to break text at character level otherwise by default it breaks for words. In order to populate vocabulary, we have called fit_on_texts() method on the tokenizer instance with our train text examples. The vocabulary is available through word_index attribute of the tokenizer instance once populated. We have printed vocabulary for reference purposes.
Please make a NOTE that vocabulary starts from index 1 as index 0 is reserved for unknown characters not present in the dataset. It is useful for text classification tasks when we try to classify a new text document that which model has never seen and it has some unseen characters/words. These unseen characters/words will be mapped to the 0th index.
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(X_train_text)
print(tokenizer.word_index)
In this section, we are organizing our dataset for training. We have set seq_length variable to value 100 because we are going to use a sequence of 100 characters. Then, we are looping through each text example of data. For each text example, we are moving the window of 100 characters through it putting 100 characters as data features (X_train) and the next character after them as target value (Y_train). After setting characters into X_train and Y_train arrays, we have also retrieved their index from our populated vocabulary. We have used texts_to_sequences() method to transform character sequences to indexes. After converting characters to indexes, we have introduced one extra dimension at the end of data features (X_train) so that they can be processed by the LSTM layer. LSTM layer processes data sequences for each example.
Below, we have explained the process with a simple example.
vocab = {
'h':1,
'e':2,
'l':3,
'o':4,
' ':5,
',':6,
'w',7,
'a':8,
'r':9,
'y':10,
'u':11,
'?':12,
'c':13,
'm':14,
't':15,
'd':16,
'z':17,
'n':18
}
text_example = "Hello, How are you? Welcome to coderzcolumn?"
seq_length = 10
X_train = [
['h','e','l','l','o',',',' ', 'h','o','w'],
[,'e','l','l','o',',',' ', 'h','o','w',' '],
['l','l','o',',',' ', 'h','o','w', ' ', 'a'],
['l','o',',',' ', 'h','o','w',' ', 'a', 'r'],
...
['d','e','r','z','c','o','l', 'u','m','n']
]
Y_train = ['e','l','l','o',',',' ', 'h','o','w',' ',..., '?']
X_train_vectorized = [
[1,2,3,4,5,6,1,4,7],
[2,3,4,5,6,1,4,7,5],
[3,4,5,6,1,4,7,5,1],
...
[16,2,9,17,13,4,3,11,14,18]
]
Y_train_vectorized = [1,2,3,4,5,6,1,4,7,5,1,...., 12]
%%time
import numpy as np
train_dataset, valid_dataset, test_dataset = torchtext.datasets.WikiText2()
seq_length = 100 ## Network Hyperparameter to tune
X_train, Y_train = [], []
for text in X_train_text[:6000]: ## Using few text examples
for i in range(0, len(text)-seq_length):
inp_seq = text[i:i+seq_length].lower()
out_seq = text[i+seq_length].lower()
X_train.append(inp_seq)
Y_train.append(tokenizer.word_index[out_seq]) ## Retrieve index for characters from vocabulary
X_train = tokenizer.texts_to_sequences(X_train) ## Retrieve index for characters from vocabulary
X_train, Y_train = np.array(X_train, dtype=np.int32).reshape(-1, seq_length,1), np.array(Y_train)
X_train.shape, Y_train.shape
gc.collect()
In this section, we have defined a network that we'll use for our text generation task. The task is a classification task as we are making the network predict one of the possible characters from the vocabulary. The first two layers of the network are LSTM layers with output sizes of 256. The first LSTM layer will takes input of shape (batch_size, seq_len, 1) = (batch_size, 100, 1) and transform it to (batch_size, seq_len, lstm_out) = (batch_size, 100, 256). This output is then given to the second LSTM layer for the processing which transforms shape to (batch_size, 256) after processing. The output of the second LSTM layer is given to a dense layer with the same output units as the length of vocabulary for processing and it transforms shape to (batch_size, vocab_len). The softmax activation is applied to the output of the dense layer.
After defining a network, we have also printed a summary of the model which has output shapes of layers and their parameters counts.
We have not covered LSTM layers in detail here. Please feel free to check the below link if you want to know about them in little detail. It explains the usage of LSTM Networks for text classification tasks.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding
lstm_out = 256
model = Sequential([
LSTM(lstm_out, input_shape=(seq_length, 1), return_sequences=True),
LSTM(lstm_out),
Dense(len(tokenizer.word_index)+1, activation="softmax")
])
model.summary()
Below, we have first compiled our network to use Adam optimizer for updating parameters and cross entropy loss for measuring network performance. We have set the learning rate to 0.001.
After compiling the network, we have trained it for 50 epochs. We have set the batch size of 1024. We can notice from the loss value getting printed after each epoch that it is decreasing after each epoch which is good and we can say that our model is learning.
from tensorflow.keras.optimizers import Adam
from keras import backend as K
model.compile(optimizer=Adam(learning_rate=0.001), loss="sparse_categorical_crossentropy")
model.fit(X_train, Y_train, batch_size=1024, epochs=50, verbose=2)
In this section, we have generated text using our trained network. We have generated 100 new characters. We first selected a text example at random from our train data and printed the characters it. This text example will work as a starting point. We have a loop for generating 100 new characters. The first iteration of the loop starts with the selected example and generates a new character. This character gets added to the end of the text sequence and the first character from the sequence is removed. This modified sequence of 100 characters is used for the second iteration of the loop to generate another character which gets added to the end of the sequence. This process is repeated 100 times. After generating 100 new characters, we have printed them as well.
We can notice from the results that the text generated by the model looks like English language text though it is not making much sense. The network has learned to properly spell words. The network is a little deterministic and repeats a few words. This can be avoided by adding little randomness to the prediction. We'll try network for more epochs in the next sections to see whether it improves results or not.
import random
random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].flatten().tolist()
print("Initial Pattern : {}".format("".join([tokenizer.index_word[idx] for idx in pattern])))
generated_text = []
for i in range(100):
X_batch = np.array(pattern, dtype=np.int32).reshape(1, seq_length, 1) ## Design Batch
preds = model.predict(X_batch) ## Make Prediction
predicted_index = preds.argmax(axis=-1)[0] ## Retrieve token index
generated_text.append(predicted_index) ## Add token index to result
pattern.append(predicted_index) ## Add token index to original pattern
pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.
print("Generated Text : {}".format("".join([tokenizer.index_word[idx] for idx in generated_text])))
Here, we have trained the network for another 50 epochs. We have modified the learning rate to 0.0003 for these epochs. The loss values getting printed hint to us that the network has improved. Next, we'll try to generate text using this more trained network.
K.set_value(model.optimizer.learning_rate, 0.0003)
model.fit(X_train, Y_train, batch_size=1024, epochs=50, verbose=2)
Here, we have again tried to generate 100 characters using our trained network. We have started with the same text example with which we had started earlier. We can notice that network has generated little different text this time. It has also generated a punctuation mark this time. It is still repeating a few words though. We'll train the network more to see whether it helps improve further. Language models generally give good results after training for many epochs.
import random
random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].flatten().tolist()
print("Initial Pattern : {}".format("".join([tokenizer.index_word[idx] for idx in pattern])))
generated_text = []
for i in range(100):
X_batch = np.array(pattern, dtype=np.int32).reshape(1, seq_length, 1) ## Design Batch
preds = model.predict(X_batch) ## Make Prediction
predicted_index = preds.argmax(axis=-1)[0] ## Retrieve token index
generated_text.append(predicted_index) ## Add token index to result
pattern.append(predicted_index) ## Add token index to original pattern
pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.
print("Generated Text : {}".format("".join([tokenizer.index_word[idx] for idx in generated_text])))
In this section, we have trained the network for another 50 epochs. We have set the learning rate to 0.0001 for these epochs. The loss values getting printed after each epoch hint to us that model is improving further. We'll test it by generating text.
K.set_value(model.optimizer.learning_rate, 0.0001)
model.fit(X_train, Y_train, batch_size=1024, epochs=50, verbose=2)
In this section, we have again generated new text of 100 characters using our model. We have used the same text example that we had used earlier as a starting point. We can notice from the generated text that it looks like English language text without any spelling errors. The network is still deterministic and repeats few words but it can be improved by trying different approaches. In the next section, we have suggested a few things which can help get good results for text generation tasks.
import random
random.seed(123)
idx = random.randint(0, len(X_train))
pattern = X_train[idx].flatten().tolist()
print("Initial Pattern : {}".format("".join([tokenizer.index_word[idx] for idx in pattern])))
generated_text = []
for i in range(100):
X_batch = np.array(pattern, dtype=np.int32).reshape(1, seq_length, 1) ## Design Batch
preds = model.predict(X_batch) ## Make Prediction
predicted_index = preds.argmax(axis=-1)[0] ## Retrieve token index
generated_text.append(predicted_index) ## Add token index to result
pattern.append(predicted_index) ## Add token index to original pattern
pattern = pattern[1:] ## Resize pattern to bring again to seq_length length.
print("Generated Text : {}".format("".join([tokenizer.index_word[idx] for idx in generated_text])))
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to