Data Science
Create Your Own Harry Potter Short Story Using RNNs and TensorFlow
“Of course it is happening inside your head, Harry, but why on earth should that mean that it is not real?”¹
Still, waiting for your Hogwarts letter?
Want to enjoy the feast in the Great Hall?
Explore the secret passages in Hogwarts?
Buy your first wand from Ollivander’s?
*sigh* You are not alone.
I have (after all this time?) always been obsessed with Harry Potter, and I recently started learning neural networks. It’s fascinating to see how creative you can get with Deep Learning, so I thought why not brew them up?
So I executed a simple text generation model using TensorFlow to create my own version of a Harry Potter short-story (can't get as good as J.K. Rowling, duh!)
This article runs you through the entire code I wrote to implement it.
But for all the Hermione’s out there, you can directly find the github code here and run it yourself!
So here’s something which will cast a Banishing Charm on your boredom while you’re quarantined.
Background
What is an RNN?
A Recurrent Neural Network is different from the other neural networks as it has a memory which stores information of all the layers it has processed so far and computes the next layer on the basis of this memory. For a simple introduction to RNNs, you can refer to this.
GRU vs LSTM
Both of these are great for text generation but GRUs are a newer concept…and there isn’t actually a way to determine which one is better in general. Tuning your hyper-parameters well is what will improve your model performance more than choosing a good architecture.²
However, if the amount of data is not a problem, LSTMs perform better. If you have less data, GRUs have fewer parameters so they train faster and work well to generalize the lesser data.
Feel free to check out this article for a more detailed explanation.
Why character-based?
When working with large datasets like this, the complete number of unique words in a corpus is much higher than the number of unique characters. A large dataset will have many many unique words, and when we assign one-hot encodings to such large matrices we’re likely to run into memory issues. Our labels alone can take up storage of terabytes of RAM.
So, the same principles which you use to predict words can be applied here, but now you’ll be working with much smaller vocabulary size.
The code
So let’s get started!
First, import the libraries you need
import tensorflow as tf
import numpy as np
import os
import time
Now, read the data
You can find and download transcripts of all the Harry Potter books from this Kaggle dataset. Here, I am combining all the seven books into one text file named ‘harrypotter.txt’. You can also train your model on any one book if you like. Just experiment with it!
files= [‘1SorcerersStone.txt’, ‘2ChamberofSecrets.txt’, ‘3ThePrisonerOfAzkaban.txt’, ‘4TheGobletOfFire.txt’, ‘5OrderofthePhoenix.txt’, ‘6TheHalfBloodPrince.txt’, ‘7DeathlyHollows.txt’]
with open(‘harrypotter.txt’, ‘w’) as outfile:
for file in files:
with open(file) as infile:
outfile.write(infile.read())
text = open(‘harrypotter.txt’).read()
Looking at the data
print(text[:300])
“Harry Potter and the Sorcerer’s Stone
CHAPTER ONE
THE BOY WHO LIVED
Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you’d expect to be involved in anything strange or mysterious, because they”³
Processing the data
We map all the unique character strings in vocab
to numbers by making two look-up tables:
- mapping the characters to numbers (
char2index
) - mapping the numbers back to the characters (
index2char
)
Then convert our text to numbers..
vocab = sorted(set(text))
char2index = {u:i for i, u in enumerate(vocab)}
index2char = np.array(vocab)
text_as_int = np.array([char2index[c] for c in text])#how it looks:
print ('{} -- characters mapped to int -- > {}'.format(repr(text[:13]), text_as_int[:13]))
‘Harry Potter ‘ — characters mapped to int → [39 64 81 81 88 3 47 78 83 83 68 81 3]
Each input sequence for our model will contain seq_length
number of characters from the text, and its corresponding target sequence will be of the same length with all characters shifted one place to the right. So we break the text into chunks of seq_length+1
.⁴
tf.data.Dataset.from_tensor_slices
converts the text vector into a stream of character indices and the batch
method lets us group these characters into batches of the required length.
By using the map
method to apply a simple function to each batch, we create our inputs and targets.
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)def split_input_target(data):
input_text = data[:-1]
target_text = data[1:]
return input_text, target_textdataset = sequences.map(split_input_target)
Before feeding this data into the model, we shuffle the data and divide it into batches. tf.data
maintains a buffer in which it shuffles elements.
BATCH_SIZE = 64
BUFFER_SIZE = 10000dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
Building the Model
Given all the characters computed until this moment, what will the next character be? This is what we will be training our RNN model to predict.
I have used tf.keras.Sequential
to define the model since all the layers in it only have a single input and produce a single output. The different layers used are:
tf.keras.layers.Embedding
: This is the input layer. An embedding is used to map all the unique characters to vectors in multi-dimensional space, havingembedding_dim
dimensions.tf.keras.layers.GRU
: A type of RNN withrnn_units
number of units.(You can also use an LSTM layer here to see what works best for your data)tf.keras.layers.Dense
: This is the output layer, withvocab_size
outputs.
It is also useful to define all the hyper-parameters separately so that it’s easier for you to change them later without editing the model definition.
vocab_size = len(vocab)
embedding_dim = 300
# Number of RNN units
rnn_units1 = 512
rnn_units2 = 256
rnn_units= [rnn_units1, rnn_units2]def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim,
batch_input_shape=[batch_size, None]), tf.keras.layers.GRU(rnn_units1, return_sequences=True,
stateful=True,recurrent_initializer='glorot_uniform'), tf.keras.layers.GRU(rnn_units2, return_sequences=True,
stateful=True,recurrent_initializer='glorot_uniform'), tf.keras.layers.Dense(vocab_size) ])
return modelmodel = build_model(
vocab_size = vocab_size,
embedding_dim=embedding_dim,
rnn_units=rnn_units,
batch_size=BATCH_SIZE)
Training the model
The standard tf.keras.losses.sparse_categorical_crossentropy
loss function works best with our model as it is applied across the last layer of the predictions. We set from_logits
to True because the model returns logits. Then we choose the adam
optimizer and compile our model.
def loss(labels, logits):
return tf.keras.losses.sparse_categorical_crossentropy(labels,
logits, from_logits=True)model.compile(optimizer='adam', loss=loss, metrics=['accuracy'])
You can configure checkpoints like this to ensure that checkpoints are saved during training.
# Directory where the checkpoints will be saved
checkpoint_dir = ‘./training_checkpoints’
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, “ckpt_{epoch}”)
checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_prefix, save_weights_only=True)
The training time of each epoch depends on your model layers and hyper-parameters used. I have set epochs to 50 to see how accuracy and loss change over time, but it may not be required to train for all 50 epochs. Make sure to stop training when you see your loss starts to increase or remains constant for a few epochs. The last epoch you train will be stored in latest_check
. If using Google Colab, set the runtime to GPU to reduce training time.
EPOCHS= 50
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])
latest_check = tf.train.latest_checkpoint(checkpoint_dir)
Text generation
If you wish to use a different batch size, you need to rebuild the model and reload the checkpoints before running. I have used batch_size
of 1 to keep it simple.
(You can run a model.summary()
to get insights on the layers of your model and the output shape after each layer)
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(latest_check)
model.build(tf.TensorShape([1, None]))
model.summary()
The following function now generates the text:
- It accepts a
start_string
, initializes the RNN state and sets the number of output characters tonum_generate
- Gets the prediction distribution of the next character using
start_string
and the RNN state. Then it calculates the index of the predicted character, which is our next input to the model. - The output state returned by the model is fed back into the model so that it now has more context, (as shown below). After predicting the next character, the cycle continues. This way the RNN learns as it builds up it’s memory from the previous outputs.⁴
- A lower
scaling
results in a more predictable text whereas higherscaling
gives a more surprising text.
def generate_text(model, start_string): num_generate = 1000 #can be anything you like input_eval = [char2index[s] for s in start_string]
input_eval = tf.expand_dims(input_eval, 0) text_generated = [] scaling = 0.5 #kept at a lower value here # Here batch size == 1
model.reset_states()
for i in range(num_generate):
predictions = model(input_eval)
# remove the batch dimension
predictions = tf.squeeze(predictions, 0)
predictions = predictions / scaling
predicted_id = tf.random.categorical(predictions,
num_samples=1)[1,0].numpy()
input_eval = tf.expand_dims([predicted_id], 0)
text_generated.append(idx2char[predicted_id])return (start_string + ‘’.join(text_generated))
And you’re done!
Outputs
You can try giving it different start strings to get different outputs.
Here is a part of the output using my favorite character:
print(generate_text(model, start_string=u”Severus Snape“))
Severus Snape moved to the scarlet Hogwarts students. Hermione said, “Well, I think it’s all right, all right, a bit dead before. . . .”
“I think I’ll have to go to the other than you be to help him a question of the staff table and the doors opened and he stared at the clock to Harry. “I think it make the sword of Gryffindor, who was there too, he was on his pillows, and he and Ron stared at him. “I am sure we can bother the boy — “
“You should have been there,” said Ron, and he took a strange and color.
“I mean, he was a really good …
You can also try different sentences:
Voldemort died of coronavirus.”
“You didn’t know what to do,” said Harry, “it was a surrounding cloak, he was the one who sustain you to go to the way.”
“Yeah, well, I think you might have done that!” she said, striding up the steps, and the strength were so far as he was a pretty great tent that was the first time they might have realized I saw him to be devastated and screaming of the crowd through the darkness at the time shouts and silence.
“You see, Harry!”
“I don’t know, see you haven’t got anything to do with a prater of the Ministry of Magic …
Here is one example if you train the model using just the first book, Sorcerer’s Stone³:
Dumbledore in the Leaky Cauldron, now empty. Harry had never been to London before. Although Hagrid seemed very cold and green eyes. He was still shaking.
Harry sat down next to the bowl of peas. “What did you talk to Professor Dumbledore.”
She eyed him with a mixture of shock and suspicion.
“Who’s there?” he said suddenly as they climbed the street. He could just see the bundle of blankets on the step of number four.
Dudley’s favorite punching bag was Harry, but he couldn’t often catch him. Harry didn’t say anything …
You’ll see the model knows when to capitalize words, make a new paragraph and it imitates a magical writing vocabulary!
Mischief Managed.
To make the sentences more coherent, you can improve the model by
- changing the different parameter values like
seq_length
,rnn_units
,embedding_dims
,scaling
to find the best settings - training it for more epochs
- adding more layers of GRU / LSTM
This model can be trained on any other series you like. Do share your own stories in the comments and have fun!
References:
[1] J.K. Rowling, Harry Potter and the Deathly Hallows, 2007
[2] Recurrent Neural Network Tutorial, Part 4 — Implementing a GRU/LSTM RNN with Python and Theano, OCTOBER 27, 2015 BY DENNY BRITZ
[3] J.K. Rowling, Harry Potter and the Sorcerer’s Stone, 1998
[4] Text generation with an RNN, TensorFlow