Neural Networks With Pound Cakes and a Little Math

Gentle introduction to neural networks

Published in

Towards AI

8 min readDec 25, 2023

My family loves pound cakes, and I enjoy baking them, especially this time of the year. Did you know that the pound cake is aptly named because it uses a pound of butter, flour, sugar, and eggs? A dozen eggs. Six whole eggs and six egg yolks give the pound cake its rich color and its dense yet melt-in-your-mouth silky texture.

What might happen if you experimented with different combinations of whole eggs and yolks? Ms. Baker, an imaginary friend of mine, did just that. An infinitely curious person, Ms. Baker is also interested in neural networks. This post describes how Ms. Baker learned about neural networks through reading and experimentation.

Ms. Baker conducts a number of experiments, using her children to test her cakes. The kids are discerning; they either gobble up with their mom’s creations or refuse to eat them altogether. Ms. Baker’s quality measure is her products’ destination: tummies or compost bin. Cake Quality is 0/1 (compost bin/tummies).

Ms. Baker discovered that she could go two steps in either direction of the optimal 6 eggs and 6 yolks. Her kids will eat the cake, but any further deviations, like 3 whole eggs and 9 egg yolks, and the cake ends up in the compost bin.

A good data scientist, Ms. Baker creates a scatterplot of her data, with eggs on the x-axis and quality on the y-axis.

She wonders what function might capture this relationship. She realizes that no single line can fit her data so perhaps a number of curvy, bendy lines are called for. Perhaps neural networks may work.

Ms. Baker is a huge fan of StatQuest, YouTube videos created by Josh Starmer and StatSquatch. She loves learning practical concepts in math, stats, and data science through these delightful and entertaining pieces. After watching “Neural Networks Pt. 1: Inside the Black Box” twice, she began to understand how neural networks are constructed. She creates a diagram of a neural network with draw.io, a free drawing program.

Neural networks have nodes, connections, and layers. Nodes are circles, connections are arrows, and each vertical column is a layer in the above drawing. Input data comes in from the left and passes through the input, hidden, and output layers to create a prediction. Math happens in the nodes and the connections through model parameters. In Ms. Baker’s data, input data is #eggs and output is quality, and model parameters determine how quality is predicted from #eggs. The smart people who created neural networks set up the following three rules:

There are two types of model parameters. One is called a weight term, which serves to multiply incoming values. A unique weight is associated with each connection. The other parameter is called a bias term, which is additive. Each node has its own bias term.
How many parameters in Ms. Baker’s neural network? Simply add the connections and nodes. There are a total of 8 connections and 5 nodes. Therefore, the neural network above has 8 weights and 5 bias terms for a total of 13 parameters. [Note that this is for special use when input data is a single column; there will be additional weights for additional columns.]
Nodes do something useful and special: they “squash” incoming values using a math function. This is by design, preventing any single computation from having an undue influence upon the entire network.

Ms. Baker understands that the neural network works in the following manner. For connection 1, the model takes the first value from #eggs, multiplies it by weight, weight1, and hands off the result to node 1. For connection 2, it takes the first value from #eggs, multiplies it with weight2, and passes the result to node 2.

Each node has its own additive bias term, which is added to the value. Node 1 adds a bias1 to the incoming value and then “squashes” the result. Now, Node 1 is ready to hand off its value to connection 3. Node 2 adds bias2 to the incoming value, “squashes” it, and then passes it to another node.

Connection 3 takes the incoming value from node 1, multiplies it with weight3, and hands it to node 3. Node 3 accepts values from connection 3 AND the connection between nodes 2 and 3. Node 3 adds these two values, adds its own bias term, and then “squashes” the result.

In Ms. Baker’s image, every node in one layer has a connection to every possible node in the next layer. This is called a dense network. Neural networks are also called black boxes because the computations in the hidden layers are not directly observable. Some neural networks have thousands of hidden layers.

The beauty of all this math, which consists mostly of multiplication and addition, is that the parameters are all “learnt” during the neural network training or building process, during which the model is given real data so that it can learn the best values. How does it do that? By repeated iterations…., with each one getting closer and closer to the best possible values.

It starts by assigning random values to the parameters and then taking the data through the model. This process is called feed-forward. It compares the predicted output to the real output, examining how far off it was with respect to each of the parameters. Then it makes slight updates or changes to parameters, which is called back-propagation, and it is off to the races doing another feed forward.

After many iterations, the model will “learn” optimal values for the parameters, which are optimal because they minimize the differences between the predicted and the real output.

Because they are computationally intense, neural networks make sense when other approaches from machine learning fail. This often happens when data is very large (long as in millions of rows) and wide (thousands of columns).

This background so fascinates Ms. Baker that she is inspired to practice this with Google Colab.

eggs = pd.Series([0,1,2,3,4,5,6,7,8,9,10,11,12])
quality = pd.Series([0, 0, 0, 0, 1, 1, 1, 1,1, 0, 0, 0, 0])

Ms. Baker knows that 13 rows of the data is not sufficient for a neural network, so she creates synthetic data using the Gen AI feature newly available in Google Colab. She types the following prompt:

# prompt: Generate series x and y with 100 values each that best fit data in eggs and quality.

And viola, the following code is generated for her:

from scipy.interpolate import splrep, splev

x_new = np.linspace(0, 12, 100)
tck = splrep(eggs, quality)
y_new = splev(x_new, tck)

plt.plot(x_new, y_new)

Ms. Baker likes her test data and now sets about training a dense neural network with Keras.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
model = Sequential()
model.add(Dense(8, input_shape=(1,), activation='softmax'))
model.add(Dense(4, activation='softmax'))
model.add(Dense(1, activation='sigmoid'))
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam')
# fit the keras model on the dataset
history=model.fit(x_new, y_new, epochs=5000, batch_size=100, verbose=0)

Since Ms. Baker only has one explanatory variable (#eggs column), she puts in ‘1’ for input_shape. She chooses the “softmax” activation function to “squash” values between 0 and 1. She selects 8 nodes for her input layer, 4 for her hidden layer, and 1 for her output layer. With the model.compile statement, she specifies the loss function, binary_crossentropy, because it does well for classification problems. She selects the optimizer ‘adam’ beccause it is good at changing parameter values in an efficient way.

Using model.fit statement, Ms. Baker specifies that the model use all of its observations in each forward pass. She wants the neural network to iterate 5,000 times to get to the most optimal values of the parameters. Her specification of batch_size is equal to the number of rows in her training data, so the parameters are updated or changed once per epoch, the number of iterations of full training data.

After a few seconds, the model finishes training and Ms. Baker peruses her training history with the following plot.

df = pd.DataFrame(history.history)
df['epoch'] = range(1, len(df) + 1)

# Plot loss
sns.lineplot(data=df, x='epoch', y='loss', label='Training Loss')
plt.title('Loss over epochs')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.show()

After examining her “Loss over epochs” plot, Ms. Baker is satisfied that the model has most likely converged (attained values that are optimal for parameters for this particular architecture.)

Ms. Baker is curious to know how her model does with her #eggs series, although she is aware that she should not be testing her model with the same data that she trained it with.

predicted_values = model.predict(eggs)
predicted_values

Nicely done, Ms. Baker.

Not content to rest, Ms. Baker decides to take a further look at the model’s parameters.

model.get_weights()

Wow! All these numbers. As a final step, Ms. Baker validates whether the number of the parameters is what it should be. Applying the rules she learnt above, she knows that there are 44 connections (8 between the input data and input layer, 32 (8 x 4) between the input layer and the hidden layer, and a final 4 between the hidden layer and the output layer. There are 13 nodes in all, so another 13 bias terms. The model should have 57 parameters.

She now checks her math with the following code:

model.summary()

And she is right! She has also discovered the following formula for calculating the parameters for a fully connected layer.

parameters in fully connected layer = (#input nodes + 1) x (#output nodes)

I hope that you enjoyed my explanation of neural networks. You can see how it is suited for BIG data and machines because the thought of all that math makes my head spin. Computers are built to crunch through numbers and are very good at it, evidenced by impressive results of neural networks. Most explanations of deep learning models’ parameters do not take into account the parameters that are between the input data and the input layer. This is because the number of parameters depends on the number of columns of input data AND nodes. If there are 8 nodes and 1 column of data, then there will be 8 weights and 8 bias terms. Similarly, if there are 2 columns of data, there will be 16 weights and 8 bias terms. Why only 8 bias terms? Because nodes have bias terms!

Another technical note: the example above feeds #eggs into the model. Generally, input data is first transformed and/or standardized before being used for neural networks.

I want to thank Josh Starmer of StatQuest. You do amazing work, Josh, and are an inspiration to all of us.

Neural Networks With Pound Cakes and a Little Math

Gentle introduction to neural networks

Written by Renu Gehring