I have always had problems in getting the shape of the various matrices right when trying to use forward or backward propagation in Neural Networks until I came across a ten-minute video by Andrew Ng in his Deep Learning Specialisation, which helped in clarifying a lot of doubts about the same. I have tried to reproduce the ideas in the video here in my own words, hoping to solidify my understanding and also help others in the process.
I am going to use the neural network above for a demo of what the shapes should be for various matrices calculated during forward and backward propagation, but the ideas can be generalised easily to any other architecture.
The Neural Network shown above contains one input layer, four hidden layers and one output layer.
The input X to the network has m samples each with n𝕩 (=2 in this example) features.
Only one sample is shown in the figure, which is represented by layer l₀.
The hidden layer
l₁ contain n₁ (=4) neurons,
l₂ contains n₂ (=5) neurons,
l₃ contains n₃ (=3) neurons,
l₄ n₄ (=2) neurons and
l₅ n₅ (=1) output.
The weights between layers
l₀ and l₁ are given by w⁽¹⁾,
between l₁ and l₂ is given by w⁽²⁾
and so on till w⁽⁵⁾ between layers l₄ and l₅.
The corresponding intercepts are b⁽¹⁾,b⁽²⁾…b⁽⁵⁾.
The activation functions at each layer are given by g⁽ⁿ⁾() where n represents the layer number.
The first layer of input vector x can be represented by activation a⁽⁰⁾ and the subsequent activations can be given by a⁽¹⁾, a⁽²⁾…a⁽⁵⁾.
The forward propagations steps are :
- z⁽¹⁾ = w⁽¹⁾x + b⁽¹⁾
Considering only one sample, the dimensions of
x is (n𝕩,1) and z⁽¹⁾ is shaped (n₁,1).
This implies that the shape of w⁽¹⁾ should be such that when the dot product of w⁽¹⁾ is taken with x shaped (n𝕩,1), the resultant matrix is of shape (n₁,1).
So, using the rules for matrix multiplication,
( matrix A shaped (m,n) multiplied by matrix B shaped (n,k) results in a matrix C shaped (m,k) )
the shape of w⁽¹⁾ will be given by (n₁,n𝕩).
The shape of b⁽¹⁾ will be the same as that of z⁽¹⁾ ie (n₁,1).
2. a⁽¹⁾ = g⁽¹⁾(z⁽¹⁾)
as a⁽¹⁾ is a function of z⁽¹⁾, the shape of
a⁽¹⁾ will be the same as that of z⁽¹⁾ ie (n₁,1).
3. z⁽²⁾ = w⁽²⁾.a⁽¹⁾ + b⁽²⁾
Here dimensions of a⁽¹⁾ are (n₁,1) and that of z⁽²⁾ are (n₂,1). So using the rule of matrix multiplication as above, the shape of
w⁽²⁾ is (n₂, n₁) and b⁽²⁾ is (n₂,1).
The same process will go on till the output layer. So, the shape of
z⁽³⁾, a⁽³⁾, b⁽³⁾ ⟹ (n₃,1)
z⁽⁴⁾, a⁽⁴⁾, b⁽⁴⁾ ⟹ (n₄,1)
z⁽⁵⁾, a⁽⁵⁾, b⁽⁵⁾ ⟹ (n₅, 1)
and, the shape of
w⁽³⁾ ⟹ (n₃, n₂)
w⁽⁴⁾ ⟹ (n₄, n₃)
w⁽⁵⁾ ⟹ (n₅, n₄)
In other words, the shape of weight matrices w, for layer l, can be given by :
w⁽ˡ⁾ ⟹ (n⁽ˡ⁾, n⁽ˡ⁻¹⁾),
and for intercepts b :
b⁽ˡ⁾ = (n⁽ˡ⁾,1),
where
n⁽ˡ⁾ ⟹ number of neurons in layer l
n⁽ˡ⁻¹⁾ ⟹ number of neurons in layer (l-1).
During backpropagation, using ‘a’, ‘z’ calculated during forward-propagation, we calculate the partial derivatives of the loss function, ‘da’, ‘dz’ until we get the values of ‘dw’ and ‘db’. The shape of these gradients will the same as their corresponding ‘a’, ‘z’, ‘w’ and ‘b’ matrices.
Vectorization
We have till now considered only one sample from the input X which has a total of m samples. Instead of using a for-loop to iterate over the m samples, the whole process can be vectorized, which means we can feed all the m samples having n₀ features at once into our Neural Network. This is more computationally efficient than using for-loop as shown above.
In the case of vectorization, our input becomes X with shape (n𝕩,m) and we represent the vectorized operations using capital letters so these corresponding changes in shape occur :
A⁽⁰⁾ ⟹ (n𝕩,m)
Z⁽¹⁾, A⁽¹⁾ ⟹ (n₁,m)
Z⁽²⁾, A⁽²⁾ ⟹ (n₂,m)
Z⁽³⁾, A⁽³⁾ ⟹ (n₃,m)
Z⁽⁴⁾, A⁽⁴⁾ ⟹ (n₄,m)
Z⁽⁵⁾, A⁽⁵⁾ ⟹ (n₅, m)
The shape of weight matrices W and intercepts b remains the same, even when we use vectorization.
So for an L layer neural network, with input X of shape (n𝕩, m), the shape of the weight matrices, bias vectors and other intermediate terms can be given by:
One question that might arise is, how does the vector b⁽ˡ⁾ with shape (n⁽ˡ⁾,1) gets added to the product of w⁽ˡ⁾ with shape (n⁽ˡ⁾, n⁽ˡ⁻¹⁾) and A⁽ˡ⁻¹⁾ with shape (n⁽ˡ⁻¹⁾, m) in the following equation :
Z⁽ˡ⁾ = w⁽ˡ⁾A⁽ˡ⁻¹⁾ + b⁽ˡ⁾
This is possible because of broadcasting, where one matrix is stretched to match the dimensions of the other during an arithmetic operation.
Thanks for reading. Please clap if this helped in making things clear for you in the same way the original video made it for me.