Deep Neural Network: Forward and Backward Propagation

3 min readDec 9, 2021

In this article, I will go over the steps involved in solving a binary classification problem using a deep neural network having L layers. If you are new to neural networks, I would suggest going over this article first, where I have discussed a single neuron neural network.

The input data X is of size (n𝕩, m ) where n𝕩 is the number of features and m is the no of samples. The output Y is of size (1, m).

The number of neurons in layer l is denoted by n⁽ˡ⁾.

We will use the ‘relu’ activation function from layers 1 to L-1, and ‘sigmoid’ for layer L. We initialise A⁽⁰⁾ = X and the weight matrices W and intercepts b for each layer l, from 1 to L, as follows :

W⁽ˡ⁾ = np.random.randn(n⁽ˡ⁾, n⁽ˡ⁻¹⁾) * 0.01

The ‘np.random.randn’ function generates a random normally distributed array of the size specified and we multiply it by 0.01 to keep our initial weights small, which otherwise can lead to vanishing gradients problem.

b⁽ˡ⁾ = np.zeros((n⁽ˡ⁾,1))

Forward Propagation

Calculate Z for each layer l from layer 1 to layer L-1 :

(Take A⁽⁰⁾ = X)

Z⁽ˡ⁾ = W⁽ˡ⁾A⁽ˡ⁻¹⁾ + b⁽ˡ⁾

2 . Store A⁽ˡ⁻¹⁾, W⁽ˡ⁾, b⁽ˡ⁾, Z⁽ˡ⁾ in a list and keep on appending for each layer.

3. Calculate A⁽ˡ⁾

A⁽ˡ⁾ = relu(Z⁽ˡ⁾)

For the layer L,

A⁽ᴸ⁾ = sigmoid(Z⁽ᴸ⁾)

4. Compute cost, using :

cost, J = -(Y(log(A⁽ᴸ⁾)ᵀ + (1-Y)(log(1-A⁽ᴸ⁾))ᵀ)/m

where Aᵀ denotes the transpose of matrix A.

Backward Propagation

For the layer L :

dZ⁽ᴸ⁾ = A⁽ᴸ⁾ - Y

refer to this article for the details of the above equation.

and getting the values of A⁽ˡ⁻¹⁾, W⁽ˡ⁾, b⁽ˡ⁾ from the cache created during forward- propagation :

dW⁽ᴸ⁾ = (dZ⁽ᴸ⁾(A⁽ˡ⁻¹⁾)ᵀ)/mdb⁽ᴸ⁾ = dZ⁽ᴸ⁾/mdA⁽ᴸ⁻¹⁾ = (W⁽ᴸ⁾)ᵀdZ⁽ᴸ⁾

2. For all the other layers till l=1

dZ⁽ˡ⁾ = dA⁽ˡ⁾ * relu'(Z⁽ˡ⁾)

where ‘*’ denotes element-wise multiplication between two matrices and ‘relu’(Z⁽ˡ⁾)’ denotes differentiation of the relu activation function wrt Z⁽ˡ⁾.

and

dW⁽ˡ⁾ = dZ⁽ˡ⁾(A⁽ˡ⁻¹⁾)ᵀ/mdb⁽ˡ⁾ = dZ⁽ˡ⁾/mdA⁽ˡ⁻¹⁾ = (W⁽ˡ⁾)ᵀdZ⁽ˡ⁾

and store the gradients dW and db in a list

3. Use the gradients stored above to update the parameters W and b for each layer

W⁽ˡ⁾ = W⁽ˡ⁾ - 𝝰dW⁽ˡ⁾b⁽ˡ⁾ = b⁽ˡ⁾ - 𝝰db⁽ˡ⁾

where 𝝰 denotes the learning rate.

Repeat all the steps for n iterations till the loss converges, to find the optimum value for W and b.

In the case of L2 regularization, the cost function becomes:

where l denotes the layers, k and j denote the number of neurons in each layer and 𝝺 is the regularization parameter

Also the gradients dW⁽ˡ⁾ change accordingly

dW⁽ˡ⁾ = dZ⁽ˡ⁾(A⁽ˡ⁻¹⁾)ᵀ/m + (𝝺/m)W⁽ˡ⁾

For any bugs that arise in the code, one thing that helps a lot in troubleshooting is checking the dimensions of the matrices at each step. For details refer here:

Getting Matrix Dimensions Right in Neural Networks

I have always had problems in getting the shape of the various matrices right when trying to use forward or backward…

psrivasin.medium.com

All the content here has been sourced from DeepLearning.AI. The wordings are mine and so are any mistakes.

Thanks for reading. Any comments, queries, or suggestions are welcome.

Deep Neural Network: Forward and Backward Propagation

Getting Matrix Dimensions Right in Neural Networks

I have always had problems in getting the shape of the various matrices right when trying to use forward or backward…

Written by Prakhar S