Deep Neural Network: Forward and Backward Propagation

Prakhar S
3 min readDec 9, 2021
Deep Neural Network. Image by Author

In this article, I will go over the steps involved in solving a binary classification problem using a deep neural network having L layers. If you are new to neural networks, I would suggest going over this article first, where I have discussed a single neuron neural network.

The input data X is of size (n𝕩, m ) where n𝕩 is the number of features and m is the no of samples. The output Y is of size (1, m).

The number of neurons in layer l is denoted by n⁽ˡ⁾.

We will use the ‘relu’ activation function from layers 1 to L-1, and ‘sigmoid’ for layer L. We initialise A⁽⁰⁾ = X and the weight matrices W and intercepts b for each layer l, from 1 to L, as follows :

W⁽ˡ⁾ = np.random.randn(n⁽ˡ⁾, n⁽ˡ⁻¹⁾) * 0.01

The ‘np.random.randn’ function generates a random normally distributed array of the size specified and we multiply it by 0.01 to keep our initial weights small, which otherwise can lead to vanishing gradients problem.

b⁽ˡ⁾ = np.zeros((n⁽ˡ⁾,1)) 

Forward Propagation

  1. Calculate Z for each layer l from layer 1 to layer L-1 :

(Take A⁽⁰⁾ = X)

Z⁽ˡ⁾ = W⁽ˡ⁾A⁽ˡ⁻¹⁾ + b⁽ˡ⁾

2 . Store A⁽ˡ⁻¹⁾, W⁽ˡ⁾, b⁽ˡ⁾, Z⁽ˡ⁾ in a list and keep on appending for each layer.

3. Calculate A⁽ˡ⁾

A⁽ˡ⁾ = relu(Z⁽ˡ⁾)

For the layer L,

A⁽ᴸ⁾ = sigmoid(Z⁽ᴸ⁾)

4. Compute cost, using :

cost, J = -(Y(log(A⁽ᴸ⁾)ᵀ + (1-Y)(log(1-A⁽ᴸ⁾))ᵀ)/m

where Aᵀ denotes the transpose of matrix A.

Backward Propagation

  1. For the layer L :
dZ⁽ᴸ⁾ = A⁽ᴸ⁾ - Y         

refer to this article for the details of the above equation.

and getting the values of A⁽ˡ⁻¹⁾, W⁽ˡ⁾, b⁽ˡ⁾ from the cache created during forward- propagation :

dW⁽ᴸ⁾ = (dZ⁽ᴸ⁾(A⁽ˡ⁻¹⁾)ᵀ)/mdb⁽ᴸ⁾ = dZ⁽ᴸ⁾/mdA⁽ᴸ⁻¹⁾ = (W⁽ᴸ⁾)ᵀdZ⁽ᴸ⁾

2. For all the other layers till l=1

dZ⁽ˡ⁾ = dA⁽ˡ⁾ * relu'(Z⁽ˡ⁾)

where ‘*’ denotes element-wise multiplication between two matrices and ‘relu’(Z⁽ˡ⁾)’ denotes differentiation of the relu activation function wrt Z⁽ˡ⁾.


dW⁽ˡ⁾ = dZ⁽ˡ⁾(A⁽ˡ⁻¹⁾)ᵀ/mdb⁽ˡ⁾ = dZ⁽ˡ⁾/mdA⁽ˡ⁻¹⁾ = (W⁽ˡ⁾)ᵀdZ⁽ˡ⁾

and store the gradients dW and db in a list

3. Use the gradients stored above to update the parameters W and b for each layer

W⁽ˡ⁾ = W⁽ˡ⁾ - 𝝰dW⁽ˡ⁾b⁽ˡ⁾ = b⁽ˡ⁾ - 𝝰db⁽ˡ⁾

where 𝝰 denotes the learning rate.

Repeat all the steps for n iterations till the loss converges, to find the optimum value for W and b.

In the case of L2 regularization, the cost function becomes:

Image source : DeepLearning.AI

where l denotes the layers, k and j denote the number of neurons in each layer and 𝝺 is the regularization parameter

Also the gradients dW⁽ˡ⁾ change accordingly

dW⁽ˡ⁾ = dZ⁽ˡ⁾(A⁽ˡ⁻¹⁾)ᵀ/m + (𝝺/m)W⁽ˡ⁾

For any bugs that arise in the code, one thing that helps a lot in troubleshooting is checking the dimensions of the matrices at each step. For details refer here:

All the content here has been sourced from DeepLearning.AI. The wordings are mine and so are any mistakes.

Thanks for reading. Any comments, queries, or suggestions are welcome.