# Deep Neural Network: Forward and Backward Propagation

--

In this article, I will go over the steps involved in solving a binary classification problem using a deep neural network having L layers. If you are new to neural networks, I would suggest going over this article first, where I have discussed a single neuron neural network.

The input data X is of size (n𝕩, m ) where n𝕩 is the number of features and m is the no of samples. The output Y is of size (1, m).

The number of neurons in layer l is denoted by n⁽ˡ⁾.

We will use the ‘relu’ activation function from layers 1 to L-1, and ‘sigmoid’ for layer L. We initialise A⁽⁰⁾ = X and the weight matrices W and intercepts b for each layer l, from 1 to L, as follows :

`W⁽ˡ⁾ = np.random.randn(n⁽ˡ⁾, n⁽ˡ⁻¹⁾) * 0.01`

The ‘np.random.randn’ function generates a random normally distributed array of the size specified and we multiply it by 0.01 to keep our initial weights small, which otherwise can lead to vanishing gradients problem.

`b⁽ˡ⁾ = np.zeros((n⁽ˡ⁾,1)) `

Forward Propagation

1. Calculate Z for each layer l from layer 1 to layer L-1 :

(Take A⁽⁰⁾ = X)

`Z⁽ˡ⁾ = W⁽ˡ⁾A⁽ˡ⁻¹⁾ + b⁽ˡ⁾`

2 . Store A⁽ˡ⁻¹⁾, W⁽ˡ⁾, b⁽ˡ⁾, Z⁽ˡ⁾ in a list and keep on appending for each layer.

3. Calculate A⁽ˡ⁾

`A⁽ˡ⁾ = relu(Z⁽ˡ⁾)`

For the layer L,

`A⁽ᴸ⁾ = sigmoid(Z⁽ᴸ⁾)`

4. Compute cost, using :

`cost, J = -(Y(log(A⁽ᴸ⁾)ᵀ + (1-Y)(log(1-A⁽ᴸ⁾))ᵀ)/m`

where Aᵀ denotes the transpose of matrix A.

Backward Propagation

1. For the layer L :
`dZ⁽ᴸ⁾ = A⁽ᴸ⁾ - Y         `

and getting the values of A⁽ˡ⁻¹⁾, W⁽ˡ⁾, b⁽ˡ⁾ from the cache created during forward- propagation :

`dW⁽ᴸ⁾ = (dZ⁽ᴸ⁾(A⁽ˡ⁻¹⁾)ᵀ)/mdb⁽ᴸ⁾ = dZ⁽ᴸ⁾/mdA⁽ᴸ⁻¹⁾ = (W⁽ᴸ⁾)ᵀdZ⁽ᴸ⁾`

2. For all the other layers till l=1

`dZ⁽ˡ⁾ = dA⁽ˡ⁾ * relu'(Z⁽ˡ⁾)`

where ‘*’ denotes element-wise multiplication between two matrices and ‘relu’(Z⁽ˡ⁾)’ denotes differentiation of the relu activation function wrt Z⁽ˡ⁾.

and

`dW⁽ˡ⁾ = dZ⁽ˡ⁾(A⁽ˡ⁻¹⁾)ᵀ/mdb⁽ˡ⁾ = dZ⁽ˡ⁾/mdA⁽ˡ⁻¹⁾ = (W⁽ˡ⁾)ᵀdZ⁽ˡ⁾`

and store the gradients dW and db in a list

3. Use the gradients stored above to update the parameters W and b for each layer

`W⁽ˡ⁾ = W⁽ˡ⁾ - 𝝰dW⁽ˡ⁾b⁽ˡ⁾ = b⁽ˡ⁾ - 𝝰db⁽ˡ⁾`

where 𝝰 denotes the learning rate.

Repeat all the steps for n iterations till the loss converges, to find the optimum value for W and b.

In the case of L2 regularization, the cost function becomes:

where l denotes the layers, k and j denote the number of neurons in each layer and 𝝺 is the regularization parameter

Also the gradients dW⁽ˡ⁾ change accordingly

`dW⁽ˡ⁾ = dZ⁽ˡ⁾(A⁽ˡ⁻¹⁾)ᵀ/m + (𝝺/m)W⁽ˡ⁾`

For any bugs that arise in the code, one thing that helps a lot in troubleshooting is checking the dimensions of the matrices at each step. For details refer here:

All the content here has been sourced from DeepLearning.AI. The wordings are mine and so are any mistakes.