In this article, I will go over the steps involved in solving a binary classification problem using a deep neural network having L layers. If you are new to neural networks, I would suggest going over this article first, where I have discussed a single neuron neural network.
The input data X is of size (n𝕩, m ) where n𝕩 is the number of features and m is the no of samples. The output Y is of size (1, m).
The number of neurons in layer l is denoted by n⁽ˡ⁾.
We will use the ‘relu’ activation function from layers 1 to L-1, and ‘sigmoid’ for layer L. We initialise A⁽⁰⁾ = X and the weight matrices W and intercepts b for each layer l, from 1 to L, as follows :
W⁽ˡ⁾ = np.random.randn(n⁽ˡ⁾, n⁽ˡ⁻¹⁾) * 0.01
The ‘np.random.randn’ function generates a random normally distributed array of the size specified and we multiply it by 0.01 to keep our initial weights small, which otherwise can lead to vanishing gradients problem.
b⁽ˡ⁾ = np.zeros((n⁽ˡ⁾,1))
Forward Propagation
- Calculate Z for each layer l from layer 1 to layer L-1 :
(Take A⁽⁰⁾ = X)
Z⁽ˡ⁾ = W⁽ˡ⁾A⁽ˡ⁻¹⁾ + b⁽ˡ⁾
2 . Store A⁽ˡ⁻¹⁾, W⁽ˡ⁾, b⁽ˡ⁾, Z⁽ˡ⁾ in a list and keep on appending for each layer.
3. Calculate A⁽ˡ⁾
A⁽ˡ⁾ = relu(Z⁽ˡ⁾)
For the layer L,
A⁽ᴸ⁾ = sigmoid(Z⁽ᴸ⁾)
4. Compute cost, using :
cost, J = -(Y(log(A⁽ᴸ⁾)ᵀ + (1-Y)(log(1-A⁽ᴸ⁾))ᵀ)/m
where Aᵀ denotes the transpose of matrix A.
Backward Propagation
- For the layer L :
dZ⁽ᴸ⁾ = A⁽ᴸ⁾ - Y
refer to this article for the details of the above equation.
and getting the values of A⁽ˡ⁻¹⁾, W⁽ˡ⁾, b⁽ˡ⁾ from the cache created during forward- propagation :
dW⁽ᴸ⁾ = (dZ⁽ᴸ⁾(A⁽ˡ⁻¹⁾)ᵀ)/mdb⁽ᴸ⁾ = dZ⁽ᴸ⁾/mdA⁽ᴸ⁻¹⁾ = (W⁽ᴸ⁾)ᵀdZ⁽ᴸ⁾
2. For all the other layers till l=1
dZ⁽ˡ⁾ = dA⁽ˡ⁾ * relu'(Z⁽ˡ⁾)
where ‘*’ denotes element-wise multiplication between two matrices and ‘relu’(Z⁽ˡ⁾)’ denotes differentiation of the relu activation function wrt Z⁽ˡ⁾.
and
dW⁽ˡ⁾ = dZ⁽ˡ⁾(A⁽ˡ⁻¹⁾)ᵀ/mdb⁽ˡ⁾ = dZ⁽ˡ⁾/mdA⁽ˡ⁻¹⁾ = (W⁽ˡ⁾)ᵀdZ⁽ˡ⁾
and store the gradients dW and db in a list
3. Use the gradients stored above to update the parameters W and b for each layer
W⁽ˡ⁾ = W⁽ˡ⁾ - 𝝰dW⁽ˡ⁾b⁽ˡ⁾ = b⁽ˡ⁾ - 𝝰db⁽ˡ⁾
where 𝝰 denotes the learning rate.
Repeat all the steps for n iterations till the loss converges, to find the optimum value for W and b.
In the case of L2 regularization, the cost function becomes:
where l denotes the layers, k and j denote the number of neurons in each layer and 𝝺 is the regularization parameter
Also the gradients dW⁽ˡ⁾ change accordingly
dW⁽ˡ⁾ = dZ⁽ˡ⁾(A⁽ˡ⁻¹⁾)ᵀ/m + (𝝺/m)W⁽ˡ⁾
For any bugs that arise in the code, one thing that helps a lot in troubleshooting is checking the dimensions of the matrices at each step. For details refer here:
All the content here has been sourced from DeepLearning.AI. The wordings are mine and so are any mistakes.
Thanks for reading. Any comments, queries, or suggestions are welcome.