Neural Networks: Logistic Regression
This article gives an overview of training a single neuron Neural Network(also known as a perception) for solving a Logistic regression problem.
Dataset :
The dataset consists of the input X, containing m samples, each sample having n features, and an output/target variable y, which is a column vector of size (m,1), having values 0 or 1, as shown
We have a binary classification problem, where given X, we need to predict whether y=1 or 0. We formulate this in the form of the following equation :
ŷ = σ(wX+b)
where :
ŷ is the predicted vector shaped (m,1) for given input X
w is the weight vector of shape (m,1) and
b is the intercept which is a scalar value
𝜎 is an activation function. We use the sigmoid function as our activation function in the case of the 0–1 binary classification problem, as it easily maps any input to output between 0 and 1.
The loss function for logistic regression is given by (vectorized form) :
Loss, L = -(y*log(ŷ) + (1-y)*(log(1-ŷ))∕m
where y is the true labels vector
and ŷ is the predicted values as calculated above
Our objective in Logistic regression is to find an optimum w and b, which minimise the loss L. Or in other words, find w and b such that
∂L∕∂w = 0 and ∂L/∂b = 0
We will use a single neuron Neural Network to find the optimum w and b. The structure of the Neural Network will be as shown :
Forward Propagation Steps:
- Initialise a weight vector w of shape (n,1) with 0s (use np.zeros).
- Initialise intercept b = 0.0
- Calculate z = w*x+b .
- Calculate a = 𝜎(z). {where 𝜎 is the sigmoid function}
- Calculate loss using :
L = (y*log(a) + (1-y)*(log(1-a))
Now we need to use back-propagation to calculate the gradients ∂L ∕ ∂w and ∂L ∕ ∂b and update the weights ‘w’ and intercept ‘b’ using the equations
Using the chain rule of differentiation, we can write
Computing each term separately :
we get
Substituting everything in the chain rule formula we get,
The same way we can obtain
Also, we can show that
which implies
and
Normally, since we are taking the derivative of the Loss function wrt to w, b and z, the convention is to represent these by ‘dw’, ‘db’ and ‘dz’. So the above equations become :
The backpropagation step involves calculating the above gradients, once we get the value of ‘a’ after forward-propagation.
We keep on calculating the gradients for each of the samples and adding them cumulatively to finally obtain the gradients for one pass through the whole data. We then use these gradients update w and b using the update equations defined above.
Vectorization
Instead of repeating this process ‘m’ times, we can use vectorization to avoid using computationally expensive ‘for’ loops to calculate the gradients.
The formulas will then become (using numpy for vectorization operation)
Z = wᵀ.X +b
= np.dot(wᵀ, X) +b
A = 𝜎(Z)
dZ = A-Y
dw = np.dot(X,dZᵀ)/m
db = np.sum(dZ, axis=1,keepdims=True)/m
w := w-𝛼dw
b := b-𝛼db
where 𝛼 is the learning rate.
We still will have to loop thru a number of iterations to make the loss converge, the above vectorization lets us avoid looping through the number of samples.
This was a simple demo of how forward-propagation and back-propagation work in a neural network. For larger neural networks the computation might become complex with the change in size and number of the hidden layers, but the process remains the same for calculating gradients, using forward propagation to calculate the predictions and then using backward propagation of the error to obtain the gradients ‘dw’ and ‘db’ and use it to update weights ‘w’ and intercepts ‘b’ and repeating the whole process until the loss is minimised.
All the ideas are from DeepLearning.AI. I have here tried to reproduce what I have learned in my own words.
Thank you for reading. Please share your inputs, suggestions or queries if any.