# A Brief Overview of the Most Common Supervised Machine Learning Algorithms

In this article, I have tried to summarise the main features of the most common supervised ML algorithms in use, excluding ensemble methods such as RandomForest or XGboost, which I will deal with in another article.

**Please note that in all the cases below, n represents the number of input rows and m represents the number of input features.**

**K-Nearest Neighbours (KNN)**

**How it works**

i) Start with a dataset with known k categories

ii) For a new data point in X with an unknown category, calculate the distance from each of the other points.

iii) Sort the distances in the increasing order

iv) From this sorted list, select the top K rows

v) Find the most frequent category from these K top rows. This will be the predicted class for the new data point

**Hyperparameters**

k, which decides how many neighbouring points to consider

**Bias-Variance Tradeoff**

As k increases, bias increases and variance decreases. When k=1, the model overfits as it tries to classify each point into a separate class. As k increases, variance increases and bias increases.

**Does not work well when**

In the presence of outliers and randomly spread points

**Training time complexity**

O(n*m) where **n is the number of data points in the training dataset and m is the number of features.**

**Run Time complexity**

O(n*m) as we need to compute the distances from each point.

**Space-Time complexity**

O(n*m) we need to store all the points in memory to compute distances

**Multiclass classification**

KNN can inherently do multiclass classification.

**Interpretability**

When the number of features m is small, Knn is interpretable.

**When to use**

Knn works very well when m << n and n itself is not very large.

Normally not used in low latency requirements because of the run time and space complexity

**Naive Bayes**

Called Naive Bayes because we use the ‘naive’ assumption of independence of features.

It is a very good baseline for text classification problems.

**Laplace Smoothing**

For a rare word or feature w, P(w|y=k) ≈ 0, where k is one of the values for labels. We cannot drop P(w|y=k), because that would be equivalent to saying P(w|y=k) = 1. So instead we use Laplace smoothing with hyperparameter 𝛼, which is typically taken as 1.

**Bias-variance tradeoff**

𝛼 increases, bias increases and variance decreases.

**Log probabilities**

We take log probabilities in Naive Bayes, to avoid multiplication with very small values.

**Interpretability**

Naive Bayes models are highly interpretable where the feature importance can be directly derived from the model.

**Imbalanced dataset**

For an imbalanced dataset, the majority class has an advantage.

**Outliers**

Laplace smoothing usually takes care of the outliers

**Multiclass Classification**

Naive Bayes can inherently do multiclass classification

It can also handle large dimensionality data as long as we use log probabilities.

**Use cases**

If conditional independence assumption between the features is true, Naive Bayes works very well

Even if some features are dependent, Naive Bayes performance is quite good

Naive Bayes is typically used when we have categorical features, not so much for numerical features

It is super interpretable, so used in cases where interpretability is highly required

**Train and run time and space complexity**

Train and run time and space complexity is low, of the order of O(m).

Can easily overfit if we do not use Laplace smoothing.

**Logistic Regression**

Works on the assumption that the classes are almost linearly separable.

**Bias-Variance Tradeoff**:

Hyperparameter 𝜆 for regularisation. **𝜆 increases, variance decreases and bias increases.**

**Train Time Complexity**

O(nm) where n is the number of data points and m is the number of features.

**Test Time Complexity**

Runtime complexity is O(m), where m is the dimensionality of the data. Hence it is quite useful for low-latency applications, especially if the dimension of the data is small.

**Space Complexity**

Space complexity is in the order of m, ie O(m).

**Support Vector Machines, SVM**

We try to find the margin maximising hyperplane.

Support Vectors are the point closest to the margin maximising hyperplane if we move parallelly in either direction.

The loss function in SVM is given by hinge loss.

RBF kernel or polynomial kernel are used for more complex decision boundaries.

**Hyperparameters**

For Linear Kernel, the hyperparameter is given by C.

**As C increases, bias decreases and variance increases**

For RBF kernel, gamma and C are the hyperparameters.

**As gamma increases bias increases and variance decreases**.

RBF kernel works very well, same as KNN, but has better space and time complexity.

**Train time and Run Time complexity**

Training time O(n²) for SVM. So typically not used when n is large.

Run time complexity is O(km) where k is the number of support vectors.

**Interpretability**

Not highly interpretable.

**Optimal use cases**

Very good for large m

Not for large n

**Decision Trees**

Decision Trees are nested if-else classifiers.

**Entropy**

If the probability of all classes is equal, then entropy is maximum ie 1.

If one class has a higher probability than others, entropy is almost 0.

**Gini Impurity**

The highest value is 0.5. Much easier to compute than Entropy.

**Information gain**

The decrease in entropy or Gini impurity while we traverse a branch in a decision tree.

**Building a Decision Tree**

If we have 4 features, we break the decision tree on each feature and calculate the information gain for each, The feature giving the highest information gain is chosen as the root.

No need for standardization of the numerical features.

For a categorical feature with many values, convert it numerical feature, and conduct P(Yi = 1|Pj) where this denotes one of the feature values.

**Hyperparameters**

The depth of the decision tree is a hyperparameter, along with the max_leaf_nodes and max_features

**Train time complexity**

O(n*logn*m) if m is large

**Run time complexity**

Convert a DT model into if-else conditions. Run time complexity is very small since the depth of the tree is typically 5~10.

For regression using DT, we use MAE as the feature selection criteria.

**Cases for DT**

Balance the data. Avoid One Hot encoding. Highly interpretable

*We have covered the salient features of the most common supervised machine learning algorithms. Stay tuned for similar articles for ensemble machine learning algorithms and unsupervised machine learning algorithms.*

*Thanks for reading !!*