A Brief Overview of the Most Common Supervised Machine Learning Algorithms

Prakhar S
5 min readFeb 1, 2022
Source: https://bigdata-madesimple.com/machine-learning-explained-understanding-supervised-unsupervised-and-reinforcement-learning/

In this article, I have tried to summarise the main features of the most common supervised ML algorithms in use, excluding ensemble methods such as RandomForest or XGboost, which I will deal with in another article.

Please note that in all the cases below, n represents the number of input rows and m represents the number of input features.

K-Nearest Neighbours (KNN)

Source: https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning
  • How it works

i) Start with a dataset with known k categories
ii) For a new data point in X with an unknown category, calculate the distance from each of the other points.
iii) Sort the distances in the increasing order
iv) From this sorted list, select the top K rows
v) Find the most frequent category from these K top rows. This will be the predicted class for the new data point

  • Hyperparameters

k, which decides how many neighbouring points to consider

  • Bias-Variance Tradeoff

As k increases, bias increases and variance decreases. When k=1, the model overfits as it tries to classify each point into a separate class. As k increases, variance increases and bias increases.

  • Does not work well when

In the presence of outliers and randomly spread points

  • Training time complexity

O(n*m) where n is the number of data points in the training dataset and m is the number of features.

  • Run Time complexity

O(n*m) as we need to compute the distances from each point.

  • Space-Time complexity

O(n*m) we need to store all the points in memory to compute distances

  • Multiclass classification

KNN can inherently do multiclass classification.

  • Interpretability

When the number of features m is small, Knn is interpretable.

  • When to use

Knn works very well when m << n and n itself is not very large.

Normally not used in low latency requirements because of the run time and space complexity

Naive Bayes

https://www.saedsayad.com/naive_bayesian.htm

Called Naive Bayes because we use the ‘naive’ assumption of independence of features.

It is a very good baseline for text classification problems.

  • Laplace Smoothing

For a rare word or feature w, P(w|y=k) ≈ 0, where k is one of the values for labels. We cannot drop P(w|y=k), because that would be equivalent to saying P(w|y=k) = 1. So instead we use Laplace smoothing with hyperparameter 𝛼, which is typically taken as 1.

  • Bias-variance tradeoff

𝛼 increases, bias increases and variance decreases.

  • Log probabilities

We take log probabilities in Naive Bayes, to avoid multiplication with very small values.

  • Interpretability

Naive Bayes models are highly interpretable where the feature importance can be directly derived from the model.

  • Imbalanced dataset

For an imbalanced dataset, the majority class has an advantage.

  • Outliers

Laplace smoothing usually takes care of the outliers

  • Multiclass Classification

Naive Bayes can inherently do multiclass classification

It can also handle large dimensionality data as long as we use log probabilities.

  • Use cases

If conditional independence assumption between the features is true, Naive Bayes works very well

Even if some features are dependent, Naive Bayes performance is quite good

Naive Bayes is typically used when we have categorical features, not so much for numerical features

It is super interpretable, so used in cases where interpretability is highly required

  • Train and run time and space complexity

Train and run time and space complexity is low, of the order of O(m).

Can easily overfit if we do not use Laplace smoothing.

Logistic Regression

https://www.javatpoint.com/linear-regression-vs-logistic-regression-in-machine-learning

Works on the assumption that the classes are almost linearly separable.

  • Bias-Variance Tradeoff:

Hyperparameter 𝜆 for regularisation. 𝜆 increases, variance decreases and bias increases.

  • Train Time Complexity

O(nm) where n is the number of data points and m is the number of features.

  • Test Time Complexity

Runtime complexity is O(m), where m is the dimensionality of the data. Hence it is quite useful for low-latency applications, especially if the dimension of the data is small.

  • Space Complexity

Space complexity is in the order of m, ie O(m).

Support Vector Machines, SVM

https://en.wikipedia.org/wiki/Support-vector_machine

We try to find the margin maximising hyperplane.

Support Vectors are the point closest to the margin maximising hyperplane if we move parallelly in either direction.

The loss function in SVM is given by hinge loss.

RBF kernel or polynomial kernel are used for more complex decision boundaries.

  • Hyperparameters

For Linear Kernel, the hyperparameter is given by C.

As C increases, bias decreases and variance increases

For RBF kernel, gamma and C are the hyperparameters.

As gamma increases bias increases and variance decreases.

RBF kernel works very well, same as KNN, but has better space and time complexity.

  • Train time and Run Time complexity

Training time O(n²) for SVM. So typically not used when n is large.

Run time complexity is O(km) where k is the number of support vectors.

  • Interpretability

Not highly interpretable.

  • Optimal use cases

Very good for large m

Not for large n

Decision Trees

https://datascience.foundation/sciencewhitepaper/understanding-decision-trees-with-python

Decision Trees are nested if-else classifiers.

  • Entropy

If the probability of all classes is equal, then entropy is maximum ie 1.

If one class has a higher probability than others, entropy is almost 0.

  • Gini Impurity

The highest value is 0.5. Much easier to compute than Entropy.

  • Information gain

The decrease in entropy or Gini impurity while we traverse a branch in a decision tree.

  • Building a Decision Tree

If we have 4 features, we break the decision tree on each feature and calculate the information gain for each, The feature giving the highest information gain is chosen as the root.

No need for standardization of the numerical features.

For a categorical feature with many values, convert it numerical feature, and conduct P(Yi = 1|Pj) where this denotes one of the feature values.

  • Hyperparameters

The depth of the decision tree is a hyperparameter, along with the max_leaf_nodes and max_features

  • Train time complexity

O(n*logn*m) if m is large

  • Run time complexity

Convert a DT model into if-else conditions. Run time complexity is very small since the depth of the tree is typically 5~10.

For regression using DT, we use MAE as the feature selection criteria.

  • Cases for DT

Balance the data. Avoid One Hot encoding. Highly interpretable

We have covered the salient features of the most common supervised machine learning algorithms. Stay tuned for similar articles for ensemble machine learning algorithms and unsupervised machine learning algorithms.

Thanks for reading !!

--

--