Hyperparameter Tuning in Machine Learning

Prakhar S
6 min readNov 13, 2021

--

Photo by Adi Goldstein on Unsplash

Every Machine Learning model consists of Model parameters, that define how the input data is converted to output data, and Hyperparameters that shape the architecture of the model itself. Model parameters are learned by the model from our data during the training process, while Hyperparameters, cannot be learned from the data, and have to be found by experimentation.

Examples of model parameters are the weights in linear regression or a logistic regression model, which are learned during model training.

Examples of hyper parameter include :

i) Number of neighbours in K Nearest Neighbours.

ii) Maximum Depth of a Decision Tree.

iii) Degree of polynomial features in Linear Regression Model.

iv) Number of trees to be included in the Random Forest Model.

v) Number of layers in our Neural Network.

vi) Number of neurons in each layer in our Neural Network.

The most basic method of finding hyperparameters is a process of manual selection, where we decide upon a range or list of hyperparameters, and iterate over them and calculate the model performance using evaluation metrics such as accuracy score for each iteration. The other more efficient and faster way is to use prebuilt algorithms in libraries like scikit-learn such as GridSearchCV or RandomSearchCV

Before we discuss these, let’s talk a little about cross-validation and the need for splitting our data into train, test and cross-validation data :

Cross-Validation

Consider a dataset having n rows, along with labels (for classification tasks) or target values(for regression tasks). To find the model parameters using this data, we split the data into training data and test data, usually in a 70:30 or 80:20 ratio. We choose a model and train it on our training data, and obtain our model parameters. Using these model parameters, we predict the class labels or real values using the test data. We choose an evaluation metric such as log-loss or mean-squared error and choose the model parameters that give us the best scores on our training and test data.

But what about hyperparameters? We cannot use our training data to find out the best hyperparameters because there is no way that our model can learn the hyperparameters on its own from the training data. And since our training data is being used to train our model parameters, it is not a good idea to try to learn our hyperparameters from them also, because the computational complexity and time complexity of such an operation would be extreme.

We can try and use our test data to find the effect of each set of hyperparameters and choose ones that minimise our loss on the test data, but such a model will not generalise well. Why? Because it has not been tested and tried out on unseen data. And the ultimate aim of any machine learning model is to perform well on any data that it has not seen before during training.

Here’s where cross-validation comes into the picture. Instead of dividing our data only into train and test, we divide our data into train, test and cross-validation data, typically in a 60:20:20 ratio. Now we use the training data to find the model parameters, cross-validation data to test for and find the best hyperparameters and the test data to check if our model parameters and hyperparameters have been optimally tuned to minimise the loss on unseen data.

Let’s now discuss some algorithms we can use for hyperparameter tuning. Both GridSearchCV and RandomSearchCV come prebuilt in sklearn’s model selection package.

GridSearchCV

GridSearch

In GridSearchCV, we pass predefined values for the hyperparameters we want to tune to our GridSearchCV function and run it, which will then try all combinations of the values passed and evaluate the model for each combination exhaustively. We can then finally choose the model with the best performance among these.

Code example of GridSearchCV :

Taken from sklearn’s official documentation

Here we have taken the SVM as our classification algorithm, and the type of kernel and the regularisation parameter C as our hyperparameters. We choose a few values for each of these and pass these as the parameters to our GridSearchCV function, which takes in the model and hyperparameters and uses cross-validation to find the best parameters. We can easily get these hyperparameters by using the ‘best_estimator_’ attribute as shown above (C=1, kernel = ‘linear).

Disadvantages of GridSearchCV :

  • It does not scale well. If we more than 3 hyperparameters, which we want to test on a range of values, GridSearchCV becomes quite computationally expensive.
  • We have no guarantee to explore the right space. Since the values, on which we are going to test our model, are fixed beforehand by us, we might end up missing out on the best values of hyperparameter for our model altogether.

RandomSearchCV

RandomSearch

RandomSearchCV differs from GridSearchCV in the sense that in RandomSearchCV, the search for parameters is not exhaustive over and each and every combination of hyperparameters. Usually, in RandomSearch, we do not provide a discrete set of values for each hyperparameter, instead, we provide a statistical distribution for our hyperparameters, from which the values are randomly sampled.

How many samples are used for model_evaluation depends on the number of iterations for which we want to run the algorithm

Taken from sklearn’s official documentation

The code is quite similar to that for GridSearchCV, except that here we have passed the regularisation hyperparameter as a uniform distribution. Also, we have defined n_iter = 20, which means that the algorithm will try out 20 different combinations of hyperparameters randomly sampled from the distributions specified and select the best among them. The greater the number of iterations, the better are the chances of finding the best parameters, though we also will have to consider the runtime.

One big disadvantage of both GridSearch and RandomSearch is that each evaluation is completely independent of the past evaluations, and hence these algorithms spend a lot of time evaluating bad hyperparameters.

Bayesian Optimization

Bayesian optimization improves upon the above algorithms by using the results of the previous experiments to improve the sampling methods for the next experiment. Hence it is more efficient and faster because it takes informed decisions based on past results.

The way it works is as follows: we initially define a model constructed with hyperparameters λ which, after training, is scored v according to some evaluation metric. Next, we use the previously evaluated hyperparameter values to compute a posterior expectation of the hyperparameter space. We can then choose the optimal hyperparameter values according to this posterior expectation as our next model candidate. We iteratively repeat this process until converging to an optimum.

Further Reading :

Thanks for reading. Your feedback and comments are welcome.

--

--

Prakhar S
Prakhar S

No responses yet