Updated On : May-24,2020 Time Investment : ~30 mins

Scikit-Learn - Cross-Validation & Hyperparameter Tuning Using Grid Search & Randomized Search¶

Table of Contents¶

1. Cross Validation
2. Hyperparameter Tuning Using Grid Search & Randomized Search
- GridSearchCV
- domizedSearchCV
References

1. Cross Validation ¶

We generally split our dataset into train and test sets. We then train our model with train data and evaluate it on test data. This kind of approach lets our model only see a training dataset which is generally around 4/5 of the data.

A better way to generalize the performance of the model is cross-validation as it lets us use more data. In cross-validation, various models are built using different training and non-overlapping test sets. Performance on test sets is then aggregated for better results.

Image Explaining 5-Fold Cross Validation¶

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn

from collections import Counter

np.set_printoptions(precision=2)

%matplotlib inline

Default Classification Tasks Approach ¶

Below we are trying the default approach to classification tasks where we divide data into train/test sets, train model, and evaluate it on the test set. We are trying only one combination of the dataset without any kind of cross-validation. It does not explore data fully hence can result in the less generic model.

from sklearn import datasets

iris = datasets.load_iris()
X_iris, Y_iris = iris.data, iris.target
print('Dataset Size : ', X_iris.shape, Y_iris.shape)

Dataset Size :  (150, 4) (150,)

Splitting Datasets Into Train/Test Sets¶

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X_iris, Y_iris, train_size=0.80, test_size=0.20, random_state=12, stratify=Y_iris)
print('Train/Test Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

Train/Test Sizes :  (120, 4) (30, 4) (120,) (30,)

Training Model¶

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

Evaluating Model On Test Set.¶

print('Train Accuracy : %.2f'%knn.score(X_train, Y_train))
print('Test Accuracy : %.2f'%knn.score(X_test, Y_test))

Train Accuracy : 0.96
Test Accuracy : 1.00

Default Regression Tasks Approach ¶

Below we are trying the default approach to regression tasks where we divide data into train/test sets, train model, and evaluate it on the test set. We are trying only one combination of the dataset without any kind of cross-validation. It does not explore data fully hence can result in the less generic model.

boston = datasets.load_boston()
X_boston, Y_boston = boston.data, boston.target
print('Dataset Size : ', X_boston.shape, Y_boston.shape)

Dataset Size :  (506, 13) (506,)

Splitting Datasets Into Train/Test Sets¶

from sklearn.neighbors import KNeighborsRegressor

X_train, X_test, Y_train, Y_test = train_test_split(X_boston, Y_boston, train_size=0.80, test_size=0.20, random_state=12)
print('Train/Test Sizes : ',X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

Train/Test Sizes :  (404, 13) (102, 13) (404,) (102,)

Training Model¶

knn = KNeighborsRegressor()
knn.fit(X_train, Y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
                    metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                    weights='uniform')

Evaluating Model On Test Set.¶

print('Train R^2 Score : %.2f'%knn.score(X_train, Y_train))
print('Test R^2 Score : %.2f'%knn.score(X_test, Y_test))

Train R^2 Score : 0.71
Test R^2 Score : 0.54

The above implementation considers only one set of train and test sets. It has not seen the whole dataset. We might get even better results if we try a few other possible combinations of train/test splits. Hence it’s worth trying various combinations to find out good results that generalize well.

sklearn also provides various splitting strategies as mentioned below:

KFold
StratifiedKFold
ShuffleSplit
StratifiedShuffleSPlit

sklearn provides cross_val_score method which tries various combinations of train/test splits and produces results of each split test score as output.

sklearn also provides a cross_validate method which is exactly the same as cross_val_score except that it returns a dictionary which has fit time, score time and test scores for each splits.

We are trying below StratifiedKFold and StratifiedShuffleSplit for classification dataset(iris) and KFold and ShuffleSplit for regression dataset(boston).

KFold ¶

K-Fold cross-validation is quite common cross-validation. In K-Fold CV, the total dataset is generally divided into 5/10 folds and then for each iteration of model training, one fold is taken as the test set and remaining folds are combined to the created train set.

from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import KFold,StratifiedKFold, ShuffleSplit, StratifiedShuffleSplit

print('Classifying Without Any Cross Validation : ', cross_val_score(KNeighborsRegressor(), X_boston, Y_boston, cv=5)) # Default KFold CV
print('Classifying With KFold Cross Validation : ', cross_val_score(KNeighborsRegressor(), X_boston, Y_boston, cv=KFold(n_splits=5)))

Classifying Without Any Cross Validation :  [-1.11  0.15 -0.43 -0.01 -0.17]
Classifying With KFold Cross Validation :  [-1.11  0.15 -0.43 -0.01 -0.17]

print('Classifying Without Any Cross Validation : \n', cross_validate(KNeighborsRegressor(), X_boston, Y_boston, cv=5)) # Default KFold CV
print('\nClassifying With KFold Cross Validation : \n', cross_validate(KNeighborsRegressor(), X_boston, Y_boston, cv=KFold(n_splits=5)))

Classifying Without Any Cross Validation :
 {'fit_time': array([0., 0., 0., 0., 0.]), 'score_time': array([0.  , 0.01, 0.  , 0.  , 0.  ]), 'test_score': array([-1.11,  0.15, -0.43, -0.01, -0.17])}

Classifying With KFold Cross Validation :
 {'fit_time': array([0., 0., 0., 0., 0.]), 'score_time': array([0., 0., 0., 0., 0.]), 'test_score': array([-1.11,  0.15, -0.43, -0.01, -0.17])}

We are trying to split the IRIS classification dataset with KFold. Notice that we are also printing each class distribution in train and test sets after splits. Please make a note that class distribution is not proper in training and test sets. By class distribution, we mean that each class of classification dataset has the same amount of presence in both train and test sets. It means that if one class is representing 30% samples of the whole dataset then in both train and test sets it should have 30% representation.

Hence we should generally use StratifiedKFold for classification datasets and KFold for regression datasets.

kfold = KFold(n_splits=5)
masks = []
for i, (train_indexes, test_indexes) in enumerate(kfold.split(X_iris)):
    mask = np.array([(False if j in train_indexes else True) for j in range(len(Y_iris))])
    print('Split[%d] Train Index Distribution by class : '%(i+1),np.bincount(Y_iris[train_indexes])/len(Y_iris))
    print('Split[%d] Test Index Distribution by class : '%(i+1), np.bincount(Y_iris[test_indexes])/len(Y_iris))
    masks.append(mask)

Split[1] Train Index Distribution by class :  [0.13 0.33 0.33]
Split[1] Test Index Distribution by class :  [0.2]
Split[2] Train Index Distribution by class :  [0.2  0.27 0.33]
Split[2] Test Index Distribution by class :  [0.13 0.07]
Split[3] Train Index Distribution by class :  [0.33 0.13 0.33]
Split[3] Test Index Distribution by class :  [0.  0.2]
Split[4] Train Index Distribution by class :  [0.33 0.27 0.2 ]
Split[4] Test Index Distribution by class :  [0.   0.07 0.13]
Split[5] Train Index Distribution by class :  [0.33 0.33 0.13]
Split[5] Test Index Distribution by class :  [0.  0.  0.2]

Visualizing Splits Of KFold¶

Below we are visualizing splits created by KFold from the previous step. We had maintained how it split data at each step into train and test data. Please make a note from the plot that Y-axis represents a split number. We can notice that in the first split it took the first 30 samples as the test set and remaining 120 samples as a train set. We then select the next 30 samples as the train set in the next iteration and so on.

with plt.style.context(('seaborn', 'ggplot')):
    plt.matshow(masks, cmap=plt.cm.Blues, fignum=1)
    plt.yticks(range(5), range(1,6))
    plt.grid(None);

StratifiedKFold ¶

The StratifiedKFold is commonly used for classification tasks. It works almost like KFold with the only difference that it maintains class distribution the same in train/test sets as that of original dataset distribution. So if we have one class which has a 30% sample in the original dataset then when we split it into train/test sets, both train and test sets will also have a 30% distribution of this class.

print('Classifying Without Any Cross Validation : ', cross_val_score(KNeighborsClassifier(), X_iris, Y_iris, cv=5)) ## It uses StratifiedKFold default
print('Classifying With Stratified KFold Cross Validation : ', cross_val_score(KNeighborsClassifier(), X_iris, Y_iris, cv=StratifiedKFold(n_splits=5)))

Classifying Without Any Cross Validation :  [0.97 1.   0.93 0.97 1.  ]
Classifying With Stratified KFold Cross Validation :  [0.97 1.   0.93 0.97 1.  ]

print('Classifying Without Any Cross Validation : \n', cross_validate(KNeighborsClassifier(), X_iris, Y_iris, cv=5)) ## It uses StratifiedKFold default
print('\nClassifying With Stratified KFold Cross Validation : \n', cross_validate(KNeighborsClassifier(), X_iris, Y_iris, cv=StratifiedKFold(n_splits=5)))

Classifying Without Any Cross Validation :
 {'fit_time': array([0., 0., 0., 0., 0.]), 'score_time': array([0., 0., 0., 0., 0.]), 'test_score': array([0.97, 1.  , 0.93, 0.97, 1.  ])}

Classifying With Stratified KFold Cross Validation :
 {'fit_time': array([0., 0., 0., 0., 0.]), 'score_time': array([0., 0., 0., 0., 0.]), 'test_score': array([0.97, 1.  , 0.93, 0.97, 1.  ])}

cross_val_score method will first divide the dataset into the first 5 folds and for each iteration, it takes one of the fold as the test set and other folds as a train set. It generally uses KFold by default for creating folds for regression problems and StratifiedKFold for classification problems.

We are trying to split the classification dataset with StratifiedKFold. Notice that we are also printing each class distribution in train and test sets after splits. Here we can note that class distribution is proper in train and test sets.

skfold = StratifiedKFold(n_splits=5)
masks = []
for i, (train_indexes, test_indexes) in enumerate(skfold.split(X_iris, Y_iris)):
    print('Split[%d] Train Index Distribution by class : '%(i+1),np.bincount(Y_iris[train_indexes])/len(Y_iris))
    print('Split[%d] Test Index Distribution by class : '%(i+1), np.bincount(Y_iris[test_indexes])/len(Y_iris))
    mask = np.array([(False if j in train_indexes else True) for j in range(len(Y_iris))])
    masks.append(mask)

Split[1] Train Index Distribution by class :  [0.27 0.27 0.27]
Split[1] Test Index Distribution by class :  [0.07 0.07 0.07]
Split[2] Train Index Distribution by class :  [0.27 0.27 0.27]
Split[2] Test Index Distribution by class :  [0.07 0.07 0.07]
Split[3] Train Index Distribution by class :  [0.27 0.27 0.27]
Split[3] Test Index Distribution by class :  [0.07 0.07 0.07]
Split[4] Train Index Distribution by class :  [0.27 0.27 0.27]
Split[4] Test Index Distribution by class :  [0.07 0.07 0.07]
Split[5] Train Index Distribution by class :  [0.27 0.27 0.27]
Split[5] Test Index Distribution by class :  [0.07 0.07 0.07]

Visualizing Splits Of StratifiedKFold¶

Below we are visualizing splits created by StratifiedKFold from the previous step. We had maintained how it split data at each step into train and test data. Please make a note from the plot that Y-axis represents a split number. We can notice that in the first split it took the first 30 samples as the test set and remaining 120 samples as train set while maintaining class proportion as well. We then select the next 30 samples as the train set in the next iteration and so on.

with plt.style.context(('seaborn', 'ggplot')):
    plt.matshow(masks, cmap=plt.cm.Blues)
    plt.yticks(range(5), range(1,6))
    plt.grid(None);

ShuffleSplit ¶

The ShuffleSplit as its name suggests splits dataset based on randomly selected indices. It's commonly used for regression tasks.

print('Classifying Without Any Cross Validation : ', cross_val_score(KNeighborsRegressor(), X_boston, Y_boston, cv=5)) # Default KFold CV
print('Classifying With ShuffleSplit Cross Validation : ', cross_val_score(KNeighborsRegressor(), X_boston, Y_boston, cv=ShuffleSplit(n_splits=5)))

Classifying Without Any Cross Validation :  [-1.11  0.15 -0.43 -0.01 -0.17]
Classifying With ShuffleSplit Cross Validation :  [0.48 0.64 0.54 0.68 0.54]

print('Classifying Without Any Cross Validation : \n', cross_validate(KNeighborsRegressor(), X_boston, Y_boston, cv=5)) # Default KFold CV
print('\nClassifying With ShuffleSplit Cross Validation : \n', cross_validate(KNeighborsRegressor(), X_boston, Y_boston, cv=ShuffleSplit(n_splits=5)))

Classifying Without Any Cross Validation :
 {'fit_time': array([0., 0., 0., 0., 0.]), 'score_time': array([0., 0., 0., 0., 0.]), 'test_score': array([-1.11,  0.15, -0.43, -0.01, -0.17])}

Classifying With ShuffleSplit Cross Validation :
 {'fit_time': array([0., 0., 0., 0., 0.]), 'score_time': array([0., 0., 0., 0., 0.]), 'test_score': array([0.24, 0.17, 0.65, 0.55, 0.53])}

We are trying to split the classification dataset with ShuffleSplit. Notice that we are also printing each class distribution in train and test sets after splits. Please make a note that class distribution is not proper in training and test sets. Hence we should generally use StratifiedShuffleSplit for classification datasets and ShuffleSplit for regression datasets.

shuffle_split = ShuffleSplit(n_splits=5)
masks = []
for i, (train_indexes, test_indexes) in enumerate(shuffle_split.split(X_iris)):
    print('Split[%d] Train Index Distribution by class : '%(i+1),np.bincount(Y_iris[train_indexes])/len(Y_iris))
    print('Split[%d] Test Index Distribution by class : '%(i+1), np.bincount(Y_iris[test_indexes])/len(Y_iris))
    mask = np.array([(False if j in train_indexes else True) for j in range(len(Y_iris))])
    masks.append(mask)

Split[1] Train Index Distribution by class :  [0.31 0.31 0.28]
Split[1] Test Index Distribution by class :  [0.02 0.03 0.05]
Split[2] Train Index Distribution by class :  [0.29 0.29 0.31]
Split[2] Test Index Distribution by class :  [0.04 0.04 0.02]
Split[3] Train Index Distribution by class :  [0.31 0.29 0.29]
Split[3] Test Index Distribution by class :  [0.02 0.04 0.04]
Split[4] Train Index Distribution by class :  [0.3 0.3 0.3]
Split[4] Test Index Distribution by class :  [0.03 0.03 0.03]
Split[5] Train Index Distribution by class :  [0.29 0.28 0.33]
Split[5] Test Index Distribution by class :  [0.04 0.05 0.01]

Visualizing Splits Of ShuffleSplit¶

We can notice from below visualization that ShuffleSplit selected samples randomly unlike KFold which selects samples serially.

with plt.style.context(('seaborn', 'ggplot')):
    plt.matshow(masks, cmap=plt.cm.Blues)
    plt.yticks(range(5), range(1,6))
    plt.grid(None);

StratifiedShuffleSplit ¶

The StratifiedShuffleSplit works exactly like ShuffleSplit but designed for classification tasks where we need to maintain class proportion after splitting of data.

print('Classifying Without Any Cross Validation : ', cross_val_score(KNeighborsClassifier(), X_iris, Y_iris, cv=5)) ## It uses StratifiedKFold default
print('Classifying With StratifiedShuffleSplit Cross Validation : ', cross_val_score(KNeighborsClassifier(), X_iris, Y_iris, cv=StratifiedShuffleSplit(n_splits=5)))

Classifying Without Any Cross Validation :  [0.97 1.   0.93 0.97 1.  ]
Classifying With StratifiedShuffleSplit Cross Validation :  [1. 1. 1. 1. 1.]

print('Classifying Without Any Cross Validation : \n', cross_validate(KNeighborsClassifier(), X_iris, Y_iris, cv=5)) ## It uses StratifiedKFold default
print('\nClassifying With StratifiedShuffleSplit Cross Validation : \n', cross_validate(KNeighborsClassifier(), X_iris, Y_iris, cv=StratifiedShuffleSplit(n_splits=5)))

Classifying Without Any Cross Validation :
 {'fit_time': array([0., 0., 0., 0., 0.]), 'score_time': array([0., 0., 0., 0., 0.]), 'test_score': array([0.97, 1.  , 0.93, 0.97, 1.  ])}

Classifying With StratifiedShuffleSplit Cross Validation :
 {'fit_time': array([0., 0., 0., 0., 0.]), 'score_time': array([0., 0., 0., 0., 0.]), 'test_score': array([0.93, 1.  , 0.87, 1.  , 1.  ])}

We are trying to split the classification dataset with StratifiedShuffleSplit. Notice that we are also printing each class distribution in train and test sets after splits. Here we can note that class distribution is proper in train and test sets.

shuffle_split = StratifiedShuffleSplit(n_splits=5)
masks = []
for i, (train_indexes, test_indexes) in enumerate(shuffle_split.split(X_iris, Y_iris)):
    print('Split[%d] Train Index Distribution by class : '%(i+1),np.bincount(Y_iris[train_indexes])/len(Y_iris))
    print('Split[%d] Test Index Distribution by class : '%(i+1), np.bincount(Y_iris[test_indexes])/len(Y_iris))
    mask = np.array([(False if j in train_indexes else True) for j in range(len(Y_iris))])
    masks.append(mask)

Split[1] Train Index Distribution by class :  [0.3 0.3 0.3]
Split[1] Test Index Distribution by class :  [0.03 0.03 0.03]
Split[2] Train Index Distribution by class :  [0.3 0.3 0.3]
Split[2] Test Index Distribution by class :  [0.03 0.03 0.03]
Split[3] Train Index Distribution by class :  [0.3 0.3 0.3]
Split[3] Test Index Distribution by class :  [0.03 0.03 0.03]
Split[4] Train Index Distribution by class :  [0.3 0.3 0.3]
Split[4] Test Index Distribution by class :  [0.03 0.03 0.03]
Split[5] Train Index Distribution by class :  [0.3 0.3 0.3]
Split[5] Test Index Distribution by class :  [0.03 0.03 0.03]

Visualising Splits Of StratifiedShuffleSplit¶

with plt.style.context(('seaborn', 'ggplot')):
    plt.matshow(masks, cmap=plt.cm.Blues)
    plt.yticks(range(5), range(1,6))
    plt.grid(None);

sklearn also provides validatation_curve method which can take single hyperparameters and list of various values for that hyperparameters, then it returns train and test scores for various cross-validation folds. It's generally used for plotting purposes.

from sklearn.model_selection import validation_curve

n_neighbors = [1, 3, 5, 10, 20, 50]
train_scores, test_scores = validation_curve(KNeighborsRegressor(), X_iris, Y_iris, param_name="n_neighbors",
                                             param_range=n_neighbors, cv=StratifiedShuffleSplit(n_splits=5, random_state=123))

with plt.style.context(('seaborn', 'ggplot')):
    plt.plot(n_neighbors, train_scores.mean(axis=1), label="train accuracy")
    plt.plot(n_neighbors, test_scores.mean(axis=1), label="test accuracy")
    plt.ylabel('Accuracy')
    plt.xlabel('Number of neighbors')
    #plt.xlim([50, 0])
    plt.legend(loc="best");

2. Hyperparameter Tuning Using Grid Search & Randomized Search ¶

All complex machine learning model has more than one hyperparameters. Most of the models have default values set for these parameters. If we fit train data with the default model then it might happen that it does not fit data well. It can overfit data or underfit data as well. We need to find a proper trade-off between overfitting & underfit by doing grid search through various values of hyperparameters of the model.

Grid Search does try the list of all combinations of values given for a list of hyperparameters with model and records the performance of model based on evaluation metrics and keeps track of the best model and hyperparameters as well. We can try all parameters by writing a loop inside a loop for each hyperparameter values.

X_train, X_test, Y_train, Y_test = train_test_split(X_boston, Y_boston,
                                                    train_size=0.80,
                                                    test_size=0.20,
                                                    random_state=12)

from sklearn.ensemble import RandomForestRegressor

best_score = 0.0
best_params = {'max_depth': None, 'max_features': 'auto','n_estimators': 10}
for max_depth in [None, 2,3,5]:
    for max_features in ['auto','sqrt', 'log2']:
        for n_estimators in [10,100]:
            score = cross_val_score(RandomForestRegressor(n_estimators=n_estimators,
                                                          max_features=max_features,
                                                          max_depth=max_depth,
                                                          random_state=123
                                                          ),
                                    X_train,
                                    Y_train,
                                    cv=ShuffleSplit(n_splits=5, random_state=123),
                                    n_jobs=-1).mean()
            if score > best_score:
                best_score= score
                best_params['max_depth'],best_params['max_features'], best_params['n_estimators'] = max_depth, max_features, n_estimators

            print('max_depth : %s, max_features : %s, n_estimators : %s , Average R^2 Score : %.2f'%(str(max_depth), max_features, str(n_estimators), score))

print('\nBest Score : %.2f, Best Params : %s'%(best_score, str(best_params)))

max_depth : None, max_features : auto, n_estimators : 10 , Average R^2 Score : 0.89
max_depth : None, max_features : auto, n_estimators : 100 , Average R^2 Score : 0.90
max_depth : None, max_features : sqrt, n_estimators : 10 , Average R^2 Score : 0.85
max_depth : None, max_features : sqrt, n_estimators : 100 , Average R^2 Score : 0.88
max_depth : None, max_features : log2, n_estimators : 10 , Average R^2 Score : 0.85
max_depth : None, max_features : log2, n_estimators : 100 , Average R^2 Score : 0.88
max_depth : 2, max_features : auto, n_estimators : 10 , Average R^2 Score : 0.68
max_depth : 2, max_features : auto, n_estimators : 100 , Average R^2 Score : 0.68
max_depth : 2, max_features : sqrt, n_estimators : 10 , Average R^2 Score : 0.57
max_depth : 2, max_features : sqrt, n_estimators : 100 , Average R^2 Score : 0.60
max_depth : 2, max_features : log2, n_estimators : 10 , Average R^2 Score : 0.57
max_depth : 2, max_features : log2, n_estimators : 100 , Average R^2 Score : 0.60
max_depth : 3, max_features : auto, n_estimators : 10 , Average R^2 Score : 0.81
max_depth : 3, max_features : auto, n_estimators : 100 , Average R^2 Score : 0.83
max_depth : 3, max_features : sqrt, n_estimators : 10 , Average R^2 Score : 0.76
max_depth : 3, max_features : sqrt, n_estimators : 100 , Average R^2 Score : 0.73
max_depth : 3, max_features : log2, n_estimators : 10 , Average R^2 Score : 0.76
max_depth : 3, max_features : log2, n_estimators : 100 , Average R^2 Score : 0.73
max_depth : 5, max_features : auto, n_estimators : 10 , Average R^2 Score : 0.87
max_depth : 5, max_features : auto, n_estimators : 100 , Average R^2 Score : 0.89
max_depth : 5, max_features : sqrt, n_estimators : 10 , Average R^2 Score : 0.78
max_depth : 5, max_features : sqrt, n_estimators : 100 , Average R^2 Score : 0.81
max_depth : 5, max_features : log2, n_estimators : 10 , Average R^2 Score : 0.78
max_depth : 5, max_features : log2, n_estimators : 100 , Average R^2 Score : 0.81

Best Score : 0.90, Best Params : {'max_depth': None, 'max_features': 'auto', 'n_estimators': 100}

rf_best = RandomForestRegressor(**best_params)
rf_best.fit(X_train, Y_train)

print("Test R^2 Score : ", rf_best.score(X_test, Y_test))

Test R^2 Score :  0.8749413705189064

GridSearchCV ¶

sklearn provides GridSearchCV class which takes a list of hyperparameters and their values as a dictionary and will try all combinations on the model and also will keep track of results as well for each Cross-Validation Folds.

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

grid = GridSearchCV(RandomForestRegressor(random_state=123),
                    param_grid = {'max_depth': [None, 2,3,5], 'max_features' : ['auto','sqrt', 'log2'], 'n_estimators': [10,100],},
                    cv = ShuffleSplit(n_splits=5, random_state=123),
                    verbose=50,
                    n_jobs=-1)

grid.fit(X_train, Y_train)

print('\nBest R^2 Score : %.2f'%grid.best_score_, ' Best Params : ', str(grid.best_params_))

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0339s.) Setting batch_size=10.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done   3 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    0.1s
[Parallel(n_jobs=-1)]: Done   6 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done   7 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:    0.3s
[Parallel(n_jobs=-1)]: Done  18 tasks      | elapsed:    0.7s
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done  38 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done  48 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done  58 tasks      | elapsed:    1.2s
[Parallel(n_jobs=-1)]: Done  68 out of 120 | elapsed:    1.3s remaining:    1.0s
[Parallel(n_jobs=-1)]: Done 110 out of 120 | elapsed:    1.8s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:    1.9s finished

Best R^2 Score : 0.90  Best Params :  {'max_depth': None, 'max_features': 'auto', 'n_estimators': 100}

Grid objects also keep tracks of all hyperparameters tried on all cross-validation splits along with information about their score, fit times, mean scores, standard scores, mean fit times, standard fit times. It also ranks models best on performance with best models ranked 1 and next one 2 and so on.

grid.cv_results_.keys()

dict_keys(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_max_depth', 'param_max_features', 'param_n_estimators', 'params', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score'])

pd.DataFrame(grid.cv_results_)[['param_max_depth', 'param_max_features', 'param_n_estimators','mean_test_score', 'rank_test_score']]

	param_max_depth	param_max_features	param_n_estimators	mean_test_score	rank_test_score
0	None	auto	10	0.890995	2
1	None	auto	100	0.902970	1
2	None	sqrt	10	0.848199	7
3	None	sqrt	100	0.875427	4
4	None	log2	10	0.848199	7
5	None	log2	100	0.875427	4
6	2	auto	10	0.684550	19
7	2	auto	100	0.681664	20
8	2	sqrt	10	0.566652	23
9	2	sqrt	100	0.598139	21
10	2	log2	10	0.566652	23
11	2	log2	100	0.598139	21
12	3	auto	10	0.812761	10
13	3	auto	100	0.825652	9
14	3	sqrt	10	0.757299	15
15	3	sqrt	100	0.727955	17
16	3	log2	10	0.757299	15
17	3	log2	100	0.727955	17
18	5	auto	10	0.873375	6
19	5	auto	100	0.885054	3
20	5	sqrt	10	0.783386	13
21	5	sqrt	100	0.811359	11
22	5	log2	10	0.783386	13
23	5	log2	100	0.811359	11

Grid object also keeps the best model available as the best_estimator_ parameter so that it can be used for prediction purposes further.

grid.best_estimator_

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=123, verbose=0,
                      warm_start=False)

print('First Few preds : ', grid.predict(X_boston)[:5])
print('Actual Values   : ', Y_boston[:5])

First Few preds :  [25.36 23.09 35.49 33.91 35.45]
Actual Values   :  [24.  21.6 34.7 33.4 36.2]

print("Test R^2 Score : ", grid.score(X_test, Y_test))

Test R^2 Score :  0.8723191006047755

RandomizedSearchCV ¶

The RandomizedSearchCV is another approach of performing hyperparameter tunning. Unlike GridSearchCV which tries all possible parameter settings passed to it, RandomizedSearchCV tries only a specified number of parameter settings from total parameter search space. It accepts a parameter named n_iter (integer) which lets RandomizedSearchCV select that many parameter settings from all possible parameter settings to try on model. Below we are explaining the usage of it using Boston housing dataset that was split into train/test sets when explaining GridSearchCV.

from sklearn.model_selection import RandomizedSearchCV

grid = RandomizedSearchCV(RandomForestRegressor(random_state=123), n_iter=5,
                    param_distributions = {'max_depth': [None, 2,3,5], 'max_features' : ['auto','sqrt', 'log2'], 'n_estimators': [10,100],},
                    cv = ShuffleSplit(n_splits=5, random_state=123),
                    verbose=50,
                    n_jobs=-1)

grid.fit(X_train, Y_train)

print('\nBest R^2 Score : %.2f'%grid.best_score_, ' Best Params : ', str(grid.best_params_))

Fitting 5 folds for each of 5 candidates, totalling 25 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Batch computation too fast (0.0148s.) Setting batch_size=26.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done   3 out of  25 | elapsed:    0.0s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done   4 out of  25 | elapsed:    0.0s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done   5 out of  25 | elapsed:    0.0s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done   6 out of  25 | elapsed:    0.0s remaining:    0.2s
[Parallel(n_jobs=-1)]: Done   7 out of  25 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done   8 out of  25 | elapsed:    0.1s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    0.2s finished

Best R^2 Score : 0.85  Best Params :  {'n_estimators': 10, 'max_features': 'log2', 'max_depth': None}

We can notice from the above output that even though a possible number of parameter settings is quite high but it only tries 5 different parameter settings. It’s showing total 25 fits because it'll do cross-validation with 5 splits per each parameter setting.

Below we are printing results of each parameter setting converted to pandas dataframe.

pd.DataFrame(grid.cv_results_)[['param_max_depth', 'param_max_features', 'param_n_estimators','mean_test_score', 'rank_test_score']]

	param_max_depth	param_max_features	param_n_estimators	mean_test_score	rank_test_score
0	3	sqrt	10	0.757299	4
1	5	log2	10	0.783386	3
2	2	sqrt	10	0.566652	5
3	None	log2	10	0.848199	1
4	3	auto	10	0.812761	2

grid.best_estimator_

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='log2', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=10,
                      n_jobs=None, oob_score=False, random_state=123, verbose=0,
                      warm_start=False)

print('First Few preds : ', grid.predict(X_boston)[:5])
print('Actual Values   : ', Y_boston[:5])

First Few preds :  [23.9  25.11 37.14 33.9  34.87]
Actual Values   :  [24.  21.6 34.7 33.4 36.2]

print("Test R^2 Score : ", grid.score(X_test, Y_test))

Test R^2 Score :  0.8680488725558099

This ends our small tutorial on cross-validation and hyperparameter tunning using a grid search using scikit-learn. Please feel free to let us know your views in the comments section.

References ¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

When going through coding examples, it's quite common to have doubts and errors.

If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.

You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.

Want to Share Your Views? Have Any Suggestions?

If you want to

provide some suggestions on topic
share your views
include some details in tutorial
suggest some new topics on which we should create tutorials/blogs

Please feel free to contact us at coderzcolumn07@gmail.com. We appreciate and value your feedbacks. You can also support us with a small contribution by clicking DONATE.

sklearn, cross-validation, grid-search

Sunny Solanki

Software Developer | Youtuber | Bonsai Enthusiast

Subscribe to Our YouTube Channel

Tutorial Categories

Artificial Intelligence (83)
Data Science (84)
Digital Marketing (8)
Machine Learning (38)
Python (131)

Scikit-Learn - Cross-Validation & Hyperparameter Tuning Using Grid Search & Randomized Search¶

Table of Contents¶

1. Cross Validation ¶

Image Explaining 5-Fold Cross Validation¶

Default Classification Tasks Approach ¶

Splitting Datasets Into Train/Test Sets¶

Training Model¶

Evaluating Model On Test Set.¶

Default Regression Tasks Approach ¶

Splitting Datasets Into Train/Test Sets¶

Training Model¶

Evaluating Model On Test Set.¶

KFold ¶

Visualizing Splits Of KFold¶

StratifiedKFold ¶

Visualizing Splits Of StratifiedKFold¶

ShuffleSplit ¶

Visualizing Splits Of ShuffleSplit¶

StratifiedShuffleSplit ¶

Visualising Splits Of StratifiedShuffleSplit¶

2. Hyperparameter Tuning Using Grid Search & Randomized Search ¶

GridSearchCV ¶

RandomizedSearchCV ¶

References ¶

Sunny Solanki

Comfortable Learning through Video Tutorials?

Stuck Somewhere? Need Help with Coding? Have Doubts About the Topic/Code?

Want to Share Your Views? Have Any Suggestions?

Sunny Solanki

Subscribe to Our YouTube Channel

Tutorial Categories

Newsletter Subscription