Machine Learning and Artificial Intelligence are the most trending topics of 21st century. Everyone is trying different types of ML models to solve their tasks.
Many ML Models can help automate tasks that were otherwise needed manual actions. One just needs enough data to train ML model. There are many Python libraries (scikit-learn, statsmodels, xgboost, catbooost, lightgbm, etc) providing implementation of famous ML algorithms. With easy to use API of these libraries, it is very easy to train ML Models using them.
But how does one check whether their trained model is meeting expectations?
Whether model generalized or not?
Is it really good at task that we can automate things or find insights into?
How does one evaluate performance of ML Models for a given task?
That's where ML Metrics comes in.
Machine Learning Metric or ML Metric is a measure of performance of an ML model on a given task. It helps us decide whether a model is good or we need to improve it. A task can be any ML task like classification, regression, clustering, etc.
ML Metric generally gives us a number that we can use to decide whether we should keep model or try another algorithm or perform hyperparameters tuning.
For classification tasks, it can be 'accuracy' that tells us how many labels were right. For regression tasks, it can mean absolute error, which tells us on average how far our predictions are from actual values.
There can be more than one metric that let us understand model performance better from different angles like accuracy, ROC AUC curve, confusion matrix, etc for classification tasks.
These metrics help us understand whether our ML model has generalized or not which can lead to better decision-making.
Python library scikit-learn (sklearn) which is first choice of many ML developers to try ML Models. It provides an implementation of many ML metrics.
As a part of this tutorial, we have explained how to use various ML Metrics available from scikit-learn through 'metrics' sub-module. It provides many metrics to measure performance of ML models. Tutorial covers various metrics available for classification, regression and clustering tasks. We have also explained how to create custom metrics.
In scikit-learn, the default choice for classification is 'accuracy' which is a number of labels correctly classified, and for regression is 'r2' which is a coefficient of determination.
Below, we have listed important sections of tutorial to give an overview of the material covered.
Below, we have imported necessary Python libraries for our tutorial and printed the versions of them used in tutorial.
import sklearn
print("Scikit-Learn Version : ",sklearn.__version__)
import sys
print("Python Verion : ", sys.version)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
np.set_printoptions(precision=2)
%matplotlib inline
In this section, we'll explain various classification metrics available from scikit-learn. We'll train a simple classification model and then calculate various metrics to evaluate their performance.
If you want to learn about classification using scikit-learn then we recommend that you go through below link. It has detailed guidance on topic.
Supervised Learning: Classification using Scikit-Learn
We have loaded breast cancer dataset available from scikit-learn for this section. The dataset has various measurements of tumors as features and target variable is binary (malignant - 0, benign - 1). Dataset has 500+ examples, 30 features, and 2 target classes.
After loading dataset, we'll be splitting a dataset into train set(80% samples) and test set (20% samples).
We'll be using a simple LogisticRegression model for training purposes. We'll then proceed to introduce various classification metrics which will be evaluating model performance on test data from various angles.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import datasets
X,Y = datasets.load_breast_cancer(return_X_y=True)
print('Dataset Size : ',X.shape,Y.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
train_size=0.80, test_size=0.20,
stratify=Y,
random_state=1)
print('Train/Test Size : ', X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
log_reg = LogisticRegression(random_state=123)
log_reg.fit(X_train, Y_train)
log_reg = LogisticRegression(random_state=123)
log_reg.fit(X_train, Y_train)
Y_preds = log_reg.predict(X_test)
Accuracy is number of true predictions divided by total number of samples. It tells us percentage/portion of examples that were predicted correctly by model.
Scikit-learn has a function named 'accuracy_score()' that let us calculate accuracy of model. We need to provide actual labels and predicted labels to function and it'll return an accuracy score.
Also, all classification models by default calculate accuracy when we call their score() methods to evaluate model performance.
from sklearn.metrics import accuracy_score
print(Y_preds[:15])
print(Y_test[:15])
print('Test Accuracy : {:.3f}'.format(accuracy_score(Y_test, Y_preds)))
print('Test Accuracy : {:.3f}'.format(log_reg.score(X_test, Y_test))) ## Score method also evaluates accuracy for classification models.
print('Training Accuracy : {:.3f}'.format(log_reg.score(X_train, Y_train)))
For binary and multi-class classification problems, confusion matrix is another metric that helps in identifying which classes are easy to predict and which are hard to predict. It provides how many examples for each class are correctly classified and how many are confused with other classes.
Confusion Matrix for binary classification problems has the below-mentioned structure.
[[TN, FP ]
[FN, TP ]]
Scikit-learn provides us with function named 'confusion_matrix()' through 'metrics' module to calculate confusion matrix. We need to provide actual labels and predicted labels for it.
from sklearn.metrics import confusion_matrix
conf_mat = confusion_matrix(Y_test, Y_preds)
print(conf_mat)
Below we are plotting the confusion matrix as it helps in interpreting results fast.
We have created a chart using Python library scikit-plot. It provides visualizations for many different ML Metrics. Please feel free to check below link to learn about it. We'll use it again below for some other ML metrics.
import scikitplot as skplt
skplt.metrics.plot_confusion_matrix(Y_test, Y_preds,
normalize=False,
title="Confusion Matrix",
cmap="Purples",
);
Classification report metrics provide precision, recall, f1-score, and support for each class.
Below, we have included a visualization that gives an exact idea about precision and recall.
Scikit-learn provides various functions to calculate precision, recall and f1-score metrics.
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score, precision_recall_fscore_support
print('Precision : %.3f'%precision_score(Y_test, Y_preds))
print('Recall : %.3f'%recall_score(Y_test, Y_preds))
print('F1-Score : %.3f'%f1_score(Y_test, Y_preds))
print('\nPrecision Recall F1-Score Support Per Class : \n',precision_recall_fscore_support(Y_test, Y_preds))
print('\nClassification Report : ')
print(classification_report(Y_test, Y_preds))
ROC(Receiver Operating Characteristic) Curve helps better understand the performance of the model when handling an unbalanced dataset.
ROC Curve works with the output of prediction function or predicted probabilities by setting different threshold values to classify examples. It then calculates different false positives rates (FPR) and true positive rates (TPR) according to set threshold values. We can then plot a line chart with Y-axis as TPR and X-axis as FPR. The area under the line is generally referred to as ROC AUC (ROC Area Under Curve).
E.g., The output of binary classification model is probability in range 0-1. We can try different threshold values (0.2,0.35, 0.5, 0.65, 0.8, 0.95) to classify examples. Based on selected threshold value, the example with probability less than threshold is classified as negative class and greater than is classifier as positive class. This way we'll get different positives and negatives for each threshold.
In the case of SVC, for example, a threshold set for output of decision function is 0 whereas ROC Curve tries various values for thresholds like [2,1,-1,-2] including negative threshold values as well.
In the case of LogisticRegression, the default threshold is 0.5 and ROC will try different threshold values.
For linear regression, the output is a probability between [0,1] hence threshold is set at 0.5 to differentiate positive/negative classes whereas in case of SVC internal kernel function returns a value, and threshold is set on that value for making a prediction.
Note: It's restricted to binary classification tasks.
Scikit-learn provides various metrics to calculate ROC and ROC AUC metrics.
Below, we have explained how to calculate ROC & ROC AUC using sklearn. We can use either predict_proba() or decision_function() for calculation.
from sklearn.metrics import roc_curve, roc_auc_score
#fpr, tpr, thresholds = roc_curve(Y_test, log_reg.predict_proba(X_test)[:, 1])
fpr, tpr, thresholds = roc_curve(Y_test, log_reg.decision_function(X_test))
#auc = roc_auc_score(Y_test, log_reg.predict_proba(X_test)[:,1])
auc = roc_auc_score(Y_test, log_reg.decision_function(X_test))
acc = log_reg.score(X_test, Y_test)
print("False Positive Rates : {}".format(fpr))
print("True Positive Rates : {}".format(tpr))
print("Threshols : {}".format(thresholds))
print("Accuracy : {:.3f}".format(acc))
print("AUC : {:.3f}".format(auc))
Below, we have plotted ROC using matplotlib.
with plt.style.context(('ggplot','seaborn')):
plt.figure(figsize=(8,6))
plt.scatter(fpr, tpr, c='blue')
plt.plot(fpr, tpr, label="Accuracy:%.2f AUC:%.2f" % (acc, auc), linewidth=2, c='red')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate (recall)")
plt.title('ROC Curve')
plt.legend(loc='best');
Below, we have plotted ROC using Python library scikit-plot.
skplt.metrics.plot_roc_curve(Y_test, log_reg.predict_proba(X_test),
title="ROC Curve", figsize=(12,6));
With a very small decision threshold, there will be few false positives, but also few false negatives, while with a very high threshold, both true positive rate and the false positive rate will be high.
So in general, the curve will be from the lower left to the upper right. A diagonal line reflects chance performance, while the goal is to be as much in the top left corner as possible. We want ROC Curve to cover almost 100% area for good performance. 50% area coverage refers to the chance model (random prediction).
Precision and Recall help a lot in case of imbalanced datasets. It works exactly like ROC curve but uses precision and recall values.
Plotting different values of precision vs recall by setting different thresholds helps in evaluating the performance of the model better in case of imbalance classes. It does not take into consideration true negatives as it's majority class and True positives represent minority class which has quite a few occurrences.
Note: It's restricted to binary classification tasks.
Scikit-learn provides various metrics to calculate Precision-Recall Curve and Precision-Recall Curve AUC metrics.
Below, we have explained how to calculate precision-recall curve & precision-recall AUC using sklearn. We can use either predict_proba() or decision_function() for calculation.
from sklearn.metrics import precision_recall_curve, auc,average_precision_score
#precision, recall, thresholds = precision_recall_curve(Y_test, log_reg.predict_proba(X_test)[:,1])
precision, recall, thresholds = precision_recall_curve(Y_test, log_reg.decision_function(X_test))
acc = log_reg.score(X_test, Y_test)
p_auc = auc(recall, precision)
print("Precision : {}".format(precision))
print("Recall : {}".format(recall))
print("Threshols : {}".format(thresholds))
print("Accuracy : {:.3f}".format(acc))
print("AUC : {:.3f}".format(p_auc))
Below, we have plotted Precision Recall Curve using matplotlib.
with plt.style.context(('ggplot', 'seaborn')):
plt.figure(figsize=(8,6))
plt.scatter(recall, precision, c='blue')
plt.plot(recall, precision, label="Accuray:%.2f, AUC:%.2f" % (acc, p_auc), linewidth=2, c='red')
plt.hlines(0.5,0.0,1.0, linestyle='dashed', colors=['orange'])
plt.xlabel("Recall (Sensitivity)")
plt.ylabel("Precision")
plt.title('Precision Recall Curve')
plt.legend(loc='best');
Below, we have plotted Precision Recall Curve using Python library scikit-plot.
skplt.metrics.plot_precision_recall_curve(Y_test, log_reg.predict_proba(X_test),
title="Precision Recall Curve", figsize=(12,6));
Precision-recall curve totally crashes if our model is not performing well in case of an imbalanced dataset. ROC curves sometimes give optimistic results hence it's better to consider precision-recall curves as well in case of imbalanced datasets. We recommend looking at classification reports as well for imbalanced datasets.
Log loss refers to the negative log-likelihood of true labels predicted by the classifier. It's a cost function whose output classifiers try to minimize while updating weights of the model.
log_loss = - y * log (y_probs) - (1-y) * log(1 - y_probs)
Scikit-learn provides a function named log_loss() through 'metrics' sub-module for calculating log loss. We need to provide actual target labels and predicted probabilities for calculating log loss to function.
from sklearn.metrics import log_loss
print('Test Log Loss : %.3f'%log_loss(Y_test, log_reg.predict_proba(X_test)))
print('Train Log Loss : %.3f'%log_loss(Y_train, log_reg.predict_proba(X_train)))
It returns a number of misclassifications or a fraction of misclassifications. It accepts normalize parameter whose value if set True then returns a fraction of misclassifications else if set to False then it returns misclassifications.
Scikit-learn provides function named 'zero_one_loss()' function through 'metrics' sub-module. We need to provide actual and predicted target labels to calculate zero one classification loss.
from sklearn.metrics import zero_one_loss
print('Number of Misclassificied Examples : ',zero_one_loss(Y_test, Y_preds, normalize=False))
print('Fraction of Misclassificied Examples : ',zero_one_loss(Y_test, Y_preds))
It returns an average recall of each class in classification problem. It's useful to deal with imbalanced datasets.
It has parameter adjusted which when set True results are adjusted for a chance so that the random performing model would get a score of 0 and perfect performance will get 1.0.
We can calculate balanced accuracy using 'balanced_accuracy_score()' function of 'sklearn.metrics' module. We need to provide actual and predicted labels to function.
from sklearn.metrics import balanced_accuracy_score
print('Balanced Accuracy : ',balanced_accuracy_score(Y_test, Y_preds))
print('Balanced Accuracy Adjusted : ',balanced_accuracy_score(Y_test, Y_preds, adjusted=True))
It computes squared differences between the actual labels of class and predicted probability by model. It should be as low as possible for good performance. It’s for binary classification problems only. It by default takes 1 as a positive class hence if one needs to consider 0 as a positive class then one can use the pos_label parameter as below.
We can calculate brier loss using 'brier_score_loss()' from scikit-learn. We need to provide actual target labels and predicted probabilities of positive class to it.
from sklearn.metrics import brier_score_loss
print('Brier Loss : ',brier_score_loss(Y_test, log_reg.predict_proba(X_test)[:,1]))
print('Brier Loss (0 as Positive Class) : ', brier_score_loss(Y_test, log_reg.predict_proba(X_test)[:,1], pos_label=0))
F-Beta score refers to a weighted average of precision and recall based on the value of the beta parameter provided. If beta < 1 then it lends more weight to precision, while beta > 1 lends more weight to recall. It has the best value of 1.0 and the worst 0.0.
It has a parameter called average which is required for multiclass problems. It accepts values [None, 'binary'(default), 'micro', 'macro', 'samples', 'weighted']. If None is specified then the score for each class is returned else average as per parameter is returned in a multi-class problem.
We can calculate F-beta score using fbeta_score() function of scikit-learn.
from sklearn.metrics import fbeta_score
print('Fbeta Favouring Precision : ', fbeta_score(Y_test, Y_preds, beta=0.5))
print('Fbeta Favouring Recall : ' ,fbeta_score(Y_test, Y_preds, beta=2.0))
It returns a fraction of labels misclassified. We can calculate hamming loss using hamming_loss() function of scikit-learn.
from sklearn.metrics import hamming_loss
print('Hamming Loss : ', hamming_loss(Y_test, Y_preds))
In this section, we'll introduce model evaluation metrics for regression tasks. We'll first train a simple regression model and then evaluate its performance by calculating various regression metrics.
If you want to learn how to handle regression tasks using scikit-learn then please check below link. It covers topic in detail.
Supervised Learning: Regression using Scikit-Learn
We'll start with loading the Boston dataset available in scikit-learn for our purpose.
We'll be splitting a dataset into train/test sets with 80% for a train set and 20% for the test set.
We'll now initialize a simple LinearSVR model and train it on the train dataset. We'll then check its performance by evaluating various regression metrics provided by scikit-learn.
from sklearn.linear_model import LinearRegression
boston = datasets.load_boston()
X, Y = boston.data, boston.target
print('Dataset Size : ', X.shape, Y.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.90, random_state=123)
print('Train/Test Size : ', X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
lin_reg = LinearRegression()
lin_reg.fit(X_train, Y_train)
Y_preds = lin_reg.predict(X_test)
The coefficient of R2 is defined as below.
The R2 score generally has values in the range 0-1. The values near 1 are considered signs of a good model.
The best possible score is 1.0 and it can be negative as well if the model is performing badly. A model that outputs constant prediction for each input will have a score of 0.0.
The majority of the regression model's score() method outputs this metric which is quite different from MSE(mean square error). Hence both should not be confused.
Scikit-learn provides function named 'r2_score()' through 'metrics' sub-module to calculate R2 score.
from sklearn.metrics import r2_score
print(Y_preds[:10])
print(Y_test[:10])
print('Test R^2 : %.3f'%r2_score(Y_test, Y_preds))
print('Test R^2 : %.3f'%lin_reg.score(X_test, Y_test))
print('Training R^2 : %.3f'%lin_reg.score(X_train, Y_train))
Below we are doing a grid search through various values of parameter C of LinearSVR and using r2 as an evaluation metric whose value will be optimized.
If you do not have a background on Grid search and want to learn about it then we would recommend you to check below link in your free time. It'll help you with the concept.
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(GradientBoostingRegressor(),
param_grid = {'max_depth': [None, 3, 4, 6]},
scoring="r2", cv=5)
grid.fit(X, Y)
print('Best Parameters : ',grid.best_params_)
print('Best Score : ',grid.best_score_)
print('Test R^2 : %.3f'%r2_score(Y_test, grid.best_estimator_.predict(X_test)))
print('Test R^2 : %.3f'%grid.best_estimator_.score(X_test, Y_test))
print('Training R^2 : %.3f'%grid.best_estimator_.score(X_train, Y_train))
Y_preds = grid.best_estimator_.predict(X_test)
print(Y_preds[:10])
print(Y_test[:10])
Mean absolute error is a simple sum of the absolute difference between actual and predicted target value divided by a number of samples.
Scikit-learn provides function named 'mean_absolute_error()' through 'metrics' sub-module to calculate mean absolute error.
from sklearn.metrics import mean_absolute_error
print('Test MAE : %.3f'%mean_absolute_error(Y_test, Y_preds))
print('Train MAE : %.3f'%mean_absolute_error(Y_train, lin_reg.predict(X_train)))
Below we are doing a grid search through various values of parameter C of LinearSVR and using neg_mean_absolute_error as an evaluation metric whose value will be optimized.
grid = GridSearchCV(GradientBoostingRegressor(),
param_grid = {'max_depth': [None, 3, 4, 6]},
scoring="r2", cv=5)
grid.fit(X, Y)
print('Best Parameters : ',grid.best_params_)
print('Test MAE : %.3f'%mean_absolute_error(Y_test, grid.best_estimator_.predict(X_test)))
print('Train MAE : %.3f'%mean_absolute_error(Y_train, grid.best_estimator_.predict(X_train)))
Y_preds = grid.best_estimator_.predict(X_test)
print(Y_preds[:10])
print(Y_test[:10])
Mean Squared Error loss function simple sum of the squared difference between actual and predicted values divided by a number of samples.
Scikit-learn provides function named 'mean_squared_error()' through 'metrics' sub-module to calculate mean squared error.
from sklearn.metrics import mean_squared_error, mean_squared_log_error
print('Test MSE : %.3f'%mean_squared_error(Y_test, Y_preds))
print('Train MSE : %.3f'%mean_squared_error(Y_train, lin_reg.predict(X_train)))
Below we are doing grid search through various values of parameter C of LinearSVR and using neg_mean_squared_error as an evaluation metric whose value will be optimized.
grid = GridSearchCV(GradientBoostingRegressor(),
param_grid = {'max_depth': [None, 3, 4, 6]},
scoring="r2", cv=5)
grid.fit(X, Y)
print('Best Parameters : ',grid.best_params_)
print('Test MSE : %.3f'%mean_squared_error(Y_test, grid.best_estimator_.predict(X_test)))
print('Train MSE : %.3f'%mean_squared_error(Y_train, grid.best_estimator_.predict(X_train)))
Y_preds = grid.best_estimator_.predict(X_test)
print(Y_preds[:10])
print(Y_test[:10])
It can not be used when target contains negative values/predictions.
Scikit-learn provides function named 'mean_squared_log_error()' through 'metrics' sub-module to calculate mean squared log error.
from sklearn.metrics import mean_squared_log_error
print("Mean Squared Log Error : {:.3f}".format(mean_squared_log_error(Y_test, Y_preds)))
Scikit-learn provides function named 'median_absolute_error()' through 'metrics' sub-module to calculate median absolute error.
from sklearn.metrics import median_absolute_error
print('Median Absolute Error : {}'.format(median_absolute_error(Y_test, Y_preds)))
print('Median Absolute Error : {}'.format(np.median(np.abs(Y_test - Y_preds))))
It returns the explained variance regression score. The best value is 1.0 and fewer values refer to a bad model.
Scikit-learn provides function named 'explained_variance_score()' through 'metrics' sub-module to calculate explained variance score.
from sklearn.metrics import explained_variance_score
print('Explained Variance Score : {:.3f}'.format(explained_variance_score(Y_test, Y_preds)))
It returns the max of the difference between actual values and the predicted value of all samples.
Scikit-learn provides function named 'max_error()' through 'metrics' sub-module to calculate residual error.
from sklearn.metrics import max_error
print('Maximum Residual Error : {:.3f}'.format(max_error(Y_test, Y_preds)))
print('Maximum Residual Error : {:.3f}'.format(max_error([1,2,3,4], [1,2,3.5,7]))) ## here 4th sample has highest difference
We'll now introduce evaluation metrics for unsupervised learning - clustering tasks. We'll train a simple ML model for solving clustering task and then evaluate its performance by calculating various metrics.
If you want to learn about clustering then we would recommend you to go through below link as it covers topic in detail.
Unsupervised Learning: Clustering using Scikit-Learn
from sklearn.cluster import KMeans, MeanShift
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, adjusted_rand_score, confusion_matrix
iris = load_iris()
X, Y = iris.data, iris.target
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.80, test_size=0.20, stratify=Y, random_state=12)
print('Train/Test Sizes : ', X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)
kmeans = KMeans(n_clusters=3)
kmeans.fit(X_train, Y_train)
Clustering algorithms return cluster labels for each cluster specified but they might not return in the same sequence as original labels. It might happen that in the original dataset some class has samples labeled as 1 and in predictions by cluster, an algorithm can label it as other than 1.
We'll use the IRIS dataset and KMeans for explanation purposes. We'll even plot results to show the difference. We'll show accuracy will improve once we use 'adjusted_rand_score()' as an evaluation function.
Y_preds = kmeans.predict(X_test)
print('Confusion Matrix : ')
print(confusion_matrix(Y_test, Y_preds))
print('Accuracy of Model : %.3f'%accuracy_score(Y_test, Y_preds))
print('Adjusted Accuracy : %.3f'%adjusted_rand_score(Y_test, Y_preds))
with plt.style.context(('ggplot', 'seaborn')):
plt.figure(figsize=(10,4))
plt.subplot(121)
plt.scatter(X_test[: , 1], X_test[:, 2], c=Y_test, cmap = plt.cm.viridis)
plt.xlabel(iris.feature_names[1])
plt.ylabel(iris.feature_names[2])
plt.title('Y Original')
plt.subplot(122)
plt.scatter(X_test[: , 1], X_test[:, 2], c=Y_preds, cmap = plt.cm.viridis)
plt.xlabel(iris.feature_names[1])
plt.ylabel(iris.feature_names[2])
plt.title('Y Predicted');
Users can also define their own scoring function if their scoring function is not available in built-in scoring functions of sklearn. In GridSearchCV and cross_val_score, one can provide object which has call method or function to scoring parameter. Object or function both need to accept estimator object, test features(X) and target(Y) as input, and return float.
Below we are defining RMSE (Root Mean Squared Error) as a class and as a function as well. We'll then use it in cross_val_score() to check performance also compares it's value with negative of neg_mean_squared_error.
boston = datasets.load_boston()
X, Y = boston.data, boston.target
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.80, test_size=0.20, random_state=1, )
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
class RootMeanSquareError(object):
def __call__(self, model, X, Y):
Y_preds = model.predict(X)
return np.sqrt(((Y - Y_preds)**2).mean())
def rootMeanSquareError(model, X, Y):
Y_preds = model.predict(X)
return np.sqrt(((Y - Y_preds)**2).mean())
lin_reg = LinearRegression()
lin_reg.fit(X_train, Y_train)
rmse_obj = RootMeanSquareError()
print("Train RMSE : {}".format(rmse_obj(lin_reg, X_train, Y_train)))
print("Train RMSE : {}".format(rootMeanSquareError(lin_reg, X_train, Y_train)))
print("\nTest RMSE : {}".format(rmse_obj(lin_reg, X_test, Y_test)))
print("Test RMSE : {}".format(rootMeanSquareError(lin_reg, X_test, Y_test)))
Below, we have explained how to use custom metrics with scikit-learn function cross_val_score().
If you are someone who does not have background on cross validation then we would recommend you to check below link.
rmse1 = cross_val_score(lin_reg, X, Y, scoring=RootMeanSquareError())
rmse2 = cross_val_score(lin_reg, X, Y, scoring=rootMeanSquareError)
rmse3 = np.sqrt(-1*cross_val_score(lin_reg, X, Y, scoring='neg_mean_squared_error'))
print('Cross Val Score Using Object : {}'.format(rmse1))
print('Cross Val Score Using Function : {}'.format(rmse2))
print('Cross Val Score Using Square Root of Neg Mean Squared Error : {}'.format(rmse3))
Below, we have explained how to use custom metrics with grid search.
If you are someone who does not have background on grid search then we would recommend you to check below link.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
grid = GridSearchCV(GradientBoostingRegressor(),
param_grid = {'max_depth': [3, 4, 6]},
scoring=RootMeanSquareError(), cv=5)
grid.fit(X_train, Y_train)
print('Best Parameters : ',grid.best_params_)
print("Best MSE : {:5f}".format(grid.best_score_))
print('Test MSE : {:.5f}'.format(mean_squared_error(Y_test, grid.best_estimator_.predict(X_test))))
print('Train MSE : {:.5f}'.format(mean_squared_error(Y_train, grid.best_estimator_.predict(X_train))))
Y_preds = grid.best_estimator_.predict(X_test)
print(Y_preds[:10])
print(Y_test[:10])
Below are list of scikit-learn builtin functions.
print('List of Inbuilt Scorers : \n')
for metric in sklearn.metrics.SCORERS:
print(metric)
This ends our small tutorial explaining how to use various ML metrics available from 'metrics' sub-module of 'sklearn' to evaluate performance of ML Models trained on classification, regression and clustering tasks.
Though ML Metrics are a good starting point for evaluating performance of ML Models, sometimes they are not enough. There can be situations when ML metrics are giving good numbers indicating a good model but in reality, our model has not generalized.
That's where various algorithms to interpret predictions of ML models come in handy. They let us see which features are contributing to predictions. What is intensity of various features towards predictions? This can helps us make even more informed decisions.
We would recommend that you read our tutorials for interpreting predictions of ML Models listed below in references section as it'll help you evaluate model performance even better.
If you are more comfortable learning through video tutorials then we would recommend that you subscribe to our YouTube channel.
When going through coding examples, it's quite common to have doubts and errors.
If you have doubts about some code examples or are stuck somewhere when trying our code, send us an email at coderzcolumn07@gmail.com. We'll help you or point you in the direction where you can find a solution to your problem.
You can even send us a mail if you are trying something new and need guidance regarding coding. We'll try to respond as soon as possible.
If you want to