AI Observability
 minutes read

Key Metrics In Model Monitoring And How To Measure Them

In this blog you will get familiar with the monitoring metrics for machine learning models. 

Neetika Khandelwal
Key Metrics In Model Monitoring And How To Measure Them
In this post:

Machine learning tasks are linked to monitoring and evaluation measures. Different metrics exist for classification, regression, statistical, ranking, and monitoring tasks. You should be able to increase your model's overall predictive capacity using several metrics for improving its performance before/after deploying it on unknown data. So, let’s first understand what model metrics are.

Model metrics are utilized to track and measure a model's performance (during training and testing). They aren't required to be differentiable. Evaluation and monitoring metrics are the most common kind of metrics. 

But how do these two metrics differ?

When a machine learning model is deployed on unseen data without performing a proper evaluation utilizing several evaluation metrics and relying just on the accuracy, it might cause problems and result in wrong predictions. So, the machine learning model's quality is measured using evaluation measures. There are a variety of evaluation measures that may be used to test a model. Classification accuracy, logarithmic loss, confusion matrix, and other metrics are among them.

Once the model is in use, you must immediately consider keeping it running correctly because it is now providing business value. Any interruption in model performance results in a direct business loss. The actual issue in any machine learning model is its maintainability, i.e., it should act the same with training and live data. The model monitoring metrics kick in at this point. It aids in the avoidance of data skews and stale models. Some monitoring measures are checks for data quality, data drift, target drift, segment performance, bias, and outliers.

It's also worth noting that model metrics aren’t the same as loss functions. Loss functions display a measure of model performance and are used to train a machine learning model. They are usually differentiable in model parameters. On the other hand, metrics are used to track and measure a model's performance (during training and testing) and do not need to be differentiable.

It's critical to choose the right metrics for your machine learning model. The measures used to assess and compare the performance of machine learning algorithms impact how they are measured and compared.

Key Metrics for Model Monitoring

Model performance is a concern for machine learning practitioners. They concentrate on how to construct a generalized model. Machine learning models cannot be 100% efficient. Thus accurate and approximate findings on unknown data are required; otherwise, the model is useless. To construct and deploy a generalized model, you must evaluate it using a relevant metric or a custom metric created using a combination of different metrics. This will allow you to improve the model's performance, fine-tune it, and acquire a better outcome.

Furthermore, because different evaluation measures fit on different sets of a dataset, it is critical to grasp the advantages and disadvantages of assessment metrics.

Now I'm hoping you understand the significance of evaluation metrics. Let's begin by looking at the numerous evaluation metrics employed in your machine learning projects.

Regression model metrics

Regression is a machine learning technique that aids in discovering relationships between independent and dependent variables, i.e., regression is a machine learning problem in which you predict discrete quantities such as price, rating, fees, etc. Following are the metrics for regression projects:

Mean Absolute Error (MAE)

It's the average of the absolute discrepancies between the actual and anticipated values from the model. A low MAE indicates that the model is good at prediction, but a large MAE indicates that your model may struggle in some areas. An MAE of 0 suggests that your model accurately predicts the outputs. The L1 loss function is another name for MAE. The MAE is calculated as

Mean Absolute Error (MAE)


Mean Squared Error (MSE)

Finding the squared difference between the actual and anticipated value is defined as a mean squared error. It is a widely used and straightforward statistic that accounts for a slight change in mean absolute error. The outliers are penalized the most if the dataset contains outliers and the estimated MSE is larger. The MSE is calculated as

Mean Squared Error (MSE)

Root Mean Squared Error (RMSE)

It's the root-squared difference between the actual and anticipated values on average. It is calculated by taking the square root of MSE. It is good to have the RMSE value be as low as feasible because the lower the RMSE value, the more accurate the model's predictions are. A higher RMSE indicates that the expected and actual values differ significantly. 

Since RMSE is not scale-invariant, the data scale influences comparisons of models using this metric. As a result, RMSE is frequently applied to standardized data. When compared to MAE, it is less resistant to outliers. RMSE is calculated as

Root Mean Squared Error RMSE

RMSLE (Root Mean Squared Log Error) 

It's the RMSE of the log-transformed predicted and target values, i.e., transforming the predicted and real dependent variable in RSME into a logarithmic value.   It's also helpful when looking for percentage errors rather than absolute values.

If you understand the above three evaluation metrics, you won't have any trouble comprehending RMSLE or most other evaluation metrics used in the regression-based machine learning model. When creating a model without calling the inputs, the RMSLE metric comes in handy. 

In that instance, the outcome will be quite variable. Because actual and projected values can be 0, and the log of 0 is undefined, 1 is added as a constant when calculating RMSLE. The penalty for underestimating the actual variable in RMSLE is more significant than for overestimating it.

Root Mean Squared Log Error


R² (R-Squared)

It is the percentage of variance explained by a relationship between two variables. It calculates the fraction of the dependent variable's variance explained by the independent variable. An R2 score is a number that runs from 0 to 1. The regression model is better if the R2 is close to 1. When R2 is equal to 0, the model does not outperform a random model. The regression model is incorrect if R2 is negative. So, using R squared, we can compare a model to a baseline model that none of the other metrics can provide. This metric is calculated using this formula:

R Squared

SSE = sum of the square of the difference between the actual value and the predicted value.


SST = the total sum of the square of the difference between the actual value and the mean of the actual value.


Adjusted R Squared

It is a variant of R-squared that considers the number of predictors in the model. The adjusted R-squared rises when the new term improves the model more than anticipated by chance. When a predictor enhances the model by less than predicted, it declines. The modified R2 indicates the percentage of variance in the dependent variable that the independent variables can explain.

Adjusted R Squared


Max Error

The absolute magnitude of the largest significant difference between a predicted variable and its real value is the maximum error, or ME. While the most popular measure is the root mean square error (RMSE), it might be challenging to comprehend. The Max-Error metric measures the worst-case error between the anticipated and true values. If the Maximum Error is substantially more significant than the RMSE,  the model likely failed to predict outliers correctly.

Max Error


Classification Model Metrics

Classification Metrics assess a model's performance and tell you whether the classification is good or bad, but each does so uniquely. We need a metric that compares discrete classes in some way because classification models produce discrete output. Below listed are some of the popular metrics used in classification models.



The most basic metric is classification accuracy. The accuracy of a classifier is the percentage of times it predicts accurately. The number of correct predictions divided by the total number of forecasts is accuracy. 

Note: Accuracy is essential when the target class is well balanced, but it is unsuitable for imbalanced classes.

Confusion Matrix

The confusion matrix (also known as the error matrix) is a tabular depiction of model predictions versus ground-truth labels. It is one of the essential ideas in classification performance.  The examples in a predicted class are represented by each row of the confusion matrix, whereas each column defines the occurrences in an actual class. It is one of the most intuitive and straightforward metrics for determining the model's correctness and accuracy. The matrix looks like this:


The confusion matrix
The confusion matrix | Source: Towards Data Science


  1. True positives: It occurs when the data point's actual class was True, and the predicted class was also True.
  1. True negatives: It occurs when the data point's actual class was False, and the predicted class was also False.
  1. False positives: It occurs when the data point's actual class is False, but the anticipated value is True. False is because the model expected the wrong thing, and positive because the class predicted the right thing.
  1. False negatives: It occurs when the data point's actual class was True, but the predicted class was False. False is because the model anticipated the wrong thing, and negative because the class predicted the wrong thing.
Note: Aim to reduce either False Positives or False Negatives for your model.



Precision explains how many of the correctly anticipated cases turned out to be positive. Precision comes in handy when false positives are more of a problem than false negatives. As a result, precision estimates the accuracy of the minority class. The ratio of accurately predicted positive instances divided by the number of predicted positive examples is used to compute it. Here is the formula for the same: 




Another significant parameter is recall, defined as the percentage of samples from a class that the model correctly predicts. So, if we want to focus on limiting false negatives, we'd like our recall to be nearly 100 percent without sacrificing precision. If we focus on minimizing false positives, we'd want precision to be as close to 100 percent as possible. This is how recall is calculated:




It presents a synthesis of precision and recall metrics. When precision becomes equal to recall, it reaches its peak. The harmonic mean of precision and recall is the F1 score. An excellent F1 score indicates that you have a low number of false positives and false negatives, indicating that you correctly identify severe threats and are not bothered by false alarms. When the F1 score is 1, the model is deemed perfect, but when it is 0, the model is considered a complete failure. This is how the F1 score is calculated:




The fraction of real negatives anticipated as negatives is known as specificity (or true negative). This means that a part of true negatives will be forecasted as positives, referred to as false positives. 

The proportion of people who are healthy and are genuinely projected as healthy, for example, is an excellent example of specificity.



Statistical metrics

Although machine learning can be thought of as applied statistics, and thus all ML metrics can be considered statistical metrics, statisticians often employ a few measures to evaluate the performance of statistical models. The following is one of the most commonly used measures in this area:



It offers us a sense of the degree to which the two variables are related. It's a bi-variate analysis metric for describing the relationship between two or more variables. In most businesses, expressing one issue in terms of relationships with others is helpful. It serves as the basis for a variety of modeling methodologies. Correlation analysis that is done correctly leads to a better comprehension of data. 

There are two sorts of correlations: positive and negative

A positive correlation suggests that as the value of one variable rises, the value of the other variables rise with it. On the other hand, negative correlation means that as the value of one variable rises, the value of the other variables fall. 

Additionally, no correlation means that if the value of one variable rises or falls, the value of the other variable(s) does not rise or fall.


Ranking metrics

Regardless of the modeling choices, recommender systems finally produce a ranking list of objects. As a result, instead of relying on other proxy measurements, it's critical to consider how to evaluate directly ranking quality. The effectiveness of a recommender is assessed using a variety of evaluation indicators.


Hit Ratio (HR)

The hit ratio is just the percentage of users for whom the correct answer appears in the L-length recommendation list. The greater the L, the higher the hit ratio because the correct answer is more likely to be included in the recommended list. As a result, selecting an acceptable value for L is critical.

Hit Ratio (HR)

Mean Reciprocal Rank (MRR)

It is also referred to as the average reciprocal hit ratio (ARHR). This is the most straightforward metric. It attempts to answer the question, "Where is the first significant item?" One could argue that hit ratio is a binary variant of MRR because it becomes 1 if there is a relevant item in the list and 0 otherwise. For a set of queries Q, MRR is essentially the average of the reciprocal ranks of the first relevant item.

Mean Reciprocal Rank (MRR)


In this blog, you learned about some of the popular metrics that play a crucial role in the machine learning model lifecycle as a part of the monitoring stage. To ensure the consistent performance of your model, you need to track some metrics mentioned above. You can select the metrics based on your model and the use case. But this tracking is crucial as your models require rigorous performance assessments because of their significant contribution to your business. These metrics help you ensure that your model is in the proper condition and working as expected. 

Censius is a platform that allows you to monitor and identify issues automatically, as well as track the health of all your models in one place. With Censius, teams can design higher performing and responsible models by explaining model decisions.

Explore Censius AI Monitoring Solutions

Liked the content? You'll love our emails!

The best MLOps and AI Observability content handpicked and delivered to your email twice a month

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Censius AI Monitoring Platform
Automate ML Model Monitoring

Explore how Censius helps you monitor, analyze and explain your ML models

Explore Platform

Censius automates model monitoring

so that you can 

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

Start Monitoring