AI Observability
 minutes read

How To Monitor ML Models In Production

In this post, we'll look at keeping track of machine learning models once they've been deployed and are running in production

Neetika Khandelwal
How To Monitor ML Models In Production
In this post:

You have just deployed your model into production, and it's ready to handle the real-world data inputs to give accurate outputs. Congratulations!

But wait, will the results produced by the machine learning model be precise after some hours in production? 

As you know, the data doesn't remain the same. With new technologies and advancements come new and advanced data.

Model Monitoring and its Importance

 Basic lifecycle of a machine learning model
 Basic lifecycle of a machine learning model

An essential stage in every machine learning process is model monitoring. It's critical to monitor your model after it's put into production. It enables you to identify and eliminate difficulties like weak prediction power, changing parameters, and poor generalization, providing a high-quality solution with exceptional performance.

Your model's performance will be affected by the transition from a lab to the real world. In production, models are more dynamic than in development. As a data scientist or machine learning engineer, you must be aware of this difficulty.

Monitoring should be intended to provide early alerts for the many things that might go wrong with a production machine learning model. Challenges such as data skew occur when your model training data is not indicative of the live data, i.e., the data we used to train the model in the research environment does not correspond to the data we receive in our live system.

When models are automatically trained on data collected in production, a more complicated situation occurs. If the data is skewed or contaminated in any manner, the trained models will perform poorly.

Metrics to be monitored

Following are some of the critical metrics that can be monitored to keep track of the model’s performance:


The percentage of your results that are relevant is referred to as precision. It assesses a model's accuracy in predicting positive labels. Precision solves how often a model was correct out of the number of times it predicted a positive outcome. When the cost of a false positive is high and the cost of a false negative is low, precision is an excellent evaluation metric to use.

Precision = True Positives/(True Positives + False Positives)


It is a model's capacity to locate all relevant cases within a data collection. The number of true positives divided by the number of true positives plus the number of false negatives defines recall in mathematics. When determining whether or not a credit card charge is fraudulent, for example, recall should be used.

Recall = True Positives/(True Positives + False Negatives)


Accuracy is a metric for evaluating a model that allows you to count how often its predictions are correct. It's simple to figure out by dividing the number of correct guesses by the total number of predictions.

Accuracy = Correct Predictions/Total Predictions

F1 Score

The F1 score is a classification model measure utilized in machine learning. It proposes enhancing two more straightforward performance indicators, precision, and recall. They are combined to form a weighted average. As a result, this score considers both false positives and false negatives. Although it is not as intuitive as accuracy, F1 is frequently more valuable than accuracy, especially if the class distribution is unequal.

F1 Score =2 * (Recall * Precision)/Recall + Precision



Gain or Lift is a metric for a classification model's effectiveness determined by the ratio of outcomes produced with and without the model. Gain and lift charts are visual aids for evaluating classification model performance.


The fraction of real negatives anticipated as negatives is known as specificity (or true negative). This means that a part of true negatives will be forecasted as positives, referred to as false positives. The proportion of healthy people who are genuinely projected as healthy is a good example of specificity.

Specificity = True Negative/(True Negative + False Positive)

Model monitoring tools and their limitations

Grafana: It is a tool that helps you visualize monitoring metrics. Grafana is a time-series analytics company. You may construct customizable dashboards for various tasks and manage everything from one place.

Limitation: There are several areas where it falls short. Some options are not configurable through the UI. We've used command-line text editors to open configuration files and manually make changes, such as LDAP/SSO configuration. It doesn't have a log analysis feature.

Amazon Sagemaker Model monitor: It detects and warns you when models in production make incorrect predictions, allowing you to maintain model accuracy. You don't need to write any code to choose and analyze the data you want to monitor and evaluate. It also includes built-in analysis in the form of statistical rules for detecting data drifts and model quality issues.

Limitation: There are a few areas where it falls short. It is not possible to schedule training jobs with it. It doesn't have a way to readily track metrics recorded throughout training. Running on larger data sets takes a long time.

Anodot: It is an artificial intelligence (AI) monitoring application that automatically understands your data. It can keep track of several things, including customer experience, partners, revenue, and telco networking. 

Limitation: There are a few instances where it lags. For non-engineers, the UI is not intuitive. Complex roles, permissions, and templating features are missing. It necessitates far too much configuration and guidance.

Seldon Alibi Detect: Outlier, adversarial, and drift detection are all covered by Alibi Detect, an open-source Python package. The package attempts to include detectors for tabular data, text, pictures, and time series that can be used both online and offline. You are free to use it in any way you see fit. 

Limitation: The designers are open about the assumptions they used to create this; however, it may not be the most excellent answer for data drift detection in a few years for various reasons, one of which being the limited community.

Censius AI Observability Platform

As discussed above, the process of machine learning model observability is the foundation that empowers ML teams to continuously deliver an excellent product to the end-users and improve results from the lab to production. 

Censius AI Observability Platform makes it easy to monitor the ML pipelines, analyze issues and explain what the model understands. You can automatically monitor and identify problems that could arise in production models before it's too late. Additionally, it lets you perform root-cause analysis to understand model anomalies. It is helping several data scientists, ML engineers, and various business stakeholders to achieve their goals every day.

You can begin using the AI Observability Platform in few steps:

  • With just a few lines of code, you can register your model, log the features and capture the model prediction.
  • You can choose from dozens of monitor configs to track the entire ML pipeline.
  • Track the monitoring and perform analysis without any line of code.

Sit back, as your entire ML pipeline is monitored. You can even run thousands of monitors without any additional engineering effort.

Following are the features of the Censius AI Observability Platform:

  • It will continuously monitor the model input and output, track data drifts and concept drifts, check for data integrity across the pipeline and provide real-time alerts for monitoring violations.
  • It will track the model's health and performance across different model versions. Based on the observations, you can view the evaluations and historical performance of the model on its customizable dashboard.
  • It lets the teams analyze and ship the models with better performance, increasing business growth.
Sign up for a free demo now!


In this blog, you got a brief introduction to how you can monitor your machine learning models after deployment. You may encounter several kinds of issues that degrade your model's performance, but if you watch the metrics mentioned above, they will help improve your model performance. You also learned about some of the tools used for model monitoring and their limitations. At last, you came across a fantastic tool that can be part of your model monitoring process and set you up for success.

Liked the content? You'll love our emails!

The best MLOps and AI Observability content handpicked and delivered to your email twice a month

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Censius AI Monitoring Platform
Automate ML Model Monitoring

Explore how Censius helps you monitor, analyze and explain your ML models

Explore Platform

Censius automates model monitoring

so that you can 

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

Start Monitoring