AI Observability
 minutes read

Model Monitoring in Machine Learning - All You Need to Know

What can go wrong with models in production? What needs to be monitored and Why?

Sanya Sinha
Model Monitoring in Machine Learning - All You Need to Know
In this post:

Machine learning is a raging discipline that has taken over the world by storm. 91.5% of business firms invest in AI and ML to break down the silos of sound corporate management.

Knowledge of the ML lifecycle is mandatory to accommodate ML professionals who would move the organization’s needle for the better. However, knowledge about individual machine learning workflows and the tools required to achieve them is paramount. Model monitoring is one such workflow that tracks the model’s performance and troubleshoots issues in the production environment.

Machine Learning Model Development Lifecycle

An overview of the Steps in the ML Lifecycle
An overview of the Steps in the ML Lifecycle

The ML model development lifecycle can be divided into multiple phases as shown in the image. However, for simplicity there are four broad categories that encompass everything included in the pipelines to develop resilient ML models. The four high-level steps in an ML lifecycle are:

  1. Data preparation
  2. Model development and testing
  3. Model deployment
  4. Model Monitoring

These four steps help refine the dataset using the results evaluated from the developed models. This, in turn, facilitates the development of state-of-the-art models. Various tools also pitched in the market enable ML engineers to accelerate every step of the pipeline to develop high-quality and efficient models. Let us examine these steps.

Dealing with Data: Preparation, Augmentation, and Cleaning

Data needs to be collected, pre-processed, augmented, cleaned, and managed to achieve the desired target of the model. 

Collecting more data samples is always advisable as only a minor subset of the data sample is annotated. Automated magnification of data on a larger scale can manipulate data to the desired formats. If the target model is not behaving accordingly, there is a significant chance of some outliers and biases in the data. Rebalancing datasets and redundant data and missing data removal could be key game-changers to bolster model performance. Data needs to be managed properly to ensure quality models using methodologies like ETL(Extract, Transform, Load). 

Model Creation, Training, and Testing

This step is the core of a machine learning cycle. Transfer learning mechanisms allow users to solve real-world problems by leveraging pre-existing, baseline models from centralized code repositories trained to solve related problems. They can then be accordingly tuned to fit into the designated datasets by constructing a training loop.  Evaluation metrics like accuracy, precision, and recall are used to determine how well the model is performing and how successful its results are in the given problem. Selecting the right metrics and analyzing the errors incurred are important for machine learning model development.

Model Deployment

After the machine learning model has been developed, deploying the model to the production environment is the final step in the ML lifecycle. ML models can be deployed across diverse environments and are often integrated with apps using real-time APIs. The deployment phase is where the model is put to use to derive solutions to real-world problems.

Model Monitoring

One of the most important parts of the ML Lifecycle is model monitoring. Model Monitoring enables ML engineers to monitor the complete ML pipeline for inference speeds, concept drifts, data drifts, and other performance metrics. Services offered by AI Observability platforms like Censius help ML engineers proactively fix issues for business-oriented problem-solving. 

Why is Model Monitoring Required?

The utility of a machine learning model is finally put to the test in the production environment. A broad spectrum of issues and discrepancies can crop up when the model is deployed in production. Consequently, monitoring ML pipelines is of utmost importance for maintaining consistency in model performance and troubleshooting errors.

Reducing Generalization Issues

Different business cases require data samples of different sizes. However, larger datasets are always preferred because the inherent probability of finding outliers can render many records invalid. Due to instances of rogue data, only a small subset of the dataset can be used for training. Thus, generalization issues crop up while magnifying a smaller dataset and annotating it on a larger scale. Consequently, retro-fitting may be imprecise, and output data might be under-fitted or overfitted. 

Data Integration and Volatile Parameters

The format of the data sets used is volatile and susceptible to change owing to dynamic requirements. For example, a model created ten years ago to recognize the faces of family members must be updated because faces change over time and age. These significant changes in the data’s schema (parameters and sampling methods included)could adversely affect model performance.

Data Drifts and Concept Drifts

The data being used by models is largely vulnerable to market trends, paradigm shifts, and industry fluctuations. So even if the input data’s schema remains unaffected, the training data values might be rendered obsolete in the long term. For example, shopping paradigms and customer behaviors shifted considerably during COVID-19. Online shopping prospered, and customer interests steeply inclined towards healthcare and sanitation. Consequently, models built on pre-COVID historical datasets are bound to fail. Wrongly-labeled data, unsuitable recommendations are trivial errors that make the model hit rock-bottom. This influences model performance as the output values might be stale. 

Catering Issues

Model deployment resources are majorly dependent on reception by the users. If the model latency is skyrocketing and traffic is not targeted, the model may not function as it was meant to. 

Most of the threats mentioned above go unnoticed in the ML lifecycle until they grow large enough. These threats also make the model appear to be a black box. Hence, monitoring the model’s vital performance has become the need of the hour for fully-flexible control and instantaneous problem-solving. Monitoring ML models reduces the overhead on Data Analysts, ML engineers, and DevOps experts of continuous manual monitoring. Instead, Censius’s AI Observability Platform does that for them and alerts them if there are inconsistencies in the model.

Machine Learning Model Monitoring- What Should be Monitored?

While it has been established that model monitoring is indispensable for enhancing model quality, there is still ambiguity in the metrics that need to be monitored. 

Monitoring can be crudely divided into two broad categories. These are functional monitoring and operational monitoring.

Two broad categories of ML model monitoring
Two broad categories of ML model monitoring

Functional Monitoring

Functional monitoring deals with gauging model performance in ts inputs, outputs, and modeling capabilities. It regulates monitoring at the level of the input data, the model, and the predicted outputs.

Data- Quality, Drift, and Outliers

Unhealthy data pipelines dispatching input data to the model degrade model performance. Data loss, data corruption, and variations in the data’s source code schemas are certain factors that render data upgrade mandatory. The AI Observability Platform offered by Censius enables users to flag data with poor quality by testing input data for missing or redundant values, new values, range, and type mismatch.

Fluctuations in data distribution trends lead to data drifts that remarkably deplete model performance. The Censius AI Observability Platform performs rigorous distribution checks by measuring statistical metrics like measures of central tendency. Divergence and distance tests can measure continuous metrics, while the chi-squared test could be used for categorical features. After A/B testing or shadow testing, the user can determine if the challenger is fit enough to compete with the champion model. If data resources are insufficient for remodeling, new data could be combined with historical data, and larger weights could be allocated to the drifting features.

Outliers are anomalies in data that are exceedingly tedious to catch. However, monitoring tools can leverage both statistical distance test metrics and unsupervised learning algorithms to detect outliers. Once the outlier has been detected, data-slicing methods can be executed on data segments to assess model performance on a smaller scale.

Model- Model drifts, Versions, and Security Concerns

Dynamic changes in the business landscape lead to model drifts. Model drifts are diversely classified and can be chunked into sub-categories based on the degradation’s time gradient. They could be instantaneous, gradual, temporary, or recurring. Statistical metrics on model performance can be used to detect model drifts.

Besides drifts and decays, keeping a tab on model versions is extremely vital for successful deployment. Model predictions should be tied up with the history of the model’s version for accurate modeling.

Machine Learning models are becoming all the more popular in the coming years. Consequently, model security is the need of the hour. Levying adversary metrics to reduce latency and compromise is an important feature integrated with Censius’s platform.


Comparing the predicted values stored in the model evaluation store with the ground truth is crucial to determine the models’ performance. However, it is extremely rare to find a perfectly well-annotated champion model. When ground truth isn’t available, the prediction results distribution aligned with the business KPI(key performance index) is leveraged. 

Operational Monitoring

Operational modeling monitors the model’s resources and system health in production. The MLOps infrastructure monitors ML pipeline health, performance metrics, and cost.


System performance metrics like GPU and memory utilization, number of API calls, and response time can be monitored by dashboard support. System infrastructure metrics like network uptime and cluster specifics define the system’s reliability.


Data pipeline components like the input data, preprocessing and validation workflows and output data can be tracked using statistical metrics. The data and model pipelines can be monitored to maintain quality data input. Unhealthy data pipelines degrade data quality significantly. 

Model dependencies are validated by observability platforms so that each version has integrated dependencies. 


Hosting and deploying the model is bound to incur charges. Data storage, compute, and orchestration charges can multiply rapidly if an operational monitoring platform like Censius isn’t leveraged. 


Deploying a model to production spawns multiple real-time complexities. A model monitoring service can detect faults and inconsistencies and troubleshoot issues that might crop up. AI Observability platforms like Censius help pinpoint data drifts, concept drifts, and data augmentation issues. The system, the ML pipelines, and the cost can be monitored to keep models healthy and optimized.

Sign up for a free trial now!

Liked the content? You'll love our emails!

The best MLOps and AI Observability content handpicked and delivered to your email twice a month

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Censius AI Monitoring Platform
Automate ML Model Monitoring

Explore how Censius helps you monitor, analyze and explain your ML models

Explore Platform

Censius automates model monitoring

so that you can 

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

Start Monitoring