Model monitoring in machine learning is the method of monitoring the performance metrics of a model in production. It ensures that the machine learning models predict the desired outcomes. To render any service a success, effective model monitoring is a must. This operationalization is highly mandatory since the ML lifecycle's production phase is full of thorns and roses.
When dealing with the volatility of the production world, successful deployment is a mammoth task. Inconsistencies in data and ephemeral production dynamics are only a few of the many obstacles in the way of smooth model deployment.
Model monitoring steps in, a robust intervention as tenacious as a 2nd-grade class monitor. Model monitoring allows ML engineers to closely watch performance metrics, market volatility, and data drifts in their ML pipelines. This aids in a thorough examination of the entire ML workflow and the implementation of tweaks that will be beneficial in the long run.
What should I monitor?
While model monitoring has emerged as a savior for production rigors, it is far from simple. This is mainly because the evaluated performance metrics were derived from offline datasets and are thus susceptible to production dynamics. The lack of understanding in the MLOps teams about the model monitoring approach exacerbates the situation. As a result, MLOps personnel need a quick, on-the-fly understanding of model monitoring pipelines. “What should we monitor” is a valid question often asked by MLOps teams.
Monitor your Data
“Any machine learning model is only as good as the data it is fed.”
Any machine learning model relies on data as its foundation. ML models can fail if the data isn't sufficiently cleaned and processed. As a result, MLOps staff must keep a close eye on the data used to build an ML model. Monitoring the data input schema in the development and production landscapes could be an excellent example of a model monitoring use case. Changing the input data schema could cause feature disintegration and model collapse. For model monitoring, it's crucial to keep track of both the training and validation datasets.
Monitor the Performance
How does a teacher assess a student's overall comprehension of a subject? She conducts an evaluation and compares their responses to a set of questions to a predetermined set of acceptable answers. Students are promoted if they can meet the required standards.
Similarly, a model's performance is assessed by comparing the model's output to the original values. However, because real-time datasets are dynamic, assessing the model's performance is significantly more difficult due to the variations in production. The model must be monitored on smaller data slices to acquire granular insights regarding class-wise performance.
Individual slices with poor model performance might be specifically targeted to improve model performance.
Monitor the Scope of your Model
Understanding the capabilities and limitations of an ML model is compulsory for MLOps teams. No model is perfect, but models must be updated with time to deal with the latest trends and paradigms. For example, if a model were trained to recognize the faces of the family members, that model would be rendered obsolete in 5 to 10 years as people’s facial features change over time. Data influx drives simultaneous model retraining to deal with breaking and behavioral data changes. This kicks model usability up by a notch in varied use-cases.
Challenges in Model Monitoring
Performance Metrics and Proxy Metrics Tracking
Tracking performance indicators in real-time looks to be one of the most realistic techniques for model monitoring. After prediction, the model utilized intends to trigger a semantic workflow. Since it is impossible to manually monitor the entire workflow, performance measurements must be used to track production details on a minute-by-minute basis. The implementation of performance metrics, on the other hand, is questioned due to a significant latency between the anticipated values and the ground truth.
While monitoring models on a LIVE dataset seems to be the most apparent solution for deployment challenges, it is not always financially and computationally feasible. Thus, metrics generated from the training datasets are used as proxy metrics for the model. For example, ML engineers do not use a neural network trained on a dataset as robust as ImageNet if they only have to solve a problem as simple as classifying dog and cat images.
However, creating entirely new models from scratch could be computationally intensive and time-consuming. Proxy metrics step in to optimize solutions that don’t perfectly adhere to the predefined use-cases. They evaluate the performance of general-purpose, off-the-shelf tools for scenarios that somewhat entail the purpose they were initially trained for.
We know that performance and proxy metrics must be used extensively when monitoring models. However, the outputs of these metrics are influenced by a diverse variety of circumstances. Some metrics, for example, are computed throughout the model deployment phase, whereas others are computed at specific points in the ML Lifecycle. Naturally, the results of these metrics would differ. How would the MLOps team choose which metric should be used to evaluate the model?
Similarly, if you want to set up an alerting system based on model performance, you'll need to select a specified threshold for the metrics. An alert will be raised if the metrics decline below this defined level. Choosing a permissible threshold for these largely stateful metrics, on the other hand, necessitates substantial domain expertise.
Manual Intervention Needed
Domain knowledge is of paramount importance while setting the threshold triggers for an alarm-based warning system. This means that personnel with in-depth subject expertise are needed for data labeling. Incorporating manual support in the production pipelines is necessary for task automation.
These subject-matter experts help validate the results for poor-quality models. They also help in exhaustive data annotation and relabeling for retraining models. For example, if an image classification model is being trained to classify retinal fundus images with proliferative diabetic retinopathy, ophthalmologists would have to be employed in the production loop for labeling the photos. Therefore, the need for manual support is still indispensable in the production pipelines.
Outliers and Discrepancies
Large-scale augmentation of datasets results in unwanted anomalies and outliers in the data. This generalization renders models unfit for predictions on unseen datasets. Outlier detection is of priceless importance in ML pipelines. Therefore, choosing a suitable outlier detector for a given application is vital for successful model monitoring.
The outlier detector chosen depends on a broad spectrum of factors, including the structure and dimensionality of the data and the availability of labeled and anomalous data. On a higher level, outlier detectors are mainly offline and online. To deal with real-time data, online outlier detectors are used. These need to be continuously updated to deal with the flow of data. Similarly, offline sensors are often deployed as individual ML models themselves to monitor static datasets. Choosing suitable outlier detectors could be exhausting and prone to errors.
Change in Data Distribution
The most significant problem encountered during model monitoring is the unpredictability of data in production. Due to the constantly fluctuating data used for testing and validation, any ML model gradually loses efficacy. For example, after the COVID 19 epidemic, sales of face masks in Chile increased by nearly 800 percent in February 2020. As a result, machine learning models built on pre-COVID datasets were likely to fail when applied to data collected after the pandemic's peak.
After the model is deployed, drift detectors are employed to identify any divergence in the input data distribution. This disparity could be a significant sign of data drift, leading to model failure. Detecting drifts in high-dimensional and nebulous datasets, on the other hand, is time-consuming and necessitates dimensionality and quality reduction.
Introduction to Observability
Observability is a novel technology to understand the fabrication of an ML pipeline which helps to troubleshoot errors. It allows the MLOps team to evaluate the operation's internal metrics, traces and logs to paddle down from the repercussions to the roots of the problems. No matter how intricate the microservice architecture used in model serving is, Observability can detect any deviances in the expected initial behavior and report it to the MLOps team.
Observability and Monitoring are often used interchangeably, which is a wrong approach. Model monitoring enables you to record the overall health of the ML pipeline, while Observability provides granular insights about the possible causes of any collapsing metric. In simpler words, monitoring answers the question ‘what,’ whereas observability answers the questions ‘why’ and ‘how.’ Monitoring falls under the umbrella of Observability.
Observability As The Solution
AI Observability seeks to fathom complete transparency in the ML pipelines to develop a comprehensive understanding of each stage of the ML lifecycle. AI Observability platforms seek to monitor and troubleshoot the ML infrastructure for any outliers and drifts. Observability platforms like Censius allow MLOps personnel to monitor complicated ML pipelines extensively and detect anomalies and their underlying causes. Leveraging such platforms not only improves the overall health of the operations pipeline but also braces MLOps teams for any impending threat, thus prompting them to take action. Observability helps MLOps teams to build end-to-end ML solutions without being throttled by the rigors of production.