If you're a data scientist or otherwise involved in research, then you know that the process of working with data can be extremely complex and requires significant time and effort to collect, label, and organize data in a structured form interpretable by man or machine. Data scientists and Machine learning engineers often worry about things like bias in training data or inconsistent datasets. But there’s one thing that can cause even worse problems: drift.
Model drift is the tendency for a complex system to change over time, sometimes in unpredictable and unexciting ways. You may notice changes in model behavior over time, and if you don't keep an eye on it, your model may fail.
What is Drift?
Drift is an important concept in machine learning and artificial intelligence. It refers to changes in the production data that can cause a model's generalization to alter. These types of interactions are often hard for humans to predict, which is why they're such an important issue for AI research. There are two major types of drift, Concept Drift and Data Drift.
Data drift is concerned with the change in the properties of the independent variables. The concept of "data drift" arises when two sets of information--initially created for different purposes--grow inaccurate or irrelevant to each other over time.. In simple words, it is a process in which two data sets originally created for one purpose diverge in meaning over time.
- Data drift is concerned with the change in the attributes of the independent variables.
- Data Drift is also known as feature drift, covariate shift.
The term "concept drift" is related to, but distinct from "data drift". The shift in the relationships between input and output data in the underlying issue over time is called concept drift in machine learning. The statistical properties of the target variable, which the model is aiming to predict, shift in unexpected ways over time
- As a result, the model based on historical data is no longer valid.
- The model's assumptions based on historical data must be changed using current data.
- This raises issues because as time goes, the predictions become less accurate.
We learn more about concept drift in the next section.
Recommended Reading - Machine Learning Model Monitoring: Why Is It So Important to Monitor Post-Deployment?
Understanding Concept Drift
Concept drift is the phenomenon where the learned concepts of an artificial intelligence system change or evolve over time. Let's take a simple example to better understand concept drift.
You created a facial recognition model, the model recognizes the face and marks attendance of students/employees, but after COVID-19, everything has changed. Now everyone on the planet is wearing masks, so the model you created on non-mask faces will not work. The change in the whole concept of the application is called concept drift.
Recommended Reading: How to Address Concept Drift?
Changes in drift can be:
This happens when there are sudden changes in the concept of the model. The COVID-19 pandemic has supplied us with stunning examples, such as the worldwide lockdowns that abruptly changed population behavior. The models developed pre-COVID depict sudden changes post-COVID. Buying habits transformed almost overnight affecting a wide range of models. The graph below shows thatsales of numerous items have unexpectedly increased. In these circumstances, model predictions can go wrong.
This kind of drift takes a long time to happen, and it's quite normal in many situations. Inflation is a good example of gradual drift. Gradual or incremental changes are usually addressed in time series models by capturing the change in seasonality; however, if this is not done, it is a source of concern that must be addressed.
This kind of drift re-occurs after a period of time. The shift in customers' buying habits as the seasons/week change is the best example of recurring drift. As shown in the graph below, every weekend a user spends more money on shopping than they do on weekdays, therefore this is a recurring trend.
Dealing with Concept Drift
Before we go into how to deal with or prevent drift, let's look at what causes it in the first place.
Causes of drift
There are many reasons for drift to occur in production models.
- If the data distribution changes because of external activities.
- A shift in input data, such as changing customer preferences due to Pandemic, or launching a product in a new market, and so on.
- Problems with data integrity.
- Data was collected incorrectly, or there was an issue with the data source.
- Sometimes data is correct, but due to poor data engineering, it might cause drift.
Recommended Reading - Dealing with Concept Drift and Class Imbalance in Multi-Label Stream Classification
How to detect drifts?
As drifts involve a statistical change in the data, keeping a watch on its statistical features, the model's predictions, and their interactions with other parameters is the quickest way to detect them. There are many different types of methods you can use to detect drifts. You can also use open-source and paid platforms to monitor your models. Here are a handful of the most common ways for detecting drift:
- Adaptive Windowing(ADWIN)
ADWIN is an algorithm that detects concept drifts in real time and changes machine learning models as needed. The approach maintains an adaptable window, which is used to compute the machine learning model. You can learn more about ADWIN on the linked website.
- Drift Detection Method (DDM)
This concept change detection approach is based on the PAC learning model premise. The learner's error rate reduces as the number of examined samples increases as long as the data distribution remains constant. You can learn more about DDM on the linked website.
- Early Drift Detection Method (EDDM)
Instead of evaluating the number of errors, this method analyzes the average distance between two errors: the running average distance and standard deviation, as well as the maximum distance and standard deviation. You can learn more about EDDM on the linked website.
How to prevent drifts?
Data scientists and ML Engineers must take action when they detect a potential drift in order to avoid future problems that could arise from unintentional changes. Here we will explore how ML Engineers and data scientists can prevent them.
- Retrain the model regularly when the model's performance falls below a certain level.
- You can train your model online, which means that your model weights are automatically updated with new data on a regular basis. The frequency of updates could be daily, monthly, or whenever new data is received. This solution is ideal if you expect incremental concept drift or an unstable model.
- Another technique to deal with drift is to drop features. Multiple models are built one at a time, and if you discover that some features aren't working, you may remove some of them and conduct A/B testing.
- To prevent drift, you can work with missing values, outliers, label encoding, and other difficulties.
- Missing values and outliers are frequently encountered while collecting data. The presence of missing values reduces the data available to be analyzed, compromising the study’s statistical power, and eventually, the reliability of its results.
- Maintaining a static model as a baseline for comparison, it might be difficult to spot concept drift and determine whether a model has degraded over time. To understand any changes in model correctness, a static model can be utilized as a baseline. Having a baseline model to monitor the success of any changes you make to avoid concept drift is beneficial. After each intervention, a baseline static model can be used to assess the correctness of the updated models.
- Continuously monitor machine learning models -inputs, outputs and data, while keeping an eye on the ML pipeline. This is where Censius AI Observability Platform comes to the rescue. With Censius, you can :
- Track for prediction, data, and concept drift
- Receive real-time alerts for monitoring violations
- Check for data integrity across the pipeline
Censius makes managing your machine learning applications a lot easier. You can improve your model's performance and health, and run hundreds of monitors on different model versions without any additional engineering effort. .
When monitoring production models, you can face many issues like feature drift, errors, outliers, debugging issues and more. A data scientist or machine learning engineer compares a specified window of live traffic with a baseline using different approaches.Next she determines which variables led to the drift after detecting model output. Without proper tools, monitoring predictions can be time-consuming.
With Censius AI Observability Platform, you can monitor different parameters like performance, traffic, data quality, drift, and a lot more. Drift monitors observe the distribution of statistical properties of the streaming input data, and the output data, allowing the user to get information about changing data properties and their effects on model performance.
- Get access to Censius.
- With Censius' intuitive user interface, you can easily add/update models, projects, datasets, and much more.
- You can submit a model log by using REST API or Python SDK. You can create API keys from the settings page after logging in.
- With Censius, you can set up different types of monitors on specific features of a model. Censius will then monitor these features continuously and alert you when violations occur.
- For Concept drift, you can track prediction data and alert users on statistical changes compared to actual outputs.
Censius provides a host of monitors across various data categories that can be used to monitor data and model health, providing a broad metric view of the model and its performance. Send your queries to email@example.com.
Model Monitoring Best Practices
- Training machine learning models/applications with large data sets improves output accuracy. Using these data sets, the algorithm will learn a variety of factors that will aid the model in finding relevant information in the database. With improvement in precision, the model will perform well in production, and chances of drift will reduce.
- Updating machine learning models regularly. For example, most machine learning methods that use weights or coefficients, such as regression algorithms and neural networks, can benefit from periodic updates.
- Developing new models to address concept drift that occurs frequently or unexpectedly. As behavior evolves, models trained on historical data will become less trustworthy.
- In some domains, such as time series problems, the data may be expected to change over time. In these types of problems, it is common to prepare the data in such a way as to remove the systematic changes to the data over time, such as trends and seasonality, by differencing.
We learned what Concept Drift is, why it occurs, and how to deal with it. We also looked at how Censius AI Observability Platform can help you with your machine learning development and monitoring journey. Hope you like the article, keep experimenting!