What is Data Drift?
Drift is a common occurrence in real-world data. When the data received in production is different from the training data in terms of data characteristics, statistical behavior, or inter-feature relationships, it is referred to as a data drift.
What Causes Data Drift?
Data drift is the result of several factors that play out in a real-world setting. For example, a change of a data entry professional at the data source, a malfunctioning sensor, or even a power failure can lead to drifting data. The causes behind data drift are varied but can be grouped under the category of disruptive triggers.
However, considering the dynamic or ever-evolving nature of data, the biggest high-level factor behind any drift can be credited to time. By the time the data moves across the cycle of data collection, data cleaning, processing, selection, training, and production, it can be safely assumed that the data has drifted in most cases, slightly if not significantly.
So while the model performs well on training, testing, and validation sets, it might not perform just as well on live incoming data. Retraining the model on the latest batch is seldom a solution since the triggering cause might render the formulas learned by the model invalid.
The Consequences of Data Drift
The consequences of data drift are straightforward. If the data drifts and a stale model is used on it, the predictions are bound to be off.
If we consider data to be a product, the above instance would easily pass as a product bug. Yet it is so often overlooked. To illustrate better, consider the example of Amazon which uses Data as a prime product. The Amazon data is used to suggest similar items to buyers based on their latest purchases to drive up sales.
Consider using a stale model for this association problem, and there will instantly be a drop in sales since users no longer find the suggested items relatable or interesting. To sum it up, the model is not the product, the data is. The model wraps around the data to pick up cues and guide business decisions. Not updating the model to suit drifting data hampers these business models and impacts the end-user.
Ways to Detect Data Drift
It is not possible to stop data from drifting, but it is possible to work with drifting data and build high-performing models that are least affected by drifts. This is achievable through an automated process that detects the slightest drift, suggests the causes behind it, and offers a plausible solution.
Automatic detection can be configured through a variety of drift detection techniques such as:
- Kolmogorov-Smirnov (K-S) test that can compare two baseline distributions
- Population Stability Index that can be used to compare distributions of variables across training and testing sets
- Page-Hinkley method that uses mean values to detect changes across changing sets of data
- Or even use a classification model to see if it can precisely differentiate or classify the training and production samples.
A few other techniques that detect differences between data distributions include methods such as the JS Divergence, KL Divergence, Wasserstein Metric or Earth Mover Distance.
How to perform them
Studying, understanding, implementing, and experimenting with each of the above techniques is time and resource-intensive and might not be scalable due to code fragility or people dependencies.
Plug-and-play solutions monitor and detect drifts with a few clicks on a UI or a couple of lines of wrapper code on the console. All the complex algorithms and drift detection techniques are abstracted and wrapped up in such a way that even business teams can configure and start monitoring or analyzing drifts.
How to Detect and Resolve Data Drift
Here’s a mini-guide on using plug-and-play solutions to instantly start picking up drifts in existing or new models (Step 1) and then fixing them quickly to ensure minimum impact or downtime on the customer’s end (Step 2).
Step 1: Setting up monitors early on
It is best practice to set up data drift monitors early on in the production stage. There are two ways to go about this:
As the above clip suggests, data drift monitors can be set up in a few steps:
1. Select the Data drift monitor and the feature you want to monitor
2. Configure the threshold and the data segment for the monitor
3. Create an alert and add all necessary email and slack channels/IDs
In three quick steps, the Censius AI observability platform starts to monitor any data drift in the selected high-risk data segment. Global segments can also be created to manage drift for the overall data. Once the data drift goes over the admissible threshold, the monitor is triggered and the alert is sent out to the teams or individuals accountable for the process.
Step 2: Root Cause Analysis
What happens after you find out about a data drift? As stressed before, retraining a model on the latest data is not always the goto solution. The solution could range anywhere from retraining the model, dropping a feature or two, or even remodeling from scratch depending on the root cause analysis (RCA).
Through RCA, the following details can be filtered out
- Magnitude of drift
- Features impacted by the drift
- Causal patterns between drifted features and other variables
- Causal patterns between drift and performance
- Changes in distribution
- Causal patterns between drift and data anomalies
With plug-and-play solutions, all the above details are accessible a few clicks away without the need for building up an RCA module from scratch. It is especially resource-intensive to build one that caters to a wide range of explainability modules.
Getting started with RCA
1. Open the Explain tab and select the data segment
At this stage, the top features for the selected segment will be displayed. If data drift is high for this segment, it can be attributed to the features with high feature importance. Sometimes, it is also possible that high drift in one feature causes higher turbulence in another feature. At this stage, one can simply note down the causal relationships between drift magnitude and feature importance.
2. Scroll down to the Feature analysis section
In this section, the distribution of the drifted feature can be observed and linked to anomalous behavior. Changes in the distribution pattern are also worth noting along with the vitals of the feature such as correlation with the target variable, percentage of missing values, SHAP value, etc.
3. Scroll down to the Performance Vs Data Drift section
The final section offers a view of the causal relationships between data drift of the feature under study in the chosen data segment and all the available performance metrics. This gives direct insight into the changing model behavior with drifting data. Paired up with insights from the entire Explainability module, the next steps become fairly clear. The developers can then pick up the red zones and fix them up in the shortest possible time to ensure minimal impact on end-users.
Get in Touch
If you wish to learn more about monitoring, explainability, and the reliable AI framework, refer to our resources (blogs, ebooks, whitepapers). To jump right into implementing monitors and explaining data drifts or other issues such as low performance, poor data quality, or bias, feel free to request a demo or sign up for free access to our monitoring and explainability platform.