A shift in the distribution of features in training and serving data while the relationship between the input and target is unchanged.
What is Data Drift?
In this information-rich world, enormous data is generated at every moment. But this data itself might change for several reasons, such as changes in the data collection system, real-world changes, or dynamic behavior of noise in the data. When data changes and affects the machine learning model’s performance, it is a data drift issue. It is interchangeably termed as a feature, population, or covariate drift.
Statistically, dataset drift between a source distribution S and a target distribution T is defined by the change in the joint distribution of features and target.
P(xs , ys) ≠ P(xt , yt)
Typical causes of data drifts include
- Data quality issues, change of data source pipeline, or sensors that have become inaccurate over time
- Natural drift in the data like mean temperature changing with the seasons
- Upstream process enhancements like units of measurement changed from inches to centimeters
- Covariate shift or change in the relation between various features - model observes new age demographics as the user group expands
Why is Data Drift Monitoring Important?
Flagging data drifts and automating model retraining jobs with new data helps ensure that the model is relevant in production and offers fair predictions over time. Timely insights on data drift detection help avoid model decay with best industry practices such as:
- Incremental learning with retraining model as new data arrives
- Training with weighted data
- Periodic retraining and updating models
How to Detect Data Drift?
Drift detection constitutes an important stage of the ML Model Lifecycle for flawless ML performance in production environments. A common approach to detecting data drift includes comparing training and production data sets distributions using a nonparametric test.
Or you can use monitoring solutions like the Censius AI Observability Platform that facilitate setting up custom alerts and thresholds to trigger user notifications. As soon as drift is detected, the platform alerts users and reminds them to take the next course of action, which might include adding new training data, model retraining, or model redevelopment.