A Comprehensive Guide to Types of Data Drift Detection

Author

Posted Nov 22, 2024

Reads 1.1K

Man in White Dress Shirt Analyzing Data Displayed on Screen
Credit: pexels.com, Man in White Dress Shirt Analyzing Data Displayed on Screen

Data drift detection is a crucial aspect of machine learning model maintenance. Data drift occurs when the underlying distribution of the data changes over time, affecting the model's performance.

This can happen due to various reasons such as changes in user behavior, new features, or even seasonal variations. For instance, a model trained on data from a specific season may perform poorly during another season.

There are several types of data drift, which can be broadly categorized into three main types: concept drift, covariate shift, and concept shift.

A unique perspective: Concept Drift

Types of Data Drift

Data drift can occur in various forms, and understanding these types is crucial for maintaining the accuracy of your model. There are three main types of data drift: Real Concept Drift, Covariate Shift, and Label Shift.

Real Concept Drift refers to a change in the posterior probability distribution of target labels, indicating a change in the underlying target concept of data. This type of drift requires an adaptation of the model's decision boundary to preserve its accuracy. Covariate Shift, on the other hand, refers to a change in the input data probability distribution, which can manifest itself only in a sub-region of the input space or be implied by the emergence of new attributes.

Credit: youtube.com, ML Drift: Identifying Issues Before You Have a Problem

Label Shift refers to a change in the prior probability distribution of target labels, which can badly affect the prediction performance of the model if the change in the distribution is significant. These three types of data drift can be accompanied by changes in the input data probability distribution, and understanding their differences is essential for developing strategies to combat them.

Here are the main types of data drift:

Concept

Concept drift is a change in the underlying relation between the data and the label. This type of drift can be caused by changes in the job market, such as the rise of high-tech companies, which made job experience more significant for programming jobs than a degree in computer science.

Concept drift is a change in the underlying relation between the data and the label, and it's a type of drift that can be caused by changes in the job market. It's not just about the data changing, but also about the relationships between the data and the label.

Here's an interesting read: Concept Drift vs Data Drift

Credit: youtube.com, What is Concept and Data Drift? | Data Science Fundamentals

There are several types of concept drift, including real concept drift, covariate shift, and label shift. Real concept drift refers to a change in the posterior probability distribution of target labels, while covariate shift refers to a change in the input data probability distribution.

Here are some examples of concept drift:

  • Real Concept Drift: change in the posterior probability distribution of target labels, such as a change in the underlying target concept of data.
  • Covariate Shift: change in the input data probability distribution, such as a change in the shopping preferences of customers.
  • Label Shift: change in the prior probability distribution of target labels, such as the emergence of new classes in the distribution.

Concept drift will almost always require some changes to the model, usually by retraining of the model on newer data. This is because the relationships between the data and the label have changed, and the model needs to adapt to this new information.

Recommended read: Model Drift vs Data Drift

Causes

Data drift is a common issue in machine learning systems. It arises from a multitude of factors, such as degradation in the quality of materials of a system's equipment.

Changes due to seasonality can also cause data drift. This is especially true for systems that collect data over long periods of time.

Adversarial activities can also lead to data drift. These activities can be intentional attempts to manipulate the data and throw off the model.

Credit: youtube.com, What Are Drifts and How to Detect Them? #machinelearning

Personal behaviors or preferences can also change over time, leading to data drift. This can be due to a variety of factors, such as changes in demographics or shifts in market trends.

Every ML system operating in a non-stationary environment has to address data drift, with detection and/or with passive or active adaptation of the model.

Tabular

Tabular data drift can be detected using various methods. Deepchecks offers three checks for this purpose.

Feature Drift uses univariate measures to detect changes in the distribution of individual features. Multivariate Drift, on the other hand, uses a domain classifier to detect changes in the relationships between multiple features.

For label distribution drift, Label Drift uses univariate measures. This check is essential when the labels are available.

In cases where labels are not available, Prediction Drift comes in handy. It uses the same methods as Label Drift but on the model's predictions, allowing it to detect possible changes in the label distribution.

Text

Credit: youtube.com, Explainable Data Drift for NLP

Text data can't be measured for drift directly because it's not structured data. We can use methods like Text Embeddings and Text Properties to represent text as a structured variable, and then measure drift on that variable.

Text Embeddings can find more complex patterns in the text, but these patterns may be difficult to explain. Text Property Drift Check uses properties to measure drift using univariate measures, which can be more explainable.

The Text Embeddings Drift Check uses embeddings to measure drift using a domain classifier. This approach is more effective in capturing complex patterns, but may require more expertise to interpret the results.

To detect data or concept drift in text data, we recommend using both Text Embeddings and Text Property methods. This will provide a more comprehensive understanding of the drift in the data.

Here's a summary of the two methods:

For drift in the label's distribution, we can use the Label Drift, which uses univariate measures. This approach is useful for detecting changes in the distribution of the target variable.

Measuring

Credit: youtube.com, Comparison of Data Drift Detection Methods | Data Science Fundamentals

Measuring data drift is a crucial step in detecting changes in your data.

To measure data drift, you need to understand which distribution you want to test and check if it's drifting relative to the distribution you choose as your reference distribution. This is not a straightforward task, but it's essential for defining the right drift metrics.

There are various methods to measure data drift, including using the Kolmogorov-Smirnov statistic or Wasserstein metric for continuous numeric distributions, and Cramer's V or Population Stability Index (PSI) for discrete or categorical distributions.

The choice of method depends on the type of data you're working with. For example, if you're dealing with continuous numeric data, the Kolmogorov-Smirnov statistic or Wasserstein metric may be a good choice.

Here are some common methods used to measure data drift:

  • Kolmogorov-Smirnov statistic: used for continuous numeric distributions
  • Wasserstein metric (Earth Movers Distance): used for continuous numeric distributions
  • Cramer's V: used for discrete or categorical distributions
  • Population Stability Index (PSI): used for discrete or categorical distributions

Ultimately, the key to measuring data drift is to choose the right method for your specific use case and to understand the limitations of each method.

Types of Data Drift Detection

Credit: youtube.com, Data Drift Detection and Model Monitoring | Concept Drift | Covariate Drift | Statistical Tests

There are several types of data drift detection methods, but one of the simplest and most common is the univariate measure method. This method involves taking one variable at a time and measuring the difference between newer and older samples of that variable.

For continuous numeric distributions, the Kolmogorov-Smirnov statistic or Wasserstein metric (Earth Movers Distance) produces the best results. For discrete or categorical distributions, Cramer’s V or Population Stability Index (PSI) is preferred.

This method has the advantage of being simple to use and producing explainable results, but it's limited to checking each feature one at a time and can't detect drift in the relations between features. If drift occurs in multiple features, it will usually detect drift multiple times.

How to Detect

Detecting data drift can be a complex task, but it's essential to catch changes in your data before they affect your model's performance. The simplest and most common drift detection method is univariate measure, where you take one variable at a time and measure the difference between newer and older samples.

Credit: youtube.com, ML Drift: Identifying Issues Before You Have a Problem

This method is easy to use and produces explainable results, but it has its limitations. For continuous numeric distributions, the Kolmogorov-Smirnov statistic or Wasserstein metric (Earth Movers Distance) is recommended, while for discrete or categorical distributions, Cramer’s V or Population Stability Index (PSI) is preferred.

Cramer’s V is a good default choice, as it's always in the range [0,1] and based on the Pearson’s chi-squared test. However, PSI is widely used in the industry, but it doesn't have an upper limit and is not very explainable.

In classification problems with unbalanced data, running the Label Drift or Prediction Drift checks with the default parameters may lead to false negatives. To detect this kind of drift, set the balance_classes parameter to True, which will cause the check to consider all classes equally, regardless of their size.

Here are some recommended methods for detecting drift in different scenarios:

Keep in mind that the univariate approach has its advantages, such as being simple to implement and easy to drill down to drifting features. However, it can be impacted by redundancy and cannot capture multivariate drifts.

To quantify drift for the entire dataset, you can either compute the drift on a per-feature basis and then use some aggregation function to get a single aggregated drift metric, or leverage multivariate methods from the get-go.

LLM

Credit: youtube.com, ML Drift: Identifying Issues Before You Have a Problem

LLM drift is a challenge in the field of Large Language Models, where models can become outdated if they're not constantly updated, leading to limitations in producing results about recent events or changes in natural language.

Popular LLMs are frequently updated with new information to avoid these types of issues, but this can also cause changes in model behavior, such as a 60% loss in performance over time.

Projecting this scenario for many years, we could also include changes in natural language, such as new expressions or terms being increasingly more used or some terms being naturally deprecated by people.

The performance and behaviors of LLMs like GPT-3.5 and GPT-4 can vary greatly over time, with some improvements but also significant degradations, and many changes in their behavior in relatively small amounts of time.

The importance of LLM Monitoring has been highlighted, especially for services that deprecate and update their underlying models in an opaque way, causing the so-called prompt drift phenomenon.

LLM migration and deprecation can lead to changes in prompt-injection data at inference, making it essential to test generative apps before migration and develop apps that are somewhat agnostic to the underlying LLM.

Data Drift in Specific Domains

Credit: youtube.com, Machine Learning Model Drift - Concept Drift & Data Drift in ML - Explanation

Data drift in specific domains can be particularly challenging to detect and address. It often occurs in domains with rapidly changing user behavior, such as online advertising.

In the finance domain, data drift can be caused by changes in market trends or economic conditions. For example, a sudden shift in interest rates can affect the performance of a model trained on historical data.

Data drift in the healthcare domain can be caused by changes in patient demographics or treatment protocols. This can lead to a model's performance degrading over time.

In the retail domain, data drift can be caused by changes in customer behavior or product offerings. For instance, a company that introduces a new product line may see a change in customer purchasing patterns.

Data drift in the transportation domain can be caused by changes in traffic patterns or road conditions. This can affect the performance of models used for route optimization or traffic forecasting.

Data drift can have significant consequences in these domains, including decreased model accuracy, increased costs, and reduced customer satisfaction.

For more insights, see: Marketing Data Enrichment

Addressing Data Drift

Credit: youtube.com, Beyond Innovation #10: Addressing Data Drift with StreamSets' Girish Pancha

Data drift is a reality in machine learning, and it's essential to address it to maintain model performance.

Retraining your model on new data is the most straightforward solution to data drift. This involves updating your model with fresh data that better represents the current distribution.

You can also combat data drift by regularly retraining your models on new data, which helps the model stay up to date. A robust model monitoring setup is necessary to provide visibility into the current model quality and ensure timely intervention.

Data drift analysis is a useful technique for model troubleshooting and debugging. By comparing per-feature distributions, you can identify which features have shifted most significantly and visually explore their distributions to interpret the change.

In cases of concept drift, retraining your model on new data is usually necessary. However, retraining can also be beneficial for data drift that caused a change in the label's distribution, but not in the ability to predict the label from the data.

Frequently Asked Questions

What is the difference between feature drift and data drift?

Feature drift refers to a change in a single feature's distribution over time, while data drift is a broader change in the overall data distribution. Understanding the difference between these two types of drift is crucial for maintaining the accuracy of machine learning models.

Keith Marchal

Senior Writer

Keith Marchal is a passionate writer who has been sharing his thoughts and experiences on his personal blog for more than a decade. He is known for his engaging storytelling style and insightful commentary on a wide range of topics, including travel, food, technology, and culture. With a keen eye for detail and a deep appreciation for the power of words, Keith's writing has captivated readers all around the world.

Love What You Read? Stay Updated!

Join our community for insights, tips, and more.