Data labeling is an important step in creating a high-performing machine learning model. Though labeling looks to be straightforward, it is not always so. Organizations encounter several challenges when collecting and selecting data. In this post, we'll look at what redundant data is, how to minimize it, and how to make data labeling more efficient.
What is Redundant Data?
In simple words, having two similar/similar-looking samples of data in two different places inside your datasets is known as data redundancy. To understand the concept, consider the following scenario: you're creating a classifier to differentiate between images of a bike and images of a cycle. You have 200 bike images, and are looking for images of a cycle. Two of your teammates have now offered their dataset.
- John shares 200 pictures of his cycle.
- Jack has photos of 200 different cycles accumulated over the last year.
Now we know that both datasets make sense depending on the case scenario. But we must agree that, in general, Jack's dataset makes more sense. Let's see why it makes more sense.
Assume that each image may provide specific information to our collection. Images with the same information (in this case, the same cycle) contribute less information than images with new information (in this case, a different cycle). So the new type of data improves the efficiency of the ML model. In general, there are two types of redundancy:
- Semantic redundancy: In this situation, the disparities between the samples are low or negligible. In simple words, they have a visual resemblance, as shown below.
- Scene similarity: This situation has similar scenic conditions but minor differences in the video/image frame. This is especially common in video datasets, where a single object/situation may include thousands of similar frames.
Several public datasets, such as CIFAR-10 or ImageNet, comprise at least 10% redundant samples, according to the study, The 10% you don't need.
Why should we Minimize Redundant Data?
If we don't examine datasets, our model's performance may suffer in various ways.
Model Performance
The data used to train the model significantly impacts its performance. As discussed in the above example, if you give a model redundant data, it will perform well in some cases but lack expertise in other scenarios. The generalization and accuracy of a model are harmed when it is trained on redundant samples. An optimum data selection is essential to decrease redundancy as much as possible.
To increase the effectiveness of your model, you should use a data-centric approach, which focuses on data quality rather than quantity. You can start by improving data labeling, feature engineering, and data augmentation. It will help you reduce data discrepancies and inaccuracies.
Cost and resources
A significant amount of effort and money is invested in classifying databases. As a result, labeling samples that aren't needed is wasteful and expensive. Data-related operations become more time-consuming and expensive when raw data passes through the standard machine learning pipeline.
Check: How can you optimize your ML model with Censius.
How to Filter Redundant Data?
To find redundancies in datasets, we look at the semantic space of a pre-trained model trained on the full dataset. The 10% You Don't Need research paper shares a two-step technique to locate and delete less useful samples.
The process starts with training and embedding. They then use Agglomerative Clustering to eliminate nearest neighbors using a clustering approach. They employ the common cosine distance as a measure for grouping. So here are two issues that need to be addressed:
- How do we achieve effective embedding?
- How can we speed up the process?
Depending on the scenario, there are different methods you can use to reduce data redundancy.
Active learning and sampling methods
Active learning is a class of machine learning techniques and a type of semi-supervised learning that aids people in selecting the best datasets. Approaches to active learning include:
- Synthetic membership inquiry synthesis - Creates a synthetic instance and asks for a label.
- Pool-based sampling - Sorts all unlabeled cases by informativeness, then chooses the best queries to annotate.
- Stream-based selective sampling identifies unlabeled instances one at a time and classifies or ignores them based on their informativeness or uncertainty.
Generate embeddings
To filter data redundancies, we need to get a good embedding. We already know that images hold a lot of data in their pixel values. Comparing images pixel by pixel would be a time-consuming task with unsatisfactory results.
Instead, we can construct embeddings for each image using a pre-trained model. An embedding is an outcome of a deep model turning an image into a vector comprising a few thousand values, distilling the information held in millions of pixels. We can use the pre-trained models as a base and fine-tune them using self-supervision on a given dataset.
Now, we want to employ a fast embedding-based data selection algorithm. Depending on the scenario, agglomerative clustering may be inefficient. So there are two types of algorithms we can use:
- Those that are destructive, in which we start with the entire dataset and subsequently eliminate samples.
- Constructive algorithms, on the other hand, start from the beginning, adding only relevant examples one by one.
This whole process will make the filtering process effective even without using labels. You can read The 10% You Don't Need research paper to learn more.
Calculate similarity
We can use standard similarity methods to determine how similar each image embedding is to the other. Here we can use cosine similarity provided by Scikit Learn since it's a straightforward method that works well in high-dimensional domains.
There are several tools and platforms available to assist you in detecting duplicate or identical data. These tools simplify your job and boost model efficiency simply by uploading datasets.
- FiftyOne - It is an open-source application for creating high-quality datasets and computer vision models. FiftyOne works seamlessly with your existing tools.
- Lightly AI - It allows you to find and remove redundancy and bias generated by the data collection process to prevent overfitting and enhance the generalization of machine learning models.
Read On: Learn More about Lightly AI
Best Practices
Let's look at a few data labeling best practices that will help you improve your model accuracy.
- Ensure consistent, high-quality data throughout the machine learning lifecycle.
- Remove any noisy samples; more data is not necessarily better.
- Focus on a subset of data using error analysis.
- Ensure label correctness and update them as needed.
With Censius, you can proactively detect and resolve performance regression, poor data quality, and model drifts to build reliable machine learning models. Censius enables you to automate model monitoring to increase model accuracy and grow your business.
Conclusion
Data plays a vital role in any machine learning model. Less noisy and less redundant data make machine learning models more efficient. This article discussed data redundancy, why we should minimize it, and ways to filter it. I hope you liked the article.
Explore how Censius helps you monitor, analyze and explain your ML models
Explore Platform