minutes read

The Data Does Not Speak For Itself: Data Labeling Deep Dive 1 of 2

A walk-through of why ML development needs data labeling and what are the process components

The Data Does Not Speak For Itself: Data Labeling Deep Dive 1 of 2
In this post:

As children, you might have encountered the beloved picture of creatures surrounding Adam and waiting patiently to be named. The image captured our childhood fascination because of the diverse array of animals that we would try to remember the names of.

Even now, as adults, when we hear these names, they form a picture in our minds. Imagine the name Elephant and the image of a tusked, big-eared, dark animal would stomp right through your vision. Or imagine a parrot and how you would associate bright plumage and that typical beak with the bird. All these creatures are identified by specific attributes that help identify their name. Or as Adam had wanted: to put a label on each animal and remember what they should be called.

Similarly, in Machine Learning (ML) development, you must have encountered data that needed names and supporting information. Be it records of individual information or media like images, the instances belonging to these datasets would need labels. This is particularly true if you plan to use the dataset for supervised or semi-supervised ML.

The development of any ML system is dependent on the availability of labeled data. Labeled data are datasets that have been pre-classified to provide training opportunities to the classifier. Additionally, it could be data collected in the production environment for performance monitoring of the live model. Therefore, data labeling is the process of enclosing information to the instances used to train, test, and validate an ML model. It is an unavoidable part of an ML pipeline that could be labor-intensive or costly if labeling services are availed from a third party. However, the labeling process can be the first opportunity for developers to gain insight into the data.

Why Data Labeling?

A typical ML development pipeline would consist of the following processes:

  • Data collection
  • Data preprocessing
  • Model selection
  • Parameter tuning
  • Evaluation
  • Deployment to a target environment

The data collection stage is significant to ensure the provision of suitable training and testing datasets. After all, data is the fuel for ML development. It is not just about publishing surveys, reading sensors and network traffic, or approaching a third-party data owner or custodian when it comes to the collection. The process could involve various tasks like acquisition, labeling, and improving existing data.

The high-level landscape of data collection tasks for ML development
The high-level landscape of data collection tasks for ML development. Image source: The author

The task of data acquisition is to search datasets to train machine learning models and could be achieved through data discovery, augmentation of an existing dataset, or data synthesis. The augmentation and generation of datasets through the synthesis of relevant existing data is fast gaining popularity due to the availability of generative models.

There is a possibility that the acquisition of new data and its labeling may seem labor-intensive. An alternative approach could be to improve the labeling of an existing dataset. This method is more practical if the business problem is novel and discovery or augmentation could be difficult. Additionally, piling on more training or testing data may not necessarily improve model performance. In some cases, re-labeling and cleaning the existing data may incur lower costs and efforts to improve model performance.

The task of data labeling can be quite complex depending on the type of data, its dimensionality and size, and the model objective. Some instances of labeling requirements may include but are not limited to drawing bounding boxes on identified objects in image datasets, or human-identified information to facilitate sentiment analysis. While a classifier can start making predictions even after being trained on a small-sized labeled dataset, its accuracy is a factor of the sufficiency of information communicated by the labels.

Data Labeling Processes

Your team has acquired an ample amount of data and the next step is to label the individual observations. Well, this is the good part!

As seen in the above image, depending on the presence or absence of the labels will require a different strategy in each case. If the dataset lacks any labels then either manual labeling or weak labeling may be employed. On the other hand, if some labels are available then the power of semi-supervised learning may be used to make further labels. The data labeling processes can be categorized as follows:

Crowd-based labeling

Historically, manual labeling was most accurate. A popular example was the labeling of millions of images for ImageNet classification. It was an ambitious project that had been assisted by Amazon Mechanical Turk. However, since Human-In-The-Loop (HITL) techniques can be labor-intensive, innovations have been directed at speeding up the manual work.

While active learning had been the traditional method followed by the ML community, in recent years crowdsourcing has gained favor. Crowdsourcing is driven by workers who may not necessarily be experts but bring the advantage of larger numbers. Gamification can also prove to be an easier method to achieve faster and more fun crowd-sourced labeling.

Active learning

This method assists the crowd through a selection of unlabeled observations that are deemed suitable to the human labelers. The workers in this case hold a certain degree of expertise. Some believe active learning is a special instance of semi-supervised learning except for the HITL factor.


Conversely, crowdsourcing techniques are driven by workers who may not be labeling experts. It is a method that is more about quantity than quality. Since mistakes in labels are expected, there is more focus on interfacing with the workers. After all, it is the understanding of the instructions that influence the labeling accuracy of the workers.

The crowdsourcing method can either work through voting where labels are voted for by the workers or through explaining and categorization. In the latter case, the rationale behind the labels is used as the justification for the labels. In explaining, a worker provides the justification, while in categorization, other workers review the provided explanations and mark conflicts if applicable.


Use of existing labels

In certain novel situations, dataset acquisition may not be practical. Since x-ray images for SARS-Cov-2 patients were not available readily during the early stage of the pandemic, researchers had used existing images of pneumonia patients. In such cases, it is encouraged to improve the labeling of the dataset or to improve the model training. Here we have listed some common techniques to achieve the use of existing labels.

Improvement of existing data

The labeled data at hand can be improved through processes of

Data cleaning:

Cleaning of data as part of the preprocessing has moved from conventional purpose to advanced functions like mitigation of potential bias in the data and sanitization to tackle data poisoning. Additionally, some recently developed solutions use rules, correlations, and reference data to extract details through probabilistic modeling.

Re-labeling of the dataset:

A trained model is as good as its training data. While labeling is crucial to the model performance, acquiring new data and labeling continually may not result in higher accuracy. This is particularly true if the labels are noisy. Therefore, the quality of the existing labels should be assessed and if required, expert workers should be involved to revise them.

Improvement of the model training

Common methods to improve on model training may include:

Making the model robust against noise and bias

Analogous to feature values, there is a high probability of noisy or adversarial labels. If a dataset has a smaller number of clean labels, then discarding noisy labels will leave you with significantly reduced training data.

Noisy labels could be of two types. Firstly, the unclear content of the images can result in confusing labels. Secondly, errors or mismatches between the image and the supporting text may result in random noise in the labels.

To make a more robust model, the relationships between images, class labels, and label noises can be integrated into the model training. Imbalanced labels can be handled through techniques like SMOTE, which over-sample minority classes. To counter overfitting concerns, the over-sampling can be assisted by the generation of synthetic examples.

Transfer learning

In case of a shortage in training data or the training time, transfer learning has become a preferred technique. While this technique uses the metadata of the models, a promising approach is to extend and apply it to the metadata of the datasets as well.

Weak labeling

Some applications, especially when based on deep learning, require large-sized datasets. In such cases, manual labeling may not be a savory thought. Additionally, a model trained on the labeled data should be able to handle unknown labels as well. Imagine an AI-based picking robot working in a warehouse. A new package shape or unknown lighting condition should not throw it into a tizzy. Therefore weak supervision techniques, where a large number of labels can be generated, are fast becoming popular.

In comparison to manual labeling, the generated labels may not be as accurate but appropriate enough for a trained model.

Data programming

This method is particularly suitable for scalability needs despite the incurred costs. Take the example of a factory, where defects on the products need to be labeled. The potential defects can be of different types and in big numbers.

Data programming is a good candidate in this case and its workflow could look like this:

Crowd-sourcing can be used to annotate defects on different component images. Then the annotations can be programmatically converted to labeling functions. These labeling functions can now be either used to train a generative model or put through majority voting. The generative model would output weak labels that can further train the defect detection model.

Example of data programming
Example of data programming. Source: A Survey on Data Collection for Machine Learning

When compared to manual labeling, the former is driven by human intuition and their understanding of the problem. On the other hand, the use of feature engineering can help generate large numbers of weak labels capable of providing higher accuracy.

Fact extraction

This method of generating weak labels is based on information extraction. Databases, even the ones hosted on the internet, contain facts that can be extracted from queries and joins. A fact could be the property of an entity such as (<India, capital, Delhi>). Such an extracted fact can be treated as a label and used further as seeds for data programming. It should also be noted that fact extraction from a text can give different output than from detection on images.



We have listed numerous techniques that are applicable depending on the type of datasets, availability of elbow grease, and existing labels. The following table can summarize when to choose a certain technique, or probably use a combination.


We hope that you enjoyed reading this blog and get a clearer picture about the data labeling process. We have also put together a discussion on data labeling solutions.

To provide you with easily digestible tidbits of information, we also send out a newsletter that you can sign up for in the form given below.

Liked the content? You'll love our emails!

The best MLOps and AI Observability content handpicked and delivered to your email twice a month

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Censius AI Monitoring Platform
Automate ML Model Monitoring

Explore how Censius helps you monitor, analyze and explain your ML models

Explore Platform

Censius automates model monitoring

so that you can 

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

Start Monitoring