MLOps

•

minutes read

The Data Does Not Speak For Itself: Data Labeling Deep Dive 2 Of 2

A walk-through of what are data labeling platforms, how to get the best out of them, and pitfalls to watch out for

Gatha

The Data Does Not Speak For Itself: Data Labeling Deep Dive 2 Of 2

In this post:

In the previous part of this series, we introduced you to the concepts of data labeling. There is no doubt that while the process is critical to machine learning (ML) development, it can add labor and time costs to the project. More so, if the process is not planned as per the business problem and the required training-testing data.

Irrespective of the domain, scientists and engineers have to wade through massive amounts of data and make it suitable for further use. As per the study by Cognilytica, 80% of ML development time could get consumed in data aggregation, cleaning, labeling, and augmentation. Here’s a pie that will put numbers to the statement:

*The various tasks that comprise ML development and the typical time allocated to them. Source:* *Cognilytica*

‍

As seen in the above distribution, data labeling demands a significant chunk of attention. Therefore the process should be well-planned to minimize any errors that can avalanche down the subsequent development. Also, some significant additional requirements call for tools and platforms to ease the process.

‍

Enter Data Labeling Platforms

As arduous as it sounds, a multitude of data labeling platforms are available to ease the 25% duration of ML development. In addition to simplifying the labeling process, these tools also foster team collaboration and re-use. The datasets at hand could vary from text to media like images, videos, and audio. Let us present how to label data the smart way without further ado.

‍

Amazon SageMaker Ground Truth

This platform bundles together AWS functionalities for ML development. Apart from labeling, you can also build the model, train and test it, and eventually deploy and manage it.

*The look and feel of Amazon SageMaker. Source:* *SageMaker*

‍

The advantages offered by SageMaker include

Beginner-friendly: Imagine starting off with ML development and finding related tools in one place. This is the benefit of using the AWS bundle that comes along with this platform.
Scalability: The workflow managed by the built-in features caters to scalability requirements.
Well-supported: The platform is well supported through tutorials, FAQs, and reviews by the users.
Vroom-vroom: Since the required tools are available on a single platform, your end-to-end needs can be catered to at a faster pace.

Some of the drawbacks are

User interface: The UI of SageMaker is oriented toward users who have the technical know-how. It needs to be made suitable for analysts who may not have technical expertise.
The monies: The costs associated with the platform are not intuitive, especially if you intend to scale up the project.
Low scope for customization: While it is suitable for getting up and running fast, SageMaker APIs lack the flexibility if your team wishes to do custom training or schedule them among other tasks.

‍

IBM Cloud Annotations

The platform introduced in the year 2020 used the power of AI to generate annotations.

‍

The benefits offered by IBM Cloud Annotations include but are not limited to

Sharing: The users can store data on the IBM cloud and collaborate in real-time
Customization: Apart from the templates, the tool allows for enhancements and applications of filters for customization needs.
Suitable for users who possess different levels of ML knowledge.

The cons of this platform are

The monies: Currently, there are no free-for-use plans.
Support for image types: While it works well for photographs, the platform may not perform well for images such as x-rays, receipts, or hand drawings. Also, the users need to be careful if working with images that have one dimension larger than the other.

‍

Google's AI Platform Data Labeling Service

This data labeling and annotation service is part of the AI services bouquet offered by Google.

*The look and feel of the labeling service offered by the Google AI platform. Source:* *Google Cloud AI Platform: Human Data labeling-as-a-Service*

‍

The pros in favor of this service are

Documentation and support: Coming from the stables of Google, the documentation is helpful even for beginners
Ease of use with a gentle learning curve
Seamless integration with BigQuery and Google Cloud Services
Variety of bounding box options to label images

Cons include

No free lunches: The services are offered in different price ranges depending on the number of human labelers and annotation units
Vulnerable to bias since heavily dependent on human labelers
Anomaly detection is missing in Google AI platform

Label Studio

Fondly called the swiss army knife of data labeling by its creators, let us see how it weighs on the scales of advantages and drawbacks.

‍

Some of the advantages offered by Label Studio are:

Open-source and free to use: The tool can be easily installed using the pip command.
Customizable interface: You can build the interface like you would make a webpage, except it would need to be written in JavaScript.
Playground: If you are not comfortable with building the interface or would like to work online, then the playground offers many examples.
Varieties galore: Can be used to label text, images, HTML documents, or audio for any combination of annotation tasks like segmentation, classification, regression, etc.

The platform has certain drawbacks:

It could be a little overwhelming for a non-programmer user.
In a collaborative environment using the SQLite database, the import of a large volume of data by another user may slow down the labeling for others. This problem can be worked around by timing the upload or switching to a different database like PostgreSQL or Redis.
Annotations may be flawed for some audio formats and would require conversion to .wav format.
Annotation of PDF files would require converting them to HTML first.

‍

Clarifai

Clarifai is yet another platform that offers a bunch of services, including labeling and annotation. As per Gartner, it has been rated slightly higher than the Google AI platform.

For a detailed comparison, you may read this report.

*How image labeling looks like on Clarifai. Source:* *Clarifai*

‍

The benefits offered by Clarifai include but are not limited to

The interface is easy to build and use
The interpretations of the images are highly precise
Non-English speakers can use the platform as it offers functionalities in other languages as well

Some common complaints by users include:

Limited documentation: The available guides and FAQs are limited and oriented towards developers
The APIs are undergoing continuous improvement and therefore require a continued learning curve
Works only for still images and not streams or videos
The monies: The services are available under a price. However, a limited duration free trial is also on offer.

‍

Labellerr

Labellerr is a SaaS solution to answer the data labeling and annotation needs of ML projects.

*A screenshot of Labeller. Source:* *Producthunt*

‍

Some of the advantages offered by Labeller are:

Supports different types of data concerned with natural language processing, computer vision, and speech recognition
Easy to use and intuitive interface
Powerful auto-labeling features
Good customer support

Cons include

No free lunches: The free trial is available for limited use only.
Only available for cloud-based deployment. No desktop, mobile or on-premise availability.

‍

V7 Darwin

This is a powerful labeling and annotation tool for computer vision.

‍

The benefits offered by V7 Darwin include but are not limited to

The tool has a gentle learning curve with an easy-to-understand user interface.
Auto-annotation gives accurate outputs.
Extensive documentation.
The tool offers the exciting option of plugging external models as well.

The platform has certain drawbacks:

Only caters to computer vision requirements.
Only available for cloud-based deployment. No desktop, mobile or on-premise availability.
Not free to use except for a trial version.
The command-line interface can be challenging for beginners.

‍

Labelbox

Labelbox is another SaaS web-based tool popularly used for image annotation.

‍

Some advantages this data labeling tool offers are

Easily achievable set-up and customization options.
The tasks are easy to track, and guidelines can be attached to each.
Can handle data scarcity issues.
It is an open-source tool, and a free version with required functions is available

Cons include

Free use is limited to 10,000 annotations.
On-premise installation is possible only through the enterprise edition.
The tool users have reported glitches in the user interface and slower image uploads and reports downloads.

‍

Additional Tools

Apart from those mentioned above, there are many other data labeling solutions like

Computer Vision Annotation Tool (CVAT) by Intel.
Superannotate also lets you automate the AI pipeline.
Tagtog is a text annotation tool that supports various formats.
Playment is a task-based data labeling tool.
Dataturks is another open-source tool for NER and POS tagging and segmentation.
LightTag can be an answer to your NLP-specific labeling needs.
Figure Eight brings Human-In-The-Loop (HITL) to training data pipelines.
Of course, information extraction from unlabeled data is equally important. Check out this discussion on BatchBALD to know more.

‍

Getting the Best Out of Data Labeling

We will now wrap up this blog by listing potential icebergs that can tank your data labeling ship and how to sail around them:

Vision matters

An organization needs to envisage clear goals, and the required resources, and performance metrics. Your annotation process can go haywire if the actions are not aligned with clear goals.

Additionally, budgeting is an important aspect that would help you decide what tools and technologies to invest in. Lastly, the success of ML development tasks should be gauged based on intermediate metrics and KPIs.

‍

The H of the HITL

Data labeling is still dependent on the human factor. Being a labor-intensive process, workforce management becomes an important requirement. An organization should therefore invest in training and collaboration.

Additionally, the collaboration between different teams like data scientists, annotators, and managers should be encouraged.

‍

Privacy compliance

The introduction of regulations in the likes of GDPR, CCPA, and HIPAA have translated into a greater need to follow privacy compliance. Tasks like data labeling handle sensitive user information and media. Therefore privacy standards need to be laid down by the organization.

Moreover, the role of human annotators should be regulated against any introduction of bias at this stage.

‍

It is still about the data

Data labeling makes or breaks the quality of the data to be used for training and testing. While workforce management can ensure the quantity of usable data and collaboration, quality assurance is another much-needed practice. In addition to privacy standards, an organization should lay down guidelines for high-quality data and annotations.

Human workers have primarily driven the use of AI for better AI Data labeling. Their presence is essential for accurate and relevant labels. In part 1 of this blog, we saw how AI-ML algorithms could help bootstrap the process and further enhance the existing data and models.

As discussed in this story, the workforce could be hesitant about adopting AI in their operations, yet a hybrid approach can be a game-changer. The platforms mentioned above and many of their counterparts have harnessed the power of AI to provide more accurate reports. Organizations should therefore invest in training and tools to foster the human-AI collaboration for processes like data labeling as well.

We hope that you enjoyed reading this blog and getting a clearer picture of the data labeling process. To provide you with easily digestible tidbits of information, we also send out a newsletter that you can sign up for here.

Liked the content? You'll love our emails!

The best MLOps and AI Observability content handpicked and delivered to your email twice a month

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

The Data Does Not Speak For Itself: Data Labeling Deep Dive 2 Of 2