According to Anaconda's State of Data Science 2022 Report, data professionals spend over 38% of their time in data preparation and cleansing. Model selection, training, and deployment take 9% of their overall time.
Here is a graphic by Anaconda:
But is this distribution fair for data professionals?
The best approach is streamlining ML projects with the right MLOps tools and practices. Let's focus on a single stage of MLOps for this blog - feature engineering. It is one of the crucial stages in ML projects. If not handled properly, inappropriate feature engineering practices can be expensive to data-driven organizations.The significant harms can be data scientists' time spent in data preparation, delayed processes, lack of reproducibility, and as a result, people losing their trust in ML outcomes.
But we have got you covered with a great solution to these problems. Feature stores. This blog will discuss feature stores, their types, benefits, and options available.
So let's jump-start the topic now.
Starting from Scratch: A Quick History of Feature Stores
Uber coined the concept of feature stores. They introduced the feature store term with their Michelangelo machine learning platform in 2017. Feature stores helped Uber to operationalize its ML projects. And after that, the trend continued with tech giants like Google, AWS, and Databricks. The following timeline highlights the key milestones.
What is a Feature in Machine Learning?
A feature is input to an ML model that influences the predictions delivered by models. These are also referred to as independent variables or attributes.
For example, in the credit score prediction model, the feature can be monthly income and current debt amount.
Feature engineering is the process that helps generate relevant features from raw datasets. These processes vary from simple aggregations to complex feature transformations.
What is a Feature Store?
Feature store is a centralised platform that helps store all features, make them accessible & reusable when required, and enables easy feature management.
The feature store could be online or offline based on the type of data it ingests. Let's have a detailed look:
- Batch data: The data stored in data lakes and data warehouses. The big data chunks come in batches and are not updated in real-time—for example, customer data like age, city, address, etc.
- Real-time data: Real-time and online data fetched from streaming and log events. For example, in real time, banks' transaction data is fed to feature stores.
Based on the above two data types, features stores can be of these two types:
These feature stores usually ingest and process batch data to build historical features.These historical features are made available in the model training pipeline. Some examples include Apache Hive, IBM Cloud Object Storage, or databases such as PostgreSQL and MySQL.
Online stores combine offline stores' data and preprocessed features from streaming data sources. These support the fastest access to the most up-to-date required features. Typical implementations include MySQL, Cassandra DB, Redis, and other complex systems.
How do Feature Stores Benefit in ML Pipelines?
Streamline workflows without redundancy
One of the significant advantages of using feature stores is to avoid duplication of work. For example, without feature stores, your workflows might include multiple models that access the data individually. Such workflows result in unnecessary duplication of processes from pulling data to transformations for feature extraction. Feature stores assist in avoiding this duplication with organized and versioned features that can be used in the future to train models.
Bring feature reusability
Time spent on repetitive tasks has been a burning issue for data science professionals. Feature stores help data teams with this problem. You can create features, store them and reuse them for further trials. It is a great choice to reduce the time spent on feature creation which takes much effort.
Feature stores are crucial to close the gap between experimentation and the production environment. The transformation of models from experiment to production environment becomes skewed with various tools and frameworks used at each stage. But feature stores help ensure consistency in the performance in both environments while ensuring scaled accessibility and reusability of features.
ML projects involve multiple teams working on different models and associated features. By bringing entire features management under a single roof, feature stores help enhance team collaboration.
When to Build or Adopt a Feature Store and When Not?
It is recommended to build or adopt a feature store when your enterprise deals with several models based on common entities like customers, users, products, services, etc.
While working on similar applications, building features that can be reused across multiple models is imperative. In most cases, data scientists will build features and store them in feature stores so that they can later be available to other computations.
Feature stores constitute an essential component of your MLOps tools. That said, you can avoid using them in case you are at a PoC stage or dealing with a minimal number of models.
Until now, we covered fundamental concepts of feature stores. Now let’s head over to the solutions available and how to choose the right feature store.
Three Popular Feature Stores - An Overview and Comparison
A joint venture by Go-JEK and Google Cloud brought together Feast. It is an open-source feature store that aids ML practitioners create, manage, share, and serve the features.
Feast decouples ML from data infrastructure by providing a single data access layer abstracting feature storage from feature retrieval. It eliminates the need for management and deployment of dedicated infrastructure by reusing components and availing new resources as needed. You can easily consume existing feature views by teams and avoid starting from scratch.
With a centralised registry, Feast aids data science professionals in publishing features and helps engineering teams ship features into production with minimal oversight and organisational friction. It serves feature data to models using these two options:
- A low-latency online store for real-time prediction
- An offline store for batch scoring or model training
A point-in-time feature retrieval is an excellent approach to solving data leakage challenges.
However, Feast is not your choice if
- You are at a primary stage of ML and not sure about its business impact
- You are a part of a small team that supports several use cases
- You rely mainly on unstructured data
Hopsworks is an enterprise-grade feature store that helps manage the transformations, storage, and retrieval/serving of features. It is a data platform for machine learning with a Python-centric feature store and MLOps capabilities.
You can use it as a standalone feature store or for managing, governing, and serving your models. Moreover, you can use it to build and operate feature pipelines and training pipelines.
This modular platform brings collaboration for ML teams, providing a secure, governed option for developing, managing, and sharing ML assets - features, models, training data, batch scoring data, logs, and more.
Hopsworks is suitable for a vast array of infrastructure options, including AWS, Azure, GCP, Kubernetes, and on-premise hardware. It also supports numerous data sources, including Snowflake, Redshift, and HDFS.
Its web UI aids in browsing and exploring existing features. Hopsworks comes in both open-source (free) and paid options.
Tecton is an enterprise-grade feature store built on top of Feast. It enables data professionals to take control of the entire lifecycle of features – from feature creation to deployment.
Tecton is primarily developed to serve as a complete feature platform for enterprises. A product developed by Uber Michelangelo creators is gaining pace with a wide array of features supported. Tecton creators are also contributing and backing an open-source feature store Feast.
Tecton supports these capabilities to make feature management more organised:
- Discover, use, monitor, and govern end-to-end feature pipelines
- Design and define ML features
- Register and collaborate on feature definitions using the feature repository
- Transform raw data Into fresh feature values
- Store and serve fresh feature values
Tecton, a managed solution built on top of Feast, enables seamless feature management and execution of feature pipelines. It comes as a costlier option than Feast but compensates with additional capabilities - autoscaling, disaster recovery, and testing.
In the following comparison table we compare all three discussed feature stores.
Build the Right MLOps Stack for You
With ML becoming a necessity to modern enterprises, suitable tools are needed to support MLOps stages.
Learn more about ML interpretability and MLOps tools to analyse different tools in MLOps stages.
Explore how Censius helps you monitor, analyze and explain your ML modelsExplore Platform