In an alternate reality, or this one, a friend asked you to watch their seven-year-old child. And like a typical young one, they were bored and demanded that you play the game of twenty questions with them. Having complete faith in their attention span, you decided to play a game of five questions instead. They wanted you to guess an insect, and this is how the game transpired:
You: "Does it sting people?"
You: "Does it have wings?"
You: "How many legs does it have?"
Child: "It has eight legs. Usually."
Ignoring the last part, you quickly guessed that the child was talking about a spider: An eight-legged insect that does not sting or fly.
In machine learning (ML) development, a model plays a similar game during testing and training. It questions certain attributes and shortlists possible answers on the attribute value. If the model is for supervised learning, then the output could be class labels or target values. Unfortunately, the attributes, also called features or parameters, are not relevant this often. The training dataset may contain features with more complex values and require some digging and dusting to get useful information say, the number of wings or eyes.
In the hypothetical game of five questions, you made an incremental discovery of information and used the possessed knowledge to answer an otherwise vast domain problem. This is the crux of feature engineering (FE). The process allows you to discover features that should contribute to strengthening the argument behind predictions or classifications. At its core, it is a technique that works on representing the data to extract information relevant to the ML algorithm.
Feature Engineering Advantages
Apart from the refinement of an algorithm, feature engineering can also help in:
Addressing the curse of dimensionality
The data acquisition process is resource-intensive and may collect as much data as possible. Features that truly contribute to model output may drown in the sea of similar meaning yet unnecessary parameters. Additionally, following the standard that requires five training instances per dimension^, training your model on an avoidably large-dimensioned dataset will require a more extensive input dataset.
Reduce overfitting risks
A model that follows the training data too closely can fail to generalize. Assuming that you modeled a predictive algorithm for a drug’s side effects, one of the features in the collected trial patients’ data checked for dehydration. While in isolation, the question seems pertinent, the model also needed to consider the weather when reporting the symptoms. A person taking the drug in winters may not experience dehydration as much as in summers. Feature engineering can thereby prevent overfitting model features to the test set.
Feature Engineering Process
Based on the dataset and the availability of resources, feature engineering can be realized in many ways. Some of these processes may be combined to achieve the best results. Given raw data, its analysis to devise features suitable for the candidate ML algorithms should be the first step. Let us see the types of analysis that drive FE.
Exploratory Data Analysis
Also commonly known as EDA, it is the process that lets you test uncharted waters. The initial exploration of an unknown dataset helps understand the different columns and their properties. The said properties can be uncovered by a statistical summary comprising the mean, median, and bounds on the allowed values, variance, and standard deviation.
By deriving meaningful inferences about the features, you can progress towards establishing relationships between them and the target variable. Some points to remember when you do EDA of a dataset:
- Visualizations derived at this stage help understand relationships among the features and the target variables. The discovery of outliers and statistical inferences of interests are additional benefits.
- Correlations uncovered during EDA help with the modeling. The variability of a feature could be correlated with the target variable and thereby provide more information to the ML model.
- The process guided by someone with domain knowledge can further refine the assumptions that form the basis of the ML algorithm. If a certain feature does not agree with the expected behavior, it could be an indicator of data sampling or extraction errors. Contrarily, you may have analyzed the feature with incorrect assumptions and may need to revise the existing knowledge.
The approach to data exploration with distrust can help understand the causes of unexpected behavior among the features. While certain sets of observations could bring down the performance of an ML model, error analysis can uncover features that contributed to the said error more than others. For instance, consider the heatmap plotted for different cities and their historical features:
The white areas in the plot show that some features have missing values, while the different tones of gray are varying quartiles. The elimination of such features or appropriate preprocessing strategies can improve the model performance.
Additionally, it is not possible to debug ML algorithms like conventional programs. Any divergence from expected behavior indicates potential bugs and faulty feature engineering. Error analysis of feature behavior can therefore help address modeling issues.
The initial identification is achieved by following the intuition of which feature will contribute to the prediction. Indeed, EDA has provided you with some good features. But wait! How do you decide if a particular feature brings more value to the table than the other? It is very simple. You just have to ask these three questions:
- Is the feature informative to a human? Remember that the selected feature would also contribute to model interpretability and error analysis.
- Are the feature values available for the majority of observations? A feature with many missing values cannot contribute to an accurate decision.
- Are the feature values distinguishable to help discriminate among the target classes or correlate to a prediction value?
The ability to represent raw data into features is a decisive process similar to modeling in mathematics and engineering. While domain modeling is different from featurization, the decisions taken in the process are more of a follow-up after the initial features have been identified.
Feature Engineering Techniques
Feature engineering is commonly realized through selection or extraction techniques. The two methods could be used together or individually as per your requirements. This image can further explain the difference between the two:
As can be seen, feature selection techniques choose a subset from the pool of features. The selection is made using three main methods of
- This approach filters the dataset and takes only a subset that contains relevant features e.g., Pearson correlation matrix.
- This approach evaluates a specific model sequentially using different potential subsets of features. While it is computationally costly and has a higher chance of overfitting, success rates are promising.
- Embedded Selection
- This approach examines different training iterations of the ML model to rank the importance of each feature e.g., Lasso Regularization.
The three feature selection techniques can be picturized as:
The second type of feature engineering, called feature extraction, builds new features using the existing set through different operations. Some common methods of feature extraction that you may have heard of include
- Principal Component Analysis (PCA)
- It is an unsupervised method for dimensionality reduction that offers additional benefits of noise filtering, visualization for multi-dimensional datasets.
- Independent Component Analysis (ICA)
- This is an ML technique to separate independent sources from a mixed signal. While PCA optimizes the covariance matrix of the data, ICA does not focus on variance among data points & mutual orthogonality of the components.
- Linear Discriminant Analysis
- A dimensionality reduction technique commonly used for supervised classification problems.
- Locally Linear Embedding
- An unsupervised dimensionality reduction method.
- T-distributed Stochastic Neighbour Embedding (t-SNE)
- An unsupervised non-linear dimensionality reduction method that calculates a similarity measure between pairs of instances in the high-dimensional & low-dimensional space.
- An autoencoder is a neural network architecture capable of discovering structure within data to develop a compressed representation of the input.
Feature Engineering Tools
The techniques discussed above can be implemented through tools and libraries. Moreover, similar to other ML development processes, FE too can be realized through automation to redce risks of human error or delays. Here, we have curated some tools that you can use to up the FE game of your ML development:
- Apache Superset is an open-source business intelligence application that provides polished visualization capabilities. The insightful visualizations can help teams discern dataset properties and features’ behavior.
- Apache Zeppelin is yet another tool that provides web-based data discovery, visualizations, and real-time collaboration.
- While the Sklearn library requires an ML engineer to spend time on algorithm selection and hyperparameter tuning, auto-sklearn that is shipped as a part of AutoML is the automation replacement for the same functions.
- Auto-sklearn works with categorical features, AutoGluon from the stables of Amazon caters to tabular data, images, or spanning text. Such tools offered by AutoML reduce the efforts typical of manual feature transformations.
- Columbus is a popular feature selection framework for the R development environment.
- Featuretools is a popular open-source Python framework that can automate feature engineering techniques for relational as well as temporal datasets. Its popularity can be attributed to its compatibility with Pandas dataframes and Sklearn APIs.
- The performance of an ML pipeline can degrade if feature engineering logic takes a long time. An automation tool like TPOT searches for the fastest feature preprocessing logic, model selection, and parameter optimization among all of the possible permutations of the pipeline.
- In case you have the specific requirement of tinkering with time series, then we suggest that you use TSFRESH Python package for feature extraction.
Feast is a vastly popular feature store. Source: Censius
- Lastly, a discussion on FE is not complete without mentioning feature stores. It is a data system to run pipelines that automate feature construction, their storage, and productionizing them for monitoring. Feast is an open-source Python library for feature store operations. It also supports a majority of cloud platforms and data sources.
In this blog, we introduced you to feature engineering, which is nothing but a gift that keeps giving. On the one hand, it improves the efficiency of an ML lifecycle; on the other, it protects against potential issues like overfitting and the curse of dimensionality. We also gave you an insight into common methods to implement FE or, better still, employ automation tools to achieve a smooth-running ML pipeline. Thank you for reading.