What is MLlib?
MLlib is Apache's scalable ML library that fits into Spark's APIs and interoperates with NumPy in Python and R libraries. It enables faster iterative computation powered by high-quality algorithms. It was launched as part of the Apache Spark project and gets tested and updated with each Spark release.
MLIib supports classification, regression, clustering, decision trees, random forests, association rules, and sequential pattern mining. It also facilitates ML workflow utilities that enable featurization, ML pipeline construction, hyper-parameter tuning, and model evaluation. It provides utilities for PCA, SVD, and statistics-oriented tasks such as hypothesis testing and summary statistics.
How does MLlib help?
MLlib being a part of the Apache Spark project, helps data engineers on several fronts.
- MLlib is used with any preferred technology stack as it supports Python, Java, R, and Scala
- Easily integrates with Hadoop data sources HBase, HDFS, or local files
- Seamlessly runs on Hadoop YARN, Kubernetes, Mesos, and standalone clusters in EC2
- Highly scalable and flexible
- Supports comprehensive algorithm library
- Supports tools for building, tuning, evaluating, and saving ML pipelines
- It excels in iterative computation and claims to be 100x faster than MapReduce that uses pass approximations
- Supports workflows with standardization, feature transformations, and normalizations
Key Features of MLlib
Colossal algorithm library
MLlib supports a comprehensive algorithm library to perform regression, clustering, classification, collaborative filtering, and pattern mining. It offers ALS, GMMs, LDA, random forest and gradient-boosted trees, linear and survival regression, K-means, and association rule mining.
Supports featurization
MLlib offers ML workflow utilities for feature transformations. It allows featurization with feature extraction, transformation, feature selection, and dimensionality reduction.
ML pipelines
MLlib offers APIs for ML algorithms to simplify combining numerous algorithms into a single pipeline or workflow. It provides tools for evaluating, constructing, and tuning ML pipelines.
ML persistence
MLlib ensures great ML persistence backed by saving and loading algorithms, models, and pipelines.
Utility support
MLlib supports several ML workflow utilities that enable feature transformations, model evaluation, hyper-parameter tuning, and ML pipeline construction. It also includes utilities for distributed linear algebra and statistics.