Jun 2020
Apache-2.0 License
Github open issues
Github stars
1 Nov
Github last commit
Stackoverflow questions

What is MLlib?

MLlib is Apache's scalable ML library that fits into Spark's APIs and interoperates with NumPy in Python and R libraries. It enables faster iterative computation powered by high-quality algorithms. It was launched as part of the Apache Spark project and gets tested and updated with each Spark release.

MLIib supports classification, regression, clustering, decision trees, random forests, association rules, and sequential pattern mining. It also facilitates ML workflow utilities that enable featurization, ML pipeline construction, hyper-parameter tuning, and model evaluation. It provides utilities for PCA, SVD, and statistics-oriented tasks such as hypothesis testing and summary statistics. 

How does MLlib help?

MLlib being a part of the Apache Spark project, helps data engineers on several fronts. 

  • MLlib is used with any preferred technology stack as it supports Python, Java, R, and Scala
  • Easily integrates with Hadoop data sources HBase, HDFS, or local files
  • Seamlessly runs on Hadoop YARN, Kubernetes, Mesos, and standalone clusters in EC2 
  • Highly scalable and flexible
  • Supports comprehensive algorithm library 
  • Supports tools for building, tuning, evaluating, and saving ML pipelines
  • It excels in iterative computation and claims to be 100x faster than MapReduce that uses pass approximations
  • Supports workflows with standardization, feature transformations, and normalizations

Key Features of MLlib

Colossal algorithm library

MLlib supports a comprehensive algorithm library to perform regression, clustering, classification, collaborative filtering, and pattern mining. It offers ALS, GMMs, LDA, random forest and gradient-boosted trees, linear and survival regression, K-means, and association rule mining. 

Supports featurization

MLlib offers ML workflow utilities for feature transformations. It allows featurization with feature extraction, transformation, feature selection, and dimensionality reduction. 

ML pipelines

MLlib offers APIs for ML algorithms to simplify combining numerous algorithms into a single pipeline or workflow. It provides tools for evaluating, constructing, and tuning ML pipelines. 

ML persistence

MLlib ensures great ML persistence backed by saving and loading algorithms, models, and pipelines. 

Utility support

MLlib supports several ML workflow utilities that enable feature transformations, model evaluation, hyper-parameter tuning, and ML pipeline construction. It also includes utilities for distributed linear algebra and statistics.

Companies using


No items found.

Liked the content? You'll love our emails!

The best MLOps and AI Observability content handpicked and delivered to your email twice a month

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Censius automates model monitoring

so that you can 

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

Start Monitoring