Horovod
Modeling

Horovod

Released: 
Dec 2018
  •  
Documentation
  •  
License:  
Apache-2.0 License
250
Github open issues
11866
Github stars
19 Nov
Github last commit
38
Stackoverflow questions

What is Horovod?

Horovod is Uber’s open-source framework for distributed deep learning that supports major deep learning toolkits like Keras, TensorFlow, PyTorch, and Apache MXNet. Horovod allows an existing training script to execute on hundreds of GPUs with a small Python code. It can be installed on-premise or run in cloud platforms such as Azure, AWS, and Databricks.

Horovod runs on top of Apache Spark for unified data processing and model training. It offers the flexibility of using the same infrastructure to train models and switch between PyTorch, TensorFlow, and MXNet. It accelerates distributed training and includes multiple optimization methods for faster distributed training.

How does Horovod Help?

Horovod helps simplify the scaling of a single-GPU training script across multiple GPUs in parallel. For this, the two prime aspects considered are – code modification requirements and speed.

Horovod uses Message Passing Interface (MPI) concepts that help to scale training scripts with minimal code changes, unlike the previous solution such as Distributed TensorFlow with parameter servers. Horovod enables scaling a training script across a single-GPU, multiple-GPUs, or even multiple hosts without further code changes.

Although installing MPI and NCCL is not hassle-free, it is a one-time job while the rest team seamlessly scales the ML training script.

Key Features of Horovod

Stand-alone Python package

Horovod is a stand-alone Python package that leverages the ring-allreduce algorithm without upgrading to the latest version of TensorFlow and applying patches to the current version. Horovod installation takes a few minutes to an hour based on hardware.

Effective NCCL implementation

Horovod replaced Baidu ring- allreduce implementation with NVIDIA’s library- NCCL that supports collective communication and works as an optimized version of ring-allreduce. NCCL 2 facilitates running ring-allreduce across multiple machines and optimizes performance. 

MPI concepts

Horovod core principles are based on the Message Passing Interface (MPI) concepts like rank, size, local, allgather, allreduce, alltoall, and broadcast.

Supported frameworks

Horovod supports TensorFlow, Keras, MXNet, PyTorch, and XLA in TensorFlow.

Efficiency

Horovod helps scale up hundreds of GPUs with upwards of 90% scaling efficiency, easy-to-use mechanisms, and portability.

Companies using

Horovod

the linux foundation
nvidia
uber
No items found.

Liked the content? You'll love our emails!

The best MLOps and AI Observability content handpicked and delivered to your email twice a month

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Censius automates model monitoring

so that you can 

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

Start Monitoring