Horovod

What is Horovod?

Horovod is Uber’s open-source framework for distributed deep learning that supports major deep learning toolkits like Keras, TensorFlow, PyTorch, and Apache MXNet. Horovod allows an existing training script to execute on hundreds of GPUs with a small Python code. It can be installed on-premise or run in cloud platforms such as Azure, AWS, and Databricks.

Horovod runs on top of Apache Spark for unified data processing and model training. It offers the flexibility of using the same infrastructure to train models and switch between PyTorch, TensorFlow, and MXNet. It accelerates distributed training and includes multiple optimization methods for faster distributed training.

How does Horovod Help?

Horovod helps simplify the scaling of a single-GPU training script across multiple GPUs in parallel. For this, the two prime aspects considered are – code modification requirements and speed.

Horovod uses Message Passing Interface (MPI) concepts that help to scale training scripts with minimal code changes, unlike the previous solution such as Distributed TensorFlow with parameter servers. Horovod enables scaling a training script across a single-GPU, multiple-GPUs, or even multiple hosts without further code changes.

Although installing MPI and NCCL is not hassle-free, it is a one-time job while the rest team seamlessly scales the ML training script.

‍

Key Features of Horovod

Stand-alone Python package

Horovod is a stand-alone Python package that leverages the ring-allreduce algorithm without upgrading to the latest version of TensorFlow and applying patches to the current version. Horovod installation takes a few minutes to an hour based on hardware.

Effective NCCL implementation

Horovod replaced Baidu ring- allreduce implementation with NVIDIA’s library- NCCL that supports collective communication and works as an optimized version of ring-allreduce. NCCL 2 facilitates running ring-allreduce across multiple machines and optimizes performance.

MPI concepts

Horovod core principles are based on the Message Passing Interface (MPI) concepts like rank, size, local, allgather, allreduce, alltoall, and broadcast.

Supported frameworks

Horovod supports TensorFlow, Keras, MXNet, PyTorch, and XLA in TensorFlow.

Efficiency

Horovod helps scale up hundreds of GPUs with upwards of 90% scaling efficiency, easy-to-use mechanisms, and portability.

What is Horovod?

How does Horovod Help?

Key Features of Horovod

Stand-alone Python package

Effective NCCL implementation

MPI concepts

Supported frameworks

Efficiency

Companies using

Horovod

Liked the content? You'll love our emails!

Other

Modeling

Tools

LIME

SHAP

Neptune

MLflow

Feast

Censius automates model monitoring

so that you can

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare

improve models

scale businesses

detect frauds

boost healthcare