Communication Backends, Raw performance benchmarking

Distributed learning requires workers to collaborate by swiftly sharing learned information with their “colleagues”. With the accelerating growth of model sizes in modern deep learning, this aspect gains even more importance.

MLBench supports both one and multiple processes per node, in addition to multi-node training. Communication between workers is crucial and will heavily affect performance, notably for communication bound training algorithms.

This blog post addresses and analyzes the raw performance of different communication backends on commodity communication hardware, used to transmit large arrays or tensors.

Communication Backends

Currently, MLBench supports 3 communication backends out of the box:

Each backend presents its benefits and disadvantages, and is designed for specific use-cases, and those will be reflected in the results.

Differences

The following table illustrates the main differences between the 3 backends.

Backend Comm. Functions Optimized for Float32 Float16
MPI All CPU, GPU Yes No
GLOO All (on CPU), broadcast & all-reduce (on GPU) CPU Yes Yes
NCCL broadcast, all reduce, reduce and all gather (on GPU) GPU only Yes Yes

As we can see, each has at least one main advantage, and must be used in specific cases.

It is also important to note that PyTorch (built from source) comes with NCCL and GLOO pre-installed, so it can be more convenient for a user to use one of those two. Otherwise, MPI needs to be compiled and installed on the machine.


Experiments

In order to evaluate the performance of communication backends, we have created a dummy task called Task 0, which repeatedly sends random tensors of increasing sizes using an all reduce operation. This means that each worker shares his tensor with all other workers, and sums all the received tensors.

The time taken for this operation is accurately measured 100 times for each sent tensor on each worker, and averaged to get a statistically significant estimation of communication times.

To obtain those results, we have used the following hardware/software:

Results

As stated above, we compare the times of communication for different tensor types and backends.

There are 4 tensor type: Float16 & Float32 CPU or GPU tensors.

CPU vs GPU tensors?

MPI and GLOO support both CPU and GPU tensor communication, while NCCL only supports communication of GPU tensors. This is a great advantage, as CPU training is less costly and can be sped up using distributed training.

CPU

In the graph below, we compare the speeds taken to perform an all reduce operation between 2, 4 and 8 workers, of Float16 and Float32 CPU tensors.

Backend performance comparison (CPU tensors)

Key differences

GPU

We now compare the speeds for GPU tensors. Here, we have the addition of NCCL in the comparison.

Backend performance comparison (GPU tensors)

Key differences

Comparison

The results obtained clearly depict the different use cases for each backend, and how they could be used to fulfill one’s:

How to run

The code for this benchmark is available here, and the docker image can be pulled using : docker pull mlbench/pytorch-backend-benchmark:latest, and it can be used to benchmark any other backend.

To benchmark a custom backend, it must first be installed in the image. For that, simply modify the Dockerfile and rebuild the image.