Comparison of MLBench and MLPerf

MLPerf is a broad benchmark suite for measuring the performance of machine learning (ML) software frameworks, ML hardware platforms and ML cloud platforms.

In this post, we will highlight the main differences between MLBench and MLPerf.

Key Advantages of MLBench

Results reporting

Both MLBench and MLPerf use end-to-end time to accuracy. However, MLBench reports how much time was used on communication and how much on computing, while MLPerf does not. Furthermore, MLBench reports more fine grained results in comparison to MLPerf. While MLPerf allows for distributed training, it does not distinguish the results obtained from single node and multi-node training. Moreover, MLPerf does not distinguish between single GPU and multiple GPUs per node. While the number of nodes and GPUs per node are reported, there does not seem to be any fine grained reporting on the amount of time spent on communication and computation. MLPerf reports only one number for all possible scenarios - time to accuracy. For this reason, MLPerf can not accurately show the effects of scaling the number of nodes or GPUs. They are also not able to pinpoint the reason for an improved or decreased performance of a model because their reported results are not fine grained. On the other hand, MLBench is fully focused on distributed training, can show the effects of scaling, can identify the bottlenecks in the model performance and can accurately show the effects of the model hyperparameters on different parts like communication and computation. In this way, MLBench offers a much more powerful and versatile benchmarking suite than the one offered by MLPerf.

Hyperparameter tuning

MLPerf restricts the set of hyperparameters that can be tuned. It also allows users to borrow hyperparameters from others. MLBench currently provides exact values for all hyperparameters.

Benchmark Suites

Benchmark Dataset Quality Target Reference Implementation Model Frameworks
MLBench MLPerf MLBench MLPerf MLBench MLPerf MLBench MLPerf
Image classification CIFAR10 (32x32) / 80% Top-1 Accuracy / ResNet-20 / PyTorch, Tensorflow /
Image classification ImageNet (224x224) TODO 75.9% Top-1 Accuracy TODO Resnet-50 v1.5 TODO MXNet, Tensorflow
Object detection (light weight) / COCO 2017 / 23% mAP / SSD-ResNet34 / Tensorflow, PyTorch
Object detection (heavy weight) / COCO 2017 / 0.377 Box min AP, 0.339 Mask min AP / Mask R-CNN / Tensorflow, PyTorch
Language Modelling Wikitext2 / Perplexity <= 50 / RNN-LM / PyTorch /
Translation (recurrent) WMT16 EN-DE WMT English-German 24.0 BLEU GNMT PyTorch Tensorflow, PyTorch
Translation (non-recurrent) WMT17 EN-DE WMT English-German 25.0 BLEU Transformer PyTorch Tensorflow, PyTorch
Reinforcement learning / N/A / Pre-trained checkpoint / Mini Go / Tensorflow