Google Cloud Tutorial

This tutorial guides you through setting up MLBench in a Google Cloud Kubernetes Engine cluster and explains basic MLBench functionality. For setup in other environments, please refer to our installation documentation. We use Google Cloud as an example, but MLBench runs in any Kubernetes cluster.

Please beware any costs that might be incurred by running this tutorial on the Google cloud. Usually costs should only be on the order of 5-10USD. We don’t take any responsibility costs incurred

Prerequisites

This tutorial assumes you have a Google Cloud account with permissions to create a new cluster. You also need to have Python, Git, and Docker installed locally and the Docker Daemon should be running.

Checkout the mlbench-helm) github repository and have a terminal open in the checked-out mlbench directory.

$ git clone git@github.com:mlbench/mlbench-helm.git

Enter the newly created directory

$ cd mlbench-helm

Installing MLBench

MLBench can be installed with the google_cloud_setup.sh script.

Note: n1-standard-2 instances have 2 CPU cores. But due to Google Cloud Kubernetes running its own monitoring and management pods, which also use some CPU, it is advisable to set MLBench to use one core less than available on the nodes

First, create a GKE cluster:

$ NUM_NODES=4 NUM_CPUS=1 ./google_cloud_setup.sh create-cluster

and then install the helm chart:

$ NUM_NODES=4 NUM_CPUS=1 ./google_cloud_setup.sh install-chart

That’s it, this should setup MLBench in your Google Kubernetes cluster. The Dashboard URL can be found in at the end of the output of the last command (e.g. http://172.16.0.1:32145).

Simply open the URL in your browser and you should be ready to go.

You can set many more options for the google_cloud_setup.py script, such as adding GPUs to the nodes or giving the cluster a custom name. To see all available options, execute the help command:

$ ./google_cloud_setup.sh help

Using MLBench

Once you open the dashboard URL, you will be greeted by a screen similar to this

The MLBench Dashboard

This shows you all currently used worker nodes (2 by default) and their current state and resource usage. Changes to the workers are continuously monitored and updated.

Clicking on the name of a worker will open up a more detailed view of its resource usage.

Detailed Resource Usage of a Worker

The Runs page in the menu on the left allows you to start new experiments as well as view already started experiments.

When adding a run, you can chose a name for this particular run, the number of worker nodes to utilize, as well as resource constraints for the individual workers.

Note: In the future, you will also be able to chose different models/frameworks/etc. but this is not yet implemented at the time this tutorial was written. By default, Resnet-18 is run.

Starting a new Experiment

When you start a new run, MLBench will automatically rescale the worker StatefulSet it Kubernetes and apply any resource limitations you might have set. It will then start training the distributed machine learning model.

You can then see the details of the experiment by clicking on its entry in the list of experiment. You can see the stdout and stderr of all workers, as well as any performance metrics the workers send back to the dashboard (e.g. Training Accuracy, Training Loss). You can also download all collected metrics as json files (Including resource usage of individual workers during the experiment).

Note: You can download metrics at any point during a run. But only the values available up until that point will be downloaded. If no metrics are available yet, the download will be empty

Stdout of an experiment

Training Loss curve of an experiment

That’s it! You successfully ran an distributed machine learning algorithm in the cloud. You can also easily develop custom worker images for your own models and compare them to existing benchmarking code without a lot of overhead.

Cleanup

To delete MLBench, run :

$ ./google_cloud_setup.sh uninstall-chart

To delete the whole Cluster (and cleanup firewall rules), run:

$ ./google_cloud_setup.sh delete-cluster

Appendix 1: Use NFS for Data storage

To avoid downloading datasets everytime we reinstall mlbench, we can use a persistent disk to save the data. To do so, one can create a GCE disk like

gcloud compute disks create --size=10G --zone=europe-west1-b my-pd-name

and add the name of persistent disk to myvalues.yaml

gcePersistentDisk:
  enabled: True
  pdName: my-pd-name

Note that Kubernetes resources PersistentVolume and PersistentVolumeClaim will be deleted when we delete the helm chart, but the datasets will be persisted on the GCE disk.