Google Cloud Tutorial10 Sep 2018 - Written by R. Grubenmann
This tutorial guides you through setting up MLBench in a Google Cloud Kubernetes Engine cluster and explains basic MLBench functionality. For setup in other environments, please refer to our installation documentation. We use Google Cloud as an example, but MLBench runs in any Kubernetes cluster.
Please beware any costs that might be incurred by running this tutorial on the Google cloud. Usually costs should only be on the order of 5-10USD. We don’t take any responsibility costs incurred
This tutorial assumes you have a Google Cloud account with permissions to create a new cluster. You also need to have Python, Git, and Docker installed locally and the Docker Daemon should be running.
Checkout the mlbench-helm) github repository and have a terminal open in the checked-out mlbench directory.
$ git clone email@example.com:mlbench/mlbench-helm.git
Enter the newly created directory
$ cd mlbench-helm
MLBench can be installed with the
n1-standard-2 instances have 2 CPU cores. But due to Google Cloud Kubernetes running its own monitoring and management pods, which also use some CPU, it is advisable to set MLBench to use one core less than available on the nodes
First, create a GKE cluster:
$ NUM_NODES=4 NUM_CPUS=1 ./google_cloud_setup.sh create-cluster
and then install the helm chart:
$ NUM_NODES=4 NUM_CPUS=1 ./google_cloud_setup.sh install-chart
That’s it, this should setup MLBench in your Google Kubernetes cluster. The Dashboard URL can be found in at the end of the output of the last command (e.g.
Simply open the URL in your browser and you should be ready to go.
You can set many more options for the
google_cloud_setup.py script, such as adding GPUs to the nodes or giving the cluster a custom name. To see all available options, execute
the help command:
$ ./google_cloud_setup.sh help
Once you open the dashboard URL, you will be greeted by a screen similar to this
This shows you all currently used worker nodes (2 by default) and their current state and resource usage. Changes to the workers are continuously monitored and updated.
Clicking on the name of a worker will open up a more detailed view of its resource usage.
Runs page in the menu on the left allows you to start new experiments as well as view already started experiments.
When adding a run, you can chose a name for this particular run, the number of worker nodes to utilize, as well as resource constraints for the individual workers.
Note: In the future, you will also be able to chose different models/frameworks/etc. but this is not yet implemented at the time this tutorial was written. By default, Resnet-18 is run.
When you start a new run, MLBench will automatically rescale the worker StatefulSet it Kubernetes and apply any resource limitations you might have set. It will then start training the distributed machine learning model.
You can then see the details of the experiment by clicking on its entry in the list of experiment. You can see the
stderr of all workers, as well as any performance metrics the workers send back to the dashboard (e.g. Training Accuracy, Training Loss). You can also download all collected metrics as json files (Including resource usage of individual workers during the experiment).
Note: You can download metrics at any point during a run. But only the values available up until that point will be downloaded. If no metrics are available yet, the download will be empty
That’s it! You successfully ran an distributed machine learning algorithm in the cloud. You can also easily develop custom worker images for your own models and compare them to existing benchmarking code without a lot of overhead.
To delete MLBench, run :
$ ./google_cloud_setup.sh uninstall-chart
To delete the whole Cluster (and cleanup firewall rules), run:
$ ./google_cloud_setup.sh delete-cluster
Appendix 1: Use NFS for Data storage
To avoid downloading datasets everytime we reinstall mlbench, we can use a persistent disk to save the data. To do so, one can create a GCE disk like
gcloud compute disks create --size=10G --zone=europe-west1-b my-pd-name
and add the name of persistent disk to
gcePersistentDisk: enabled: True pdName: my-pd-name
Note that Kubernetes resources
PersistentVolumeClaim will be deleted when we delete the helm chart, but the datasets will be persisted on the GCE disk.