8. Distributed Machine Learning on HPC from k8s using KubeRay operator and interLinkο
8.1. Starting a Ray cluster on HPC from Kubernetes for HPO and distributed ML trainingο
Author(s): Linus Eickhoff (CERN), Matteo Bunino (CERN)
This tutorial demonstrates how to set up a KubeRay cluster and run the itwinai training or hyperparameter optimization (HPO) pipeline using the hython-itwinai-plugin.
In this tutorial, weβll use a Kubernetes cluster hosted on grnet (a cloud provider) to run KubeRay, which will then access
computing resources on the Vega supercomputer (an HPC system) through interLink. While we use specific infrastructure in
this example, the concepts apply to any Kubernetes cluster accessing HPC resources.
In this guide we use the following as cloud and HPC resources:
grnetis thecloudserver where we start and install the KubeRay cluster (grnet homepage)Vegais theHPCenvironment whose resources are accessed from the KubeRay cluster via interlink (Vega homepage)
The KubeRay cluster runs its pods on HPC resources using interLink.
For additional background on Ray job submission and KubeRay, refer to:
KubeRay overview β Setting up and managing Ray on Kubernetes
Ray job submission guide β Explains how to submit jobs to a Ray cluster
Job submission quickstart β A hands-on walkthrough to get started quickly
8.1.1. Prerequisitesο
Make sure you have the singularity container file in an accessible location on HPC.
To pull the singularity container image of the hython plugin on HPC, run:
singularity pull --force hython:sif docker://ghcr.io/intertwin-eu/hython-itwinai-plugin:<tag>
This creates a file named hython:sif in the current directory.
Check the available hython images
for the appropriate tag.
8.1.2. Connect to cloud server via SSHο
First, request access to the cloud server.
KubeRay creates multiple pods that are transparently offloaded to HPC through interLink.
The KubeRay cluster accesses HPC resources via interlink.
In the specific case of using grnet, switch to superuser shell and navigate to the work directory:
# for grnet instance
sudo su
cd .interlink
8.1.3. Start the Ray clusterο
To start the Kubernetes cluster, create a values file (e.g., <your-name>_raycluster.yaml).
You can use the raycluster_example.yaml in this directory as a template.
Edit the values file (raycluster_example.yaml) to ensure it points to the correct sif file:
image:
# TODO Change to the path where your singularity container file resides. (example is for file named hython:sif)
repository: <path>/hython
tag: sif
pullPolicy: IfNotPresent
# TODO Edit resources as needed (e.g. increase number and resources per head/worker pod)
To get an overview over the available attributes for ray values files, please consult the ray documentation for the RayCluster Configuration.
Then execute:
helm upgrade --install raycluster kuberay/ray-cluster --version 1.2.2 --values <your-name>_raycluster.yaml
This command starts the KubeRay cluster based on your values file.
To check the status of pods with βrayclusterβ in their name:
kubectl get pods | grep raycluster
Since the pods need to allocate jobs on HPC, wait a few minutes for the cluster to be ready for submissions.
The pods are ready when each pod shows 1/1 and Running.
Warning
Remember to shut down your raycluster when itβs no longer in use to free up HPC resources.
See 3. Shutting down and deleting the KubeRay cluster
8.1.4. Submit to the KubeRay clusterο
To submit a Ray job to the KubeRay cluster from HPC, run:
ray job submit --address <address> --working-dir <cwd> -- <command>
For example, to start the hython training pipeline from the hython-itwinai-plugin directory:
ray job submit --address <address> --working-dir configuration_files/ -- itwinai exec-pipeline --config-name vega_training +pipe_key=training
Note
The address is not public information. Please contact one of the contributors if you donβt have it. It represents the public address of the Ray head node, exposed via Ingress in this setup. Note that this configuration may vary for different setups.
To log to the intertwin MLflow server, override the tracking_uri and prefix your authentication environment variables.
For example, to run the HPO pipeline of the hython-itwinai-plugin:
ray job submit \
--address <address> \
--working-dir configuration_files/ \
-- \
MLFLOW_TRACKING_USERNAME=<username> \
MLFLOW_TRACKING_PASSWORD=<password> \
itwinai exec-pipeline \
--config-name <config-name> \
tracking_uri=http://mlflow.intertwin.fedcloud.eu/ \
+pipe_key=hpo
Note
Ensure experiment_name is set to a unique name, as your job will fail if the name is already in use.
If someone else created an experiment with the same name, it will fail with permission denied.
First, create an account here, then use your email as the username.
8.1.5. Shutting down and deleting the KubeRay clusterο
When finished, run the following on the cloud instance:
helm delete raycluster
This command releases the associated HPC resources.
8.2. raycluster_example.yamlο
This file defines the RayCluster, the file is referenced in the tutorial as the values file used by the KubeRay operator to deploy Ray clusters on Kubernetes. It specifies the configuration for head and worker nodes, including resource requests, environment variables, and startup commands. For a full reference of supported fields and structure, see the Ray on Kubernetes config documentation
# TODO: Change to your image path (filename should be <name>:<tag>)
image:
repository: /ceph/hpc/data/st2301-itwin-users/lineick/hython
tag: sif
pullPolicy: IfNotPresent
# TODO: change resources for head as needed
head:
resources:
limits:
cpu: "64"
# To avoid out-of-memory issues, never allocate less than 2G memory for the Ray head.
memory: "128G"
requests:
cpu: "64"
# To avoid out-of-memory issues, never allocate less than 2G memory for the Ray head.
memory: "128G"
annotations:
slurm-job.vk.io/flags: "-p gpu --gres=gpu:1 --time 230" # TODO: Adjust as needed
# TODO: Add container envs here
slurm-job.vk.io/singularity-options: "--no-home --compat --no-mount /exa5 --env POD_IP=$POD_IP --env HYDRA_FULL_ERROR=1,NCCL_SOCKET_IFNAME=br0,RAY_record_ref_creation_sites=1,SLURM_NNODES=1,ITWINAI_LOG_LEVEL=DEBUG"
slurm-job.vk.io/singularity-mounts: "--bind /ceph"
interlink.eu/pod-vpn: "true"
slurm-job.vk.io/pre-exec: "mkdir -p /ceph/hpc/data/st2301-itwin-users/interlink/ray; cd /ceph/hpc/home/ciangottinid/test_dodas_net && singularity exec --no-mount /scratch,/exa5,/cvmfs --bind /ceph/hpc/data/st2301-itwin-users/eurac/:/ceph/hpc/data/st2301-itwin-users/eurac/ --bind $PWD/wireguard:/var/run/wireguard --bind /ceph/hpc/data/st2301-itwin-users/interlink/ray:/mnt/cluster_storage --env INTERNAL_IP=$INTERNAL_IP --env POD_IP=$POD_IP /ceph/hpc/home/ciangottinid/launch:latest ./slirp.sh"
nodeSelector:
kubernetes.io/hostname: vega-virtual-node
tolerations:
- key: virtual-node.interlink/no-schedule
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
rayStartParams:
node-ip-address: "$POD_IP"
verbose: true
# volumeMounts:
# - mountPath: /ceph
# name: ceph-volume
# readOnly: true
headService: {}
# TODO: change resources for worker as needed
worker:
# If you want to disable the default workergroup
# uncomment the line below
# disabled: true
groupName: workergroup
replicas: 3 # TODO: Change number of workers in cluster
minReplicas: 1
maxReplicas: 3
resources:
limits:
cpu: "64"
memory: "128G"
requests:
cpu: "64"
memory: "128G"
annotations:
slurm-job.vk.io/flags: "-p gpu --gres=gpu:1 --time 230" # TODO: change num of gpus and cpus
slurm-job.vk.io/singularity-options: "--no-home --compat --no-mount /exa5 --env HYDRA_FULL_ERROR=1,RAY_record_ref_creation_sites=1,ITWINAI_LOG_LEVEL=DEBUG,NCCL_SOCKET_IFNAME=br0,SLURM_NNODES=1"
slurm-job.vk.io/singularity-mounts: "--bind /ceph"
interlink.eu/pod-vpn: "true"
slurm-job.vk.io/pre-exec: "mkdir -p /ceph/hpc/data/st2301-itwin-users/interlink/ray; cd /ceph/hpc/home/ciangottinid/test_dodas_net && singularity exec --no-mount /scratch,/exa5,/cvmfs --bind /ceph/hpc/data/st2301-itwin-users/eurac/:/ceph/hpc/data/st2301-itwin-users/eurac/ --bind $PWD/wireguard:/var/run/wireguard --bind /ceph/hpc/data/st2301-itwin-users/interlink/ray:/mnt/cluster_storage --env INTERNAL_IP=$INTERNAL_IP --env POD_IP=$POD_IP /ceph/hpc/home/ciangottinid/launch:latest ./slirp.sh"
nodeSelector:
kubernetes.io/hostname: vega-virtual-node
tolerations:
- key: virtual-node.interlink/no-schedule
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
rayStartParams:
#num-gpus: 1
#num-cpus: 4
#memory: 12000
verbose: true
# Configuration for Head's Kubernetes Service
service:
# This is optional, and the default is ClusterIP.
type: ClusterIP