8. Distributed Machine Learning on HPC from k8s using KubeRay operator and interLink

8.1. Starting a Ray cluster on HPC from Kubernetes for HPO and distributed ML training

Author(s): Linus Eickhoff (CERN), Matteo Bunino (CERN)

This tutorial demonstrates how to set up a KubeRay cluster and run the itwinai training or hyperparameter optimization (HPO) pipeline using the hython-itwinai-plugin.

In this tutorial, we’ll use a Kubernetes cluster hosted on grnet (a cloud provider) to run KubeRay, which will then access computing resources on the Vega supercomputer (an HPC system) through interLink. While we use specific infrastructure in this example, the concepts apply to any Kubernetes cluster accessing HPC resources.

In this guide we use the following as cloud and HPC resources:

grnet is the cloud server where we start and install the KubeRay cluster (grnet homepage)
Vega is the HPC environment whose resources are accessed from the KubeRay cluster via interlink (Vega homepage)

The KubeRay cluster runs its pods on HPC resources using interLink.

For additional background on Ray job submission and KubeRay, refer to:

KubeRay overview – Setting up and managing Ray on Kubernetes
Ray job submission guide – Explains how to submit jobs to a Ray cluster
Job submission quickstart – A hands-on walkthrough to get started quickly

8.1.1. Prerequisites

Make sure you have the singularity container file in an accessible location on HPC.

To pull the singularity container image of the hython plugin on HPC, run:

singularity pull --force hython:sif docker://ghcr.io/intertwin-eu/hython-itwinai-plugin:<tag>

This creates a file named hython:sif in the current directory. Check the available hython images for the appropriate tag.

8.1.2. Connect to cloud server via SSH

First, request access to the cloud server. KubeRay creates multiple pods that are transparently offloaded to HPC through interLink. The KubeRay cluster accesses HPC resources via interlink.

In the specific case of using grnet, switch to superuser shell and navigate to the work directory:

# for grnet instance
sudo su
cd .interlink

8.1.3. Start the Ray cluster

To start the Kubernetes cluster, create a values file (e.g., <your-name>_raycluster.yaml). You can use the raycluster_example.yaml in this directory as a template.

Edit the values file (raycluster_example.yaml) to ensure it points to the correct sif file:

image:
    # TODO Change to the path where your singularity container file resides. (example is for file named hython:sif)
    repository: <path>/hython
    tag: sif
    pullPolicy: IfNotPresent
# TODO Edit resources as needed (e.g. increase number and resources per head/worker pod)

To get an overview over the available attributes for ray values files, please consult the ray documentation for the RayCluster Configuration.

Then execute:

helm upgrade --install raycluster kuberay/ray-cluster --version 1.2.2 --values <your-name>_raycluster.yaml

This command starts the KubeRay cluster based on your values file.

To check the status of pods with “raycluster” in their name:

kubectl get pods | grep raycluster

Since the pods need to allocate jobs on HPC, wait a few minutes for the cluster to be ready for submissions. The pods are ready when each pod shows 1/1 and Running.

Warning

Remember to shut down your raycluster when it’s no longer in use to free up HPC resources. See 3. Shutting down and deleting the KubeRay cluster

8.1.4. Submit to the KubeRay cluster

To submit a Ray job to the KubeRay cluster from HPC, run:

ray job submit --address <address> --working-dir <cwd> -- <command>

For example, to start the hython training pipeline from the hython-itwinai-plugin directory:

ray job submit --address <address> --working-dir configuration_files/ -- itwinai exec-pipeline --config-name vega_training +pipe_key=training

Note

The address is not public information. Please contact one of the contributors if you don’t have it. It represents the public address of the Ray head node, exposed via Ingress in this setup. Note that this configuration may vary for different setups.

To log to the intertwin MLflow server, override the tracking_uri and prefix your authentication environment variables. For example, to run the HPO pipeline of the hython-itwinai-plugin:

ray job submit \
  --address <address> \
  --working-dir configuration_files/ \
  -- \
  MLFLOW_TRACKING_USERNAME=<username> \
  MLFLOW_TRACKING_PASSWORD=<password> \
  itwinai exec-pipeline \
    --config-name <config-name> \
    tracking_uri=http://mlflow.intertwin.fedcloud.eu/ \
    +pipe_key=hpo

Note

Ensure experiment_name is set to a unique name, as your job will fail if the name is already in use. If someone else created an experiment with the same name, it will fail with permission denied. First, create an account here, then use your email as the username.

8.1.5. Shutting down and deleting the KubeRay cluster

When finished, run the following on the cloud instance:

helm delete raycluster

This command releases the associated HPC resources.

8.2. raycluster_example.yaml

This file defines the RayCluster, the file is referenced in the tutorial as the values file used by the KubeRay operator to deploy Ray clusters on Kubernetes. It specifies the configuration for head and worker nodes, including resource requests, environment variables, and startup commands. For a full reference of supported fields and structure, see the Ray on Kubernetes config documentation

# TODO: Change to your image path (filename should be <name>:<tag>)
image:
  repository: /ceph/hpc/data/st2301-itwin-users/lineick/hython
  tag: sif
  pullPolicy: IfNotPresent

# TODO: change resources for head as needed
head:
  resources:
    limits:
      cpu: "64"
      # To avoid out-of-memory issues, never allocate less than 2G memory for the Ray head.
      memory: "128G"
    requests:
      cpu: "64"
      # To avoid out-of-memory issues, never allocate less than 2G memory for the Ray head.
      memory: "128G"
  annotations:
    slurm-job.vk.io/flags: "-p gpu --gres=gpu:1  --time 230" # TODO: Adjust as needed
    # TODO: Add container envs here
    slurm-job.vk.io/singularity-options: "--no-home --compat --no-mount /exa5 --env POD_IP=$POD_IP  --env HYDRA_FULL_ERROR=1,NCCL_SOCKET_IFNAME=br0,RAY_record_ref_creation_sites=1,SLURM_NNODES=1,ITWINAI_LOG_LEVEL=DEBUG"
    slurm-job.vk.io/singularity-mounts: "--bind /ceph"
    interlink.eu/pod-vpn: "true"
    slurm-job.vk.io/pre-exec: "mkdir -p  /ceph/hpc/data/st2301-itwin-users/interlink/ray; cd /ceph/hpc/home/ciangottinid/test_dodas_net && singularity exec --no-mount /scratch,/exa5,/cvmfs --bind /ceph/hpc/data/st2301-itwin-users/eurac/:/ceph/hpc/data/st2301-itwin-users/eurac/ --bind $PWD/wireguard:/var/run/wireguard --bind /ceph/hpc/data/st2301-itwin-users/interlink/ray:/mnt/cluster_storage --env INTERNAL_IP=$INTERNAL_IP --env POD_IP=$POD_IP  /ceph/hpc/home/ciangottinid/launch:latest  ./slirp.sh"
  nodeSelector:
    kubernetes.io/hostname: vega-virtual-node

  tolerations:
    - key: virtual-node.interlink/no-schedule
      operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
  rayStartParams:
    node-ip-address: "$POD_IP"
    verbose: true
    #  volumeMounts:
    #    - mountPath: /ceph
    #      name: ceph-volume
    #      readOnly: true
  headService: {}

# TODO: change resources for worker as needed
worker:
  # If you want to disable the default workergroup
  # uncomment the line below
  # disabled: true
  groupName: workergroup
  replicas: 3 # TODO: Change number of workers in cluster
  minReplicas: 1
  maxReplicas: 3
  resources:
    limits:
      cpu: "64"
      memory: "128G"
    requests:
      cpu: "64"
      memory: "128G"
  annotations:
    slurm-job.vk.io/flags: "-p gpu --gres=gpu:1 --time 230" # TODO: change num of gpus and cpus
    slurm-job.vk.io/singularity-options: "--no-home --compat --no-mount /exa5 --env HYDRA_FULL_ERROR=1,RAY_record_ref_creation_sites=1,ITWINAI_LOG_LEVEL=DEBUG,NCCL_SOCKET_IFNAME=br0,SLURM_NNODES=1"
    slurm-job.vk.io/singularity-mounts: "--bind /ceph"
    interlink.eu/pod-vpn: "true"
    slurm-job.vk.io/pre-exec: "mkdir -p  /ceph/hpc/data/st2301-itwin-users/interlink/ray; cd /ceph/hpc/home/ciangottinid/test_dodas_net && singularity exec --no-mount /scratch,/exa5,/cvmfs --bind /ceph/hpc/data/st2301-itwin-users/eurac/:/ceph/hpc/data/st2301-itwin-users/eurac/ --bind $PWD/wireguard:/var/run/wireguard --bind /ceph/hpc/data/st2301-itwin-users/interlink/ray:/mnt/cluster_storage --env INTERNAL_IP=$INTERNAL_IP --env POD_IP=$POD_IP  /ceph/hpc/home/ciangottinid/launch:latest  ./slirp.sh"
  nodeSelector:
    kubernetes.io/hostname: vega-virtual-node
  tolerations:
    - key: virtual-node.interlink/no-schedule
      operator: Exists
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
  rayStartParams:
    #num-gpus: 1
    #num-cpus: 4
    #memory: 12000
    verbose: true

# Configuration for Head's Kubernetes Service
service:
  # This is optional, and the default is ClusterIP.
  type: ClusterIP