Fast particle detector simulation (CERN)

This use case trains a 3D Generative Adversarial Network (3DGAN) for generation of images of calorimeter depositions. It is based on the prototype 3DGAN model developed at CERN and is implemented on PyTorch Lightning framework.

This section covers the CERN use case that utilizes the torch-lightning framework for training and evaluation. Following you can find instructions to execute CERN use case and its integral scripts:

Integration with itwinai

Integration author(s): Kalliopi Tsolaki (CERN), Matteo Bunino (CERN)

First of all, from the repository root, create a torch environment following the installation instructions.

Now, install custom requirements for this use case in requirements.txt file. Example:

source .venv-pytorch/bin/activate
cd use-cases/3dgan
pip install -r requirements.txt

Note

Python commands below assumed to be executed from within the virtual environment.

Training

Make sure to be in the use-cases/3dgan folder. Before you can start training, you have to download the data using the dataloading script:

itwinai exec-pipeline +pipe_key=training_pipeline +pipe_steps=[dataloading_step]

Now you can launch training using itwinai and the provided training configuration config.yaml:

itwinai exec-pipeline +pipe_key=training_pipeline

The command above shows how to run the training using a single worker, but if you want to run distributed ML training you have two options: interactive (launch from terminal) or batch (launch form SLURM job script).

Warning

Before running distributed ML, make sure that the distributed strategy used by pytorch lightning is set to ddp_find_unused_parameters_true . You can set this manually by setting distributed_strategy: ddp_find_unused_parameters_true in config.yaml.

To know more on SLURM, see our SLURM cheatsheet.

Distributed training on a single node (interactive)

If you want to use SLURM in interactive mode, do the following:

# Allocate resources (on JSC)
$ salloc --partition=batch --nodes=1 --account=intertwin  --gres=gpu:4 --time=1:59:00
job ID is XXXX
# Get a shell in the compute node (if using SLURM)
$ srun --jobid XXXX --overlap --pty /bin/bash
# Now you are inside the compute node

# On JSC, you may need to load some modules
ml --force purge
ml Stages/2024 GCC OpenMPI CUDA/12 MPI-settings/CUDA Python HDF5 PnetCDF libaio mpi4py

# ...before activating the Python environment (adapt this to your env name/path)
source ../../envAI_hdfml/bin/activate

To launch the training with torch DDP use:

torchrun --standalone --nnodes=1 --nproc-per-node=gpu \
    $(which itwinai) exec-pipeline +pipe_key=training_pipeline

# Alternatively, from a SLURM login node:
srun --jobid XXXX --ntasks-per-node=1 torchrun --standalone --nnodes=1 --nproc-per-node=gpu \
    $(which itwinai) exec-pipeline +pipe_key=training_pipeline

Distributed training with SLURM (batch mode)

Differently from the interactive approach, this way allows you to use more than one compute node, thus allowing to scale the distributed ML to larger resources.

Remember that on JSC there is no internet connection on compute nodes, thus if your script tries to contact the internet it will fail. If needed, make sure to download the datasets from the SLURM login node before launching the job.

# Launch a SLURM batch job (on JSC)
sbatch slurm.jsc.sh

# Launch a SLURM batch job (on Vega)
sbatch slurm.vega.sh

# Check the job in the SLURM queue
squeue -u YOUR_USERNAME

# Check the job status
sacct -j JOBID

Job’s stdout is usually saved to job.out and its stderr is saved to job.err.

Visualize the results of training

Depending on the logging service that you are using, there are different ways to inspect the logs generated during ML training.

To visualize the logs generated with MLFLow, if you set a local path as tracking URI, run the following in the terminal:

mlflow ui --backend-store-uri LOCAL_TRACKING_URI

And select the “3DGAN” experiment.

Offload training with interLink (batch mode)

To submit a SLURM job to a remote HPC system, you can use interLink (find more info in the interLink directory).

For short, this is the usual workflow:

Make sure to have kubectl installed on your system (e.g., laptop, VM) and to have it using a valid kubeconfig.
Create a Docker container containing your code and the python environment (all python dependencies needed by your code).
Push the Docker container to your preferred public containers registry.
Update the Kubernetes pod accordingly, making sure that the newly created container image is pulled on the remote HPC system and used to run your job.
Submit the job through interLink.

Example:

# Check current pods status
kubectl get pods

# Delete existing pod before re-submitting it
kubectl delete pod POD_NAME

# Submit new pod for ML training
kubectl apply -y interLink/3dgan-train.yaml

# Once completed, retrieve the pod logs
kubectl logs --insecure-skip-tls-verify-backend POD_NAME

Inference

As inference dataset we can reuse training/validation dataset, for instance the one downloaded from Google Drive folder: if the dataset root folder is not present, the dataset will be downloaded. The inference dataset is a set of H5 files stored inside exp_data sub-folders:
```
├── exp_data
│   ├── data
|   │   ├── file_0.h5
|   │   ├── file_1.h5
...
|   │   ├── file_N.h5
```
As model, if a pre-trained checkpoint is not available, we can create a dummy version of it with:
```
python create_inference_sample.py
```
Run inference command. This will generate a 3dgan-generated-data folder containing generated particle traces in form of torch tensors (.pth files) and 3D scatter plots (.jpg images).
```
itwinai exec-pipeline +pipe_key=inference_pipeline
```

The inference execution will produce a folder called 3dgan-generated-data containing generated 3D particle trajectories (overwritten if already there). Each generated 3D image is stored both as a torch tensor (.pth) and 3D scatter plot (.jpg):

├── 3dgan-generated-data
|   ├── energy=1.296749234199524&angle=1.272539496421814.pth
|   ├── energy=1.296749234199524&angle=1.272539496421814.jpg
...
|   ├── energy=1.664689540863037&angle=1.4906378984451294.pth
|   ├── energy=1.664689540863037&angle=1.4906378984451294.jpg

However, if aggregate_predictions in the ParticleImagesSaver step is set to True, only one pickled file will be generated inside 3dgan-generated-data folder. Notice that multiple inference calls will create new files under 3dgan-generated-data folder.

With fields overriding:

# Override variables
export CERN_DATA_ROOT="../.."  # data root
export TMP_DATA_ROOT=$CERN_DATA_ROOT
export CERN_CODE_ROOT="." # where code and configuration are stored
export MAX_DATA_SAMPLES=20000 # max dataset size
export BATCH_SIZE=1024 # increase to fill up GPU memory
export NUM_WORKERS_DL=4 # num worker processes used by the dataloader to pre-fetch data
export AGGREGATE_PREDS="true" # write predictions in a single file
export ACCELERATOR="gpu" # choose "cpu" or "gpu"
export STRATEGY="auto" # distributed strategy
export DEVICES="0," # GPU devices list


itwinai exec-pipeline --config-path $CERN_CODE_ROOT \
    +pipe_key=inference_pipeline \
    dataset_location=$CERN_DATA_ROOT/exp_data \
    logs_dir=$TMP_DATA_ROOT/ml_logs/mlflow_logs \
    distributed_strategy=$STRATEGY \
    devices=$DEVICES \
    hw_accelerators=$ACCELERATOR \
    checkpoints_path=$TMP_DATA_ROOT/checkpoints \
    inference_model_uri=$CERN_CODE_ROOT/3dgan-inference.pth \
    max_dataset_size=$MAX_DATA_SAMPLES \
    batch_size=$BATCH_SIZE \
    num_workers_dataloader=$NUM_WORKERS_DL \
    inference_results_location=$TMP_DATA_ROOT/3dgan-generated-data \
    aggregate_predictions=$AGGREGATE_PREDS

Docker image

Build from project root with

# Local
docker buildx build -t itwinai:0.0.1-3dgan-0.1 -f use-cases/3dgan/Dockerfile .

# Ghcr.io
docker buildx build -t ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.1 -f use-cases/3dgan/Dockerfile .
docker push ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.1

You can run inference from wherever a sample of H5 files is available (folder called exp_data/’):

├── $PWD
|   ├── exp_data
|   │   ├── data
|   |   │   ├── file_0.h5
|   |   │   ├── file_1.h5
...
|   |   │   ├── file_N.h5

docker run -it --rm --name running-inference -v "$PWD":/tmp/data ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.1

This command will store the results in a folder called 3dgan-generated-data:

├── $PWD
|   ├── 3dgan-generated-data
|   │   ├── energy=1.296749234199524&angle=1.272539496421814.pth
|   │   ├── energy=1.296749234199524&angle=1.272539496421814.jpg
...
|   │   ├── energy=1.664689540863037&angle=1.4906378984451294.pth
|   │   ├── energy=1.664689540863037&angle=1.4906378984451294.jpg

To override fields in the configuration file at runtime, do that inline appending the override at the end of the command. Example: path.to.config.element=NEW_VALUE.

Please find a complete exampled below, showing how to override default configurations by setting some env variables:

# Override variables
export CERN_DATA_ROOT="/usr/data"
export CERN_CODE_ROOT="/usr/src/app"
export MAX_DATA_SAMPLES=10 # max dataset size
export BATCH_SIZE=64 # increase to fill up GPU memory
export NUM_WORKERS_DL=4 # num worker processes used by the dataloader to pre-fetch data
export AGGREGATE_PREDS="true" # write predictions in a single file
export ACCELERATOR="gpu" # choose "cpu" or "gpu"

docker run -it --rm --name running-inference \
-v "$PWD":/usr/data ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.1 \
/bin/bash -c "itwinai exec-pipeline \
    --config-path $CERN_CODE_ROOT \
    +pipe_key=inference_pipeline \
    dataset_location=$CERN_DATA_ROOT/exp_data \
    logs_dir=$TMP_DATA_ROOT/ml_logs/mlflow_logs \
    distributed_strategy=$STRATEGY \
    devices=$DEVICES \
    hw_accelerators=$ACCELERATOR \
    checkpoints_path=$TMP_DATA_ROOT/checkpoints \
    inference_model_uri=$CERN_CODE_ROOT/3dgan-inference.pth \
    max_dataset_size=$MAX_DATA_SAMPLES \
    batch_size=$BATCH_SIZE \
    num_workers_dataloader=$NUM_WORKERS_DL \
    inference_results_location=$TMP_DATA_ROOT/3dgan-generated-data \
    aggregate_predictions=$AGGREGATE_PREDS "

How to fully exploit GPU resources

Keeping the example above as reference, increase the value of BATCH_SIZE as much as possible (just below “out of memory” errors). Also, make sure that ACCELERATOR="gpu". Also, make sure to use a dataset large enough by changing the value of MAX_DATA_SAMPLES to collect meaningful performance data. Consider that each H5 file contains roughly 5k items, thus setting MAX_DATA_SAMPLES=10000 should be enough to use all items in each input H5 file.

You can try:

export MAX_DATA_SAMPLES=10000 # max dataset size
export BATCH_SIZE=1024 # increase to fill up GPU memory
export ACCELERATOR="gpu

Singularity

Run Docker container with Singularity:

singularity run --nv -B "$PWD":/usr/data docker://ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.1 /bin/bash -c \
"cd /usr/src/app && itwinai exec-pipeline +pipe_key=inference_pipeline"

Example with overrides (as above for Docker):

# Override variables
export CERN_DATA_ROOT="/usr/data"
export CERN_CODE_ROOT="/usr/src/app"
export MAX_DATA_SAMPLES=10 # max dataset size
export BATCH_SIZE=64 # increase to fill up GPU memory
export NUM_WORKERS_DL=4 # num worker processes used by the dataloader to pre-fetch data
export AGGREGATE_PREDS="true" # write predictions in a single file
export ACCELERATOR="gpu" # choose "cpu" or "gpu"

singularity run --nv -B "$PWD":/usr/data docker://ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.1 /bin/bash -c \
"cd /usr/src/app && itwinai exec-pipeline \
    --config-path $CERN_CODE_ROOT \
    +pipe_key=inference_pipeline \
    dataset_location=$CERN_DATA_ROOT/exp_data \
    logs_dir=$TMP_DATA_ROOT/ml_logs/mlflow_logs \
    distributed_strategy=$STRATEGY \
    devices=$DEVICES \
    hw_accelerators=$ACCELERATOR \
    checkpoints_path=$TMP_DATA_ROOT/checkpoints \
    inference_model_uri=$CERN_CODE_ROOT/3dgan-inference.pth \
    max_dataset_size=$MAX_DATA_SAMPLES \
    batch_size=$BATCH_SIZE \
    num_workers_dataloader=$NUM_WORKERS_DL \
    inference_results_location=$TMP_DATA_ROOT/3dgan-generated-data \
    aggregate_predictions=$AGGREGATE_PREDS "

3DGAN plugin for itwinai

The integration code of the 3DGAN model has been adapted to be distributed as an independent itwinai plugin called itwinai-3dgan-plugin.

Offloading jobs via interLink

The CERN use case also has an integration with interLink. You can find the relevant files in the interLink directory on Github. You can also look at the README for more information:

Offloading through interLink

This folder contains kubernetes pod examples to be used alongside with interLink, to offload computation to remote HPC providers.

To use these pods, you will need to install kubectl and setup a kubeconfig.

Manage pod

# A pod needs to be deleted before re-submitting another one with the same name
kubectl delete pod POD_NAME
# Alternatively
kubectl apply --overwrite --force -f test.yaml

# Submit pod
kubectl apply -f my-pod.yaml

# Get status
kubectl get nodes
kubectl get pods

# Get pod STDOUT
kubectl logs --insecure-skip-tls-verify-backend POD_NAME

Pod annotations

Allocate resources through SLURM:

slurm-job.vk.io/flags: "-p gpu --gres=gpu:1 --ntasks-per-node=1 --nodes=1"

On some HPC system it may be needed to download the docker container before submitting the offloaded job. T0 do so, you can use the following annotation:

slurm-job.vk.io/pre-exec: "singularity pull /ceph/hpc/data/st2301-itwin-users/itwinaiv6_1.sif docker://ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.2"

IMPORTANT: add this annotation only once, when the image is not there.

Request resources: requests and limits

GPUs shall be requested in the SLURM annotation slurm-job.vk.io/flags as described above. Number of CPUs and memory, on the other hand, is defined using pod limits and requests.

For instance:

resources:
    limits:
        cpu: "48"
        memory: 150Gi
    requests:
        cpu: "4"
        memory: 20Gi

limits define upped bounds, whereas requests lower bounds. See here to know more.

Node selector

To select to which remote system to offload, change the value in the node selector:

nodeSelector:
    kubernetes.io/hostname: vega-new-vk

Additional info in interLink docs.

Secrets

See this guide on how to set Kubernetes secretes as env variables of a container.

Example:

# Create secret to store MLFlow server credentials
kubectl create secret generic mlflow-server --from-literal=username='XYZ' --from-literal=password='ABC'

# Inspect secrets
kubectl get secret
kubectl describe secret SECRET_NAME

# Delete secrets
kubectl delete secret SECRET_NAME