Fast particle detector simulation (CERN)ο
This use case trains a 3D Generative Adversarial Network (3DGAN) for generation of images of calorimeter depositions. It is based on the prototype 3DGAN model developed at CERN and is implemented on PyTorch Lightning framework.
This section covers the CERN use case that utilizes the torch-lightning framework for training and evaluation. Following you can find instructions to execute CERN use case and its integral scripts:
Integration with itwinaiο
Integration author(s): Kalliopi Tsolaki (CERN), Matteo Bunino (CERN)
First of all, from the repository root, create a torch environment following the installation instructions.
Now, install custom requirements for this use case in
requirements.txt file. Example:
source .venv-pytorch/bin/activate
cd use-cases/3dgan
pip install -r requirements.txt
Note
Python commands below assumed to be executed from within the virtual environment.
Trainingο
Make sure to be in the use-cases/3dgan folder. Before you can start training, you have
to download the data using the dataloading script:
itwinai exec-pipeline +pipe_key=training_pipeline +pipe_steps=[dataloading_step]
Now you can launch training using itwinai and the provided training configuration config.yaml:
itwinai exec-pipeline +pipe_key=training_pipeline
The command above shows how to run the training using a single worker, but if you want to run distributed ML training you have two options: interactive (launch from terminal) or batch (launch form SLURM job script).
Warning
Before running distributed ML, make sure that the distributed strategy used
by pytorch lightning is set to ddp_find_unused_parameters_true . You can set
this manually by setting
distributed_strategy: ddp_find_unused_parameters_true in config.yaml.
To know more on SLURM, see our SLURM cheatsheet.
Distributed training on a single node (interactive)ο
If you want to use SLURM in interactive mode, do the following:
# Allocate resources (on JSC)
$ salloc --partition=batch --nodes=1 --account=intertwin --gres=gpu:4 --time=1:59:00
job ID is XXXX
# Get a shell in the compute node (if using SLURM)
$ srun --jobid XXXX --overlap --pty /bin/bash
# Now you are inside the compute node
# On JSC, you may need to load some modules
ml --force purge
ml Stages/2024 GCC OpenMPI CUDA/12 MPI-settings/CUDA Python HDF5 PnetCDF libaio mpi4py
# ...before activating the Python environment (adapt this to your env name/path)
source ../../envAI_hdfml/bin/activate
To launch the training with torch DDP use:
torchrun --standalone --nnodes=1 --nproc-per-node=gpu \
$(which itwinai) exec-pipeline +pipe_key=training_pipeline
# Alternatively, from a SLURM login node:
srun --jobid XXXX --ntasks-per-node=1 torchrun --standalone --nnodes=1 --nproc-per-node=gpu \
$(which itwinai) exec-pipeline +pipe_key=training_pipeline
Distributed training with SLURM (batch mode)ο
Differently from the interactive approach, this way allows you to use more than one compute node, thus allowing to scale the distributed ML to larger resources.
Remember that on JSC there is no internet connection on compute nodes, thus if your script tries to contact the internet it will fail. If needed, make sure to download the datasets from the SLURM login node before launching the job.
# Launch a SLURM batch job (on JSC)
sbatch slurm.jsc.sh
# Launch a SLURM batch job (on Vega)
sbatch slurm.vega.sh
# Check the job in the SLURM queue
squeue -u YOUR_USERNAME
# Check the job status
sacct -j JOBID
Jobβs stdout is usually saved to job.out and its stderr is saved to job.err.
Visualize the results of trainingο
Depending on the logging service that you are using, there are different ways to inspect the logs generated during ML training.
To visualize the logs generated with MLFLow, if you set a local path as tracking URI, run the following in the terminal:
mlflow ui --backend-store-uri LOCAL_TRACKING_URI
And select the β3DGANβ experiment.
Offload training with interLink (batch mode)ο
To submit a SLURM job to a remote HPC system, you can use interLink (find more info in the interLink directory).
For short, this is the usual workflow:
Make sure to have
kubectlinstalled on your system (e.g., laptop, VM) and to have it using a validkubeconfig.Create a Docker container containing your code and the python environment (all python dependencies needed by your code).
Push the Docker container to your preferred public containers registry.
Update the Kubernetes pod accordingly, making sure that the newly created container image is pulled on the remote HPC system and used to run your job.
Submit the job through interLink.
Example:
# Check current pods status
kubectl get pods
# Delete existing pod before re-submitting it
kubectl delete pod POD_NAME
# Submit new pod for ML training
kubectl apply -y interLink/3dgan-train.yaml
# Once completed, retrieve the pod logs
kubectl logs --insecure-skip-tls-verify-backend POD_NAME
Inferenceο
As inference dataset we can reuse training/validation dataset, for instance the one downloaded from Google Drive folder: if the dataset root folder is not present, the dataset will be downloaded. The inference dataset is a set of H5 files stored inside
exp_datasub-folders:βββ exp_data β βββ data | β βββ file_0.h5 | β βββ file_1.h5 ... | β βββ file_N.h5
As model, if a pre-trained checkpoint is not available, we can create a dummy version of it with:
python create_inference_sample.pyRun inference command. This will generate a
3dgan-generated-datafolder containing generated particle traces in form of torch tensors (.pth files) and 3D scatter plots (.jpg images).itwinai exec-pipeline +pipe_key=inference_pipeline
The inference execution will produce a folder called
3dgan-generated-data containing
generated 3D particle trajectories (overwritten if already
there). Each generated 3D image is stored both as a
torch tensor (.pth) and 3D scatter plot (.jpg):
βββ 3dgan-generated-data
| βββ energy=1.296749234199524&angle=1.272539496421814.pth
| βββ energy=1.296749234199524&angle=1.272539496421814.jpg
...
| βββ energy=1.664689540863037&angle=1.4906378984451294.pth
| βββ energy=1.664689540863037&angle=1.4906378984451294.jpg
However, if aggregate_predictions in the ParticleImagesSaver step is set to True,
only one pickled file will be generated inside 3dgan-generated-data folder.
Notice that multiple inference calls will create new files under 3dgan-generated-data folder.
With fields overriding:
# Override variables
export CERN_DATA_ROOT="../.." # data root
export TMP_DATA_ROOT=$CERN_DATA_ROOT
export CERN_CODE_ROOT="." # where code and configuration are stored
export MAX_DATA_SAMPLES=20000 # max dataset size
export BATCH_SIZE=1024 # increase to fill up GPU memory
export NUM_WORKERS_DL=4 # num worker processes used by the dataloader to pre-fetch data
export AGGREGATE_PREDS="true" # write predictions in a single file
export ACCELERATOR="gpu" # choose "cpu" or "gpu"
export STRATEGY="auto" # distributed strategy
export DEVICES="0," # GPU devices list
itwinai exec-pipeline --config-path $CERN_CODE_ROOT \
+pipe_key=inference_pipeline \
dataset_location=$CERN_DATA_ROOT/exp_data \
logs_dir=$TMP_DATA_ROOT/ml_logs/mlflow_logs \
distributed_strategy=$STRATEGY \
devices=$DEVICES \
hw_accelerators=$ACCELERATOR \
checkpoints_path=$TMP_DATA_ROOT/checkpoints \
inference_model_uri=$CERN_CODE_ROOT/3dgan-inference.pth \
max_dataset_size=$MAX_DATA_SAMPLES \
batch_size=$BATCH_SIZE \
num_workers_dataloader=$NUM_WORKERS_DL \
inference_results_location=$TMP_DATA_ROOT/3dgan-generated-data \
aggregate_predictions=$AGGREGATE_PREDS
Docker imageο
Build from project root with
# Local
docker buildx build -t itwinai:0.0.1-3dgan-0.1 -f use-cases/3dgan/Dockerfile .
# Ghcr.io
docker buildx build -t ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.1 -f use-cases/3dgan/Dockerfile .
docker push ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.1
You can run inference from wherever a sample of H5 files is available
(folder called exp_data/β):
βββ $PWD
| βββ exp_data
| β βββ data
| | β βββ file_0.h5
| | β βββ file_1.h5
...
| | β βββ file_N.h5
docker run -it --rm --name running-inference -v "$PWD":/tmp/data ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.1
This command will store the results in a folder called 3dgan-generated-data:
βββ $PWD
| βββ 3dgan-generated-data
| β βββ energy=1.296749234199524&angle=1.272539496421814.pth
| β βββ energy=1.296749234199524&angle=1.272539496421814.jpg
...
| β βββ energy=1.664689540863037&angle=1.4906378984451294.pth
| β βββ energy=1.664689540863037&angle=1.4906378984451294.jpg
To override fields in the configuration file at runtime, do that inline appending the override
at the end of the command. Example: path.to.config.element=NEW_VALUE.
Please find a complete exampled below, showing how to override default configurations by setting some env variables:
# Override variables
export CERN_DATA_ROOT="/usr/data"
export CERN_CODE_ROOT="/usr/src/app"
export MAX_DATA_SAMPLES=10 # max dataset size
export BATCH_SIZE=64 # increase to fill up GPU memory
export NUM_WORKERS_DL=4 # num worker processes used by the dataloader to pre-fetch data
export AGGREGATE_PREDS="true" # write predictions in a single file
export ACCELERATOR="gpu" # choose "cpu" or "gpu"
docker run -it --rm --name running-inference \
-v "$PWD":/usr/data ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.1 \
/bin/bash -c "itwinai exec-pipeline \
--config-path $CERN_CODE_ROOT \
+pipe_key=inference_pipeline \
dataset_location=$CERN_DATA_ROOT/exp_data \
logs_dir=$TMP_DATA_ROOT/ml_logs/mlflow_logs \
distributed_strategy=$STRATEGY \
devices=$DEVICES \
hw_accelerators=$ACCELERATOR \
checkpoints_path=$TMP_DATA_ROOT/checkpoints \
inference_model_uri=$CERN_CODE_ROOT/3dgan-inference.pth \
max_dataset_size=$MAX_DATA_SAMPLES \
batch_size=$BATCH_SIZE \
num_workers_dataloader=$NUM_WORKERS_DL \
inference_results_location=$TMP_DATA_ROOT/3dgan-generated-data \
aggregate_predictions=$AGGREGATE_PREDS "
How to fully exploit GPU resourcesο
Keeping the example above as reference, increase the value of BATCH_SIZE as much as possible
(just below βout of memoryβ errors). Also, make sure that ACCELERATOR="gpu". Also, make sure
to use a dataset large enough by changing the value of MAX_DATA_SAMPLES to collect meaningful
performance data. Consider that each H5 file contains roughly 5k items, thus setting
MAX_DATA_SAMPLES=10000 should be enough to use all items in each input H5 file.
You can try:
export MAX_DATA_SAMPLES=10000 # max dataset size
export BATCH_SIZE=1024 # increase to fill up GPU memory
export ACCELERATOR="gpu
Singularityο
Run Docker container with Singularity:
singularity run --nv -B "$PWD":/usr/data docker://ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.1 /bin/bash -c \
"cd /usr/src/app && itwinai exec-pipeline +pipe_key=inference_pipeline"
Example with overrides (as above for Docker):
# Override variables
export CERN_DATA_ROOT="/usr/data"
export CERN_CODE_ROOT="/usr/src/app"
export MAX_DATA_SAMPLES=10 # max dataset size
export BATCH_SIZE=64 # increase to fill up GPU memory
export NUM_WORKERS_DL=4 # num worker processes used by the dataloader to pre-fetch data
export AGGREGATE_PREDS="true" # write predictions in a single file
export ACCELERATOR="gpu" # choose "cpu" or "gpu"
singularity run --nv -B "$PWD":/usr/data docker://ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.1 /bin/bash -c \
"cd /usr/src/app && itwinai exec-pipeline \
--config-path $CERN_CODE_ROOT \
+pipe_key=inference_pipeline \
dataset_location=$CERN_DATA_ROOT/exp_data \
logs_dir=$TMP_DATA_ROOT/ml_logs/mlflow_logs \
distributed_strategy=$STRATEGY \
devices=$DEVICES \
hw_accelerators=$ACCELERATOR \
checkpoints_path=$TMP_DATA_ROOT/checkpoints \
inference_model_uri=$CERN_CODE_ROOT/3dgan-inference.pth \
max_dataset_size=$MAX_DATA_SAMPLES \
batch_size=$BATCH_SIZE \
num_workers_dataloader=$NUM_WORKERS_DL \
inference_results_location=$TMP_DATA_ROOT/3dgan-generated-data \
aggregate_predictions=$AGGREGATE_PREDS "
3DGAN plugin for itwinaiο
The integration code of the 3DGAN model has been adapted to be distributed as an independent itwinai plugin called itwinai-3dgan-plugin.
Offloading jobs via interLinkο
The CERN use case also has an integration with interLink. You can find the relevant files in the interLink directory on Github. You can also look at the README for more information:
Offloading through interLinkο
This folder contains kubernetes pod examples to be used alongside with interLink, to offload computation to remote HPC providers.
To use these pods, you will need to install kubectl and setup a kubeconfig.
Manage podο
# A pod needs to be deleted before re-submitting another one with the same name
kubectl delete pod POD_NAME
# Alternatively
kubectl apply --overwrite --force -f test.yaml
# Submit pod
kubectl apply -f my-pod.yaml
# Get status
kubectl get nodes
kubectl get pods
# Get pod STDOUT
kubectl logs --insecure-skip-tls-verify-backend POD_NAME
Pod annotationsο
Allocate resources through SLURM:
slurm-job.vk.io/flags: "-p gpu --gres=gpu:1 --ntasks-per-node=1 --nodes=1"
On some HPC system it may be needed to download the docker container before submitting the offloaded job. T0 do so, you can use the following annotation:
slurm-job.vk.io/pre-exec: "singularity pull /ceph/hpc/data/st2301-itwin-users/itwinaiv6_1.sif docker://ghcr.io/intertwin-eu/itwinai:0.0.1-3dgan-0.2"
IMPORTANT: add this annotation only once, when the image is not there.
Request resources: requests and limitsο
GPUs shall be requested in the SLURM annotation slurm-job.vk.io/flags as described above.
Number of CPUs and memory, on the other hand, is defined using pod limits and requests.
For instance:
resources:
limits:
cpu: "48"
memory: 150Gi
requests:
cpu: "4"
memory: 20Gi
limits define upped bounds, whereas requests lower bounds.
See here
to know more.
Node selectorο
To select to which remote system to offload, change the value in the node selector:
nodeSelector:
kubernetes.io/hostname: vega-new-vk
Additional info in interLink docs.
Secretsο
See this guide on how to set Kubernetes secretes as env variables of a container.
Example:
# Create secret to store MLFlow server credentials
kubectl create secret generic mlflow-server --from-literal=username='XYZ' --from-literal=password='ABC'
# Inspect secrets
kubectl get secret
kubectl describe secret SECRET_NAME
# Delete secrets
kubectl delete secret SECRET_NAME