Virgo

The code is adapted from this notebook available on the Virgo use case’s repository.

To know more on the interTwin Virgo Noise detector use case and its DT, please visit the published deliverables, D4.2, D7.2 and D7.4.

You can find the relevant code in the use case’s folder on Github, or by consulting the use case’s README:

available on the Virgo use case’s repository.

Installation

If running the pipeline directly on a node (or from your terminal), first install the required libraries in the pre-existing itwinai environment using the following command:

pip install -r requirements.txt

Training Pipelines

This repository offers two main approaches for training based on the dataset size:

Small Dataset Pipeline

This pipeline allows you to generate a small synthetic dataset on-the-fly as part of the training process. It’s suited for quick tests and debugging where the entire workflow stays self-contained in memory. The dataset generation step can also be skipped for subsequent runs.

To run the entire pipeline, including dataset generation, use the following command:

itwinai exec-pipeline --config config.yaml --pipe-key training_pipeline_small

If you’ve already generated the dataset in a previous run, you can skip the dataset generation step by executing the following command:

itwinai exec-pipeline --config config.yaml --pipe-key training_pipeline_small --steps 1:

This will load the dataset from memory and proceed with the training steps.

Large Dataset Pipeline

The large dataset pipeline is designed to handle massive datasets that are stored on disk. To generate this data, this project includes another SLURM job script, synthetic-data-gen/data_generation_hdf5.sh, which generates a synthetic dataset for the Virgo gravitational wave detector use case.

The synthetic data is generated using a Python script, file_gen_hdf5.py, which creates multiple HDF5 files containing simulated data. We generate multiple files as this allows us to create them in parallel, saving us some time. To do this, we use SLURM job arrays. After generating the files, they are concatenated into a single, large file using concat_hdf5_dataset_files.py.

To generate a new dataset, you can run the SLURM script with the following command:

sbatch synthetic_data_gen/data_generation_hdf5.sh

Once the dataset is generated, you can proceed with training:

itwinai exec-pipeline --config config.yaml --pipe-key training_pipeline

You can also run the training in a distributed manner using all strategies by running runall.sh:

bash runall.sh

Change the $TRAINING_CMD variable in runall.sh to reflect the pipeline you wish to run, as explained above. This will launch jobs for all the strategies and log their outputs into the logs_slurm folder.

When using the MLFLow logger, you can visualize the logs in from the MLFlow UI:

mlflow ui --backend-store-uri mllogs/mlflow

# In background
mlflow ui --backend-store-uri mllogs/mlflow > /dev/null 2>&1 &

Running scaling tests

Scaling tests provide information about how well the different distributed strategies scale. We have integrated them into this use case and you can run them using the scaling-test.sh script:

bash scaling-test.sh

To generate the plots, refer to the Scaling-Test Tutorial.

Running HPO for Virgo on JSC

Hyperparameter optimization (HPO) is integrated into the pipeline using Ray Tune and Ray Train. This allows you to run multiple trials and fine-tune model parameters efficiently. HPO is configured to run multiple trials in parallel. There is two methods to run HPO. Both methods are launched with

sbatch slurm_ray.sh

This script sets up a Ray cluster and runs the script for hyperparameter tuning. Chnage the run command in slurm.sh to run the script you want. You have two options:

You can run non-distributed HPO by using the command

python hpo.py --num_samples 4 --max_iterations 2 --ngpus $num_gpus --ncpus $num_cpus --pipeline_name training_pipeline

at the end of the slurm script. Change the argument num_samples to run a different number of trials, and change max_iterations to set a higher or lower stopping criteria. 3. You can run distributed HPO by using the command

$PYTHON_VENV/bin/itwinai exec-pipeline --config config.yaml --pipe-key ray_training_pipeline

at the end of the slurm script.

Please refer to the itwinai documentation for more guides and tutorials on these two HPO methods.

Scalability Metrics

Here are some examples of the scalability metrics for this use case:

Average Epoch Time Comparison

This plot shows a comparison between the average time per epochs for each strategy and number of nodes.

../_images/absolute_scalability_plot1.png

Relative Epoch Time Speedup

This plot shows a comparison between the speedup between the different number of nodes for each strategy. The speedup is calculated using the lowest number of nodes as a baseline.

../_images/relative_scalability_plot1.png

Communication vs Computation

This plot shows how much of the GPU time is spent doing computation compared to communication between GPUs and nodes, for each strategy and number of nodes. The shaded area is communication and the colored area is computation. They have all been normalized so that the values are between 0 and 1.0.

GPU Utilization

This plot shows how high the GPU utilization is for each strategy and number of nodes, as a percentage from 0 to 100. This is the defined as how much of the time is spent in computation mode vs not, and does not directly correlate to FLOPs.

Power Consumption

This plot shows the total energy consumption in watt-hours for the different strategies and number of nodes.