Getting Started with Hyperparameter Optimization in itwinai
Author(s): Anna Lappe (CERN)
This tutorial will walk you through setting up and running hyperparameter optimization (HPO)
for your itwinai pipeline using Ray Tune. By the end, you’ll be able to customize the provided
template code for your own pipeline and start optimizing your models.
You can find the code for the tutorial on Github.
It includes a complete example of running a simple custom itwinai TorchTrainer on the MNIST dataset.
Here we will go through the hpo.py template step-by-step to help you understand the workflow
and guide you in customizing it for your specific needs. That said, the script is designed
to be general-purpose, so with minimal adjustments (just one or two lines), you should be able
to run your own training pipeline by using the
slurm script and the
HPO template provided in the tutorial,.
Prerequisites
Before getting started, ensure you have:
A basic understanding of itwinai pipelines and trainers. If not, refer to the tutorials on workflows.
A YAML file defining your pipeline.
All necessary components implemented (e.g., data getter, loader, trainer, etc.).
Step 1: Import the Required Libraries:
Let’s start by importing all the libraries needed for this script.
import argparse
from typing import Dict
import torch
from ray import train, tune
from itwinai.parser import ConfigParser
Step 2: Define the Trial Execution Function
The run_trial function executes a single HPO trial using the provided configuration and data.
config contains the hyperparameters (e.g., learning rate, batch size). data is a general
dictionary that can contain any values you need to pass to the function. In our template,
it includes only the pipeline name defined in your config.yaml, but you may pass additional
values here.
Note: Keep the size of the data dictionary small because it is serialized by Ray Tune when sent to workers. Avoid including large objects like datasets here.
def run_trial(config: Dict, data: Dict):
pipeline_name = data["pipeline_name"]
parser = ConfigParser(
config="config.yaml",
override_keys={
# Set hyperparameters controlled by ray
"batch_size": config["batch_size"],
"optim_lr": config["optim_lr"]
},
)
my_pipeline = parser.parse_pipeline(
pipeline_nested_key=pipeline_name, verbose=False
)
my_pipeline.execute()
Adaptation:
If neccessary, replace
config.yamlwith the path to your own configuration file.Adjust hyperparameter keys (e.g.,
batch_size,learning_rate) to match the keys used in your pipeline.If your pipeline requires additional configurations, add them to the
override_keysdictionary.
For a detailed explanation of defining and parsing pipelines, refer to the Workflow Page.
Step 3: Configure and Run HPO
The run_hpo function is the core of this script. It sets up Ray Tune to manage the hyperparameter optimization process.
1. Define the search space: Specify the range of hyperparameters to explore. In our example:
search_space = {
"batch_size": tune.choice([3, 4, 5, 6]),
"optim_lr": tune.uniform(1e-5, 1e-3),
}
Adaptation:
Add or remove hyperparameters to suit your model.
Use
tune.choicefor discrete parameters andtune.uniformfor continuous, uniform ranges. You can add different sampling distributions and types. For additional search space options, please refer to the Ray Tune documentation. Any search space definition supported by Ray Tune can be used in this template as well.
2. Define ray configs: Set up the tuning and run configurations to control the number of trials, optimization goal, and stopping criteria.
tune_config = tune.TuneConfig(
metric=args.metric, # Metric to optimize (loss by default)
mode="min", # Minimize the loss
num_samples=args.num_samples, # Number of trials to run
)
run_config = train.RunConfig(
name="Virgo-Ray-Experiment",
stop={"training_iteration": args.max_iterations}
)
Adaptation:
Replace
args.metricwith the metric relevant to your optimization task. This metric has to be reported at the end of each iteration (by default this is one epoch) in your trainer. We explain how to do this in Step 4.If needed, replace the stopping criteria defined in
stop. If set lower than your trainer’s epochs, trials stop early.
3. Allocate Resources: Assign GPUs and CPUs for each trial. In this example we dynamically allocate resources based on the total available number of GPUs and CPUs.
ngpus_per_trial = max(1, args.ngpus // args.num_samples)
ncpus_per_trial = max(1, args.ncpus // args.num_samples)
# Set resource allocation for each trial (number of GPUs and/or number of CPUs)
resources_per_trial = {"gpu": ngpus_per_trial, "cpu": ncpus_per_trial}
run_with_resources = tune.with_resources(run_trial, resources=resources_per_trial)
Adaptation:
If not using GPUs, set
ngpus_per_trialto 0
4. Set Up and Execute the Tuner:
Combine the trial function, configurations, and search space into a Tuner object, and run it.
data = {"pipeline_name": args.pipeline_name}
trainable_with_parameters = tune.with_parameters(run_with_resources, data=data)
# Set up Ray Tune Tuner
tuner = tune.Tuner(
trainable_with_parameters,
tune_config=tune_config,
run_config=run_config,
param_space=search_space, # Search space defined above
)
result_grid = tuner.fit()
Finally, we can call our function:
# Main entry point for script execution
if __name__ == "__main__":
# Parse command-line arguments
...
# Check for available GPU
if torch.cuda.is_available():
device = "cuda"
print(f"Using GPU: {torch.cuda.get_device_name(torch.cuda.current_device())}")
else:
device = "cpu"
print("Using CPU")
run_hpo(args)
Step 4: Add Reporting Call to Your Trainer
To manage trials, i.e. know when to stop bad performing ones and where to search next for good hyperparameter configurations, Ray Tune needs to know how our model is doing during training. For this, we add the following to report metrics at the end of each training iteration:
class MyItwinaiTrainer(TorchTrainer):
...
def train():
for epoch in epochs:
...
# Report training metrics of last epoch to Ray
train.report({"loss": epoch_val_loss})
It is important that this metric is the same that you specify when setting up your Tune Config, as described in Step 3.
Running the Script: Once you have your HPO script and your trainer set up, you can launch your training by executing the slurm script:
cd tutorials/hpo-workflows/simple-workflow
sbatch slurm_hpo.sh
This script launches a ray cluster and executes the hpo.py script.
For more details, see the HPO introduction.