.. _hpo_torchtrainer_workflow: Hyperparameter Optimization with TorchTrainer on MNIST ====================================================== **Author(s)**: Anna Lappe (CERN), Matteo Bunino (CERN) This tutorial provides a step-by-step guide to using the :class:`~itwinai.torch.trainer.TorchTrainer` class for running a hyperparameter optimization (HPO) study. We assume that you are familiar with the :class:`~itwinai.torch.trainer.TorchTrainer` class, and the itwinai training pipeline. If you are not, you might want to go through the tutorials on these first. To illustrate the process, we will work with the FashionMNIST dataset. By the end of this tutorial, you will: * Understand how the :class:`~itwinai.torch.trainer.TorchTrainer` functions. * Know how to create a configuration file to define your HPO study. * Understand the steps required to define and run an HPO study with the itwinai training pipeline. You can find the full code for this tutorial `on Github `_. Setting up the Trainer ---------------------- Let's start by defining our trainer class. When you extend the itwinai :class:`~itwinai.torch.trainer.TorchTrainer`, you inherit all the necessary logic to connect to an existing Ray cluster, perform distributed machine learning (ML), hyperparameter optimization (HPO), or both simultaneously. When selecting a strategy (e.g., Horovod or DeepSpeed) to distribute training, the :class:`~itwinai.torch.trainer.TorchTrainer` ensures that workers can communicate via the underlying Ray cluster. Similarly, the :class:`~itwinai.torch.trainer.TorchTrainer` allows you to run HPO **without requiring any code changes**. During HPO, Ray executes the :meth:`~itwinai.torch.trainer.TorchTrainer.train` method for each trial **independently**, meaning that trials are completely agnostic of one another. For more details, see the :doc:`HPO introduction <../../how-it-works/hpo/explain-hpo>`. If you need to implement custom training logic not supported by :class:`~itwinai.torch.trainer.TorchTrainer`, you can create a new trainer that inherits from it and override the :meth:`~itwinai.torch.trainer.TorchTrainer.train` method. .. important:: **For Ray to correctly execute the training code,** you must call the :meth:`~itwinai.torch.trainer.TorchTrainer.ray_report` method **at the end of each epoch** to report the validation metric that you want to optimize during tuning (typically, the validation loss). This is essential for both **distributed ML training and HPO**, as it allows Ray workers to communicate back with the head process, keeping it updated on the validation loss evolution. Additionally, this method enables checkpoint saving to a persistent storage location. See the official Ray documentation for more details: `Saving Checkpoints in Ray `_. Also, consider that when a Ray cluster is not available and you are not running HPO, the :meth:`~itwinai.torch.trainer.TorchTrainer.train` method is automatically ignored. In other words, you don't need to remove the call to :meth:`~itwinai.torch.trainer.TorchTrainer.ray_report` when you are not using Ray for distributed ML training or HPO. In this tutorial, we will tune two hyperparameters: **batch size** and **learning rate**. Our model will be a **ResNet18**, trained on the **FashionMNIST** dataset. Below you can find an example of how the :meth:`~itwinai.torch.trainer.TorchTrainer.train` method can be overridden: .. code-block:: python def train(self) -> None: device = self.strategy.device() for self.current_epoch in range(self.epochs): self.set_epoch() train_losses = [] val_losses = [] # Training epoch self.model.train() for images, labels in self.train_dataloader: images, labels = images.to(device), labels.to(device) outputs = self.model(images) train_loss = self.loss(outputs, labels) self.optimizer.zero_grad() train_loss.backward() self.optimizer.step() train_losses.append(train_loss.detach().cpu().numpy()) # Validation epoch self.model.eval() for images, labels in self.validation_dataloader: images, labels = images.to(device), labels.to(device) with torch.no_grad(): outputs = self.model(images) val_loss = self.loss(outputs, labels) val_losses.append(val_loss.detach().cpu().numpy()) # Log metrics with itwinai loggers self.log( np.mean(train_losses), "train_loss", kind="metric", step=self.current_epoch ) self.log(np.mean(val_losses), "val_loss", kind="metric", step=self.current_epoch) # Report metrics and checkpoint to Ray head checkpoint = { "epoch": self.current_epoch, "loss": train_loss, "val_loss": val_loss, } metrics = {"loss": val_loss.item()} self.ray_report(metrics=metrics, checkpoint_data=checkpoint) Configuring our Trainer ----------------------- Now that we have our Trainer set up, the next step is to define a configuration file for our HPO pipeline. Once again, this configuration will look very similar to any other itwinai pipeline configuration, but we will add some HPO-specific parameters to define our search space, search algorithm and scheduling algorithm. When you want to run distributed ML training or HPO with Ray, you can specify additional Ray-specific configuration objects that can be passed as arguments to the :class:`~itwinai.torch.trainer.TorchTrainer`, using the arguments starting with ``ray_`` prefix (e.g., ``ray_tune_config``). In the configuration file, Ray configurations can be defined using Hydra syntax for objects. The Ray search space is needed to define the domains of all the hyperparameters that we want to tune. Once the Ray's search algorithm samples an hyperparameter set for a trial, the sampled hyperparameter values will be used to override the default value in the :class:`~itwinai.torch.config.TrainingConfiguration`, which is passed using the ``config`` argument of the :class:`~itwinai.torch.trainer.TorchTrainer`. .. code-block:: yaml # For more info: https://docs.ray.io/en/latest/train/api/doc/ray.train.ScalingConfig.html ray_scaling_config: _target_: ray.train.ScalingConfig num_workers: 1 use_gpu: true resources_per_worker: CPU: 8 GPU: 1 # For more info: https://docs.ray.io/en/latest/tune/api/doc/ray.tune.TuneConfig.html ray_tune_config: _target_: ray.tune.TuneConfig num_samples: 2 scheduler: _target_: ray.tune.schedulers.ASHAScheduler metric: loss # name of the metric to optimize during HPO mode: min max_t: 10 grace_period: 5 reduction_factor: 4 brackets: 1 # For more info: https://docs.ray.io/en/latest/tune/api/doc/ray.tune.RunConfig.html ray_run_config: _target_: ray.tune.RunConfig storage_path: ${itwinai.cwd:}/ray_checkpoints name: FashionMNIST-HPO-Experiment # For more info: https://docs.ray.io/en/latest/tune/api/search_space.html ray_search_space: batch_size: type: choice categories: [32, 64, 128] learning_rate: type: uniform lower: 1e-5 upper: 1e-3 Okay, let's break down the Ray configuration objects. * The ``ray_scaling_config`` argument defines how we distribute resources between our trials. To learn more about the options for setting resources, please refer to the `ray train documentation `_ on this topic. It is important that you ensure that you have allocated suffiecient resources on your cluster to be able to execute at least one trial. This means that if your configuration demands 4 GPUs and 32 CPUs per trial under ``resources_per_worker``, you should make sure that you have allocated at least this many GPUs and CPUs for your job. * In the ``ray_tune_config`` we configure which search algorithm and scheduler to use to search the hyperparameter space and sample new configurations. You can refer to the ray documentation to learn more about the supported `search algorithms `_ and `schedulers `_. In the ``num_samples`` argument you can specify how many trials you wish to run, the default is one. Ray will queue trials if they cannot all be executed at once. * The ``ray_run_config`` defines a path that is used for checkpointing. This is mandatory to set if you want to distribute any one trial across more than one node, because ray uses this as a shared directory to coordinate and share data generated on each of the nodes. The ``ray_run_config`` is of type ``ray.tune.RunConfig`` even for Ray distributed training without HPO. Find out more about the ``RunConfig`` `here `_. * In the ``ray_search_space`` we define which hyperparameters we want to tune. For the tunable parameters we have to specify the type and define their domain. For more information on which parameter types are possible and how to define their domains, have a look at `this page `_. .. danger:: **IMPORTANT:** When tuning, **you must use the exact hyperparameter names** as defined in :class:`~itwinai.torch.config.TrainingConfiguration`. If you use different names, **the hyperparameters will be ignored**, making the entire tuning process **invalid**. **Example:** In :class:`~itwinai.torch.config.TrainingConfiguration`, the learning rate is defined as ``optim_lr``. Therefore, when defining a search space for the learning rate, you **must** use ``optim_lr`` as the name for the learning rate. **Why?** The trainer accesses the learning rate using ``self.config.optim_lr``. If you define it with different names (e.g., ``lr`` or ``learning_rate``), the tuner will set the learning rate with the wrong name in the training configuration, **and it will be ignored by the trainer**. .. note:: Notice how in the ``ray_run_config`` we use the custom OmegaConf resolver ``${itwinai.cwd:}`` provided by itwinai to dynamically compute the absolute path to the current working directory, depending on where the pipeline is executed. It is important to use an absolute path because the run config expects a URI for the ``storage_path``. Running our Code ---------------- Great! So we have created our custom trainer inheriting from the :class:`~itwinai.torch.trainer.TorchTrainer`, and we have defined our pipeline in a configuration file. Now, all that is left to do is launch our training on HPC: .. code-block:: bash cd tutorials/hpo-workflows/fashion-mnist sbatch slurm_hpo.sh