Tracking ML workflows ========================= **Author(s)**: Matteo Bunino (CERN), Killian Verder (CERN) One key aspect of understanding the intricacies of your ML experiment lies in the descriptive data around hyper-parameters, metrics, and model checkpoints. This information allows us to compare different models and reproduce previous results. In order to facilitate this, itwinai provides a :class:`~itwinai.loggers.Logger` wrapper which provides users with a simple, unified method to extract descriptive information about their ML training run regardless of which supported logger is used. The core idea is to abstract user-developed code from popular logging frameworks (such as MLFlow, Weights&Biases, Tensorboard), allowing itwinai users to easily switch between loggers with no changes to their code. Moreover, this abstraction layer allows the use of multiple logging frameworks at the same time, enabling our users to pick the best from each one. The supported loggers, also shown in figure :numref:`logger_abstraction_fig`, are listed in the :py:mod:`itwinai.loggers` module documentation. .. _logger_abstraction_fig: .. figure:: figures/logger_fig.png :alt: Diagram of supported loggers in itwinai `logger` wrapper :align: center :scale: 12% Informal representation of itwinai logger abstraction. Getting started with itwinai loggers ------------------------------------- itwinai provides a wrapper that allows the user to call a logger in a unified manner regardless of which logger is used on the backend. The abstraction of logging logic that the itwinai logger provides can avoid redundant code duplication and simplify the maintenance of large-scale projects. The itwinai :class:`~itwinai.loggers.Logger` provides a unique interface to log any *kind* of data trough the :meth:`~itwinai.loggers.LogMixin.log` method. You can use it to log metrics, model checkpoints, images, text, provenance information, and more. However, to make sure that the item you want to log is correctly managed, you need to specify its kind by means of the ``kind`` argument. See :py:mod:`itwinai.loggers` module documentation for a list of object kinds that can be logged. .. admonition:: Example of basic usage of itwinai Logger The following outlines how to concretely use the logger on a toy example: .. code-block:: python from itwinai.loggers import MLFlowLogger my_logger = MLFlowLogger( experiment_name='my_awesome_experiment', log_freq=100 ) # Initialize logger my_logger.create_logger_context() # During PyTorch model training for epoch in range(100): loss_accumulator = .0 for batch_idx, batch in enumerate(train_dataloader): x, y = batch pred = my_net(x) loss = loss_criterion(y, pred) ... # Log the loss with "batch" granularity my_logger.log( item=loss.item(), identifier='my_training_loss_batch', kind='metric', batch_idx=batch_idx ) loss_accumulator += loss.item() ... epoch_loss = loss_accumulator / len(train_dataloader) # Log the loss with "epoch" granularity my_logger.log( item=epoch_loss, identifier='my_training_loss_epoch', kind='metric' ) # Save model after training my_logger.log( item=my_net, identifier='trained_model', kind='model' ) # Destroy logger before exiting my_logger.destroy_logger_context() Similarly, you can use itwinai loggers in the itwinai :class:`~itwinai.torch.trainer.TorchTrainer`. .. admonition:: Example of itwinai Logger and TorchTrainer .. code-block:: python from itwinai.loggers import WandBLogger from itwinai.torch.trainer import TorchTrainer my_logger = WandBLogger( project_name='my_awesome_experiment', log_freq=100 ) # If needed, override the default trainer class MyCustomTrainer(TorchTrainer): ... def train(self, ...): ... self.logger.log( item=some_metric, identifier='my_metric_name', kind='metric' ) # Instantiate the trainer passing some itwinai logger my_trainer = MyCustomTrainer( config=..., epochs=100, model=my_net, strategy='ddp', logger=my_logger ) ... # Start training my_trainer.execute(train_dataset, validation_dataset) Logging frequency trade-off +++++++++++++++++++++++++++ Neural networks parameters are iteratively optimized on small data samples extracted from some training dataset, also called **batches**. On the other hand, an **epoch** refers to one sweep through the entire dataset, iterating over all batches that compose it. Thus, an epoch consists of multiple batch iterations. Logging ML metadata for each batch of each epoch would provide users with the most granular information possible, but it comes at a significant cost in training speed due to the slow process of writing to disk after each batch (a.k.a. I/O bottleneck). Thus, logging every few batches or only once per epoch can be a worthy trade-off depending on the use case. The ``log_freq`` argument in the :class:`~itwinai.loggers.Logger` constructor allows the user to determine at which batch interval the action of logging is actually performed, without having to change the training routine. When an integer is passed as ``log_freq``, the logger will honor the call to the :meth:`~itwinai.loggers.LogMixin.log` method for each batch, if that batch's index is a multiple of the given integer (i.e., ``batch_id % log_freq == 0``). In all the other cases (i.e., ``batch_id % log_freq != 0``), the logger will ignore the :meth:`~itwinai.loggers.LogMixin.log` call in a transparent way for the user. .. admonition:: Example on the functioning of ``log_freq`` constructor argument Should ``log_freq = 10``, the first batch (``batch_id = 0``) is logged, after which the 11th batch (``batch_id = 10``) is logged, after which the 21th batch is logged and so on. ``log_freq`` can also receive the following string values: ``"epoch"`` or ``"batch"``. - When set to ``"epoch"``, the logger only logs when called outside of the nested training loop iterating over dataset batches. In other words, the logger only logs if ``batch_idx`` is not passed as parameter to the :meth:`~itwinai.loggers.LogMixin.log` method. - When set to ``"batch"``, every batch is logged. .. warning:: The logger assumes to be outside of the inner training loop (namely, the one iterating over dataset batches) when the ``batch_idx`` argument of the :meth:`~itwinai.loggers.LogMixin.log` method is set to ``None`` or is simply not given. **It is therefore your responsibility to make sure that** ``batch_idx`` **is always passed to the** :meth:`~itwinai.loggers.LogMixin.log` **method when available (i.e., when iterating over batches)!** If you don't do it, the logger will always log, regardless of what you pass to the ``log_freq`` argument in the :class:`~itwinai.loggers.Logger` constructor. In other words, if you do not pass the ``batch_idx`` argument to the :meth:`~itwinai.loggers.LogMixin.log` where you should, then you can run into the situation where you told the logger to only log on epochs, but then it also logs on batches. Logging during distributed ML ++++++++++++++++++++++++++++++++++ In distributed workflows, multiple workers perform tasks at the same time. This can sometimes cause problems called race conditions, where the order in which different workers access or modify the same resource affects the software's behavior. For example, if every worker in a distributed ML job tries to log its local copy of a variable, they might all write to the same file simultaneously, leading to errors. In other cases, if all workers log identical information (like model parameters when using data parallelism), it can result in unnecessary redundancy. To manage this, each worker in a distributed environment is given a unique number, called its **global rank**. This is simply an integer going from :math:`0` to :math:`N - 1`, where :math:`N` is the total number of workers. A worker's rank can be accessed from a process using various methods, such as through environment variables set by the `torchrun `_ launcher. The itwinai logger helps control which workers are allowed to log and, conversely, in which workers calls to the :meth:`~itwinai.loggers.LogMixin.log` method should be ignored. Using the ``log_on_workers argument`` argument in the :class:`~itwinai.loggers.Logger` constructor, you can specify which worker(s) should log based on their global rank. To make sure each logger knows its worker's rank, the :meth:`~itwinai.loggers.Logger.create_logger_context` method accepts a ``rank`` integer. .. admonition:: Example of itwinai Logger in a distributed ML job In this example, we use the itwinai :class:`~itwinai.loggers.MLFlowLogger` to log only on the worker with global rank equal to 0. In all the other workers, calls to the :meth:`~itwinai.loggers.LogMixin.log` method will be ignored. As you can see, the code is very similar to the example above, with the only differences being in the logger constructor and the :meth:`~itwinai.loggers.Logger.create_logger_context` method. When using the itwinai :class:`~itwinai.torch.trainer.TorchTrainer`, the rank will be automatically passed to the logger's :meth:`~itwinai.loggers.Logger.create_logger_context` method by the :class:`~itwinai.torch.trainer.TorchTrainer`. .. code-block:: python import os from itwinai.loggers import MLFlowLogger my_logger = MLFlowLogger( experiment_name='my_awesome_experiment', log_freq=100, log_on_workers=0 ) # Initialize logger, assuming RANK env variable by torchrun launcher my_logger.create_logger_context(rank=int(os.environ['RANK'])) # During PyTorch model training for epoch in range(100): loss_accumulator = .0 for batch_idx, batch in enumerate(train_dataloader): x, y = batch pred = my_net(x) loss = loss_criterion(y, pred) ... # Log the loss with "batch" granularity my_logger.log( item=loss.item(), identifier='my_training_loss_batch', kind='metric', batch_idx=batch_idx ) loss_accumulator += loss.item() ... epoch_loss = loss_accumulator / len(train_dataloader) # Log the loss with "epoch" granularity my_logger.log( item=loss.item(), identifier='my_training_loss_epoch', kind='metric' ) # Save model after training my_logger.log( item=my_net, identifier='trained_model', kind='model' ) # Destroy logger before exiting my_logger.destroy_logger_context() .. note:: When logging on more than one worker using the same logger, **it is responsibility of the user to verify that the chosen logging framework supports multiprocessing**, adapting it accordingly if not. Pytorch Lightning Logger +++++++++++++++++++++++++++ In addition to the itwinai loggers, itwinai provides a way for users to integrate itwinai logging functionalities within PyTorch Lightning workflows. Designed to work as a direct substitute for the PyTorch Lightning logger, the :class:`~itwinai.loggerstorch..ItwinaiLogger` wraps any existing itwinai logger (e.g., MLFlow or Prov4ML logger) and adapts it to the PyTorch Lightning interface. By implementing the same methods as the PyTorch Lightning logger, the :class:`~itwinai.loggerstorch..ItwinaiLogger` allows users to replace the default logger in their existing code with this wrapper while maintaining compatibility with PyTorch Lightning's logging operations, including integration with the ModelCheckpoint callback. .. admonition:: Example of Pytorch Lightning Trainer using ItwinaiLogger In this example, we see how we can instantiate the :class:`~itwinai.loggerstorch..ItwinaiLogger` with an itwinai logger instance of our choice and pass it to a Pytorch Lightning Trainer to use just as any other Lightning Logger. .. code-block:: python import pytorch_lightning as pl from my_models import MyModel from itwinai.loggers import Prov4ML from itwinai.torch.loggers import ItwinaiLogger my_model = MyModel() my_prov_logger = Prov4ML() my_lightning_logger = ItwinaiLogger(itwinai_logger=my_prov_logger) trainer = pl.Trainer(logger=my_itwinai_logger) trainer.fit(model) This illustrates the basic structure of instantiating the :class:`~itwinai.loggerstorch..ItwinaiLogger` and passing it to a Pytorch Lightning trainer instance. A more detailed example of the use of the :class:`~itwinai.loggerstorch..ItwinaiLogger` in an itwinai Pytorch Lightning training pipeline on the MNIST dataset can be found in :doc:`Pytorch Lightning <../../use-cases/mnist_doc>` Further references ------------------- - :py:mod:`itwinai.loggers` module documentation. - `MLFlow `_: An open-source logger, MLFlow integrates with most commonly used ML libraries. MLFlow offers tools such as a model registry to aid in version tracking, facilitation of model deployment through MLFlow Models, and strong integration with commonly used ML frameworks. - `Weights&Biases `_: Besides comprehensive tracking of hyperparameters, model metrics, and system performance measures, WandB offers an interactive web-based dashboard that visualises logged metrics in real time. - `Tensorboard for TensorFlow `_: Tensorboard offers a comprehensive suite of visualisation tools including real-time plotting, graph visualisation of neural networks, and image and audio logging besides scalar outputs. - `Tensorboard for PyTorch `_ Similarly as for TensorFlow, you can use this variant of TensorBoard when training PyTorch models.