Profiling Overview

This is an overview over the different profiling methods used in itwinai, as well as a guide on when to use which profiler.

itwinai Profilers — a Quick Intro

These are the different options for profiling your training with itwinai:

  • Computation vs Other: Tries to approximate the time spent doing computation and not in computation to understand potential bottlenecks with the distribution across multiple GPUs. We count any call to PyTorch’s ATen library as computation.

  • GPU Energy Consumption and Utilization: Measures how much energy is spent and the average utilization for the GPUs.

  • Time per Epoch: Measures how much time is spent per epoch to understand how well the training algorithm scales.

  • General Profiling with py-spy: Measures how much time is spent in each function with statistical sampling to help you focus your optimization efforts on the right part of the code.

The first three can be toggled with the following boolean flags in your configuration:

  • enable_torch_profiling: Activate the PyTorch Profiler for computation vs other.

  • store_torch_profiling_traces: Store the traces from the PyTorch Profiler.

    • Requires enable_torch_profiling to be activated as well.

  • measure_gpu_data: Measure the GPU energy consumption and utilization.

  • measure_epoch_time: Measure the time per epoch.

As these flags are input parameters to the TorchTrainer, make sure to place them under this target, as shown in the following example:

...
training_step:
  _target_: itwinai.torch.trainer.TorchTrainer
  enable_torch_profiling: True
  store_torch_profiling_traces: True
  measure_gpu_data: True
  measure_epoch_time: True

The profiling data will be logged to the selected loggers, if you want to generate a scalability report afterwards, ensure that the the MLFlowLogger is set up in your configuration, as this is the data source used to generate the report. If you want a full example on how to set up your configuration, you can have a look at the MNIST use case.

For more information on how to activate the py-spy profiler, read the py-spy profiling guide.

Selection Guide

This section guides you in choosing the right profiler based on what you’re trying to measure. Some profilers are primarily intended for analyzing scalability across different training setups, while others are best suited for debugging general bottlenecks.

Understanding Scalability

If you’re running your code on multiple GPUs or nodes and want to evaluate how well it scales, itwinai provides several tools to help you break down where time is spent and how hardware is used.

Note

When evaluating the scalability of your model/algorithm, factors such as network congestion or heat can cause fluctuations in training speed, thus adding noise to the scalability data. Because of this, we recommend running multiple identical runs. This likely reduces the noise and gives you more robust results. To do this, you can run the same test multiple times, with the same run_name. This is already supported by the scalability report generated by itwinai.

enable_torch_profiling

Approximates time spent on computation vs other to help identify scaling bottlenecks when running on multiple GPUs or nodes. This is done using the averaged results from the PyTorch Profiler. We compare the time spent in the ATen library, PyTorch’s computation library, to the time spent in other calls. This is done using regex matching.

Warning

This measure is only a rough approximation, as it does not account for overlap in time. Also note that distributed training frameworks differ in their implementation, so comparisons across frameworks are not meaningful. Use this to compare how each strategy scales, not as an absolute measure of potential overhead.

store_torch_profiling_traces

Saves the traces from the profiling using the TensorBoard Trace Handler. Requires that enable_torch_profiling also is activated.

measure_gpu_data

Monitors GPU energy consumption and utilization. Useful for assessing whether your GPUs are underutilized.

Reports average utilization and total energy usage per GPU for the full training run.

measure_epoch_time

Tracks the wall-clock time per epoch to evaluate how your training scales with more data or compute.

This is a coarse but direct measure of scalability. The output can be plotted or compared across runs and configurations.

Diagnosing Python-Side Bottlenecks

py-spy

External profiler that captures a statistical overview of where time is spent in your Python code.

Particularly useful for spotting performance issues that are unrelated to scaling—such as slow Python loops, blocking calls, or I/O overhead. Best used when you’re unsure where to begin optimizing.

For more details, see the py-spy profiling guide.