Profiling Overview
This is an overview over the different profiling methods used in itwinai, as well as a
guide on when to use which profiler.
itwinai Profilers — a Quick Intro
These are the different options for profiling your training with itwinai:
Computation vs Other: Tries to approximate the time spent doing computation and not in computation to understand potential bottlenecks with the distribution across multiple GPUs. We count any call to PyTorch’s ATen library as computation.
GPU Energy Consumption and Utilization: Measures how much energy is spent and the average utilization for the GPUs.
Time per Epoch: Measures how much time is spent per epoch to understand how well the training algorithm scales.
General Profiling with py-spy: Measures how much time is spent in each function with statistical sampling to help you focus your optimization efforts on the right part of the code.
The first three can be toggled with the following boolean flags in your configuration:
enable_torch_profiling: Activate the PyTorch Profiler for computation vs other.store_torch_profiling_traces: Store the traces from the PyTorch Profiler.Requires
enable_torch_profilingto be activated as well.
measure_gpu_data: Measure the GPU energy consumption and utilization.measure_epoch_time: Measure the time per epoch.
As these flags are input parameters to the TorchTrainer, make sure to place them under
this target, as shown in the following example:
...
training_step:
_target_: itwinai.torch.trainer.TorchTrainer
enable_torch_profiling: True
store_torch_profiling_traces: True
measure_gpu_data: True
measure_epoch_time: True
The profiling data will be logged to the selected loggers, if you want to generate a scalability
report afterwards, ensure that the the MLFlowLogger is set up in your configuration, as
this is the data source used to generate the report.
If you want a full example on how to set up your configuration, you can have a look at the
MNIST use case.
For more information on how to activate the py-spy profiler, read the py-spy profiling guide.
Selection Guide
This section guides you in choosing the right profiler based on what you’re trying to measure. Some profilers are primarily intended for analyzing scalability across different training setups, while others are best suited for debugging general bottlenecks.
Understanding Scalability
If you’re running your code on multiple GPUs or nodes and want to evaluate how well it scales,
itwinai provides several tools to help you break down where time is spent and how hardware
is used.
Note
When evaluating the scalability of your model/algorithm, factors such as network congestion
or heat can cause fluctuations in training speed, thus adding noise to the scalability
data. Because of this, we recommend running multiple identical runs. This likely reduces
the noise and gives you more robust results. To do this, you can run the same test multiple
times, with the same run_name. This is already supported by the
scalability report generated by itwinai.
- enable_torch_profiling
Approximates time spent on computation vs other to help identify scaling bottlenecks when running on multiple GPUs or nodes. This is done using the averaged results from the PyTorch Profiler. We compare the time spent in the ATen library, PyTorch’s computation library, to the time spent in other calls. This is done using regex matching.
Warning
This measure is only a rough approximation, as it does not account for overlap in time. Also note that distributed training frameworks differ in their implementation, so comparisons across frameworks are not meaningful. Use this to compare how each strategy scales, not as an absolute measure of potential overhead.
- store_torch_profiling_traces
Saves the traces from the profiling using the TensorBoard Trace Handler. Requires that
enable_torch_profilingalso is activated.- measure_gpu_data
Monitors GPU energy consumption and utilization. Useful for assessing whether your GPUs are underutilized.
Reports average utilization and total energy usage per GPU for the full training run.
- measure_epoch_time
Tracks the wall-clock time per epoch to evaluate how your training scales with more data or compute.
This is a coarse but direct measure of scalability. The output can be plotted or compared across runs and configurations.
Diagnosing Python-Side Bottlenecks
- py-spy
External profiler that captures a statistical overview of where time is spent in your Python code.
Particularly useful for spotting performance issues that are unrelated to scaling—such as slow Python loops, blocking calls, or I/O overhead. Best used when you’re unsure where to begin optimizing.
For more details, see the py-spy profiling guide.