Profiling Overview
==================

This is an overview over the different profiling methods used in ``itwinai``, as well as a
guide on when to use which profiler.

``itwinai`` Profilers — a Quick Intro
-------------------------------------

These are the different options for profiling your training with ``itwinai``:

* **Computation vs Other**: Tries to approximate the time spent doing computation and not in
  computation to understand potential bottlenecks with the distribution across multiple GPUs.
  We count any call to PyTorch's ATen library as computation.
* **GPU Energy Consumption and Utilization**: Measures how much energy is spent and the
  average utilization for the GPUs. 
* **Time per Epoch**: Measures how much time is spent per epoch to understand how well the
  training algorithm scales.
* **General Profiling with py-spy**: Measures how much time is spent in each function with
  statistical sampling to help you focus your optimization efforts on the right part of the
  code. 

The first three can be toggled with the following boolean flags in your configuration:

* ``enable_torch_profiling``: Activate the PyTorch Profiler for computation vs other.
* ``store_torch_profiling_traces``: Store the traces from the PyTorch Profiler.

  * Requires ``enable_torch_profiling`` to be activated as well.

* ``measure_gpu_data``: Measure the GPU energy consumption and utilization.
* ``measure_epoch_time``: Measure the time per epoch.

As these flags are input parameters to the ``TorchTrainer``, make sure to place them under
this target, as shown in the following example:

.. code-block:: yaml

      ...
      training_step:
        _target_: itwinai.torch.trainer.TorchTrainer
        enable_torch_profiling: True
        store_torch_profiling_traces: True
        measure_gpu_data: True
        measure_epoch_time: True


The profiling data will be logged to the selected loggers, if you want to generate a scalability
report afterwards, ensure that the the ``MLFlowLogger`` is set up in your configuration, as
this is the data source used to generate the report.
If you want a full example on how to set up your configuration, you can have a look at the
:doc:`MNIST use case <../../use-cases/mnist_doc>`.

For more information on how to activate the **py-spy** profiler, read the
:doc:`py-spy profiling guide <py-spy-profiling>`.


Selection Guide
---------------

This section guides you in choosing the right profiler based on what you're trying to measure.
Some profilers are primarily intended for analyzing **scalability** across different training
setups, while others are best suited for **debugging general bottlenecks**.

Understanding Scalability
^^^^^^^^^^^^^^^^^^^^^^^^^

If you're running your code on multiple GPUs or nodes and want to evaluate how well it scales,
``itwinai`` provides several tools to help you break down where time is spent and how hardware
is used.

.. note::

   When evaluating the scalability of your model/algorithm, factors such as network congestion
   or heat can cause fluctuations in training speed, thus adding noise to the scalability
   data. Because of this, we recommend running multiple identical runs. This likely reduces
   the noise and gives you more robust results. To do this, you can run the same test multiple
   times, with the same ``run_name``. This is already supported by the
   scalability report generated by ``itwinai``. 

enable_torch_profiling
    Approximates time spent on computation vs other to help identify scaling bottlenecks when
    running on multiple GPUs or nodes. This is done using the averaged results from the
    PyTorch Profiler. We compare the time spent in the ATen library, PyTorch's computation
    library, to the time spent in other calls. This is done using regex matching. 

    .. warning::

       This measure is only a rough approximation, as it does not account for overlap in time.
       Also note that distributed training frameworks differ in their implementation, so
       comparisons across frameworks are not meaningful. Use this to compare how each strategy
       scales, not as an absolute measure of potential overhead.


store_torch_profiling_traces
    Saves the traces from the profiling using the TensorBoard Trace Handler. Requires that
    ``enable_torch_profiling`` also is activated. 
    

measure_gpu_data
    Monitors GPU energy consumption and utilization. Useful for assessing whether your GPUs are
    underutilized.

    Reports average utilization and total energy usage per GPU for the full training run.

measure_epoch_time
    Tracks the wall-clock time per epoch to evaluate how your training scales with more data or
    compute.

    This is a coarse but direct measure of scalability. The output can be plotted or compared
    across runs and configurations.

Diagnosing Python-Side Bottlenecks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

py-spy
    External profiler that captures a statistical overview of where time is spent in your
    Python code.

    Particularly useful for spotting performance issues that are unrelated to scaling—such as
    slow Python loops, blocking calls, or I/O overhead. Best used when you're unsure where to
    begin optimizing.

    For more details, see the :doc:`py-spy profiling guide <py-spy-profiling>`.