itwinai.scalability_report

data.py

itwinai.scalability_report.data.read_scalability_metrics_from_csv(data_dir: Path | str, expected_columns: Set) DataFrame[source]

Reads and validates scalability metric CSV files from a directory and combines them into a single DataFrame.

Parameters:
  • data_dir (Path | str) – Path to the directory containing the CSV files. All files in the directory must have a .csv extension.

  • expected_columns (Set) – A set of column names expected to be present in each CSV file.

Returns:

A DataFrame containing the concatenated data from all valid CSV files in the directory.

Return type:

pd.DataFrame

Raises:
  • ValueError – If the directory contains non-CSV files, if no .csv files are found,

  • or if any file is missing the expected columns.

plot.py

itwinai.scalability_report.plot.calculate_plot_dimensions(num_datapoints: int) Tuple[int, int][source]

Calculates the height and width of a plot, given a number of datapoints.

Returns:

The calculated height int: The calculated width

Return type:

int

itwinai.scalability_report.plot.absolute_avg_epoch_time_plot(avg_epoch_time_df: DataFrame, gpus_per_node: int = 4) Tuple[Figure, Axes][source]

Generates a log-log plot of average epoch training times against the number of GPUs for distributed training strategies.

Parameters:
  • avg_epoch_time_df (pd.DataFrame) – A DataFrame containing the following columns: - “nodes”: Number of nodes used in the training process. - “avg_epoch_time”: Average time (in seconds) taken for an epoch. - “name”: Name of the distributed training strategy.

  • gpus_per_node (int) – Number of GPUs per node. Used to calculate the total number of GPUs for each training configuration. Defaults to 4.

Returns:

A tuple containing the matplotlib Figure and Axes objects of the generated plot.

Return type:

Tuple[Figure, Axes]

Raises:

ValueError – If avg_epoch_time_df is missing required columns.

itwinai.scalability_report.plot.relative_epoch_time_speedup_plot(avg_epoch_time_df: DataFrame, gpus_per_node: int = 4) Tuple[Figure, Axes][source]

Creates a log-log plot showing the relative training speedup for distributed training strategies as the number of GPUs increases.

Parameters:
  • avg_epoch_time_df (pd.DataFrame) – A DataFrame containing the following columns: - “nodes”: Number of nodes used in the training process. - “avg_epoch_time”: Average time (in seconds) taken for an epoch. - “name”: Name of the distributed training strategy.

  • gpus_per_node (int) – Number of GPUs per node. Used to calculate the total number of GPUs for each training configuration. Defaults to 4.

Returns:

A tuple containing the matplotlib Figure and Axes objects of the generated plot.

Return type:

Tuple[Figure, Axes]

Raises:

ValueError – If avg_epoch_time_df is missing required columns.

itwinai.scalability_report.plot.gpu_bar_plot(data_df: DataFrame, plot_title: str, y_label: str, main_column: str) Tuple[Figure, Axes][source]

Creates a bar plot for the specified data, grouped by strategy and number of GPUs.

Parameters:
  • data_df (pd.DataFrame) – A DataFrame containing the data to plot. Must include the columns “strategy”, “num_global_gpus”, and the column specified in main_column.

  • plot_title (str) – The title of the plot.

  • y_label (str) – The label for the y-axis.

  • main_column (str) – The column in data_df to use for the bar heights.

Returns:

A tuple containing the matplotlib Figure and Axes objects of the generated bar plot.

Return type:

Tuple[Figure, Axes]

Raises:

ValueError – If data_df is missing required columns.

itwinai.scalability_report.plot.computation_fraction_bar_plot(communication_data_df: DataFrame) Tuple[Figure, Axes][source]

Creates a stacked bar plot showing computation and communication fractions for different strategies and GPU counts.

Parameters:

communication_data_df (pd.DataFrame) – A DataFrame containing the following columns: - “strategy”: The name of the distributed training strategy. - “num_gpus”: The number of GPUs used. - “computation_fraction”: The fraction of time spent on computation.

Returns:

A tuple containing the matplotlib Figure and Axes objects of the generated plot.

Return type:

Tuple[Figure, Axes]

Raises:

ValueError – If the DataFrame is missing required columns or has invalid data.

reports.py

itwinai.scalability_report.reports.epoch_time_report(epoch_time_dir: Path | str, plot_dir: Path | str, backup_dir: Path, do_backup: bool = False) None[source]

Generates reports and plots for epoch training times across distributed training strategies, including a log-log plot of absolute average epoch times against the number of GPUs and a log-log plot of relative speedup as more GPUs are added. The function optionally creates backups of the data.

Parameters:
  • epoch_time_dir (Path | str) – Path to the directory containing CSV files with epoch time metrics. The files must include the columns “name”, “nodes”, “epoch_id”, and “time”.

  • plot_dir (Path | str) – Path to the directory where the generated plots will be saved.

  • backup_dir (Path) – Path to the directory where backups of the data will be stored if do_backup is True.

  • do_backup (bool) – Whether to create a backup of the epoch time data in the backup_dir. Defaults to False.

itwinai.scalability_report.reports.gpu_data_report(gpu_data_dir: Path | str, plot_dir: Path | str, backup_dir: Path, do_backup: bool = False) None[source]

Generates reports and plots for GPU energy consumption and utilization across distributed training strategies. Includes bar plots for energy consumption and GPU utilization by strategy and number of GPUs. The function optionally creates backups of the data.

Parameters:
  • gpu_data_dir (Path | str) – Path to the directory containing CSV files with GPU data. The files must include the columns “sample_idx”, “utilization”, “power”, “local_rank”, “node_idx”, “num_global_gpus”, “strategy”, and “probing_interval”.

  • plot_dir (Path | str) – Path to the directory where the generated plots will be saved.

  • backup_dir (Path) – Path to the directory where backups of the data will be stored if do_backup is True.

  • do_backup (bool) – Whether to create a backup of the GPU data in the backup_dir. Defaults to False.

itwinai.scalability_report.reports.communication_data_report(communication_data_dir: Path | str, plot_dir: Path | str, backup_dir: Path, do_backup: bool = False) None[source]

Generates reports and plots for communication and computation fractions across distributed training strategies. Includes a bar plot showing the fraction of time spent on computation vs communication for each strategy and GPU count. The function optionally creates backups of the data.

Parameters:
  • communication_data_dir (Path | str) – Path to the directory containing CSV files with communication data. The files must include the columns “strategy”, “num_gpus”, “global_rank”, “name”, and “self_cuda_time_total”.

  • plot_dir (Path | str) – Path to the directory where the generated plot will be saved.

  • backup_dir (Path) – Path to the directory where backups of the data will be stored if do_backup is True.

  • do_backup (bool) – Whether to create a backup of the communication data in the backup_dir. Defaults to False.

utils.py

itwinai.scalability_report.utils.check_contains_columns(df: DataFrame, expected_columns: Set, file_path: Path | None = None) None[source]

Validates that the given DataFrame contains all the expected columns. Raises a ValueError if any columns are missing, including the file path in the error message if provided.

itwinai.scalability_report.utils.check_probing_interval_consistency(gpu_data_df: DataFrame) None[source]

Checks that the probing_interval is consistent within each group of strategy and number of GPUs.

Raises:

ValueError – If the probing intervals are inconsistent for any group.

itwinai.scalability_report.utils.calculate_gpu_statistics(gpu_data_df: DataFrame, expected_columns: Set) DataFrame[source]

Calculates both the total energy expenditure (in Watt-hours) and the average GPU utilization for each strategy and number of GPUs. Ensures consistent probing intervals.

Returns:

A DataFrame containing the total energy expenditure and

average GPU utilization for each strategy and number of GPUs, with the columns strategy, num_global_gpus, total_energy_wh, and utilization.

Return type:

pd.DataFrame

Raises:
  • ValueError – If the given DataFrame does not contain the expected columns.

  • ValueError – If the probing intervals are inconsistent for any group.

itwinai.scalability_report.utils.calculate_comp_and_comm_time(df: DataFrame) Tuple[float, float][source]

Calculates the time spent on computation and communication in seconds from the given DataFrame, assuming an NCCL backend.

Raises:
  • ValueError – If the DataFrame is missing the required columns ‘name’ or

  • 'self_cuda_time_total'.

itwinai.scalability_report.utils.get_computation_fraction_data(df: DataFrame) DataFrame[source]

Calculates the computation fraction for each strategy and GPU configuration, returning a DataFrame with the results. The computation fraction is defined as the ratio of computation time to the total time (computation + communication).