itwinai.scalability_report

data

itwinai.scalability_report.data.read_profiling_data_from_mlflow(mlflow_client: MlflowClient, experiment_name: str, run_names: List[str] | None = None, expected_columns: Set[str] | None = None) DataFrame | None[source]

Reads and validates profiling data from a mlflow experiment and combines them into a single DataFrame.

Parameters:
  • mlflow_client (MlflowClient) – An instance of MlflowClient to interact with MLflow.

  • experiment_name (str) – Name of the MLflow experiment to read from.

  • run_names (List[str] | None) – Name of the runs to read metrics from. If empty, all runs

  • considered. (in the experiment will be)

  • expected_columns (Set[str] | None) – A set of column names expected to be present in

  • None (the profiling data. If)

  • columns. (no validation is performed on the)

Returns:

A DataFrame containing the concatenated data from all valid CSV files in the directory.

Return type:

pd.DataFrame | None

itwinai.scalability_report.data.read_epoch_time_from_mlflow(mlflow_client: MlflowClient, experiment_name: str, run_names: List[str] | None = None) DataFrame | None[source]

Reads and validates epoch time metrics from a mlflow experiment and combines them into a single DataFrame.

Parameters:
  • experiment_name (str) – Name of the MLflow experiment to read from.

  • run_names (List[str] | None) – Name of the runs to read metrics from. If empty, all runs in the experiment will be considered.

Returns:

A DataFrame containing the concatenated data from all epoch time metrics in the given runs of the experiment.

Return type:

pd.DataFrame | None

itwinai.scalability_report.data.read_gpu_metrics_from_mlflow(mlflow_client: MlflowClient, experiment_name: str, run_names: List[str] | None = None) DataFrame | None[source]

Reads and validates GPU metrics from an mlflow experiment and combines them into a single DataFrame.

Parameters:
  • experiment_name (str) – Name of the MLflow experiment to read from.

  • run_names (List[str] | None) – Name of the runs to read metrics from. If empty, all runs

  • considered. (in the experiment will be)

Returns:

A DataFrame containing the concatenated data from all gpu metrics in the given runs of the experiment.

Return type:

pd.DataFrame | None

plot

itwinai.scalability_report.plot.calculate_plot_dimensions(num_datapoints: int) Tuple[int, int][source]

Calculates the height and width of a plot, given a number of datapoints.

Returns:

The calculated height int: The calculated width

Return type:

int

itwinai.scalability_report.plot.absolute_avg_epoch_time_plot(avg_epoch_time_df: DataFrame) Tuple[Figure, Axes][source]

Generates a log-log plot of average epoch training times against the number of GPUs for distributed training strategies.

Parameters:

avg_epoch_time_df (pd.DataFrame) – A DataFrame containing the following columns: - “global_world_size”: Number of GPUs used in the training process. - “avg_epoch_time”: Average time (in seconds) taken for an epoch. - “strategy”: Name of the distributed training strategy.

Returns:

A tuple containing the matplotlib Figure and Axes objects of the generated plot.

Return type:

Tuple[Figure, Axes]

Raises:

ValueError – If avg_epoch_time_df is missing required columns.

itwinai.scalability_report.plot.relative_epoch_time_speedup_plot(avg_epoch_time_df: DataFrame) Tuple[Figure, Axes][source]

Creates a log-log plot showing the relative training speedup for distributed training strategies as the number of workers increases.

Parameters:

avg_epoch_time_df (pd.DataFrame) – A DataFrame containing: - “global_world_size”: Number of GPUs used in the training process. - “avg_epoch_time”: Average time (in seconds) taken for an epoch. - “strategy”: Name of the distributed training strategy.

Returns:

A tuple containing the matplotlib Figure and Axes objects of the generated plot.

Return type:

Tuple[Figure, Axes]

Raises:

ValueError – If avg_epoch_time_df is missing required columns.

itwinai.scalability_report.plot.gpu_bar_plot(data_df: DataFrame, plot_title: str, y_label: str, main_column: str, ray_footnote: str | None = None) Tuple[Figure, Axes][source]

Creates a centered bar plot grouped by number of GPUs and strategy.

Parameters:
  • data_df (pd.DataFrame) – DataFrame containing “strategy”, “global_world_size”, and main_column.

  • plot_title (str) – The title of the plot.

  • y_label (str) – The label for the y-axis.

  • main_column (str) – Column name for bar heights.

  • ray_footnote (str | None) – Optional footnote to add if a ray strategy is present.

Returns:

The generated bar plot.

Return type:

Tuple[Figure, Axes]

itwinai.scalability_report.plot.computation_fraction_bar_plot(communication_data_df: DataFrame) Tuple[Figure, Axes][source]

Creates a stacked bar plot showing computation and communication fractions for different strategies and GPU counts.

Parameters:

communication_data_df (pd.DataFrame) – A DataFrame containing the following columns: - “strategy”: The name of the distributed training strategy. - “num_gpus”: The number of GPUs used. - “computation_fraction”: The fraction of time spent on computation.

Returns:

A tuple containing the matplotlib Figure and Axes objects of the generated plot.

Return type:

Tuple[Figure, Axes]

Raises:

ValueError – If the DataFrame is missing required columns or has invalid data.

itwinai.scalability_report.plot.computation_vs_other_bar_plot(computation_data_df: DataFrame) Tuple[Figure, Axes][source]

Creates a stacked bar plot showing computation and other fractions for different strategies and GPU counts.

Parameters:

computation_data_df (pd.DataFrame) – A DataFrame containing the following columns: - “num_gpus”: The number of GPUs used. - “strategy”: The name of the distributed training strategy. - “computation_fraction”: The fraction of time spent on computation.

Returns:

A tuple containing the matplotlib Figure and Axes objects of the generated plot.

Return type:

Tuple[Figure, Axes]

Raises:

ValueError – If the DataFrame is missing required columns or has invalid data.

reports

itwinai.scalability_report.reports.epoch_time_report(plot_dir: Path | str, mlflow_client: MlflowClient, experiment_name: str, run_names: List[str] | None = None, plot_file_suffix: str = '.png') str | None[source]

Generates reports and plots for epoch training times across distributed training strategies, including a log-log plot of absolute average epoch times against the number of GPUs and a log-log plot of relative speedup as more GPUs are added.

Parameters:
  • plot_dir (Path | str) – Path to the directory where the generated plots will be saved.

  • mlflow_client (MlflowClient) – MLflow client to interact with the MLflow tracking server.

  • experiment_name (str) – Name of the MLflow experiment to retrieve epoch time data from.

  • run_names (List[str] | None) – List of specific run names to filter the epoch time data. If None, all runs in the experiment will be considered.

  • plot_file_suffix (str) – Suffix for the plot file names. Defaults to “.png”.

Returns:

A string representation of the epoch time statistics table, or None if

no data was found.

Return type:

str | None

itwinai.scalability_report.reports.gpu_data_report(plot_dir: Path | str, mlflow_client: MlflowClient, experiment_name: str, run_names: List[str] | None = None, plot_file_suffix: str = '.png', ray_footnote: str | None = None) str | None[source]

Generates reports and plots for GPU energy consumption and utilization across distributed training strategies. Includes bar plots for energy consumption and GPU utilization by strategy and number of GPUs.

Parameters:
  • plot_dir (Path | str) – Path to the directory where the generated plots will be saved.

  • mlflow_client (MlflowClient) – MLflow client to interact with the MLflow tracking

  • experiment_name (str) – Name of the MLflow experiment to retrieve GPU data from.

  • run_names (List[str] | None) – List of specific run names to filter the GPU data. If None, all runs in the experiment will be considered.

  • plot_file_suffix (str) – Suffix for the plot file names. Defaults to “.png”.

  • ray_footnote (str | None) – Optional footnote for energy plots containing ray strategies. Defaults to None.

Returns:

A string representation of the GPU data statistics table, or None if

no data is available.

Return type:

str | None

itwinai.scalability_report.reports.communication_data_report(plot_dir: Path | str, mlflow_client: MlflowClient, experiment_name: str, run_names: List[str] | None, plot_file_suffix: str = '.png') str | None[source]

Generates reports and plots for communication and computation fractions across distributed training strategies. Includes a bar plot showing the fraction of time spent on computation vs communication for each strategy and GPU count.

Parameters:
  • plot_dir (Path | str) – Path to the directory where the generated plot will be saved.

  • mlflow_client (MlflowClient) – MLflow client to interact with the MLflow tracking server.

  • experiment_name (str) – Name of the MLflow experiment to retrieve data from.

  • run_names (List[str]) – List of specific run names to filter the data. If None, all runs in the experiment will be considered.

  • plot_file_suffix (str) – Suffix for the plot file names. Defaults to “.png”.

itwinai.scalability_report.reports.computation_data_report(plot_dir: Path | str, mlflow_client: MlflowClient, experiment_name: str, run_names: List[str] | None = None, plot_file_suffix: str = '.png') str | None[source]

Generates reports and plots for computation and other fractions across distributed training strategies. Includes a bar plot showing the fraction of time spent on computation vs other for each strategy and GPU count.

Parameters:
  • plot_dir (Path | str) – Path to the directory where the generated plot will be saved.

  • mlflow_client (MlflowClient) – MLflow client to interact with the MLflow tracking server.

  • experiment_name (str) – Name of the MLflow experiment to retrieve data from.

  • run_names (List[str] | None) – List of specific run names to filter the data. If None, all runs in the experiment will be considered.

  • plot_file_suffix (str) – Suffix for the plot file names. Defaults to “.png”.

Returns:

A string representation of the computation data statistics table, or None if no data is available.

Return type:

str | None

utils

itwinai.scalability_report.utils.check_contains_columns(df: DataFrame, expected_columns: Set) None[source]

Validates that the given DataFrame contains all the expected columns. Raises a ValueError if any columns are missing, including the file path in the error message if provided.

itwinai.scalability_report.utils.check_probing_interval_consistency(gpu_data_df: DataFrame) None[source]

Checks that the probing_interval is consistent within each group of strategy and number of GPUs.

Raises:

ValueError – If the probing intervals are inconsistent for any group.

itwinai.scalability_report.utils.calculate_epoch_statistics(epoch_time_df: DataFrame, expected_columns: Set) DataFrame[source]

Calculates the average epoch time for each strategy and number of GPUs from the given DataFrame. The DataFrame is expected to contain the columns ‘strategy’, ‘global_world_size’, ‘sample_idx’, ‘metric_name’, and ‘value’. The ‘metric_name’ column should contain the value ‘epoch_time_s’ for epoch time measurements.

Parameters:
  • epoch_time_df (pd.DataFrame) – DataFrame containing epoch time data.

  • expected_columns (Set) – Set of expected columns in the DataFrame.

Returns:

A DataFrame containing the average epoch time for each strategy and

number of GPUs, with the columns strategy, global_world_size, sample_idx, and avg_epoch_time.

Return type:

pd.DataFrame

Raises:
  • ValueError – If the given DataFrame does not contain the expected columns.

  • ValueError – If the probing intervals are inconsistent for any group.

itwinai.scalability_report.utils.calculate_gpu_statistics(gpu_data_df: DataFrame, expected_columns: Set) DataFrame[source]

Calculates both the total energy expenditure (in Watt-hours) and the average GPU utilization for each strategy and number of GPUs. Ensures consistent probing intervals.

Returns:

A DataFrame containing the total energy expenditure and

average GPU utilization for each strategy and number of GPUs, with the columns strategy, global_world_size, total_energy_wh, and utilization.

Return type:

pd.DataFrame

Raises:
  • ValueError – If the given DataFrame does not contain the expected columns.

  • ValueError – If the probing intervals are inconsistent for any group.

itwinai.scalability_report.utils.calculate_comp_and_comm_time(df: DataFrame) Tuple[float, float][source]

Calculates the time spent on computation and communication in seconds from the given DataFrame, assuming an NCCL backend.

Raises:
  • ValueError – If the DataFrame is missing the required columns ‘name’ or

  • 'self_cuda_time_total'.

itwinai.scalability_report.utils.calculate_comp_time(df: DataFrame) float[source]

Calculates the time spent on computation in seconds from the given DataFrame.

Raises:
  • ValueError – If the DataFrame is missing the required columns ‘name’ or

  • 'self_cuda_time_total'.

itwinai.scalability_report.utils.get_computation_fraction_data(df: DataFrame) DataFrame[source]

Calculates the computation fraction for each strategy and GPU configuration, returning a DataFrame with the results. The computation fraction is defined as the ratio of computation time to the total time (computation + communication).

itwinai.scalability_report.utils.get_computation_vs_other_data(df: DataFrame) DataFrame[source]

Calculates the computation fraction for each strategy and GPU configuration, returning a DataFrame with the results. The computation fraction is defined as the ratio of computation time to the total time of profiling.