itwinai.scalability_report
data
- itwinai.scalability_report.data.read_profiling_data_from_mlflow(mlflow_client: MlflowClient, experiment_name: str, run_names: List[str] | None = None, expected_columns: Set[str] | None = None) DataFrame | None[source]
Reads and validates profiling data from a mlflow experiment and combines them into a single DataFrame.
- Parameters:
mlflow_client (MlflowClient) – An instance of MlflowClient to interact with MLflow.
experiment_name (str) – Name of the MLflow experiment to read from.
run_names (List[str] | None) – Name of the runs to read metrics from. If empty, all runs
considered. (in the experiment will be)
expected_columns (Set[str] | None) – A set of column names expected to be present in
None (the profiling data. If)
columns. (no validation is performed on the)
- Returns:
A DataFrame containing the concatenated data from all valid CSV files in the directory.
- Return type:
pd.DataFrame | None
- itwinai.scalability_report.data.read_epoch_time_from_mlflow(mlflow_client: MlflowClient, experiment_name: str, run_names: List[str] | None = None) DataFrame | None[source]
Reads and validates epoch time metrics from a mlflow experiment and combines them into a single DataFrame.
- Parameters:
experiment_name (str) – Name of the MLflow experiment to read from.
run_names (List[str] | None) – Name of the runs to read metrics from. If empty, all runs in the experiment will be considered.
- Returns:
A DataFrame containing the concatenated data from all epoch time metrics in the given runs of the experiment.
- Return type:
pd.DataFrame | None
- itwinai.scalability_report.data.read_gpu_metrics_from_mlflow(mlflow_client: MlflowClient, experiment_name: str, run_names: List[str] | None = None) DataFrame | None[source]
Reads and validates GPU metrics from an mlflow experiment and combines them into a single DataFrame.
- Parameters:
experiment_name (str) – Name of the MLflow experiment to read from.
run_names (List[str] | None) – Name of the runs to read metrics from. If empty, all runs
considered. (in the experiment will be)
- Returns:
A DataFrame containing the concatenated data from all gpu metrics in the given runs of the experiment.
- Return type:
pd.DataFrame | None
plot
- itwinai.scalability_report.plot.calculate_plot_dimensions(num_datapoints: int) Tuple[int, int][source]
Calculates the height and width of a plot, given a number of datapoints.
- Returns:
The calculated height int: The calculated width
- Return type:
int
- itwinai.scalability_report.plot.absolute_avg_epoch_time_plot(avg_epoch_time_df: DataFrame) Tuple[Figure, Axes][source]
Generates a log-log plot of average epoch training times against the number of GPUs for distributed training strategies.
- Parameters:
avg_epoch_time_df (pd.DataFrame) – A DataFrame containing the following columns: - “global_world_size”: Number of GPUs used in the training process. - “avg_epoch_time”: Average time (in seconds) taken for an epoch. - “strategy”: Name of the distributed training strategy.
- Returns:
A tuple containing the matplotlib Figure and Axes objects of the generated plot.
- Return type:
Tuple[Figure, Axes]
- Raises:
ValueError – If avg_epoch_time_df is missing required columns.
- itwinai.scalability_report.plot.relative_epoch_time_speedup_plot(avg_epoch_time_df: DataFrame) Tuple[Figure, Axes][source]
Creates a log-log plot showing the relative training speedup for distributed training strategies as the number of workers increases.
- Parameters:
avg_epoch_time_df (pd.DataFrame) – A DataFrame containing: - “global_world_size”: Number of GPUs used in the training process. - “avg_epoch_time”: Average time (in seconds) taken for an epoch. - “strategy”: Name of the distributed training strategy.
- Returns:
A tuple containing the matplotlib Figure and Axes objects of the generated plot.
- Return type:
Tuple[Figure, Axes]
- Raises:
ValueError – If avg_epoch_time_df is missing required columns.
- itwinai.scalability_report.plot.gpu_bar_plot(data_df: DataFrame, plot_title: str, y_label: str, main_column: str, ray_footnote: str | None = None) Tuple[Figure, Axes][source]
Creates a centered bar plot grouped by number of GPUs and strategy.
- Parameters:
data_df (pd.DataFrame) – DataFrame containing “strategy”, “global_world_size”, and main_column.
plot_title (str) – The title of the plot.
y_label (str) – The label for the y-axis.
main_column (str) – Column name for bar heights.
ray_footnote (str | None) – Optional footnote to add if a ray strategy is present.
- Returns:
The generated bar plot.
- Return type:
Tuple[Figure, Axes]
- itwinai.scalability_report.plot.computation_fraction_bar_plot(communication_data_df: DataFrame) Tuple[Figure, Axes][source]
Creates a stacked bar plot showing computation and communication fractions for different strategies and GPU counts.
- Parameters:
communication_data_df (pd.DataFrame) – A DataFrame containing the following columns: - “strategy”: The name of the distributed training strategy. - “num_gpus”: The number of GPUs used. - “computation_fraction”: The fraction of time spent on computation.
- Returns:
A tuple containing the matplotlib Figure and Axes objects of the generated plot.
- Return type:
Tuple[Figure, Axes]
- Raises:
ValueError – If the DataFrame is missing required columns or has invalid data.
- itwinai.scalability_report.plot.computation_vs_other_bar_plot(computation_data_df: DataFrame) Tuple[Figure, Axes][source]
Creates a stacked bar plot showing computation and other fractions for different strategies and GPU counts.
- Parameters:
computation_data_df (pd.DataFrame) – A DataFrame containing the following columns: - “num_gpus”: The number of GPUs used. - “strategy”: The name of the distributed training strategy. - “computation_fraction”: The fraction of time spent on computation.
- Returns:
A tuple containing the matplotlib Figure and Axes objects of the generated plot.
- Return type:
Tuple[Figure, Axes]
- Raises:
ValueError – If the DataFrame is missing required columns or has invalid data.
reports
- itwinai.scalability_report.reports.epoch_time_report(plot_dir: Path | str, mlflow_client: MlflowClient, experiment_name: str, run_names: List[str] | None = None, plot_file_suffix: str = '.png') str | None[source]
Generates reports and plots for epoch training times across distributed training strategies, including a log-log plot of absolute average epoch times against the number of GPUs and a log-log plot of relative speedup as more GPUs are added.
- Parameters:
plot_dir (Path | str) – Path to the directory where the generated plots will be saved.
mlflow_client (MlflowClient) – MLflow client to interact with the MLflow tracking server.
experiment_name (str) – Name of the MLflow experiment to retrieve epoch time data from.
run_names (List[str] | None) – List of specific run names to filter the epoch time data. If None, all runs in the experiment will be considered.
plot_file_suffix (str) – Suffix for the plot file names. Defaults to “.png”.
- Returns:
- A string representation of the epoch time statistics table, or None if
no data was found.
- Return type:
str | None
- itwinai.scalability_report.reports.gpu_data_report(plot_dir: Path | str, mlflow_client: MlflowClient, experiment_name: str, run_names: List[str] | None = None, plot_file_suffix: str = '.png', ray_footnote: str | None = None) str | None[source]
Generates reports and plots for GPU energy consumption and utilization across distributed training strategies. Includes bar plots for energy consumption and GPU utilization by strategy and number of GPUs.
- Parameters:
plot_dir (Path | str) – Path to the directory where the generated plots will be saved.
mlflow_client (MlflowClient) – MLflow client to interact with the MLflow tracking
experiment_name (str) – Name of the MLflow experiment to retrieve GPU data from.
run_names (List[str] | None) – List of specific run names to filter the GPU data. If None, all runs in the experiment will be considered.
plot_file_suffix (str) – Suffix for the plot file names. Defaults to “.png”.
ray_footnote (str | None) – Optional footnote for energy plots containing ray strategies. Defaults to None.
- Returns:
- A string representation of the GPU data statistics table, or None if
no data is available.
- Return type:
str | None
- itwinai.scalability_report.reports.communication_data_report(plot_dir: Path | str, mlflow_client: MlflowClient, experiment_name: str, run_names: List[str] | None, plot_file_suffix: str = '.png') str | None[source]
Generates reports and plots for communication and computation fractions across distributed training strategies. Includes a bar plot showing the fraction of time spent on computation vs communication for each strategy and GPU count.
- Parameters:
plot_dir (Path | str) – Path to the directory where the generated plot will be saved.
mlflow_client (MlflowClient) – MLflow client to interact with the MLflow tracking server.
experiment_name (str) – Name of the MLflow experiment to retrieve data from.
run_names (List[str]) – List of specific run names to filter the data. If None, all runs in the experiment will be considered.
plot_file_suffix (str) – Suffix for the plot file names. Defaults to “.png”.
- itwinai.scalability_report.reports.computation_data_report(plot_dir: Path | str, mlflow_client: MlflowClient, experiment_name: str, run_names: List[str] | None = None, plot_file_suffix: str = '.png') str | None[source]
Generates reports and plots for computation and other fractions across distributed training strategies. Includes a bar plot showing the fraction of time spent on computation vs other for each strategy and GPU count.
- Parameters:
plot_dir (Path | str) – Path to the directory where the generated plot will be saved.
mlflow_client (MlflowClient) – MLflow client to interact with the MLflow tracking server.
experiment_name (str) – Name of the MLflow experiment to retrieve data from.
run_names (List[str] | None) – List of specific run names to filter the data. If None, all runs in the experiment will be considered.
plot_file_suffix (str) – Suffix for the plot file names. Defaults to “.png”.
- Returns:
A string representation of the computation data statistics table, or None if no data is available.
- Return type:
str | None
utils
- itwinai.scalability_report.utils.check_contains_columns(df: DataFrame, expected_columns: Set) None[source]
Validates that the given DataFrame contains all the expected columns. Raises a ValueError if any columns are missing, including the file path in the error message if provided.
- itwinai.scalability_report.utils.check_probing_interval_consistency(gpu_data_df: DataFrame) None[source]
Checks that the probing_interval is consistent within each group of strategy and number of GPUs.
- Raises:
ValueError – If the probing intervals are inconsistent for any group.
- itwinai.scalability_report.utils.calculate_epoch_statistics(epoch_time_df: DataFrame, expected_columns: Set) DataFrame[source]
Calculates the average epoch time for each strategy and number of GPUs from the given DataFrame. The DataFrame is expected to contain the columns ‘strategy’, ‘global_world_size’, ‘sample_idx’, ‘metric_name’, and ‘value’. The ‘metric_name’ column should contain the value ‘epoch_time_s’ for epoch time measurements.
- Parameters:
epoch_time_df (pd.DataFrame) – DataFrame containing epoch time data.
expected_columns (Set) – Set of expected columns in the DataFrame.
- Returns:
- A DataFrame containing the average epoch time for each strategy and
number of GPUs, with the columns
strategy,global_world_size,sample_idx, andavg_epoch_time.
- Return type:
pd.DataFrame
- Raises:
ValueError – If the given DataFrame does not contain the expected columns.
ValueError – If the probing intervals are inconsistent for any group.
- itwinai.scalability_report.utils.calculate_gpu_statistics(gpu_data_df: DataFrame, expected_columns: Set) DataFrame[source]
Calculates both the total energy expenditure (in Watt-hours) and the average GPU utilization for each strategy and number of GPUs. Ensures consistent probing intervals.
- Returns:
- A DataFrame containing the total energy expenditure and
average GPU utilization for each strategy and number of GPUs, with the columns
strategy,global_world_size,total_energy_wh, andutilization.
- Return type:
pd.DataFrame
- Raises:
ValueError – If the given DataFrame does not contain the expected columns.
ValueError – If the probing intervals are inconsistent for any group.
- itwinai.scalability_report.utils.calculate_comp_and_comm_time(df: DataFrame) Tuple[float, float][source]
Calculates the time spent on computation and communication in seconds from the given DataFrame, assuming an NCCL backend.
- Raises:
ValueError – If the DataFrame is missing the required columns ‘name’ or
'self_cuda_time_total'. –
- itwinai.scalability_report.utils.calculate_comp_time(df: DataFrame) float[source]
Calculates the time spent on computation in seconds from the given DataFrame.
- Raises:
ValueError – If the DataFrame is missing the required columns ‘name’ or
'self_cuda_time_total'. –
- itwinai.scalability_report.utils.get_computation_fraction_data(df: DataFrame) DataFrame[source]
Calculates the computation fraction for each strategy and GPU configuration, returning a DataFrame with the results. The computation fraction is defined as the ratio of computation time to the total time (computation + communication).
- itwinai.scalability_report.utils.get_computation_vs_other_data(df: DataFrame) DataFrame[source]
Calculates the computation fraction for each strategy and GPU configuration, returning a DataFrame with the results. The computation fraction is defined as the ratio of computation time to the total time of profiling.