itwinai.slurm

configuration

class itwinai.slurm.configuration.SlurmScriptConfiguration(*, job_name: str | None = None, account: str, partition: str, time: str = '00:30:00', std_out: Path | None = None, err_out: Path | None = None, num_nodes: int = 1, num_tasks: int | None = None, num_tasks_per_node: int = 1, gpus_per_node: int = 4, cpus_per_task: int = 16, memory: str = '16G', exclusive: bool = False, pre_exec_command: str | None = None, exec_command: str | None = None, save_script: bool = False, submit_job: bool = False, save_dir: Path | None = PosixPath('slurm-scripts'), pre_exec_file: str | None = None, exec_file: str | None = None)[source]

Bases: BaseModel

Configuration object for the SLURM script. It contains all the settings for the SLURM script such as which hardware you are requesting or for how long to run it. As it allows for any pre_exec_command and exec_command, it should work for any SLURM script.

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

job_name: str | None

Optional job name for the SLURM job. Defaults to None (auto-generated later).

account: str

Billing account to charge the job to. Required.

partition: str

Partition/queue the job should run on. Required.

time: str

Wall-clock time limit for the job (HH:MM:SS). Defaults to 00:30:00.

std_out: Path | None

Path to standard output file. Defaults to None (filled later).

err_out: Path | None

Path to standard error file. Defaults to None (filled later).

num_nodes: int

Number of nodes requested. Defaults to 1.

num_tasks: int | None

Total number of tasks, on all nodes. Defaults to None (computed dynamically).

num_tasks_per_node: int

Number of tasks per node. Defaults to 1.

gpus_per_node: int

GPUs per node requested. Defaults to 4.

cpus_per_task: int

CPUs per task requested. Defaults to 16.

memory: str

Memory per node requested. Defaults to β€œ16G”.

exclusive: bool

Whether to request exclusive node access. Defaults to False.

pre_exec_command: str | None

Pre-execution command content (shell). Defaults to None (set by builder). Typically used to set up the environment before executing the command, e.g. β€œml Python”, β€œsource .venv/bin/activate” etc. Usually this should not be set by the user except for advanced use cases, and it will be generated by the SLURM script builder based on the configuration.

exec_command: str | None

Main execution command content (shell). Defaults to None (set by builder). Command to execute, typically an β€˜srun’ command. Usually this should not be set by the user except for advanced use cases, and it will be generated by the SLURM script builder based on the configuration.

save_script: bool

Whether to save the generated SLURM script. Defaults to False.

submit_job: bool

Whether to submit the generated SLURM script. Defaults to False.

save_dir: Path | None

Directory where the script should be saved. Defaults to β€œslurm-scripts”.

pre_exec_file: str | None

Path/URL to a pre-exec file to load content from. Ignored if not provided. Defaults to None.

exec_file: str | None

Path/URL to an exec file to load content from. Ignored if not provided. Defaults to None.

exclusive_line() str[source]
generate_script() str[source]

Uses the provided configuration parameters and formats a SLURM script with the requested settings.

Returns:

A string containing the SLURM script.

Return type:

str

class itwinai.slurm.configuration.MLSlurmBuilderConfig(*, job_name: str | None = None, account: str, partition: str, time: str = '00:30:00', std_out: ~pathlib.Path | None = None, err_out: ~pathlib.Path | None = None, num_nodes: int = 1, num_tasks: int | None = None, num_tasks_per_node: int = 1, gpus_per_node: int = 4, cpus_per_task: int = 16, memory: str = '16G', exclusive: bool = False, pre_exec_command: str | None = None, exec_command: str | None = None, save_script: bool = False, submit_job: bool = False, save_dir: ~pathlib.Path | None = PosixPath('slurm-scripts'), pre_exec_file: str | None = None, exec_file: str | None = None, use_ray: bool = False, container_path: ~pathlib.Path | None = None, distributed_strategy: ~typing.Literal['ddp', 'horovod', 'deepspeed'], mode: ~typing.Literal['single', 'runall', 'scaling-test'] = 'single', training_cmd: str | None = '{itwinai_launcher} exec-pipeline --config-name={config_name} --config-path={config_path} --strategy={distributed_strategy} --run-name={run_name} +pipe_key={pipe_key} ', python_venv: str | None = None, config_name: str = 'config', config_path: str = '.', pipe_key: str = 'training_pipeline', scalability_nodes: ~typing.List[int] = <factory>, py_spy: bool = False, profiling_sampling_rate: int = 10, run_name: str = 'main-run')[source]

Bases: SlurmScriptConfiguration

Extends the base SLURM configuration with ML builder-specific options.

model_config: ClassVar[ConfigDict] = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

use_ray: bool

Whether to launch jobs via Ray. Defaults to False.

container_path: Path | None

Optional container path to export. Defaults to None.

distributed_strategy: Literal['ddp', 'horovod', 'deepspeed']

Distributed strategy to use for training. Required.

mode: Literal['single', 'runall', 'scaling-test']

Execution mode can be a single job, all strategies, or scaling test (with all strategies). Defaults to β€œsingle”.

training_cmd: str | None

Optional custom training command template. Can reference any field in this config. Defaults to {itwinai_launcher} exec-pipeline --config-name={config_name} --config-path={config_path} --strategy={distributed_strategy} --run_name={run_name} +pipe_key={pipe_key}.

python_venv: str | None

Python virtual environment to activate. Defaults to None.

config_name: str

Hydra config name to pass to exec-pipeline. Defaults to β€œconfig”.

config_path: str

Hydra config path to pass to exec-pipeline. Defaults to β€œ.”.

pipe_key: str

Pipeline key to execute. Defaults to β€œtraining_pipeline”.

scalability_nodes: List[int]

Node counts to use for scaling tests. Defaults to [1, 2, 4, 8].

py_spy: bool

Enable py-spy profiling. Defaults to False.

profiling_sampling_rate: int

Sampling rate for py-spy profiling. Defaults to 10.

run_name: str

Run name for tracking. Defaults to β€œmain-run”.

classmethod parse_scalability_nodes(value)[source]
classmethod normalize_choices(value)[source]
build_training_command() str[source]

Render the training command using the configured template and fields.

script_builder

class itwinai.slurm.script_builder.SlurmScriptBuilder(config: SlurmScriptConfiguration)[source]

Bases: object

Base builder for SLURM scripts that handles defaults, execution prep, and dispatch.

Parameters:

config (SlurmScriptConfiguration) – configuration object.

Note

The provided SlurmScriptConfiguration may be modified while preparing the script.

config: SlurmScriptConfiguration
static submit_script(script: str) None[source]

Submits the given script with β€˜sbatch’ using a temporary file.

static save_script(script: str, file_path: Path) None[source]

Saves the given script to the given file path.

static print_script(script: str) None[source]

Prints the given script to stdout using the cli_logger.

process_script() None[source]

Processes the given script by submitting and/or saving it, or by printing it to stdout. Also prepares the script by inserting default values wherever they are not set, as well as creating the needed directories.

class itwinai.slurm.script_builder.MLSlurmBuilder(config: MLSlurmBuilderConfig)[source]

Bases: SlurmScriptBuilder

Builds a SLURM script tailored to distributed machine learning.

Uses the provided MLSlurmBuilderConfig to build the script and inserts values as needed.

Parameters:

config (MLSlurmBuilderConfig) – Validated configuration controlling script generation.

Note

The given configuration object might be modified by some of the methods.

config: MLSlurmBuilderConfig
get_exec_command() str[source]

Generates an execution command for the SLURM script. Considers whether ray is enabled or not and finds the appropriate expected bash function.

get_pre_exec_command() str[source]

Generates a pre-execution command for the SLURM script. Adds a command to source the python venv if given and a command to export a container path variable if given.

process_script() None[source]

Generate the SLURM script then print, save, and/or submit based on config flags.

  • Always renders the script (filling defaults, loading exec/pre-exec files).

  • Prints to stdout when neither submit_job nor save_script is set.

  • Saves to save_dir when save_script is True.

  • Submits via sbatch when submit_job is True (ensures log dirs exist).

process_all_strategies(strategies: Iterable[str] = ('ddp', 'horovod', 'deepspeed'))[source]

Runs the SLURM script with all the given strategies.

process_scaling_test(strategies: Iterable[str] = ('ddp', 'horovod', 'deepspeed'), num_nodes_list: Iterable[int] = (1, 2, 4, 8))[source]

Runs a scaling test, i.e. runs all the strategies with separate runs for each distinct number of nodes.

itwinai.slurm.script_builder.generate_default_slurm_script(config: MLSlurmBuilderConfig) None[source]

Generates and optionally submits a default SLURM script from a validated config.

itwinai.slurm.script_builder.process_builder(slurm_script_builder: MLSlurmBuilder)[source]

utils

itwinai.slurm.utils.retrieve_remote_file(url: str) str[source]

Fetches remote file from url.

Parameters:

url – URL to the raw configuration file (YAML/JSON format), e.g. raw GitHub link.

itwinai.slurm.utils.remove_indentation_from_multiline_string(multiline_string: str) str[source]

Removes all indentation from the start of each line in a multi-line string.

If you want to remove only the shared indentation of all lines, thus preserving indentation for nested structures, use the builtin textwrap.dedent function instead.

The main purpose of this function is allowing you to define multi-line strings that only appear indented in the code, thus increasing readability.