Complete Workflow Exampleο
This page shows you how to run a complete machine learning workflow with itwinai in just a few steps. This example demonstrates distributed training with SLURM integration, plugin usage, and pipeline configuration all in a single configuration file.
This guide shows how to use the itwinai run command, which is a single entry
point to install an itwinai plugin, define an ML workflow and its
hyperparameters, and submit a distributed AI job on a SLURM cluster. This is
made possible thanks to the integration within the same configuration file of:
plugin configuration, SLURM cluster configuration, and ML workflow definition
(including hyperparameters).
If you want to know more about how to use the itwinai run command, please
refer to the official CLI reference for run.
Note
Difference between itwinai run and itwinai exec-pipeline
itwinai runreceives a configuration file that includes configuration for plugins, the SLURM cluster, and ML workflows, and it takes care of executing the complete workflow end-to-end.See the official CLI reference for run.
itwinai exec-pipelineis more low-level and is designed to receive only the configuration of an ML workflow (a.k.a. a βpipelineβ). It does not take care of SLURM submission or plugin installation: it only executes the ML workload described by the pipeline, i.e., a subset of the configuration thatitwinai runwould use.See the official CLI reference for exec-pipeline.
Prerequisitesο
itwinai installed:
pip install itwinaiAccess to an HPC cluster with SLURM (and a corresponding pre-execution script)
Git access for plugin installation
Stepsο
Create a configuration file, e.g.
config.yaml, that contains a pipeline, the required plugins, and a SLURM configuration, as follows:# Default fields (always needed) strategy: ddp run_name: "mnist" plugins: - <my first required plugin> - <my second required plugin> slurm_config: job_name: my-job-name account: my-billing-account partition: my-partition submit_job: true # Set this if you actually want to submit the job save_script: true # Set this if you want to store the generated SLURM script(s) pre_exec_file: <path or URL to pre-execution file for current system> # Fields referring to this config file pipe_key: training_pipeline config_name: <name of this config file> config_path: <path to the directory containing this config file> # Propagate global config to this slurm_config distributed_strategy: ${strategy} run_name: ${run_name} # This provides a template for the training command launched using: # $ itwinai exec-pipeline -c <this config>.yaml [ARGS] # The template is filled in using the fields of this slurm_config. # Example: training_cmd: > {itwinai_launcher} exec-pipeline --config-name={config_name} --config-path={config_path} --strategy={distributed_strategy} --run-name={run_name} +pipe_key={pipe_key} # Any other SLURM configuration options you want to set. # Check out the SLURM builder for more information. ... # Your pipeline. You can name it whatever you want, but make sure to set # the ``pipe_key`` variable accordingly. training_pipeline: _target_: itwinai.pipeline.Pipeline steps: dataloading_step: ...
Run the workflow:
itwinai run -c run_config.yaml
The command above will install the dependencies and produce a SLURM job script, but it will not submit the job to SLURM. To also submit the job to the SLURM queue, add the
-joption:itwinai run -jc run_config.yaml
The
slurm_configsection followsitwinai.slurm.configuration.MLSlurmBuilderConfig(extendingitwinai.slurm.configuration.SlurmScriptConfiguration), which documents each field and its default. Use the YAML to set values;-jand-sare the only CLI overrides applied on top of the config for submission and saving.
MNIST Exampleο
Hereβs a concrete example showing how to run distributed MNIST training on the Vega HPC system using the itwinai MNIST plugin. This simplified example shows the key components:
# Default fields (always needed)
strategy: ddp
run_name: "mnist"
# General config
dataset_root: .tmp/
num_classes: 10
batch_size: 128
num_workers_dataloader: 4
pin_memory: False
lr: 0.001
momentum: 0.9
fp16_allreduce: False
use_adasum: False
gradient_predivide_factor: 1.0
epochs: 5
test_data_path: mnist-sample-data
inference_model_mlflow_uri: mnist-pre-trained.pth
predictions_dir: mnist-predictions
predictions_file: predictions.csv
class_labels: null
checkpoints_location: checkpoints
checkpoint_every: 1
plugins:
- git+https://github.com/matbun/itwinai-mnist-plugin.git
slurm_config:
job_name: mnist-job
account: s24r05-03-users
partition: gpu
memory: 64G
mode: single
num_nodes: 2
submit_job: true
save_script: true
pre_exec_file: https://raw.githubusercontent.com/interTwin-eu/itwinai/refs/heads/main/src/itwinai/slurm/system-base-scripts/vega_pre_exec.sh
# Fields referring to this config file
pipe_key: training_pipeline
config_name: run-example # Assuming this is the name of this file
config_path: . # Assuming that run-example.yaml is in the current directory
# Propagate global config to this slurm_config
distributed_strategy: ${strategy}
run_name: ${run_name}
# This provides a template for the training command launched
# using itwinai exec-pipeline -c <this config>.yaml
# The template is filled in using the fields of this slurm_config.
training_cmd: >
{itwinai_launcher} exec-pipeline
--config-name={config_name}
--config-path={config_path}
--strategy={distributed_strategy}
--run-name={run_name}
+pipe_key={pipe_key}
# Workflows configuration
training_pipeline:
_target_: itwinai.pipeline.Pipeline
steps:
dataloading_step:
_target_: itwinai.plugins.mnist.dataloader.MNISTDataModuleTorch
save_path: ${dataset_root}
training_step:
_target_: itwinai.torch.trainer.TorchTrainer
strategy: ${strategy}
measure_gpu_data: False
enable_torch_profiling: False
store_torch_profiling_traces: False
measure_epoch_time: False
run_name: ${run_name}
time_ray: True # track time for ray report and fit
# from_checkpoint: ${itwinai.cwd:}/checkpoints_ddp/best_model/
config:
batch_size: ${batch_size}
num_workers_dataloader: ${num_workers_dataloader}
pin_gpu_memory: ${pin_memory}
... # The rest of the pipeline is omitted for the sake of readability
Note
This example has been simplified for readability. The full configuration includes additional
parameters, hyperparameter optimization settings, detailed metrics, and more complete pipeline
steps. For a working example, please refer to use-cases/mnist/torch/run-example.yaml.
Key Componentsο
Plugin: Uses the
itwinai-mnist-pluginwhich provides MNIST-specific components (based on the code inuse-cases/mnist/torch/)SLURM: Configured for Vega HPC system with 2 GPU nodes
Pipeline: Two-step workflow with data loading and distributed training
Logging: Combines console output with MLFlow experiment tracking
Full Example Configuration
For a complete configuration with hyperparameter optimization, advanced metrics, and more detailed settings,
see the full example at use-cases/mnist/torch/run-example.yaml. You can run it directly with:
itwinai run -jc https://raw.githubusercontent.com/interTwin-eu/itwinai/refs/heads/main/use-cases/mnist/torch/run-example.yaml
What This Example Doesο
This configuration demonstrates several key itwinai features:
- Plugin Integration
The example uses the MNIST plugin from GitHub, showing how to extend itwinai with external components.
- SLURM Integration
The
slurm_configsection automatically generates and submits SLURM jobs for HPC execution, including multi-node distributed training setup.- Unified Configuration
All training parameters, infrastructure settings, and pipeline definitions are in one file, making it easy to reproduce experiments.
- Distributed Training
Configured for 2-node distributed training using DDP (Distributed Data Parallel) strategy.
Expected Outputο
When you run this example, itwinai will:
Download and install the MNIST plugin
Generate a SLURM job script
Submit the job to your HPC cluster
Run distributed MNIST training across 2 nodes
Save checkpoints and training logs