Complete Workflow Example

This page shows you how to run a complete machine learning workflow with itwinai in just a few steps. This example demonstrates distributed training with SLURM integration, plugin usage, and pipeline configuration all in a single configuration file.

This guide shows how to use the itwinai run command, which is a single entry point to install an itwinai plugin, define an ML workflow and its hyperparameters, and submit a distributed AI job on a SLURM cluster. This is made possible thanks to the integration within the same configuration file of: plugin configuration, SLURM cluster configuration, and ML workflow definition (including hyperparameters).

If you want to know more about how to use the itwinai run command, please refer to the official CLI reference for run.

Note

Difference between itwinai run and itwinai exec-pipeline

itwinai run receives a configuration file that includes configuration for plugins, the SLURM cluster, and ML workflows, and it takes care of executing the complete workflow end-to-end.

See the official CLI reference for run.
itwinai exec-pipeline is more low-level and is designed to receive only the configuration of an ML workflow (a.k.a. a “pipeline”). It does not take care of SLURM submission or plugin installation: it only executes the ML workload described by the pipeline, i.e., a subset of the configuration that itwinai run would use.

See the official CLI reference for exec-pipeline.

Prerequisites

itwinai installed: pip install itwinai
Access to an HPC cluster with SLURM (and a corresponding pre-execution script)
Git access for plugin installation

Steps

Create a configuration file, e.g. config.yaml, that contains a pipeline, the required plugins, and a SLURM configuration, as follows:

# Default fields (always needed)
strategy: ddp
run_name: "mnist"

plugins:
  - <my first required plugin>
  - <my second required plugin>

slurm_config:
  job_name: my-job-name
  account: my-billing-account
  partition: my-partition
  submit_job: true   # Set this if you actually want to submit the job
  save_script: true  # Set this if you want to store the generated SLURM script(s)
  pre_exec_file: <path or URL to pre-execution file for current system>

  # Fields referring to this config file
  pipe_key: training_pipeline
  config_name: <name of this config file>
  config_path: <path to the directory containing this config file>

  # Propagate global config to this slurm_config
  distributed_strategy: ${strategy}
  run_name: ${run_name}

  # This provides a template for the training command launched using:
  # $ itwinai exec-pipeline -c <this config>.yaml [ARGS]
  # The template is filled in using the fields of this slurm_config.
  # Example:
  training_cmd: >
    {itwinai_launcher} exec-pipeline
    --config-name={config_name}
    --config-path={config_path}
    --strategy={distributed_strategy}
    --run-name={run_name}
    +pipe_key={pipe_key}

  # Any other SLURM configuration options you want to set.
  # Check out the SLURM builder for more information.
  ...

# Your pipeline. You can name it whatever you want, but make sure to set
# the ``pipe_key`` variable accordingly.
training_pipeline:
  _target_: itwinai.pipeline.Pipeline
  steps:
    dataloading_step:
      ...

Run the workflow:
```
itwinai run -c run_config.yaml
```
The command above will install the dependencies and produce a SLURM job script, but it will not submit the job to SLURM. To also submit the job to the SLURM queue, add the -j option:
```
itwinai run -jc run_config.yaml
```
The slurm_config section follows itwinai.slurm.configuration.MLSlurmBuilderConfig (extending itwinai.slurm.configuration.SlurmScriptConfiguration), which documents each field and its default. Use the YAML to set values; -j and -s are the only CLI overrides applied on top of the config for submission and saving.

MNIST Example

Here’s a concrete example showing how to run distributed MNIST training on the Vega HPC system using the itwinai MNIST plugin. This simplified example shows the key components:

# Default fields (always needed)
strategy: ddp
run_name: "mnist"

# General config
dataset_root: .tmp/
num_classes: 10
batch_size: 128
num_workers_dataloader: 4
pin_memory: False
lr: 0.001
momentum: 0.9
fp16_allreduce: False
use_adasum: False
gradient_predivide_factor: 1.0
epochs: 5
test_data_path: mnist-sample-data
inference_model_mlflow_uri: mnist-pre-trained.pth
predictions_dir: mnist-predictions
predictions_file: predictions.csv
class_labels: null
checkpoints_location: checkpoints
checkpoint_every: 1

plugins:
  - git+https://github.com/matbun/itwinai-mnist-plugin.git

slurm_config:
  job_name: mnist-job
  account: s24r05-03-users
  partition: gpu
  memory: 64G
  mode: single
  num_nodes: 2
  submit_job: true
  save_script: true
  pre_exec_file: https://raw.githubusercontent.com/interTwin-eu/itwinai/refs/heads/main/src/itwinai/slurm/system-base-scripts/vega_pre_exec.sh

  # Fields referring to this config file
  pipe_key: training_pipeline
  config_name: run-example # Assuming this is the name of this file
  config_path: . # Assuming that run-example.yaml is in the current directory

  # Propagate global config to this slurm_config
  distributed_strategy: ${strategy}
  run_name: ${run_name}

  # This provides a template for the training command launched
  # using itwinai exec-pipeline -c <this config>.yaml
  # The template is filled in using the fields of this slurm_config.
  training_cmd: >
    {itwinai_launcher} exec-pipeline
    --config-name={config_name}
    --config-path={config_path}
    --strategy={distributed_strategy}
    --run-name={run_name}
    +pipe_key={pipe_key}

# Workflows configuration
training_pipeline:
  _target_: itwinai.pipeline.Pipeline
  steps:
    dataloading_step:
      _target_: itwinai.plugins.mnist.dataloader.MNISTDataModuleTorch
      save_path: ${dataset_root}
    training_step:
      _target_: itwinai.torch.trainer.TorchTrainer
      strategy: ${strategy}
      measure_gpu_data: False
      enable_torch_profiling: False
      store_torch_profiling_traces: False
      measure_epoch_time: False
      run_name: ${run_name}
      time_ray: True # track time for ray report and fit
      # from_checkpoint: ${itwinai.cwd:}/checkpoints_ddp/best_model/
      config:
        batch_size: ${batch_size}
        num_workers_dataloader: ${num_workers_dataloader}
        pin_gpu_memory: ${pin_memory}

    ... # The rest of the pipeline is omitted for the sake of readability

Note

This example has been simplified for readability. The full configuration includes additional parameters, hyperparameter optimization settings, detailed metrics, and more complete pipeline steps. For a working example, please refer to use-cases/mnist/torch/run-example.yaml.

Key Components

Plugin: Uses the itwinai-mnist-plugin which provides MNIST-specific components (based on the code in use-cases/mnist/torch/)
SLURM: Configured for Vega HPC system with 2 GPU nodes
Pipeline: Two-step workflow with data loading and distributed training
Logging: Combines console output with MLFlow experiment tracking

Full Example Configuration

For a complete configuration with hyperparameter optimization, advanced metrics, and more detailed settings, see the full example at use-cases/mnist/torch/run-example.yaml. You can run it directly with:

itwinai run -jc https://raw.githubusercontent.com/interTwin-eu/itwinai/refs/heads/main/use-cases/mnist/torch/run-example.yaml

What This Example Does

This configuration demonstrates several key itwinai features:

Plugin Integration: The example uses the MNIST plugin from GitHub, showing how to extend itwinai with external components.
SLURM Integration: The slurm_config section automatically generates and submits SLURM jobs for HPC execution, including multi-node distributed training setup.
Unified Configuration: All training parameters, infrastructure settings, and pipeline definitions are in one file, making it easy to reproduce experiments.
Distributed Training: Configured for 2-node distributed training using DDP (Distributed Data Parallel) strategy.

Expected Output

When you run this example, itwinai will:

Download and install the MNIST plugin
Generate a SLURM job script
Submit the job to your HPC cluster
Run distributed MNIST training across 2 nodes
Save checkpoints and training logs