Complete Workflow Example ========================= This page shows you how to run a complete machine learning workflow with itwinai in just a few steps. This example demonstrates distributed training with SLURM integration, plugin usage, and pipeline configuration all in a single configuration file. This guide shows how to use the ``itwinai run`` command, which is a single entry point to install an itwinai plugin, define an ML workflow and its hyperparameters, and submit a distributed AI job on a SLURM cluster. This is made possible thanks to the integration within the same configuration file of: plugin configuration, SLURM cluster configuration, and ML workflow definition (including hyperparameters). If you want to know more about how to use the ``itwinai run`` command, please refer to the official CLI reference for `run `__. .. note:: **Difference between** ``itwinai run`` **and** ``itwinai exec-pipeline`` - ``itwinai run`` receives a configuration file that includes configuration for plugins, the SLURM cluster, and ML workflows, and it takes care of executing the complete workflow end-to-end. See the official CLI reference for `run `__. - ``itwinai exec-pipeline`` is more low-level and is designed to receive only the configuration of an ML workflow (a.k.a. a "pipeline"). It does not take care of SLURM submission or plugin installation: it only executes the ML workload described by the pipeline, i.e., a subset of the configuration that ``itwinai run`` would use. See the official CLI reference for `exec-pipeline `__. Prerequisites ------------- - itwinai installed: ``pip install itwinai`` - Access to an HPC cluster with SLURM (and a corresponding pre-execution script) - Git access for plugin installation Steps ----- 1. Create a configuration file, e.g. ``config.yaml``, that contains a pipeline, the required plugins, and a SLURM configuration, as follows: .. code-block:: yaml # Default fields (always needed) strategy: ddp run_name: "mnist" plugins: - - slurm_config: job_name: my-job-name account: my-billing-account partition: my-partition submit_job: true # Set this if you actually want to submit the job save_script: true # Set this if you want to store the generated SLURM script(s) pre_exec_file: # Fields referring to this config file pipe_key: training_pipeline config_name: config_path: # Propagate global config to this slurm_config distributed_strategy: ${strategy} run_name: ${run_name} # This provides a template for the training command launched using: # $ itwinai exec-pipeline -c .yaml [ARGS] # The template is filled in using the fields of this slurm_config. # Example: training_cmd: > {itwinai_launcher} exec-pipeline --config-name={config_name} --config-path={config_path} --strategy={distributed_strategy} --run-name={run_name} +pipe_key={pipe_key} # Any other SLURM configuration options you want to set. # Check out the SLURM builder for more information. ... # Your pipeline. You can name it whatever you want, but make sure to set # the ``pipe_key`` variable accordingly. training_pipeline: _target_: itwinai.pipeline.Pipeline steps: dataloading_step: ... 2. Run the workflow: .. code-block:: bash itwinai run -c run_config.yaml The command above will install the dependencies and produce a SLURM job script, but it will not submit the job to SLURM. To also submit the job to the SLURM queue, add the ``-j`` option: .. code-block:: bash itwinai run -jc run_config.yaml The ``slurm_config`` section follows :class:`itwinai.slurm.configuration.MLSlurmBuilderConfig` (extending :class:`itwinai.slurm.configuration.SlurmScriptConfiguration`), which documents each field and its default. Use the YAML to set values; ``-j`` and ``-s`` are the only CLI overrides applied on top of the config for submission and saving. MNIST Example ------------- Here's a concrete example showing how to run distributed MNIST training on the Vega HPC system using the itwinai MNIST plugin. This simplified example shows the key components: .. code-block:: yaml # Default fields (always needed) strategy: ddp run_name: "mnist" # General config dataset_root: .tmp/ num_classes: 10 batch_size: 128 num_workers_dataloader: 4 pin_memory: False lr: 0.001 momentum: 0.9 fp16_allreduce: False use_adasum: False gradient_predivide_factor: 1.0 epochs: 5 test_data_path: mnist-sample-data inference_model_mlflow_uri: mnist-pre-trained.pth predictions_dir: mnist-predictions predictions_file: predictions.csv class_labels: null checkpoints_location: checkpoints checkpoint_every: 1 plugins: - git+https://github.com/matbun/itwinai-mnist-plugin.git slurm_config: job_name: mnist-job account: s24r05-03-users partition: gpu memory: 64G mode: single num_nodes: 2 submit_job: true save_script: true pre_exec_file: https://raw.githubusercontent.com/interTwin-eu/itwinai/refs/heads/main/src/itwinai/slurm/system-base-scripts/vega_pre_exec.sh # Fields referring to this config file pipe_key: training_pipeline config_name: run-example # Assuming this is the name of this file config_path: . # Assuming that run-example.yaml is in the current directory # Propagate global config to this slurm_config distributed_strategy: ${strategy} run_name: ${run_name} # This provides a template for the training command launched # using itwinai exec-pipeline -c .yaml # The template is filled in using the fields of this slurm_config. training_cmd: > {itwinai_launcher} exec-pipeline --config-name={config_name} --config-path={config_path} --strategy={distributed_strategy} --run-name={run_name} +pipe_key={pipe_key} # Workflows configuration training_pipeline: _target_: itwinai.pipeline.Pipeline steps: dataloading_step: _target_: itwinai.plugins.mnist.dataloader.MNISTDataModuleTorch save_path: ${dataset_root} training_step: _target_: itwinai.torch.trainer.TorchTrainer strategy: ${strategy} measure_gpu_data: False enable_torch_profiling: False store_torch_profiling_traces: False measure_epoch_time: False run_name: ${run_name} time_ray: True # track time for ray report and fit # from_checkpoint: ${itwinai.cwd:}/checkpoints_ddp/best_model/ config: batch_size: ${batch_size} num_workers_dataloader: ${num_workers_dataloader} pin_gpu_memory: ${pin_memory} ... # The rest of the pipeline is omitted for the sake of readability .. note:: This example has been simplified for readability. The full configuration includes additional parameters, hyperparameter optimization settings, detailed metrics, and more complete pipeline steps. For a working example, please refer to ``use-cases/mnist/torch/run-example.yaml``. Key Components ~~~~~~~~~~~~~~ - **Plugin**: Uses the ``itwinai-mnist-plugin`` which provides MNIST-specific components (based on the code in ``use-cases/mnist/torch/``) - **SLURM**: Configured for Vega HPC system with 2 GPU nodes - **Pipeline**: Two-step workflow with data loading and distributed training - **Logging**: Combines console output with MLFlow experiment tracking **Full Example Configuration** For a complete configuration with hyperparameter optimization, advanced metrics, and more detailed settings, see the full example at ``use-cases/mnist/torch/run-example.yaml``. You can run it directly with: .. code-block:: bash itwinai run -jc https://raw.githubusercontent.com/interTwin-eu/itwinai/refs/heads/main/use-cases/mnist/torch/run-example.yaml What This Example Does ~~~~~~~~~~~~~~~~~~~~~~ This configuration demonstrates several key itwinai features: **Plugin Integration** The example uses the MNIST plugin from GitHub, showing how to extend itwinai with external components. **SLURM Integration** The ``slurm_config`` section automatically generates and submits SLURM jobs for HPC execution, including multi-node distributed training setup. **Unified Configuration** All training parameters, infrastructure settings, and pipeline definitions are in one file, making it easy to reproduce experiments. **Distributed Training** Configured for 2-node distributed training using DDP (Distributed Data Parallel) strategy. Expected Output ~~~~~~~~~~~~~~~ When you run this example, itwinai will: 1. Download and install the MNIST plugin 2. Generate a SLURM job script 3. Submit the job to your HPC cluster 4. Run distributed MNIST training across 2 nodes 5. Save checkpoints and training logs