.. _using_configuration_files: ======================================= Using Configuration Files in itwinai ======================================= Pipeline and configuration files ================================ **Author(s)**: Anna Lappe (CERN), Matteo Bunino (CERN) In the previous tutorial, we introduced how to create new components and assemble them into a **Pipeline** for a simplified workflow execution. The **Pipeline** executes components in the order they are defined, assuming that each component's outputs will fit as inputs to the next one. Sometimes, you might want to define your pipeline in a configuration **YAML** file instead. This allows for: - Easier modification and reuse of pipeline definitions - Clear separation between code and configuration - Dynamic overrides of parameters at runtime Example Configuration File ========================== .. code-block:: yaml training_pipeline: _target_: itwinai.pipeline.Pipeline steps: my-getter: _target_: basic_components.MyDataGetter data_size: 200 my-splitter: _target_: basic_components.MyDatasetSplitter train_proportion: 0.5 validation_proportion: 0.25 test_proportion: 0.25 my-trainer: _target_: basic_components.MyTrainer itwinai uses `Hydra `_ to parse configuration files and instantiate pipelines dynamically. There are two ways you can use your configuration file to run a pipeline: from the command-line interface (CLI) or from within your code. Parsing Pipelines from CLI ========================== You can execute a pipeline from a configuration file (which by default is called ``config.yaml``) with the CLI using: .. code-block:: bash itwinai exec-pipeline This command loads the configuration file and executes the defined pipeline. You can customize execution using the following options: 1. Setting a ``--config-path`` and ``--config-name`` ----------------------------------------------------- By default, the parser will look for a file called ``config.yaml`` inside your current working directory. If you want to change this, set the path to your configuration file with the ``--config-path`` option. This can be either absolute or relative to your current working directory and should point to the directory in which your configuration file is located. If your configuration file has a different name (not ``config.yaml``), you may specify this with the ``--config-name`` flag: .. code-block:: bash itwinai exec-pipeline --config-path path/to/dir --config-name my-config-file Note that we omit the extension ``.yaml``. This is intentional, as hydra expects only the stem of the filename. 2. Selecting a pipeline by name (``+pipe_key``) ----------------------------------------------- A configuration file can contain multiple pipelines. The default key that the parser will look for is ``training_pipeline``. Use the ``pipe_key`` argument to overwrite this default and specify which pipeline to execute: .. code-block:: bash itwinai exec-pipeline +pipe_key=another_training_pipeline 3. Selecting Steps to Run (``+pipe_steps``) ------------------------------------------- If you only want to run a subset of specific steps of the pipeline, use ``pipe_steps``: .. code-block:: bash itwinai exec-pipeline +pipe_steps=[my-splitter, my-trainer] This will execute only the ``MyDatasetSplitter`` and ``MyTrainer`` steps of the pipeline. You can also give ``pipe_steps`` as a list of indices, if your configuration file defines your steps in list format. 4. Dynamically overriding configuration fields ---------------------------------------------- You can override any parameter in the configuration file directly from the command line: .. code-block:: bash itwinai exec-pipeline +training_pipeline.steps.my-getter.data_size=500 This modifies the ``data_size`` parameter inside the pipeline configuration. You can also override fields if your pipeline steps are defined in the form of a list in your configuration file. In this case, you give the step's index instead of its name, for example .. code-block:: bash itwinai exec-pipeline +training_pipeline.steps.0.data_size=500 Advanced Functionality with Hydra ================================= Since this implementation is based on **Hydra**, you can use all of Hydra’s command-line arguments, such as for multi-run execution, merging configuration files, and debugging. For more details, refer to the `Hydra documentation `_. .. note:: If your pipeline execution fails and you need detailed error messages, we recommend that you set the following environment variable before running the pipeline: .. code-block:: bash export HYDRA_FULL_ERROR=1 This will give you more verbose error messages including the full stack trace given by Hydra. If you do not want the variable to persist, i.e. you only want to run your command with the the detailed error message once, you can also run it such that the environment variable ``HYDRA_FULL_ERROR`` will not persist and reset after your command has been executed: .. code-block:: bash HYDRA_FULL_ERROR=1 itwinai exec-pipeline Parsing Pipelines from Python ============================= In some cases, you may want to parse and execute a pipeline from a configuration file from within your Python code. You can do this by running: .. code-block:: python from hydra import compose, initialize from itwinai import exec_pipeline_with_compose # Here, we show how to run a pre-existing pipeline stored as # a configuration file from within Python code, with the possibility of dynamically # overriding some fields # Load pipeline from saved YAML (dynamic deserialization) with initialize(): cfg = compose( config_name="my-config.yaml", overrides=[ "pipeline.steps.0.data_size=400", ], ) exec_pipeline_with_compose(cfg) Reproducibility =============== Each execution logs the pipeline configuration under the ``outputs/`` directory. This ensures reproducibility by recording the exact parameters used for execution.