Using Configuration Files in itwinaiο
Pipeline and configuration filesο
Author(s): Anna Lappe (CERN), Matteo Bunino (CERN)
In the previous tutorial, we introduced how to create new components and assemble them into a Pipeline for a simplified workflow execution. The Pipeline executes components in the order they are defined, assuming that each componentβs outputs will fit as inputs to the next one.
Sometimes, you might want to define your pipeline in a configuration YAML file instead. This allows for:
Easier modification and reuse of pipeline definitions
Clear separation between code and configuration
Dynamic overrides of parameters at runtime
Example Configuration Fileο
training_pipeline:
_target_: itwinai.pipeline.Pipeline
steps:
my-getter:
_target_: basic_components.MyDataGetter
data_size: 200
my-splitter:
_target_: basic_components.MyDatasetSplitter
train_proportion: 0.5
validation_proportion: 0.25
test_proportion: 0.25
my-trainer:
_target_: basic_components.MyTrainer
itwinai uses Hydra to parse configuration files and instantiate pipelines dynamically. There are two ways you can use your configuration file to run a pipeline: from the command-line interface (CLI) or from within your code.
Parsing Pipelines from CLIο
You can execute a pipeline from a configuration file (which by default is called config.yaml)
with the CLI using:
itwinai exec-pipeline
This command loads the configuration file and executes the defined pipeline. You can customize execution using the following options:
1. Setting a --config-path and --config-nameο
By default, the parser will look for a file called config.yaml inside your current working
directory. If you want to change this, set the path to your configuration file with the
--config-path option. This can be either absolute or relative to your current working
directory and should point to the directory in which your configuration file is located.
If your configuration file has a different name (not config.yaml), you may specify this with the --config-name flag:
itwinai exec-pipeline --config-path path/to/dir --config-name my-config-file
Note that we omit the extension .yaml. This is intentional, as hydra expects only the stem
of the filename.
2. Selecting a pipeline by name (+pipe_key)ο
A configuration file can contain multiple pipelines. The default key that the parser will look
for is training_pipeline. Use the pipe_key argument to overwrite this default and
specify which pipeline to execute:
itwinai exec-pipeline +pipe_key=another_training_pipeline
3. Selecting Steps to Run (+pipe_steps)ο
If you only want to run a subset of specific steps of the pipeline, use pipe_steps:
itwinai exec-pipeline +pipe_steps=[my-splitter, my-trainer]
This will execute only the MyDatasetSplitter and MyTrainer steps of the pipeline. You can also
give pipe_steps as a list of indices, if your configuration file defines your steps in list format.
4. Dynamically overriding configuration fieldsο
You can override any parameter in the configuration file directly from the command line:
itwinai exec-pipeline +training_pipeline.steps.my-getter.data_size=500
This modifies the data_size parameter inside the pipeline configuration. You can also override
fields if your pipeline steps are defined in the form of a list in your configuration file.
In this case, you give the stepβs index instead of its name, for example
itwinai exec-pipeline +training_pipeline.steps.0.data_size=500
Advanced Functionality with Hydraο
Since this implementation is based on Hydra, you can use all of Hydraβs command-line arguments, such as for multi-run execution, merging configuration files, and debugging. For more details, refer to the Hydra documentation.
Note
If your pipeline execution fails and you need detailed error messages, we recommend that you set the following environment variable before running the pipeline:
export HYDRA_FULL_ERROR=1
This will give you more verbose error messages including the full stack trace given by Hydra.
If you do not want the variable to persist, i.e. you only want to run your command with the
the detailed error message once, you can also run it such that the environment variable
HYDRA_FULL_ERROR will not persist and reset after your command has been executed:
HYDRA_FULL_ERROR=1 itwinai exec-pipeline
Parsing Pipelines from Pythonο
In some cases, you may want to parse and execute a pipeline from a configuration file from within your Python code. You can do this by running:
from hydra import compose, initialize
from itwinai import exec_pipeline_with_compose
# Here, we show how to run a pre-existing pipeline stored as
# a configuration file from within Python code, with the possibility of dynamically
# overriding some fields
# Load pipeline from saved YAML (dynamic deserialization)
with initialize():
cfg = compose(
config_name="my-config.yaml",
overrides=[
"pipeline.steps.0.data_size=400",
],
)
exec_pipeline_with_compose(cfg)
Reproducibilityο
Each execution logs the pipeline configuration under the outputs/ directory. This ensures
reproducibility by recording the exact parameters used for execution.