{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e5064a94",
   "metadata": {},
   "source": [
    "# Pipeline and configuration files\n",
    "\n",
    "**Author(s)**: Matteo Bunino (CERN)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "588e0f6a",
   "metadata": {},
   "source": [
    "In the previous tutorial we saw how to create new components and assemble them\n",
    "into a Pipeline for a simplified workflow execution. The Pipeline executes\n",
    "the components in the order in which they are given, *assuming* that the\n",
    "outputs of a component will fit as inputs of the following component.\n",
    "This is not always true, thus you can use the ``Adapter`` component to\n",
    "compensate for mismatches. This component allows users to define a policy to\n",
    "rearrange intermediate results between two components.\n",
    "\n",
    "Moreover, it is good for reproducibility to keep track of the pipeline\n",
    "configuration used to achieve some outstanding ML results. It would be a shame\n",
    "to forget how you achieved state-of-the-art results!\n",
    "\n",
    "itwinai allows to export the Pipeline form Python code to configuration file,\n",
    "to persist both parameters and workflow structure. Exporting to configuration\n",
    "file assumes that each component class resides in a separate python file, so\n",
    "that the pipeline configuration is agnostic from the current python script.\n",
    "\n",
    "Once the Pipeline has been exported to configuration file (YAML), it can\n",
    "be executed directly from CLI:\n",
    "\n",
    "```bash\n",
    "itwinai exec-pipeline --config my-pipeline.yaml --override nested.key=42\n",
    "```\n",
    "\n",
    "The itwinai CLI allows for dynamic override of configuration fields, by means\n",
    "of nested key notation. Also list indices are supported:\n",
    "\n",
    "```bash\n",
    "itwinai exec-pipeline --config my-pipe.yaml --override nested.list.2.0=42\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "c30284ef",
   "metadata": {},
   "outputs": [],
   "source": [
    "from itwinai.pipeline import Pipeline\n",
    "from itwinai.parser import ConfigParser\n",
    "from itwinai.components import Adapter\n",
    "\n",
    "# Now we are importing the components from an external module\n",
    "from basic_components import MyDataGetter, MyDatasetSplitter, MyTrainer, MySaver\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ad31f43c",
   "metadata": {},
   "source": [
    "## Running the pipeline\n",
    "\n",
    "Here you can find a graphical representation of the pipeline implemented below.\n",
    "\n",
    "![pipeline](sample_pipeline_2.jpg)\n",
    "\n",
    "### Important!\n",
    "\n",
    "Pipeline components can be serialized only when they are imported from an external file!\n",
    "\n",
    "In this case, `MyDataGetter`, `MyDatasetSplitter`, and `MyTrainer` are imported from `basic_components`.\n",
    "Otherwise, the pipe serialization cannot be deserialized by another process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "826fe8c7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "#######################################\n",
      "# Starting execution of 'Pipeline'... #\n",
      "#######################################\n",
      "###########################################\n",
      "# Starting execution of 'MyDataGetter'... #\n",
      "###########################################\n",
      "#####################################\n",
      "# 'MyDataGetter' executed in 0.000s #\n",
      "#####################################\n",
      "################################################\n",
      "# Starting execution of 'MyDatasetSplitter'... #\n",
      "################################################\n",
      "##########################################\n",
      "# 'MyDatasetSplitter' executed in 0.000s #\n",
      "##########################################\n",
      "########################################\n",
      "# Starting execution of 'MyTrainer'... #\n",
      "########################################\n",
      "##################################\n",
      "# 'MyTrainer' executed in 0.000s #\n",
      "##################################\n",
      "######################################\n",
      "# Starting execution of 'Adapter'... #\n",
      "######################################\n",
      "################################\n",
      "# 'Adapter' executed in 0.000s #\n",
      "################################\n",
      "######################################\n",
      "# Starting execution of 'MySaver'... #\n",
      "######################################\n",
      "################################\n",
      "# 'MySaver' executed in 0.000s #\n",
      "################################\n",
      "#################################\n",
      "# 'Pipeline' executed in 0.001s #\n",
      "#################################\n",
      "Trained model:  my_trained_model\n"
     ]
    }
   ],
   "source": [
    "\n",
    "# In this pipeline, the MyTrainer produces 4 elements as output: train,\n",
    "# validation, test datasets, and trained model. The Adapter selects the\n",
    "# trained model only, and forwards it to the saver, which expects a single\n",
    "# item as input.\n",
    "pipeline = Pipeline([\n",
    "    MyDataGetter(data_size=100),\n",
    "    MyDatasetSplitter(\n",
    "        train_proportion=.5,\n",
    "        validation_proportion=.25,\n",
    "        test_proportion=0.25\n",
    "    ),\n",
    "    MyTrainer(),\n",
    "    Adapter(policy=[f\"{Adapter.INPUT_PREFIX}-1\"]),\n",
    "    MySaver()\n",
    "])\n",
    "\n",
    "# Run pipeline\n",
    "trained_model = pipeline.execute()\n",
    "print(\"Trained model: \", trained_model)\n",
    "\n",
    "# Serialize pipeline to YAML\n",
    "pipeline.to_yaml(\"basic_pipeline_example.yaml\", \"pipeline\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "b6a7391f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MyDataGetter's data_size is now: 200\n",
      "\n",
      "#######################################\n",
      "# Starting execution of 'Pipeline'... #\n",
      "#######################################\n",
      "###########################################\n",
      "# Starting execution of 'MyDataGetter'... #\n",
      "###########################################\n",
      "#####################################\n",
      "# 'MyDataGetter' executed in 0.000s #\n",
      "#####################################\n",
      "################################################\n",
      "# Starting execution of 'MyDatasetSplitter'... #\n",
      "################################################\n",
      "##########################################\n",
      "# 'MyDatasetSplitter' executed in 0.000s #\n",
      "##########################################\n",
      "########################################\n",
      "# Starting execution of 'MyTrainer'... #\n",
      "########################################\n",
      "##################################\n",
      "# 'MyTrainer' executed in 0.000s #\n",
      "##################################\n",
      "######################################\n",
      "# Starting execution of 'Adapter'... #\n",
      "######################################\n",
      "################################\n",
      "# 'Adapter' executed in 0.000s #\n",
      "################################\n",
      "######################################\n",
      "# Starting execution of 'MySaver'... #\n",
      "######################################\n",
      "################################\n",
      "# 'MySaver' executed in 0.000s #\n",
      "################################\n",
      "#################################\n",
      "# 'Pipeline' executed in 0.001s #\n",
      "#################################\n",
      "Trained model (2):  my_trained_model\n"
     ]
    }
   ],
   "source": [
    "# Here, we show how to run a pre-existing pipeline stored as\n",
    "# a configuration file, with the possibility of dynamically\n",
    "# override some fields\n",
    "\n",
    "# Load pipeline from saved YAML (dynamic deserialization)\n",
    "parser = ConfigParser(\n",
    "    config=\"basic_pipeline_example.yaml\",\n",
    "    override_keys={\n",
    "        \"pipeline.init_args.steps.0.init_args.data_size\": 200\n",
    "    }\n",
    ")\n",
    "pipeline = parser.parse_pipeline()\n",
    "print(f\"MyDataGetter's data_size is now: {pipeline.steps[0].data_size}\\n\")\n",
    "\n",
    "# Run parsed pipeline, with new data_size for MyDataGetter\n",
    "trained_model = pipeline.execute()\n",
    "print(\"Trained model (2): \", trained_model)\n",
    "\n",
    "# Save new pipeline to YAML file\n",
    "pipeline.to_yaml(\"basic_pipeline_example_v2.yaml\", \"pipeline\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "416ac5ab",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "#######################################\n",
      "# Starting execution of 'Pipeline'... #\n",
      "#######################################\n",
      "###########################################\n",
      "# Starting execution of 'MyDataGetter'... #\n",
      "###########################################\n",
      "#####################################\n",
      "# 'MyDataGetter' executed in 0.000s #\n",
      "#####################################\n",
      "################################################\n",
      "# Starting execution of 'MyDatasetSplitter'... #\n",
      "################################################\n",
      "##########################################\n",
      "# 'MyDatasetSplitter' executed in 0.000s #\n",
      "##########################################\n",
      "########################################\n",
      "# Starting execution of 'MyTrainer'... #\n",
      "########################################\n",
      "##################################\n",
      "# 'MyTrainer' executed in 0.000s #\n",
      "##################################\n",
      "######################################\n",
      "# Starting execution of 'Adapter'... #\n",
      "######################################\n",
      "################################\n",
      "# 'Adapter' executed in 0.000s #\n",
      "################################\n",
      "######################################\n",
      "# Starting execution of 'MySaver'... #\n",
      "######################################\n",
      "################################\n",
      "# 'MySaver' executed in 0.000s #\n",
      "################################\n",
      "#################################\n",
      "# 'Pipeline' executed in 0.001s #\n",
      "#################################\n"
     ]
    }
   ],
   "source": [
    "# Emulate pipeline execution from CLI, with dynamic override of\n",
    "# pipeline configuration fields\n",
    "!itwinai exec-pipeline --config basic_pipeline_example_v2.yaml \\\\\n",
    "    --override pipeline.init_args.steps.0.init_args.data_size=300 \\\\\n",
    "    --override pipeline.init_args.steps.1.init_args.train_proportion=0.4"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}