CLI

Here you can find the itwinai CLI reference.

itwinai command line interface.

Usage:

$ [OPTIONS] COMMAND [ARGS]...

Options:

-v, --version: Show version and exit.
--install-completion: Install completion for the current shell.
--show-completion: Show completion for the current shell, to copy it or customize the installation.
--help: Show this message and exit.

Commands:

generate-flamegraph: Generates a flamegraph from the given…
generate-py-spy-report: Generates an aggregation of the raw py-spy…
generate-scalability-report: Generates scalability reports for epoch…
sanity-check: Run sanity checks on the installation of…
check-distributed-cluster: This command provides a suite of tests for…
generate-slurm: Generate a SLURM script from a…
run: Launch ML jobs with dependency…
exec-pipeline: Execute a pipeline from configuration file…
mlflow-ui: Visualize logs with Mlflow.
mlflow-server: Spawn Mlflow server.
kill-mlflow-server: Kill Mlflow server.
download-mlflow-data: Download metrics data from MLFlow…
tensorboard-ui: Visualize logs with TensorBoard.
upload-model-to-hub: Upload a model checkpoint to the AI Model…

`generate-flamegraph`

Generates a flamegraph from the given profiling output.

Usage:

$ generate-flamegraph [OPTIONS]

Options:

--file TEXT: The location of the raw profiling data. [required]
--output-filename TEXT: The filename of the resulting flamegraph. [default: flamegraph.svg]
--help: Show this message and exit.

`generate-py-spy-report`

Generates an aggregation of the raw py-spy profiling data, showing which leaf functions collected the most samples.

Usage:

$ generate-py-spy-report [OPTIONS]

Options:

--file TEXT: The location of the raw profiling data. [required]
--num-rows TEXT: Number of rows to display. Pass ‘all’ to print the full table. [default: 10]
--aggregate-leaf-paths / --no-aggregate-leaf-paths: Whether to aggregate all unique leaf calls across different call stacks. [default: no-aggregate-leaf-paths]
--library-name TEXT: Which library name to find the lowest contact point of. [default: itwinai]
--help: Show this message and exit.

`generate-scalability-report`

Generates scalability reports for epoch time, GPU data, and communication data based the mlflow logs.

This command processes runs under the given experiment at a tracking uri. It generates plots and metrics for scalability analysis and saves them in the plot_dir.

Usage:

$ generate-scalability-report [OPTIONS]

Options:

--tracking-uri TEXT: The tracking URI of the MLFlow server. [default: mllogs/mlflow]
--experiment-name TEXT: The name of the mlflow experiment to use for the GPU data report. [default: unnamed-experiment]
--plot-dir TEXT: Which directory to save the resulting plots in. [default: plots]
--run-names TEXT: Which run names to read, presented as comma-separated values e.g. ‘run0,run1’.
--plot-file-suffix TEXT: Which file suffix to use for the plots. Useful for changing between raster and vector based images [default: .png]
--include-communication / --no-include-communication: Include communication data in the scalability report. Disclaimer: Communication fractions are unreliable and vary significantly for different HPC systems. [default: no-include-communication]
--no-warnings / --no-no-warnings: Create plots without warnings. [default: no-no-warnings]
--help: Show this message and exit.

`sanity-check`

Run sanity checks on the installation of itwinai and its dependencies by trying to import itwinai modules. By default, only itwinai core modules (neither torch, nor tensorflow) are tested.

Usage:

$ sanity-check [OPTIONS]

Options:

--torch / --no-torch: Check also itwinai.torch modules. [default: no-torch]
--tensorflow / --no-tensorflow: Check also itwinai.tensorflow modules. [default: no-tensorflow]
--all / --no-all: Check all modules. [default: no-all]
--optional-deps TEXT: List of optional dependencies.
--help: Show this message and exit.

`check-distributed-cluster`

This command provides a suite of tests for a quick sanity check of the network setup for torch distributed. Useful when working with containers on HPC. Remember to prepend torchrun in front of this command or to start a Ray cluster.

Usage:

$ check-distributed-cluster [OPTIONS]

Options:

--platform TEXT: Hardware platform: nvidia or amd [default: nvidia]
--launcher TEXT: Distributed ML cluster: torchrun or ray [default: torchrun]
--help: Show this message and exit.

`generate-slurm`

Generate a SLURM script from a configuration file.

Usage:

$ generate-slurm [OPTIONS]

Options:

-c, --config TEXT: Path or URL to a YAML SLURM configuration file. [required]
-j, --submit-job / --no-submit-job: Whether to submit the SLURM job after generating the script.
-s, --save-script / --no-save-script: Whether to save the generated SLURM script to disk.
--help: Show this message and exit.

`run`

Launch ML jobs with dependency installation and SLURM scheduling.

Usage:

$ run [OPTIONS]

Options:

-c, --config TEXT: Path or URL to a configuration file in yaml format. [required]
-j, --submit-job / --no-submit-job: Whether to submit the SLURM job after generating the script.
-s, --save-script / --no-save-script: Whether to save the generated SLURM script to disk.
--help: Show this message and exit.

`exec-pipeline`

Execute a pipeline from configuration file using Hydra CLI. Allows dynamic override of fields which can be appended as a list of overrides (e.g., batch_size=32). By default, it will expect a configuration file called “config.yaml” in the current working directory. To override the default behavior set –config-name and –config-path. By default, this command will execute the whole pipeline under “training_pipeline” field in the configuration file. To execute a different pipeline you can override this by passing “+pipe_key=your_pipeline” in the list of overrides, and to execute only a subset of the steps, you can pass “+pipe_steps=[0,1]”.

Usage:

$ exec-pipeline [OPTIONS] [OVERRIDES]...

Arguments:

[OVERRIDES]...: Any key=value arguments to override config values (use dots for.nested=overrides), using the Hydra syntax.

Options:

--hydra-help / --no-hydra-help: Show Hydra’s help page [default: no-hydra-help]
--version / --no-version: Show Hydra’s version and exit [default: no-version]
-c, --cfg TEXT: Show config instead of running
--resolve / --no-resolve: Used in conjunction with –cfg, resolve config interpolations before printing. [default: no-resolve]
-p, --package TEXT: Config package to show
-r, --run TEXT: Run a job
-m, --multirun TEXT: Run multiple jobs with the configured launcher and sweeper
-sc, --shell-completion TEXT: Install or Uninstall shell completion
--strategy TEXT: Override the global ‘strategy’ field in the config (creates it if missing).
--run-name TEXT: Override the global ‘run_name’ field in the config (creates it if missing).
-cp, --config-path TEXT: Overrides the config_path specified in hydra.main(). The config_path is absolute, or relative to the current workign directory. Defaults to the current working directory.
-cn, --config-name TEXT: Overrides the config_name specified in hydra.main() [default: config]
-cd, --config-dir TEXT: Adds an additional config dir to the config search path
--experimental-rerun TEXT: Rerun a job from a previous config pickle
-i, --info TEXT: Print Hydra information
--help: Show this message and exit.

`mlflow-ui`

Visualize logs with Mlflow.

Usage:

$ mlflow-ui [OPTIONS]

Options:

--path TEXT: Path to logs storage. [default: mllogs/mlflow]
--port INTEGER: Port on which the MLFlow UI is listening. [default: 5000]
--host TEXT: Which host to use. Switch to ‘0.0.0.0’ to e.g. allow for port-forwarding. [default: 127.0.0.1]
--help: Show this message and exit.

`mlflow-server`

Spawn Mlflow server.

Usage:

$ mlflow-server [OPTIONS]

Options:

--path TEXT: Path to logs storage. [default: mllogs/mlflow]
--port INTEGER: Port on which the server is listening. [default: 5000]
--help: Show this message and exit.

`kill-mlflow-server`

Kill Mlflow server.

Usage:

$ kill-mlflow-server [OPTIONS]

Options:

--port INTEGER: Port on which the server is listening. [default: 5000]
--help: Show this message and exit.

`download-mlflow-data`

Download metrics data from MLFlow experiments and save to a CSV file.

Requires MLFlow authentication if the server is configured to use it. Authentication must be provided via the following environment variables: ‘MLFLOW_TRACKING_USERNAME’ and ‘MLFLOW_TRACKING_PASSWORD’.

Usage:

$ download-mlflow-data [OPTIONS]

Options:

--tracking-uri TEXT: The tracking URI of the MLFlow server. [default: https://mlflow.intertwin.fedcloud.eu/]
--experiment-id TEXT: The experiment ID that you wish to retrieve data from. [default: 48]
--output-file TEXT: The file path to save the data to. [default: mlflow_data.csv]
--help: Show this message and exit.

`tensorboard-ui`

Visualize logs with TensorBoard.

Usage:

$ tensorboard-ui [OPTIONS]

Options:

--path TEXT: Path to logs storage. [default: mllogs/tensorboard]
--port INTEGER: Port on which the Tensorboard UI is listening. [default: 6006]
--host TEXT: Which host to use. Switch to ‘0.0.0.0’ to e.g. allow for port-forwarding. [default: 127.0.0.1]
--help: Show this message and exit.

`upload-model-to-hub`

Upload a model checkpoint to the AI Model Hub. Please note that this command requires internet connection to push to the model hub.

The model directory should contain:

model checkpoint file(s)
manifest.yaml with model id and metadata
(optional) metadata.json with additional information

Usage:

$ upload-model-to-hub [OPTIONS] MODEL_DIR

Arguments:

MODEL_DIR: Path to directory with model checkpoint, manifest.yaml and metadata. [required]

Options:

--hub-url TEXT: URL of model hub server. If not provided, use HYPHA_SERVER_URL or .env file
--api-token TEXT: API token. If not provided, use HYPHA_TOKEN or .env file.
--env-file TEXT: Path to .env file containing MODEL_HUB_URL and MODEL_HUB_API_TOKEN.
--upload-script TEXT: Path to upload_model.py script. If not provided, downloads from GitHub.
--help: Show this message and exit.

CLI

generate-flamegraph

generate-py-spy-report

generate-scalability-report

sanity-check

check-distributed-cluster

generate-slurm

run

exec-pipeline

mlflow-ui

mlflow-server

kill-mlflow-server

download-mlflow-data

tensorboard-ui

upload-model-to-hub

`generate-flamegraph`

`generate-py-spy-report`

`generate-scalability-report`

`sanity-check`

`check-distributed-cluster`

`generate-slurm`

`run`

`exec-pipeline`

`mlflow-ui`

`mlflow-server`

`kill-mlflow-server`

`download-mlflow-data`

`tensorboard-ui`

`upload-model-to-hub`