CLI

Here you can find the itwinai CLI reference.

Usage:

$ [OPTIONS] COMMAND [ARGS]...

Options:

--install-completion: Install completion for the current shell.
--show-completion: Show completion for the current shell, to copy it or customize the installation.
--help: Show this message and exit.

Commands:

generate-flamegraph: Generates a flamegraph from the given…
generate-py-spy-report: Generates an aggregation of the raw py-spy…
generate-scalability-report: Generates scalability reports for epoch…
sanity-check: Run sanity checks on the installation of…
check-distributed-cluster: This command provides a suite of tests for…
generate-slurm: Generates a default SLURM script using…
exec-pipeline: Execute a pipeline from configuration file…
mlflow-ui: Visualize logs with Mlflow.
mlflow-server: Spawn Mlflow server.
kill-mlflow-server: Kill Mlflow server.
download-mlflow-data: Download metrics data from MLFlow…
tensorboard-ui: Visualize logs with TensorBoard.

`generate-flamegraph`

Generates a flamegraph from the given profiling output.

Usage:

$ generate-flamegraph [OPTIONS]

Options:

--file TEXT: The location of the raw profiling data. [required]
--output-filename TEXT: The filename of the resulting flamegraph. [default: flamegraph.svg]
--help: Show this message and exit.

`generate-py-spy-report`

Generates an aggregation of the raw py-spy profiling data, showing which leaf functions collected the most samples.

Usage:

$ generate-py-spy-report [OPTIONS]

Options:

--file TEXT: The location of the raw profiling data. [required]
--num-rows TEXT: Number of rows to display. Pass ‘all’ to print the full table. [default: 10]
--aggregate-leaf-paths / --no-aggregate-leaf-paths: Whether to aggregate all unique leaf calls across different call stacks. [default: no-aggregate-leaf-paths]
--library-name TEXT: Which library name to find the lowest contact point of. [default: itwinai]
--help: Show this message and exit.

`generate-scalability-report`

Generates scalability reports for epoch time, GPU data, and communication data based the mlflow logs.

This command processes runs under the given experiment at a tracking uri. It generates plots and metrics for scalability analysis and saves them in the plot_dir.

Usage:

$ generate-scalability-report [OPTIONS]

Options:

--tracking-uri TEXT: The tracking URI of the MLFlow server. [default: mllogs/mlflow]
--experiment-name TEXT: The name of the mlflow experiment to use for the GPU data report. [default: unnamed-experiment]
--plot-dir TEXT: Which directory to save the resulting plots in. [default: plots]
--run-names TEXT: Which run names to read, presented as comma-separated values e.g. ‘run0,run1’.
--plot-file-suffix TEXT: Which file suffix to use for the plots. Useful for changing between raster and vector based images [default: .png]
--include-communication / --no-include-communication: Include communication data in the scalability report. Disclaimer: Communication fractions are unreliable and vary significantly for different HPC systems. [default: no-include-communication]
--no-warnings / --no-no-warnings: Create plots without warnings. [default: no-no-warnings]
--help: Show this message and exit.

`sanity-check`

Run sanity checks on the installation of itwinai and its dependencies by trying to import itwinai modules. By default, only itwinai core modules (neither torch, nor tensorflow) are tested.

Usage:

$ sanity-check [OPTIONS]

Options:

--torch / --no-torch: Check also itwinai.torch modules. [default: no-torch]
--tensorflow / --no-tensorflow: Check also itwinai.tensorflow modules. [default: no-tensorflow]
--all / --no-all: Check all modules. [default: no-all]
--optional-deps TEXT: List of optional dependencies.
--help: Show this message and exit.

`check-distributed-cluster`

This command provides a suite of tests for a quick sanity check of the network setup for torch distributed. Useful when working with containers on HPC. Remember to prepend torchrun in front of this command or to start a Ray cluster.

Usage:

$ check-distributed-cluster [OPTIONS]

Options:

--platform TEXT: Hardware platform: nvidia or amd [default: nvidia]
--launcher TEXT: Distributed ML cluster: torchrun or ray [default: torchrun]
--help: Show this message and exit.

`generate-slurm`

Generates a default SLURM script using arguments and optionally a configuration file.

Usage:

$ generate-slurm [OPTIONS]

Options:

--job-name TEXT: The name of the SLURM job.
--account TEXT: The billing account for the SLURM job. [default: intertwin]
--time TEXT: The time limit of the SLURM job. [default: 00:30:00]
--partition TEXT: Which partition of the cluster the SLURM job is going to run on. [default: develbooster]
--std-out TEXT: The standard out file.
--err-out TEXT: The error out file.
--num-nodes INTEGER: The number of nodes that the SLURM job is going to run on. [default: 1]
--gpus-per-node INTEGER: The requested number of GPUs per node. [default: 4]
--cpus-per-gpu INTEGER: The requested number of CPUs per SLURM task. [default: 4]
--save-script: Whether to save the script or not.
--submit-script: Whether to submit the script or not.
--save-dir TEXT: In which directory to save the script, if saving it.
--exec-file TEXT: The location of the file containing the execution command. Also accepts a remote url.
--pre-exec-file TEXT: The location of the file containing the pre-execution command. Also accepts a remote url.
--use-ray: Whether to enable Ray or not.
--memory TEXT: How much memory to allocate per node. [default: 16G]
--exclusive: Whether to make the SLURM job exclusive or not.
--run-name TEXT: Which run name to use. [default: 16G]
--exp-name TEXT: Which experiment name to use. [default: 16G]
--container-path TEXT: Path to container that should be exported.
--config-path TEXT: The path to the directory containing the config file to use for training. [default: .]
--config-name TEXT: The name of the config file to use for training. [default: config]
--pipe-key TEXT: Which pipe key to use for running the pipeline. [default: rnn_training_pipeline]
--mode TEXT: Which mode to run, e.g. scaling test, all strategies, or a single run. [default: single]
--dist-strat TEXT: Which distributed strategy to use. [default: ddp]
--training-cmd TEXT: The training command to use for the python script.
--python-venv TEXT: Which python venv to use for running the command. [default: .venv]
--scalability-nodes TEXT: A comma-separated list of node numbers to use for the scalability test. [default: 1,2,4,8]
--config TEXT: The path to the SLURM configuration file.
--py-spy: Whether to activate profiling with py-spy or not
--profiling-sampling-rate INTEGER: The rate at which to profile with the py-spy profiler. [default: 10]
--help: Show this message and exit.

`exec-pipeline`

Execute a pipeline from configuration file using Hydra CLI. Allows dynamic override of fields which can be appended as a list of overrides (e.g., batch_size=32). By default, it will expect a configuration file called “config.yaml” in the current working directory. To override the default behavior set –config-name and –config-path. By default, this command will execute the whole pipeline under “training_pipeline” field in the configuration file. To execute a different pipeline you can override this by passing “+pipe_key=your_pipeline” in the list of overrides, and to execute only a subset of the steps, you can pass “+pipe_steps=[0,1]”.

Usage:

$ exec-pipeline [OPTIONS] [OVERRIDES]...

Arguments:

[OVERRIDES]...: Any key=value arguments to override config values (use dots for.nested=overrides), using the Hydra syntax.

Options:

--hydra-help / --no-hydra-help: Show Hydra’s help page [default: no-hydra-help]
--version / --no-version: Show Hydra’s version and exit [default: no-version]
-c, --cfg TEXT: Show config instead of running
--resolve / --no-resolve: Used in conjunction with –cfg, resolve config interpolations before printing. [default: no-resolve]
-p, --package TEXT: Config package to show
-r, --run TEXT: Run a job
-m, --multirun TEXT: Run multiple jobs with the configured launcher and sweeper
-sc, --shell-completion TEXT: Install or Uninstall shell completion
-cp, --config-path TEXT: Overrides the config_path specified in hydra.main(). The config_path is absolute, or relative to the current workign directory. Defaults to the current working directory.
-cn, --config-name TEXT: Overrides the config_name specified in hydra.main() [default: config]
-cd, --config-dir TEXT: Adds an additional config dir to the config search path
--experimental-rerun TEXT: Rerun a job from a previous config pickle
-i, --info TEXT: Print Hydra information
--help: Show this message and exit.

`mlflow-ui`

Visualize logs with Mlflow.

Usage:

$ mlflow-ui [OPTIONS]

Options:

--path TEXT: Path to logs storage. [default: mllogs/mlflow]
--port INTEGER: Port on which the MLFlow UI is listening. [default: 5000]
--host TEXT: Which host to use. Switch to ‘0.0.0.0’ to e.g. allow for port-forwarding. [default: 127.0.0.1]
--help: Show this message and exit.

`mlflow-server`

Spawn Mlflow server.

Usage:

$ mlflow-server [OPTIONS]

Options:

--path TEXT: Path to logs storage. [default: mllogs/mlflow]
--port INTEGER: Port on which the server is listening. [default: 5000]
--help: Show this message and exit.

`kill-mlflow-server`

Kill Mlflow server.

Usage:

$ kill-mlflow-server [OPTIONS]

Options:

--port INTEGER: Port on which the server is listening. [default: 5000]
--help: Show this message and exit.

`download-mlflow-data`

Download metrics data from MLFlow experiments and save to a CSV file.

Requires MLFlow authentication if the server is configured to use it. Authentication must be provided via the following environment variables: ‘MLFLOW_TRACKING_USERNAME’ and ‘MLFLOW_TRACKING_PASSWORD’.

Usage:

$ download-mlflow-data [OPTIONS]

Options:

--tracking-uri TEXT: The tracking URI of the MLFlow server. [default: https://mlflow.intertwin.fedcloud.eu/]
--experiment-id TEXT: The experiment ID that you wish to retrieve data from. [default: 48]
--output-file TEXT: The file path to save the data to. [default: mlflow_data.csv]
--help: Show this message and exit.

`tensorboard-ui`

Visualize logs with TensorBoard.

Usage:

$ tensorboard-ui [OPTIONS]

Options:

--path TEXT: Path to logs storage. [default: mllogs/tensorboard]
--port INTEGER: Port on which the Tensorboard UI is listening. [default: 6006]
--host TEXT: Which host to use. Switch to ‘0.0.0.0’ to e.g. allow for port-forwarding. [default: 127.0.0.1]
--help: Show this message and exit.

CLI

generate-flamegraph

generate-py-spy-report

generate-scalability-report

sanity-check

check-distributed-cluster

generate-slurm

exec-pipeline

mlflow-ui

mlflow-server

kill-mlflow-server

download-mlflow-data

tensorboard-ui

`generate-flamegraph`

`generate-py-spy-report`

`generate-scalability-report`

`sanity-check`

`check-distributed-cluster`

`generate-slurm`

`exec-pipeline`

`mlflow-ui`

`mlflow-server`

`kill-mlflow-server`

`download-mlflow-data`

`tensorboard-ui`