CLI
Here you can find the itwinai CLI reference.
Usage:
$ [OPTIONS] COMMAND [ARGS]...
Options:
--install-completion: Install completion for the current shell.--show-completion: Show completion for the current shell, to copy it or customize the installation.--help: Show this message and exit.
Commands:
generate-flamegraph: Generates a flamegraph from the given…generate-py-spy-report: Generates an aggregation of the raw py-spy…generate-scalability-report: Generates scalability reports for epoch…sanity-check: Run sanity checks on the installation of…check-distributed-cluster: This command provides a suite of tests for…generate-slurm: Generates a default SLURM script using…exec-pipeline: Execute a pipeline from configuration file…mlflow-ui: Visualize logs with Mlflow.mlflow-server: Spawn Mlflow server.kill-mlflow-server: Kill Mlflow server.download-mlflow-data: Download metrics data from MLFlow…tensorboard-ui: Visualize logs with TensorBoard.
generate-flamegraph
Generates a flamegraph from the given profiling output.
Usage:
$ generate-flamegraph [OPTIONS]
Options:
--file TEXT: The location of the raw profiling data. [required]--output-filename TEXT: The filename of the resulting flamegraph. [default: flamegraph.svg]--help: Show this message and exit.
generate-py-spy-report
Generates an aggregation of the raw py-spy profiling data, showing which leaf functions collected the most samples.
Usage:
$ generate-py-spy-report [OPTIONS]
Options:
--file TEXT: The location of the raw profiling data. [required]--num-rows TEXT: Number of rows to display. Pass ‘all’ to print the full table. [default: 10]--aggregate-leaf-paths / --no-aggregate-leaf-paths: Whether to aggregate all unique leaf calls across different call stacks. [default: no-aggregate-leaf-paths]--library-name TEXT: Which library name to find the lowest contact point of. [default: itwinai]--help: Show this message and exit.
generate-scalability-report
Generates scalability reports for epoch time, GPU data, and communication data based the mlflow logs.
This command processes runs under the given experiment at a tracking uri.
It generates plots and metrics for scalability analysis and saves them in the plot_dir.
Usage:
$ generate-scalability-report [OPTIONS]
Options:
--tracking-uri TEXT: The tracking URI of the MLFlow server. [default: mllogs/mlflow]--experiment-name TEXT: The name of the mlflow experiment to use for the GPU data report. [default: unnamed-experiment]--plot-dir TEXT: Which directory to save the resulting plots in. [default: plots]--run-names TEXT: Which run names to read, presented as comma-separated values e.g. ‘run0,run1’.--plot-file-suffix TEXT: Which file suffix to use for the plots. Useful for changing between raster and vector based images [default: .png]--include-communication / --no-include-communication: Include communication data in the scalability report. Disclaimer: Communication fractions are unreliable and vary significantly for different HPC systems. [default: no-include-communication]--no-warnings / --no-no-warnings: Create plots without warnings. [default: no-no-warnings]--help: Show this message and exit.
sanity-check
Run sanity checks on the installation of itwinai and its dependencies by trying to import itwinai modules. By default, only itwinai core modules (neither torch, nor tensorflow) are tested.
Usage:
$ sanity-check [OPTIONS]
Options:
--torch / --no-torch: Check also itwinai.torch modules. [default: no-torch]--tensorflow / --no-tensorflow: Check also itwinai.tensorflow modules. [default: no-tensorflow]--all / --no-all: Check all modules. [default: no-all]--optional-deps TEXT: List of optional dependencies.--help: Show this message and exit.
check-distributed-cluster
This command provides a suite of tests for a quick sanity check of the network setup for torch distributed. Useful when working with containers on HPC. Remember to prepend torchrun in front of this command or to start a Ray cluster.
Usage:
$ check-distributed-cluster [OPTIONS]
Options:
--platform TEXT: Hardware platform: nvidia or amd [default: nvidia]--launcher TEXT: Distributed ML cluster: torchrun or ray [default: torchrun]--help: Show this message and exit.
generate-slurm
Generates a default SLURM script using arguments and optionally a configuration file.
Usage:
$ generate-slurm [OPTIONS]
Options:
--job-name TEXT: The name of the SLURM job.--account TEXT: The billing account for the SLURM job. [default: intertwin]--time TEXT: The time limit of the SLURM job. [default: 00:30:00]--partition TEXT: Which partition of the cluster the SLURM job is going to run on. [default: develbooster]--std-out TEXT: The standard out file.--err-out TEXT: The error out file.--num-nodes INTEGER: The number of nodes that the SLURM job is going to run on. [default: 1]--gpus-per-node INTEGER: The requested number of GPUs per node. [default: 4]--cpus-per-gpu INTEGER: The requested number of CPUs per SLURM task. [default: 4]--save-script: Whether to save the script or not.--submit-script: Whether to submit the script or not.--save-dir TEXT: In which directory to save the script, if saving it.--exec-file TEXT: The location of the file containing the execution command. Also accepts a remote url.--pre-exec-file TEXT: The location of the file containing the pre-execution command. Also accepts a remote url.--use-ray: Whether to enable Ray or not.--memory TEXT: How much memory to allocate per node. [default: 16G]--exclusive: Whether to make the SLURM job exclusive or not.--run-name TEXT: Which run name to use. [default: 16G]--exp-name TEXT: Which experiment name to use. [default: 16G]--container-path TEXT: Path to container that should be exported.--config-path TEXT: The path to the directory containing the config file to use for training. [default: .]--config-name TEXT: The name of the config file to use for training. [default: config]--pipe-key TEXT: Which pipe key to use for running the pipeline. [default: rnn_training_pipeline]--mode TEXT: Which mode to run, e.g. scaling test, all strategies, or a single run. [default: single]--dist-strat TEXT: Which distributed strategy to use. [default: ddp]--training-cmd TEXT: The training command to use for the python script.--python-venv TEXT: Which python venv to use for running the command. [default: .venv]--scalability-nodes TEXT: A comma-separated list of node numbers to use for the scalability test. [default: 1,2,4,8]--config TEXT: The path to the SLURM configuration file.--py-spy: Whether to activate profiling with py-spy or not--profiling-sampling-rate INTEGER: The rate at which to profile with the py-spy profiler. [default: 10]--help: Show this message and exit.
exec-pipeline
Execute a pipeline from configuration file using Hydra CLI. Allows dynamic override of fields which can be appended as a list of overrides (e.g., batch_size=32). By default, it will expect a configuration file called “config.yaml” in the current working directory. To override the default behavior set –config-name and –config-path. By default, this command will execute the whole pipeline under “training_pipeline” field in the configuration file. To execute a different pipeline you can override this by passing “+pipe_key=your_pipeline” in the list of overrides, and to execute only a subset of the steps, you can pass “+pipe_steps=[0,1]”.
Usage:
$ exec-pipeline [OPTIONS] [OVERRIDES]...
Arguments:
[OVERRIDES]...: Any key=value arguments to override config values (use dots for.nested=overrides), using the Hydra syntax.
Options:
--hydra-help / --no-hydra-help: Show Hydra’s help page [default: no-hydra-help]--version / --no-version: Show Hydra’s version and exit [default: no-version]-c, --cfg TEXT: Show config instead of running--resolve / --no-resolve: Used in conjunction with –cfg, resolve config interpolations before printing. [default: no-resolve]-p, --package TEXT: Config package to show-r, --run TEXT: Run a job-m, --multirun TEXT: Run multiple jobs with the configured launcher and sweeper-sc, --shell-completion TEXT: Install or Uninstall shell completion-cp, --config-path TEXT: Overrides the config_path specified in hydra.main(). The config_path is absolute, or relative to the current workign directory. Defaults to the current working directory.-cn, --config-name TEXT: Overrides the config_name specified in hydra.main() [default: config]-cd, --config-dir TEXT: Adds an additional config dir to the config search path--experimental-rerun TEXT: Rerun a job from a previous config pickle-i, --info TEXT: Print Hydra information--help: Show this message and exit.
mlflow-ui
Visualize logs with Mlflow.
Usage:
$ mlflow-ui [OPTIONS]
Options:
--path TEXT: Path to logs storage. [default: mllogs/mlflow]--port INTEGER: Port on which the MLFlow UI is listening. [default: 5000]--host TEXT: Which host to use. Switch to ‘0.0.0.0’ to e.g. allow for port-forwarding. [default: 127.0.0.1]--help: Show this message and exit.
mlflow-server
Spawn Mlflow server.
Usage:
$ mlflow-server [OPTIONS]
Options:
--path TEXT: Path to logs storage. [default: mllogs/mlflow]--port INTEGER: Port on which the server is listening. [default: 5000]--help: Show this message and exit.
kill-mlflow-server
Kill Mlflow server.
Usage:
$ kill-mlflow-server [OPTIONS]
Options:
--port INTEGER: Port on which the server is listening. [default: 5000]--help: Show this message and exit.
download-mlflow-data
Download metrics data from MLFlow experiments and save to a CSV file.
Requires MLFlow authentication if the server is configured to use it. Authentication must be provided via the following environment variables: ‘MLFLOW_TRACKING_USERNAME’ and ‘MLFLOW_TRACKING_PASSWORD’.
Usage:
$ download-mlflow-data [OPTIONS]
Options:
--tracking-uri TEXT: The tracking URI of the MLFlow server. [default: https://mlflow.intertwin.fedcloud.eu/]--experiment-id TEXT: The experiment ID that you wish to retrieve data from. [default: 48]--output-file TEXT: The file path to save the data to. [default: mlflow_data.csv]--help: Show this message and exit.
tensorboard-ui
Visualize logs with TensorBoard.
Usage:
$ tensorboard-ui [OPTIONS]
Options:
--path TEXT: Path to logs storage. [default: mllogs/tensorboard]--port INTEGER: Port on which the Tensorboard UI is listening. [default: 6006]--host TEXT: Which host to use. Switch to ‘0.0.0.0’ to e.g. allow for port-forwarding. [default: 127.0.0.1]--help: Show this message and exit.