3. Using the itwinai TorchTrainer Class

The code used in this tutorial is adapted from this example.

The itwinai TorchTrainer class works as a wrapper that manages most aspects of training. It facilitates distributed machine learning and allows for extensive customization by subclassing and overriding the desired methods.

You can find all the associated code in the GitHub repository.

3.1. Setting Up the Training Script

The following is an outline on how you can setup the training script:

# Create dataset as usual
train_dataset = ...

# Create model as usual
model = ...

trainer = TorchTrainer(config={}, model=model, strategy="ddp")

_, _, _, trained_model = trainer.execute(train_dataset, ...)

3.2. Launching Distributed Training

To launch the training across multiple workers, i.e. with multiple GPUs, potentially across multiple nodes, you can use torchrun to allow the processes to communicate between them. If you are on a system that uses SLURM, you can combine srun and torchrun to start the processes on different nodes as well. Here is an example on how you could do this, assuming your code is in train.py:

srun --cpu-bind=none --ntasks-per-node=1 \
    bash -c "torchrun \
    --nnodes=2 \
    --nproc_per_node=4 \
    --rdzv_id=151152 \
    --rdzv_conf=is_host=\$(((SLURM_NODEID)) && echo 0 || echo 1) \
    --rdzv_backend=c10d \
    --rdzv_endpoint='$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)'i:29500 \
    python train.py"

3.3. Complete TorchTrainer Example

Below we have a complete example of how to use the TorchTrainer to train a model on the MNIST dataset, which can be seen on Github here. This can be run locally using

python train.py

or in a distributed manner as explained in the section above. If you wish to analyze the resulting MLFlow logs, you can use the following command:

itwinai mlflow-ui --path mllogs/mlflow

Note

You might have to change the port or the host, depending on which system you are on. If you are running this on a server and wish to port-forward the result to your local machine, then you have to change out the host using --host to 0.0.0.0. For more information on this, look for information on how to forward ports with SSH online.

Here you can see the contents of train.py:

# --------------------------------------------------------------------------------------
# Part of the interTwin Project: https://www.intertwin.eu/
#
# Created by: Matteo Bunino
#
# Credit:
# - Matteo Bunino <matteo.bunino@cern.ch> - CERN
# - Jarl Sondre Sæther <jarl.sondre.saether@cern.ch> - CERN
# --------------------------------------------------------------------------------------

"""Adapted from: https://github.com/pytorch/examples/blob/main/mnist/main.py"""

import argparse

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchmetrics
from torchvision import datasets, transforms

from itwinai.loggers import MLFlowLogger
from itwinai.torch.config import TrainingConfiguration
from itwinai.torch.trainer import TorchTrainer


# Step 1: setup your neural network architecture
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        return x


def main():
    # Step 2 (optional): Parse your arguments from the command line
    parser = argparse.ArgumentParser(description="PyTorch MNIST Example")
    parser.add_argument(
        "--batch-size",
        type=int,
        default=64,
        help="input batch size for training (default: 64)",
    )
    parser.add_argument(
        "--epochs", type=int, default=14, help="number of epochs to train (default: 14)"
    )
    parser.add_argument(
        "--strategy", type=str, default="ddp", help="distributed strategy (default=ddp)"
    )
    parser.add_argument(
        "--lr", type=float, default=1.0, help="learning rate (default: 1.0)"
    )
    parser.add_argument("--seed", type=int, default=1, help="random seed (default: 1)")
    parser.add_argument(
        "--ckpt-interval",
        type=int,
        default=10,
        help="how many batches to wait before logging training status",
    )
    args = parser.parse_args()

    # Step 3: Create your datasets
    transform = transforms.Compose(
        [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]
    )
    train_dataset = datasets.MNIST(
        "../data", train=True, download=True, transform=transform
    )
    validation_dataset = datasets.MNIST("../data", train=False, transform=transform)

    # Step 4: Configure your model and your training configuration
    model = Net()

    training_config = TrainingConfiguration(
        batch_size=args.batch_size,
        optim_lr=args.lr,
        optimizer="adadelta",
        loss="cross_entropy",
    )

    # Step 5 (optional): Configure a logger and some metrics
    logger = MLFlowLogger(experiment_name="mnist-tutorial", log_freq=10)

    metrics = {
        "accuracy": torchmetrics.Accuracy(task="multiclass", num_classes=10),
        "precision": torchmetrics.Precision(task="multiclass", num_classes=10),
    }

    # Step 6: Create your Trainer
    trainer = TorchTrainer(
        config=training_config,
        model=model,
        metrics=metrics,
        logger=logger,
        strategy=args.strategy,
        epochs=args.epochs,
        random_seed=args.seed,
        checkpoint_every=args.ckpt_interval,
    )

    # Step 7: Launch your training
    _, _, _, trained_model = trainer.execute(train_dataset, validation_dataset, None)


if __name__ == "__main__":
    main()