Training a neural network

Author(s): Matteo Bunino (CERN)

itwinai aims at simplifying the way you train deep learning models, helping you to scale training to HPC resources, while integrating popular logging frameworks, such as MLFlow, Weights&Biases, and Tensorboard.

itwinai TorchTrainer

Below, you can find some tutorials that will help you getting familiar with the itwinai TorchTrainer:

Explanation of Distributed Data Parallelism

Author(s): Killian Verder (CERN), Matteo Bunino (CERN)

Deep neural networks (DNN) are often extremely large and are trained on massive amounts of data, more than most computers have memory for. Even smaller DNNs can take days to train. Distributed Data Parallel (DDP) addresses these two issues, long training times and limited memory, by using multiple machines to host and train both model and data.

Data parallelism is an easy way for a developer to vastly reduce training times. Rather than using single-node parallelism, DDP scales to multiple machines. This scaling maximises parallelisation of your model and drastically reduces training times.

Another benefit of DDP is removal of single-machine memory constraints. Since a dataset or model can be stored across several machines, it becomes possible to analyse much larger datasets or models.

Below is a list of resources expanding on theoretical aspects and practical implementations of DDP:

Data-Parallelism with Deepspeed’s Zero Redundancy Optimizer (ZeRO):

https://sumanthrh.com/post/distributed-and-efficient-finetuning/#zero-powered-data-parallelism

Investigation of expected performance improvement:

https://www.mdpi.com/2079-9292/11/10/1525