Glossary

Author: Linus Eickhoff (CERN)

This page summarizes key terms used in distributed training and high-performance computing (HPC), providing quick reference for terminology relevant to the itwinai project, its documentation, and codebase.

HPC Terms

accelerator

Specialized compute device (e.g., GPU, TPU, FPGA) designed to speed up specific workloads.
collectives / collective operations / collective communications

Concurrency primitives that involve all ranks in a communicator to move or reduce data in a single operation. Read more here.
- all-gather
  
  Each rank sends its local tensor and receives the concatenated result from every rank.
- all-reduce
  
  All ranks combine their tensors with an element-wise reduction (e.g., sum) and obtain the identical reduced result.
- broadcast
  
  A single source rank transmits a tensor that every other rank receives unchanged.
directive

Compiler instruction embedded that modifies compilation or execution behavior, such as enabling parallelism or optimization, without altering program logic.
job

Submitted unit of work that the scheduler runs with allocated resources.
InfiniBand

High-bandwidth, low-latency network interconnect widely used in HPC clusters.
node

Physical server in a cluster containing CPUs, memory, and often accelerators.
NUMA

Non-Uniform Memory Access architecture in which memory latency depends on the socket that owns the memory.
NVLink

Point-to-point GPU interconnect from NVIDIA that offers higher bandwidth than PCIe.
NVSwitch

On-package switch fabric providing all-to-all NVLink connectivity among multiple GPUs in a node.
PCIe

Serial expansion bus standard that connects CPUs, GPUs, Network Interface Controllers (NICs), and storage devices.
rank

Index assigned to each device (e.g. GPU) in a distributed group, identifying its position in collective operations.
RDMA

Remote Direct Memory Access lets one host read or write another host’s memory without CPU intervention.
RPC (Remote Procedure Call)

Protocol that allows a program to execute a procedure or function on a remote system as if it were local.
scheduler (e.g. SLURM)

Software that queues jobs and assigns cluster resources according to policy and priority.
straggler

Task or node that runs significantly slower than its peers, delaying synchronous operations.
task (SLURM)

Smallest schedulable execution unit within a job, typically a process or thread.
wall time / wall-clock time

Real-world elapsed time from job start to finish.

Distributed ML Terms

data parallelism

Replicates the full model on every device and synchronizes gradients in or after mini-batch.
HPO (Hyperparameter Optimization)

Process of systematically searching for the best hyperparameter values to maximize model performance.
model parallelism

Splits model layers or parameter shards across devices so a single forward pass spans multiple accelerators.
tensor parallelism

Partitions individual tensors along dimensions, letting different accelerators compute slices of the same layer.
trial

Single train run of a model with a specific set of hyperparameters, evaluated independently.
world size

Total number of devices participating in the current distributed run.

Libraries for Distributed Computing

CUDA

NVIDIA’s GPU-computing platform and runtime for C, C++, and Python kernels.
gRPC

High-performance RPC framework using HTTP/2 and Protocol Buffers for language-agnostic services.
Kubernetes

Cluster-orchestration system for scheduling and managing containerized applications.
- helm
  
  Package manager that deploys and upgrades Kubernetes applications via declarative charts.
- pod
  
  Smallest deployable Kubernetes object, grouping one or more tightly coupled containers.
MPI (Message Passing Interface)

Family of libraries implementing the Message Passing Interface standard for distributed communication. Used for point-to-point and collective operations in distributed applications.
- MPI
  
  The MPI specification defining point-to-point and collective semantics for parallel programs.
- OpenMPI
  
  Popular open-source, production-grade implementation of the MPI standard.
NCCL

NVIDIA Collective Communications Library optimized for intra- and inter-node GPU collectives.
OpenMP

Compiler-directive API for shared-memory parallelism on multicore CPUs.
RCCL

AMD’s Radeon Collective Communications Library, drop-in compatible with NCCL for AMD GPUs.
ROCm

AMD’s open-source GPU-computing stack analogous to CUDA.
Singularity

Container runtime tailored to HPC that runs unprivileged, reproducible images (similar to Docker).
SLURM

Open-source workload manager that queues jobs and allocates nodes on HPC systems.

Libraries for ML

DDP

PyTorch’s DistributedDataParallel wrapper enabling synchronous data-parallel training across ranks.
DeepSpeed

Microsoft library that extends PyTorch with memory-efficient optimizers, ZeRO sharding, and kernel fusions.
- shard
  
  A slice of parameters or optimizer states stored on a specific rank in ZeRO.
- ZeRO
  
  Optimization algorithm that partitions optimizer states, gradients, and parameters to fit massive models.
Horovod

Framework providing MPI/NCCL-backed data-parallel training APIs across major DL frameworks.
Ray

Distributed execution framework offering HPO and task, actor, and object store abstractions for Python.
- placement group
  
  Ray construct for requesting a set of resources that are grouped or located together on the same machine or nearby.
- KubeRay
  
  Kubernetes operator that provisions and manages Ray clusters as native resources.
- Ray Tune
  
  Ray’s HPO library that supports distributed trials and advanced search and HPO-scheduling algorithms.