Glossary

Author: Linus Eickhoff (CERN)

This page summarizes key terms used in distributed training and high-performance computing (HPC), providing quick reference for terminology relevant to the itwinai project, its documentation, and codebase.

HPC Terms

  • accelerator

    Specialized compute device (e.g., GPU, TPU, FPGA) designed to speed up specific workloads.

  • collectives / collective operations / collective communications

    Concurrency primitives that involve all ranks in a communicator to move or reduce data in a single operation. Read more here.

    • all-gather

      Each rank sends its local tensor and receives the concatenated result from every rank.

    • all-reduce

      All ranks combine their tensors with an element-wise reduction (e.g., sum) and obtain the identical reduced result.

    • broadcast

      A single source rank transmits a tensor that every other rank receives unchanged.

  • directive

    Compiler instruction embedded that modifies compilation or execution behavior, such as enabling parallelism or optimization, without altering program logic.

  • job

    Submitted unit of work that the scheduler runs with allocated resources.

  • InfiniBand

    High-bandwidth, low-latency network interconnect widely used in HPC clusters.

  • node

    Physical server in a cluster containing CPUs, memory, and often accelerators.

  • NUMA

    Non-Uniform Memory Access architecture in which memory latency depends on the socket that owns the memory.

  • NVLink

    Point-to-point GPU interconnect from NVIDIA that offers higher bandwidth than PCIe.

  • NVSwitch

    On-package switch fabric providing all-to-all NVLink connectivity among multiple GPUs in a node.

  • PCIe

    Serial expansion bus standard that connects CPUs, GPUs, Network Interface Controllers (NICs), and storage devices.

  • rank

    Index assigned to each device (e.g. GPU) in a distributed group, identifying its position in collective operations.

  • RDMA

    Remote Direct Memory Access lets one host read or write another host’s memory without CPU intervention.

  • RPC (Remote Procedure Call)

    Protocol that allows a program to execute a procedure or function on a remote system as if it were local.

  • scheduler (e.g. SLURM)

    Software that queues jobs and assigns cluster resources according to policy and priority.

  • straggler

    Task or node that runs significantly slower than its peers, delaying synchronous operations.

  • task (SLURM)

    Smallest schedulable execution unit within a job, typically a process or thread.

  • wall time / wall-clock time

    Real-world elapsed time from job start to finish.

Distributed ML Terms

  • data parallelism

    Replicates the full model on every device and synchronizes gradients in or after mini-batch.

  • HPO (Hyperparameter Optimization)

    Process of systematically searching for the best hyperparameter values to maximize model performance.

  • model parallelism

    Splits model layers or parameter shards across devices so a single forward pass spans multiple accelerators.

  • tensor parallelism

    Partitions individual tensors along dimensions, letting different accelerators compute slices of the same layer.

  • trial

    Single train run of a model with a specific set of hyperparameters, evaluated independently.

  • world size

    Total number of devices participating in the current distributed run.

Libraries for Distributed Computing

  • CUDA

    NVIDIA’s GPU-computing platform and runtime for C, C++, and Python kernels.

  • gRPC

    High-performance RPC framework using HTTP/2 and Protocol Buffers for language-agnostic services.

  • Kubernetes

    Cluster-orchestration system for scheduling and managing containerized applications.

    • helm

      Package manager that deploys and upgrades Kubernetes applications via declarative charts.

    • pod

      Smallest deployable Kubernetes object, grouping one or more tightly coupled containers.

  • MPI (Message Passing Interface)

    Family of libraries implementing the Message Passing Interface standard for distributed communication. Used for point-to-point and collective operations in distributed applications.

    • MPI

      The MPI specification defining point-to-point and collective semantics for parallel programs.

    • OpenMPI

      Popular open-source, production-grade implementation of the MPI standard.

  • NCCL

    NVIDIA Collective Communications Library optimized for intra- and inter-node GPU collectives.

  • OpenMP

    Compiler-directive API for shared-memory parallelism on multicore CPUs.

  • RCCL

    AMD’s Radeon Collective Communications Library, drop-in compatible with NCCL for AMD GPUs.

  • ROCm

    AMD’s open-source GPU-computing stack analogous to CUDA.

  • Singularity

    Container runtime tailored to HPC that runs unprivileged, reproducible images (similar to Docker).

  • SLURM

    Open-source workload manager that queues jobs and allocates nodes on HPC systems.

Libraries for ML

  • DDP

    PyTorch’s DistributedDataParallel wrapper enabling synchronous data-parallel training across ranks.

  • DeepSpeed

    Microsoft library that extends PyTorch with memory-efficient optimizers, ZeRO sharding, and kernel fusions.

    • shard

      A slice of parameters or optimizer states stored on a specific rank in ZeRO.

    • ZeRO

      Optimization algorithm that partitions optimizer states, gradients, and parameters to fit massive models.

  • Horovod

    Framework providing MPI/NCCL-backed data-parallel training APIs across major DL frameworks.

  • Ray

    Distributed execution framework offering HPO and task, actor, and object store abstractions for Python.

    • placement group

      Ray construct for requesting a set of resources that are grouped or located together on the same machine or nearby.

    • KubeRay

      Kubernetes operator that provisions and manages Ray clusters as native resources.

    • Ray Tune

      Ray’s HPO library that supports distributed trials and advanced search and HPO-scheduling algorithms.