/tag/slurm
GPU Best Practices
Core Principles Measure before you scale. Always take a short, single‑GPU baseline and record simple metrics. Right‑size, don’t over‑ask. Request only the GPUs/CPUs/RAM and walltime your measurements justify. Keep GPUs busy. If utilization is low, fix input/data issues before adding more GPUs. Short interactive, long batch. Use OOD for quick experiments; move long work to SLURM. Be a good citizen. Release idle sessions, clean up scratch, and prefer storage patterns that reduce system load. Right‑Sizing in 5 Steps Baseline (≤5 minutes): Run a tiny slice on 1 GPU. Note: Throughput (samples/s or tokens/s) GPU utilization and memory usage Any stalls from CPU or I/O Find the knee: Increase batch size and enable mixed precision if supported.