--- title: Accelerating AI workloads with GPUs in kubernetes weight: 3 tags: - keynote - ai - nvidia --- {{% button href="https://youtu.be/gn5SZWyaZ34" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} Kevin and Sanjay from NVIDIA ## Enabling GPUs in Kubernetes today * Host level components: Toolkit, drivers * Kubernetes components: Device plugin, feature discovery, node selector * NVIDIA humbly brings you a GPU operator ## GPU sharing * Time slicing: Switch around by time * Multi Process Service: Always run on the GPU but share (space-) * Multi Instance GPU: Space-seperated sharing on the hardware * Virtual GPU: Virtualizes Time slicing or MIG * CUDA Streams: Run multiple kernels in a single app ## Dynamic resource allocation * A new alpha feature since Kube 1.26 for dynamic resource requesting * You just request a resource via the API and have fun * The sharing itself is an implementation detail ## GPU scale-out challenges * NVIDIA Picasso is a foundry for model creation powered by Kubernetes * The workload is the training workload split into batches * Challenge: Schedule multiple training jobs by different users that are prioritized ### Topology aware placements * You need thousands of GPUs, a typical Node has 8 GPUs with fast NVLink communication - beyond that switching * Target: optimize related jobs based on GPU node distance and NUMA placement ### Fault tolerance and resiliency * Stuff can break, resulting in slowdowns or errors * Challenge: Detect faults and handle them * Observability both in-band and out of band that expose node conditions in Kubernetes * Needed: Automated fault-tolerant scheduling ### Multidimensional optimization * There are different KPIs: starvation, priority, occupancy, fairness * Challenge: What to choose (the multidimensional decision problem) * Needed: A scheduler that can balance the dimensions