day 2 keynotes
This commit is contained in:
50
content/day2/03_accelerating_ai_workloads.md
Normal file
50
content/day2/03_accelerating_ai_workloads.md
Normal file
@@ -0,0 +1,50 @@
|
||||
---
|
||||
title: Accelerating AI workloads with GPUs in kubernetes
|
||||
weight: 3
|
||||
---
|
||||
|
||||
Kevin and Sanjay from NVIDIA
|
||||
|
||||
## Enabeling GPUs in Kubernetes today
|
||||
|
||||
* Host level components: Toolkit, drivers
|
||||
* Kubernetes components: Device plugin, feature discovery, node selector
|
||||
* NVIDIA humbly brings you a GPU operator
|
||||
|
||||
## GPU sharing
|
||||
|
||||
* Time slicing: Switch around by time
|
||||
* Multi Process Service: Run allways on the GPU but share (space-)
|
||||
* Multi Instance GPU: Space-seperated sharing on the hardware
|
||||
* Virtual GPU: Virtualices Time slicing or MIG
|
||||
* CUDA Streams: Run multiple kernels in a single app
|
||||
|
||||
## Dynamic resource allocation
|
||||
|
||||
* A new alpha feature since Kube 1.26 for dynamic ressource requesting
|
||||
* You just request a ressource via the API and have fun
|
||||
* The sharing itself is an implementation detail
|
||||
|
||||
## GPU scale out challenges
|
||||
|
||||
* NVIDIA Picasso is a foundry for model creation powered by Kubernetes
|
||||
* The workload is the training workload split into batches
|
||||
* Challenge: Schedule multiple training jobs by different users that are prioritized
|
||||
|
||||
### Topology aware placments
|
||||
|
||||
* You need thousands of GPUs, a typical Node has 8 GPUs with fast NVLink communication - beyond that switching
|
||||
* Target: optimize related jobs based on GPU node distance and NUMA placement
|
||||
|
||||
### Fault tolerance and resiliency
|
||||
|
||||
* Stuff can break, resulting in slowdowns or errors
|
||||
* Challenge: Detect faults and handle them
|
||||
* Observability both in-band and out ouf band that expose node conditions in kubernetes
|
||||
* Needed: Automated fault-tolerant scheduling
|
||||
|
||||
### Multi-dimensional optimization
|
||||
|
||||
* There are different KPIs: starvation, prioprity, occupanccy, fainrness
|
||||
* Challenge: What to choose (the multi-dimensional decision problemn)
|
||||
* Needed: A scheduler that can balance the dimensions
|
||||
Reference in New Issue
Block a user