docs(day1): GPU Talk
This commit is contained in:
parent
a36f562cf4
commit
77f34ed1ab
56
content/day1/04_gpus-go-round.md
Normal file
56
content/day1/04_gpus-go-round.md
Normal file
@ -0,0 +1,56 @@
|
||||
---
|
||||
title: THE GPUs on the bus go round and round
|
||||
weight: 4
|
||||
tags:
|
||||
- kubecon
|
||||
- gpu
|
||||
- nvidia
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
## Background
|
||||
|
||||
- They are the GForce Now folks
|
||||
- Large fleet of clusters all over the world (60.000+ GPUs)
|
||||
- They use kubevirt to pass through GPUs (vfio driver) or vGPUs
|
||||
- Devices fail from time to time
|
||||
- Sometimes failures needs restarts
|
||||
|
||||
## Failure discovery
|
||||
|
||||
- Goal: Maintain capacity
|
||||
- Failure reasons: Overheating, insufficient power, driver issues, hardware faults, ...
|
||||
- Problem: They only detected failure by detecting capacity decreasing or not being able to switch drivers
|
||||
- Fix: First detect failure, then remidiate
|
||||
- GPU Problem detector as part of their internal device plugin
|
||||
- Node Problem detector -> triggers remediation through maintainance
|
||||
|
||||
## Remidiation approaches
|
||||
|
||||
- Reboot: Works every time, but has workload related downsides -> Legit solutiom, but drain can take very long
|
||||
- Discovery of remidiation loops -> Too many reboots indicate something being not quite right
|
||||
- Optimized drain: Prioritize draining of nodes with failed devices before other maintainance
|
||||
- The current workflow is: Reboot (automated) -> Power cycle (automated) -> Rebuild Node (automated) -> Manual intervention / RMA
|
||||
|
||||
## Prevention
|
||||
|
||||
> Problems should not affect workload
|
||||
|
||||
- Healthchecks with alerts
|
||||
- Firmware & Driver updates
|
||||
- Thermal & Powermanagement
|
||||
|
||||
## Future Challenges
|
||||
|
||||
- What if a high density with 8 GPUs has one failure?
|
||||
- What is an acceptable rate of working to broken GPUs per Node
|
||||
- If there is a problematic node that has to be rebooted every couple of days should the scheduler avoid thus node?
|
||||
|
||||
## Q&A
|
||||
|
||||
- Are there any plans to opensource the gpu problem detection: We could certainly do it, not on the roadmap r/n
|
||||
- Are the failure rates representative and what is counted as failure:
|
||||
- Failure is not being able to run a workload on a node (could be hardware or driver failure)
|
||||
- The failure rate is 0,6% but the affected capacity is 1,2% (with 2 GPUs per node)
|
Loading…
x
Reference in New Issue
Block a user