docs(day1): GPU Talk

2025-04-02 17:43:21 +02:00
parent a36f562cf4
commit 77f34ed1ab
1 changed files with 56 additions and 0 deletions
--- a/content/day1/04_gpus-go-round.md
+++ b/content/day1/04_gpus-go-round.md
@@ -0,0 +1,56 @@
+---
+title: THE GPUs on the bus go round and round
+weight: 4
+tags:
+ - kubecon
+ - gpu
+ - nvidia
+---
+
+<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
+<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
+
+## Background
+
+- They are the GForce Now folks
+- Large fleet of clusters all over the world (60.000+ GPUs)
+- They use kubevirt to pass through GPUs (vfio driver) or vGPUs
+- Devices fail from time to time
+- Sometimes failures needs restarts
+
+## Failure discovery
+
+- Goal: Maintain capacity
+- Failure reasons: Overheating, insufficient power, driver issues, hardware faults, ...
+- Problem: They only detected failure by detecting capacity decreasing or not being able to switch drivers
+- Fix: First detect failure, then remidiate
+    - GPU Problem detector as part of their internal device plugin
+    - Node Problem detector -> triggers remediation through maintainance
+
+## Remidiation approaches
+
+- Reboot: Works every time, but has workload related downsides -> Legit solutiom, but drain can take very long
+- Discovery of remidiation loops -> Too many reboots indicate something being not quite right
+- Optimized drain: Prioritize draining of nodes with failed devices before other maintainance
+- The current workflow is: Reboot (automated) -> Power cycle (automated) -> Rebuild Node (automated) -> Manual intervention / RMA
+
+## Prevention
+
+> Problems should not affect workload
+
+- Healthchecks with alerts
+- Firmware & Driver updates
+- Thermal & Powermanagement
+
+## Future Challenges
+
+- What if a high density with 8 GPUs has one failure?
+- What is an acceptable rate of working to broken GPUs per Node
+- If there is a problematic node that has to be rebooted every couple of days should the scheduler avoid thus node?
+
+## Q&A
+
+- Are there any plans to opensource the gpu problem detection: We could certainly do it, not on the roadmap r/n
+- Are the failure rates representative and what is counted as failure:
+    - Failure is not being able to run a workload on a node (could be hardware or driver failure)
+    - The failure rate is 0,6% but the affected capacity is 1,2% (with 2 GPUs per node)