kubecon25/04_gpus-go-round.md at 0e24bf4fd695cc0865a0af60f18806e9b80814af - kubecon25 - ODIT.Services

niggl/kubecon25

Nicolai Ort 0e24bf4fd6

Build latest image / build-container (push) Failing after 50s

Details

docs: Added youtube links

2025-05-07 07:07:48 +02:00

2.2 KiB

Raw Blame History

title, weight, tags

title

weight

tags

The GPUs on the bus go round and round

4

kubecon

gpu

nvidia

{{% button href="https://youtu.be/cLJRh4y4vXg" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}

Background

They are the GForce Now folks
Large fleet of clusters all over the world (60.000+ GPUs)
They use kubevirt to pass through GPUs (vfio driver) or vGPUs
Devices fail from time to time
Sometimes failures needs restarts

Failure discovery

Goal: Maintain capacity
Failure reasons: Overheating, insufficient power, driver issues, hardware faults, ...
Problem: They only detected failure by detecting capacity decreasing or not being able to switch drivers
Fix: First detect failure, then remidiate
- GPU Problem detector as part of their internal device plugin
- Node Problem detector -> triggers remediation through maintainance

Remidiation approaches

Reboot: Works every time, but has workload related downsides -> Legit solutiom, but drain can take very long
Discovery of remidiation loops -> Too many reboots indicate something being not quite right
Optimized drain: Prioritize draining of nodes with failed devices before other maintainance
The current workflow is: Reboot (automated) -> Power cycle (automated) -> Rebuild Node (automated) -> Manual intervention / RMA

Prevention

Problems should not affect workload

Healthchecks with alerts
Firmware & Driver updates
Thermal & Powermanagement

Future Challenges

What if a high density with 8 GPUs has one failure?
What is an acceptable rate of working to broken GPUs per Node
If there is a problematic node that has to be rebooted every couple of days should the scheduler avoid thus node?

Q&A

Are there any plans to opensource the gpu problem detection: We could certainly do it, not on the roadmap r/n
Are the failure rates representative and what is counted as failure:
- Failure is not being able to run a workload on a node (could be hardware or driver failure)
- The failure rate is 0,6% but the affected capacity is 1,2% (with 2 GPUs per node)