diff --git a/content/day1/04_gpus-go-round.md b/content/day1/04_gpus-go-round.md new file mode 100644 index 0000000..917212c --- /dev/null +++ b/content/day1/04_gpus-go-round.md @@ -0,0 +1,56 @@ +--- +title: THE GPUs on the bus go round and round +weight: 4 +tags: + - kubecon + - gpu + - nvidia +--- + + + + +## Background + +- They are the GForce Now folks +- Large fleet of clusters all over the world (60.000+ GPUs) +- They use kubevirt to pass through GPUs (vfio driver) or vGPUs +- Devices fail from time to time +- Sometimes failures needs restarts + +## Failure discovery + +- Goal: Maintain capacity +- Failure reasons: Overheating, insufficient power, driver issues, hardware faults, ... +- Problem: They only detected failure by detecting capacity decreasing or not being able to switch drivers +- Fix: First detect failure, then remidiate + - GPU Problem detector as part of their internal device plugin + - Node Problem detector -> triggers remediation through maintainance + +## Remidiation approaches + +- Reboot: Works every time, but has workload related downsides -> Legit solutiom, but drain can take very long +- Discovery of remidiation loops -> Too many reboots indicate something being not quite right +- Optimized drain: Prioritize draining of nodes with failed devices before other maintainance +- The current workflow is: Reboot (automated) -> Power cycle (automated) -> Rebuild Node (automated) -> Manual intervention / RMA + +## Prevention + +> Problems should not affect workload + +- Healthchecks with alerts +- Firmware & Driver updates +- Thermal & Powermanagement + +## Future Challenges + +- What if a high density with 8 GPUs has one failure? +- What is an acceptable rate of working to broken GPUs per Node +- If there is a problematic node that has to be rebooted every couple of days should the scheduler avoid thus node? + +## Q&A + +- Are there any plans to opensource the gpu problem detection: We could certainly do it, not on the roadmap r/n +- Are the failure rates representative and what is counted as failure: + - Failure is not being able to run a workload on a node (could be hardware or driver failure) + - The failure rate is 0,6% but the affected capacity is 1,2% (with 2 GPUs per node) \ No newline at end of file