--- title: THE GPUs on the bus go round and round weight: 4 tags: - kubecon - gpu - nvidia --- ## Background - They are the GForce Now folks - Large fleet of clusters all over the world (60.000+ GPUs) - They use kubevirt to pass through GPUs (vfio driver) or vGPUs - Devices fail from time to time - Sometimes failures needs restarts ## Failure discovery - Goal: Maintain capacity - Failure reasons: Overheating, insufficient power, driver issues, hardware faults, ... - Problem: They only detected failure by detecting capacity decreasing or not being able to switch drivers - Fix: First detect failure, then remidiate - GPU Problem detector as part of their internal device plugin - Node Problem detector -> triggers remediation through maintainance ## Remidiation approaches - Reboot: Works every time, but has workload related downsides -> Legit solutiom, but drain can take very long - Discovery of remidiation loops -> Too many reboots indicate something being not quite right - Optimized drain: Prioritize draining of nodes with failed devices before other maintainance - The current workflow is: Reboot (automated) -> Power cycle (automated) -> Rebuild Node (automated) -> Manual intervention / RMA ## Prevention > Problems should not affect workload - Healthchecks with alerts - Firmware & Driver updates - Thermal & Powermanagement ## Future Challenges - What if a high density with 8 GPUs has one failure? - What is an acceptable rate of working to broken GPUs per Node - If there is a problematic node that has to be rebooted every couple of days should the scheduler avoid thus node? ## Q&A - Are there any plans to opensource the gpu problem detection: We could certainly do it, not on the roadmap r/n - Are the failure rates representative and what is counted as failure: - Failure is not being able to run a workload on a node (could be hardware or driver failure) - The failure rate is 0,6% but the affected capacity is 1,2% (with 2 GPUs per node)