--- title: Scaling GPU Clusters without melting down weight: 1 tags: - ml - nvidia - ai - apiserver - go - kubecon --- {{% button href="https://youtu.be/dUfp3j1j-mg" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} {{% button href="https://static.sched.com/hosted_files/kccnceu2025/50/Scaling%20GPU%20Clusters%20Without%20Melting%20Down%21%20%281%29.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} ## Baseline - We need mroe and more gpus -> Control Plane needs to keep track of more objects - Goal: Scale Workers without scaling control plane ## Current Problems ### Secret list calls go up and control plane goes down - Scenario: High number of list calls with larger secrets - Problem: OOM apiserver b/c cache - Fix: API Priority & Fairness (only allow two concurrent list calls, queue the rest) - Result: Decreased number of oom crashes ### High memory usage until we restart the apiserver - Scenario: API-Server frees up to 40% of it's memory util when restarted - Main suspect: Memory collection - Idea: Tune GOGC (ENV Var `GOCC`) -> They set the default 100 to 50 - Result: Decrease in memory util and no more growing util over time ### Large skew in memory utilization - Scanario: Scew between api server memory utilization across api-server pods - Problem: If a pod with high util get's hist with a list, the api-server will oom -> The LB redirects to the other 2 -> Those OOM - Observation: The lb in fron of the api server pods also shows some skew -> Explains the skew - Root cause: lb has long living tcp connections to the servers and balances based on connections and not requests - Idea: Switch up the lb configuration -> Not quite the right angle - Fix: Goaway-chance param in apiserver - random `COAWAY TCP` message get's sent -> Tearing down connection gracefully, recreate connection ### Architectural mistakes - Large number of secrets per workload -> List, Encode/Decode overhead - No caching -> To many list calls ### Preview - There are a bunch of sig api-machinery improvements planned ## The future - The switch from NUMA GPU-Devices to DRA - DRA is powerfull engough to get rid of custom numa stuff ### The stack - Currently: - CP: APIServer, Controller manager, Scheduler and Topology aware scheduler - Worker: Device Plugin, nfd topology updater - Future - CP: APIServer, Controller manager, Scheduler - Worker: Device Plugin ### Testing scaling - Tool: KWOK (Kubernetes WithOut Kublet) - used to simulate gpu workout - Env: K8S 1.32 with scaling from 0 to 4000 Workloads - Metrics: - Scheduling Latency: Topo aware was way more latency-affected - Scheduler Memory util: 30% of memory saved with dra - APi-Server Memory: Another 20& of memory saved - Result: They are confident that DRA will bew stable and even save memeory and cpu util