kubecon25/01_scaling-gpu.md at 0e24bf4fd695cc0865a0af60f18806e9b80814af - kubecon25 - ODIT.Services

niggl/kubecon25

Nicolai Ort 0e24bf4fd6

Build latest image / build-container (push) Failing after 50s

Details

docs: Added youtube links

2025-05-07 07:07:48 +02:00

2.8 KiB

Raw Blame History

title, weight, tags

title

weight

tags

Scaling GPU Clusters without melting down

1

ml

nvidia

ai

apiserver

go

kubecon

{{% button href="https://youtu.be/dUfp3j1j-mg" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} {{% button href="https://static.sched.com/hosted_files/kccnceu2025/50/Scaling%20GPU%20Clusters%20Without%20Melting%20Down%21%20%281%29.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}

Baseline

We need mroe and more gpus -> Control Plane needs to keep track of more objects
Goal: Scale Workers without scaling control plane

Current Problems

Secret list calls go up and control plane goes down

Scenario: High number of list calls with larger secrets
Problem: OOM apiserver b/c cache
Fix: API Priority & Fairness (only allow two concurrent list calls, queue the rest)
Result: Decreased number of oom crashes

High memory usage until we restart the apiserver

Scenario: API-Server frees up to 40% of it's memory util when restarted
Main suspect: Memory collection
Idea: Tune GOGC (ENV Var GOCC) -> They set the default 100 to 50
Result: Decrease in memory util and no more growing util over time

Large skew in memory utilization

Scanario: Scew between api server memory utilization across api-server pods
Problem: If a pod with high util get's hist with a list, the api-server will oom -> The LB redirects to the other 2 -> Those OOM
Observation: The lb in fron of the api server pods also shows some skew -> Explains the skew
Root cause: lb has long living tcp connections to the servers and balances based on connections and not requests
Idea: Switch up the lb configuration -> Not quite the right angle
Fix: Goaway-chance param in apiserver - random COAWAY TCP message get's sent -> Tearing down connection gracefully, recreate connection

Architectural mistakes

Large number of secrets per workload -> List, Encode/Decode overhead
No caching -> To many list calls

Preview

There are a bunch of sig api-machinery improvements planned

The future

The switch from NUMA GPU-Devices to DRA
DRA is powerfull engough to get rid of custom numa stuff

The stack

Currently:
- CP: APIServer, Controller manager, Scheduler and Topology aware scheduler
- Worker: Device Plugin, nfd topology updater
Future
- CP: APIServer, Controller manager, Scheduler
- Worker: Device Plugin

Testing scaling

Tool: KWOK (Kubernetes WithOut Kublet) - used to simulate gpu workout
Env: K8S 1.32 with scaling from 0 to 4000 Workloads
Metrics:
- Scheduling Latency: Topo aware was way more latency-affected
- Scheduler Memory util: 30% of memory saved with dra
- APi-Server Memory: Another 20& of memory saved
Result: They are confident that DRA will bew stable and even save memeory and cpu util