From bd7d9fe87d4b6f675acc28d0939448caed725926 Mon Sep 17 00:00:00 2001 From: Nicolai Ort Date: Wed, 2 Apr 2025 12:44:09 +0200 Subject: [PATCH] docs(day1): First talk --- content/day1/01_scaling-gpu.md | 75 ++++++++++++++++++++++++++++++++++ content/day1/_index.md | 2 +- 2 files changed, 76 insertions(+), 1 deletion(-) create mode 100644 content/day1/01_scaling-gpu.md diff --git a/content/day1/01_scaling-gpu.md b/content/day1/01_scaling-gpu.md new file mode 100644 index 0000000..fab9b41 --- /dev/null +++ b/content/day1/01_scaling-gpu.md @@ -0,0 +1,75 @@ +--- +title: Scaling GPU Clusters without melting down +weight: 1 +tags: + - ml + - nvidia + - ai + - apiserver + - go +--- + + + +## Baseline + +- We need mroe and more gpus -> Control Plane needs to keep track of more objects +- Goal: Scale Workers without scaling control plane + +## Current Problems + +### Secret list calls go up and control plane goes down + +- Scenario: High number of list calls with larger secrets +- Problem: OOM apiserver b/c cache +- Fix: API Priority & Fairness (only allow two concurrent list calls, queue the rest) +- Result: Decreased number of oom crashes + +### High memory usage until we restart the apiserver + +- Scenario: API-Server frees up to 40% of it's memory util when restarted +- Main suspect: Memory collection +- Idea: Tune GOGC (ENV Var `GOCC`) -> They set the default 100 to 50 +- Result: Decrease in memory util and no more growing util over time + +### Large skew in memory utilization + +- Scanario: Scew between api server memory utilization across api-server pods +- Problem: If a pod with high util get's hist with a list, the api-server will oom -> The LB redirects to the other 2 -> Those OOM +- Observation: The lb in fron of the api server pods also shows some skew -> Explains the skew +- Root cause: lb has long living tcp connections to the servers and balances based on connections and not requests +- Idea: Switch up the lb configuration -> Not quite the right angle +- Fix: Goaway-chance param in apiserver - random `COAWAY TCP` message get's sent -> Tearing down connection gracefully, recreate connection + +### Architectural mistakes + +- Large number of secrets per workload -> List, Encode/Decode overhead +- No caching -> To many list calls + +### Preview + +- There are a bunch of sig api-machinery improvements planned + +## The future + +- The switch from NUMA GPU-Devices to DRA +- DRA is powerfull engough to get rid of custom numa stuff + +### The stack + +- Currently: + - CP: APIServer, Controller manager, Scheduler and Topology aware scheduler + - Worker: Device Plugin, nfd topology updater +- Future + - CP: APIServer, Controller manager, Scheduler + - Worker: Device Plugin + +### Testing scaling + +- Tool: KWOK (Kubernetes WithOut Kublet) - used to simulate gpu workout +- Env: K8S 1.32 with scaling from 0 to 4000 Workloads +- Metrics: + - Scheduling Latency: Topo aware was way more latency-affected + - Scheduler Memory util: 30% of memory saved with dra + - APi-Server Memory: Another 20& of memory saved +- Result: They are confident that DRA will bew stable and even save memeory and cpu util diff --git a/content/day1/_index.md b/content/day1/_index.md index 47741c4..15bfc9b 100644 --- a/content/day1/_index.md +++ b/content/day1/_index.md @@ -4,7 +4,7 @@ title: Day 1 weight: 5 --- -TODO: +Day 1 of the main KubeCon event startet with a bunch of keynotes from the cncf themselfes (anouncing the next ) ## Talk recommendations