docs(day1): First talk

2025-04-02 12:44:09 +02:00
parent f4858d81a8
commit bd7d9fe87d
2 changed files with 76 additions and 1 deletions
--- a/content/day1/01_scaling-gpu.md
+++ b/content/day1/01_scaling-gpu.md
@@ -0,0 +1,75 @@
+---
+title: Scaling GPU Clusters without melting down
+weight: 1
+tags:
+ - ml
+ - nvidia
+ - ai
+ - apiserver
+ - go
+---
+
+<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
+
+## Baseline
+
+- We need mroe and more gpus -> Control Plane needs to keep track of more objects
+- Goal: Scale Workers without scaling control plane
+
+## Current Problems
+
+### Secret list calls go up and control plane goes down
+
+- Scenario: High number of list calls with larger secrets
+- Problem: OOM apiserver b/c cache
+- Fix: API Priority & Fairness (only allow two concurrent list calls, queue the rest)
+- Result: Decreased number of oom crashes
+
+### High memory usage until we restart the apiserver
+
+- Scenario: API-Server frees up to 40% of it's memory util when restarted
+- Main suspect: Memory collection
+- Idea: Tune GOGC (ENV Var `GOCC`) -> They set the default 100 to 50
+- Result: Decrease in memory util and no more growing util over time
+
+### Large skew in memory utilization
+
+- Scanario: Scew between api server memory utilization across api-server pods
+- Problem: If a pod with high util get's hist with a list, the api-server will oom -> The LB redirects to the other 2 -> Those OOM
+- Observation: The lb in fron of the api server pods also shows some skew -> Explains the skew
+- Root cause: lb has long living tcp connections to the servers and balances based on connections and not requests
+- Idea: Switch up the lb configuration -> Not quite the right angle
+- Fix: Goaway-chance param in apiserver - random `COAWAY TCP` message get's sent -> Tearing down connection gracefully, recreate connection
+
+### Architectural mistakes
+
+- Large number of secrets per workload -> List, Encode/Decode overhead
+- No caching -> To many list calls
+
+### Preview
+
+- There are a bunch of sig api-machinery improvements planned
+
+## The future
+
+- The switch from NUMA GPU-Devices to DRA
+- DRA is powerfull engough to get rid of custom numa stuff
+
+### The stack
+
+- Currently:
+    -  CP: APIServer, Controller manager, Scheduler and Topology aware scheduler
+    -  Worker: Device Plugin, nfd topology updater
+- Future
+    - CP: APIServer, Controller manager, Scheduler
+    - Worker: Device Plugin
+
+### Testing scaling
+
+- Tool: KWOK (Kubernetes WithOut Kublet) - used to simulate gpu workout
+- Env: K8S 1.32 with scaling from 0 to 4000 Workloads
+- Metrics:
+    - Scheduling Latency: Topo aware was way more latency-affected
+    - Scheduler Memory util: 30% of memory saved with dra
+    - APi-Server Memory: Another 20& of memory saved
+- Result: They are confident that DRA will bew stable and even save memeory and cpu util
--- a/content/day1/_index.md
+++ b/content/day1/_index.md
@@ -4,7 +4,7 @@ title: Day 1
 weight: 5
 ---

-TODO:
+Day 1 of the main KubeCon event startet with a bunch of keynotes from the cncf themselfes (anouncing the next )

 ## Talk recommendations