docs(day1): First talk
All checks were successful
Build latest image / build-container (push) Successful in 47s
All checks were successful
Build latest image / build-container (push) Successful in 47s
This commit is contained in:
parent
f4858d81a8
commit
bd7d9fe87d
75
content/day1/01_scaling-gpu.md
Normal file
75
content/day1/01_scaling-gpu.md
Normal file
@ -0,0 +1,75 @@
|
||||
---
|
||||
title: Scaling GPU Clusters without melting down
|
||||
weight: 1
|
||||
tags:
|
||||
- ml
|
||||
- nvidia
|
||||
- ai
|
||||
- apiserver
|
||||
- go
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
|
||||
## Baseline
|
||||
|
||||
- We need mroe and more gpus -> Control Plane needs to keep track of more objects
|
||||
- Goal: Scale Workers without scaling control plane
|
||||
|
||||
## Current Problems
|
||||
|
||||
### Secret list calls go up and control plane goes down
|
||||
|
||||
- Scenario: High number of list calls with larger secrets
|
||||
- Problem: OOM apiserver b/c cache
|
||||
- Fix: API Priority & Fairness (only allow two concurrent list calls, queue the rest)
|
||||
- Result: Decreased number of oom crashes
|
||||
|
||||
### High memory usage until we restart the apiserver
|
||||
|
||||
- Scenario: API-Server frees up to 40% of it's memory util when restarted
|
||||
- Main suspect: Memory collection
|
||||
- Idea: Tune GOGC (ENV Var `GOCC`) -> They set the default 100 to 50
|
||||
- Result: Decrease in memory util and no more growing util over time
|
||||
|
||||
### Large skew in memory utilization
|
||||
|
||||
- Scanario: Scew between api server memory utilization across api-server pods
|
||||
- Problem: If a pod with high util get's hist with a list, the api-server will oom -> The LB redirects to the other 2 -> Those OOM
|
||||
- Observation: The lb in fron of the api server pods also shows some skew -> Explains the skew
|
||||
- Root cause: lb has long living tcp connections to the servers and balances based on connections and not requests
|
||||
- Idea: Switch up the lb configuration -> Not quite the right angle
|
||||
- Fix: Goaway-chance param in apiserver - random `COAWAY TCP` message get's sent -> Tearing down connection gracefully, recreate connection
|
||||
|
||||
### Architectural mistakes
|
||||
|
||||
- Large number of secrets per workload -> List, Encode/Decode overhead
|
||||
- No caching -> To many list calls
|
||||
|
||||
### Preview
|
||||
|
||||
- There are a bunch of sig api-machinery improvements planned
|
||||
|
||||
## The future
|
||||
|
||||
- The switch from NUMA GPU-Devices to DRA
|
||||
- DRA is powerfull engough to get rid of custom numa stuff
|
||||
|
||||
### The stack
|
||||
|
||||
- Currently:
|
||||
- CP: APIServer, Controller manager, Scheduler and Topology aware scheduler
|
||||
- Worker: Device Plugin, nfd topology updater
|
||||
- Future
|
||||
- CP: APIServer, Controller manager, Scheduler
|
||||
- Worker: Device Plugin
|
||||
|
||||
### Testing scaling
|
||||
|
||||
- Tool: KWOK (Kubernetes WithOut Kublet) - used to simulate gpu workout
|
||||
- Env: K8S 1.32 with scaling from 0 to 4000 Workloads
|
||||
- Metrics:
|
||||
- Scheduling Latency: Topo aware was way more latency-affected
|
||||
- Scheduler Memory util: 30% of memory saved with dra
|
||||
- APi-Server Memory: Another 20& of memory saved
|
||||
- Result: They are confident that DRA will bew stable and even save memeory and cpu util
|
@ -4,7 +4,7 @@ title: Day 1
|
||||
weight: 5
|
||||
---
|
||||
|
||||
TODO:
|
||||
Day 1 of the main KubeCon event startet with a bunch of keynotes from the cncf themselfes (anouncing the next )
|
||||
|
||||
## Talk recommendations
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user