kubecon25/content/day1/01_scaling-gpu.md
Nicolai Ort 0e24bf4fd6
Some checks failed
Build latest image / build-container (push) Failing after 50s
docs: Added youtube links
2025-05-07 07:07:48 +02:00

2.8 KiB

title, weight, tags
title weight tags
Scaling GPU Clusters without melting down 1
ml
nvidia
ai
apiserver
go
kubecon

{{% button href="https://youtu.be/dUfp3j1j-mg" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} {{% button href="https://static.sched.com/hosted_files/kccnceu2025/50/Scaling%20GPU%20Clusters%20Without%20Melting%20Down%21%20%281%29.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}

Baseline

  • We need mroe and more gpus -> Control Plane needs to keep track of more objects
  • Goal: Scale Workers without scaling control plane

Current Problems

Secret list calls go up and control plane goes down

  • Scenario: High number of list calls with larger secrets
  • Problem: OOM apiserver b/c cache
  • Fix: API Priority & Fairness (only allow two concurrent list calls, queue the rest)
  • Result: Decreased number of oom crashes

High memory usage until we restart the apiserver

  • Scenario: API-Server frees up to 40% of it's memory util when restarted
  • Main suspect: Memory collection
  • Idea: Tune GOGC (ENV Var GOCC) -> They set the default 100 to 50
  • Result: Decrease in memory util and no more growing util over time

Large skew in memory utilization

  • Scanario: Scew between api server memory utilization across api-server pods
  • Problem: If a pod with high util get's hist with a list, the api-server will oom -> The LB redirects to the other 2 -> Those OOM
  • Observation: The lb in fron of the api server pods also shows some skew -> Explains the skew
  • Root cause: lb has long living tcp connections to the servers and balances based on connections and not requests
  • Idea: Switch up the lb configuration -> Not quite the right angle
  • Fix: Goaway-chance param in apiserver - random COAWAY TCP message get's sent -> Tearing down connection gracefully, recreate connection

Architectural mistakes

  • Large number of secrets per workload -> List, Encode/Decode overhead
  • No caching -> To many list calls

Preview

  • There are a bunch of sig api-machinery improvements planned

The future

  • The switch from NUMA GPU-Devices to DRA
  • DRA is powerfull engough to get rid of custom numa stuff

The stack

  • Currently:
    • CP: APIServer, Controller manager, Scheduler and Topology aware scheduler
    • Worker: Device Plugin, nfd topology updater
  • Future
    • CP: APIServer, Controller manager, Scheduler
    • Worker: Device Plugin

Testing scaling

  • Tool: KWOK (Kubernetes WithOut Kublet) - used to simulate gpu workout
  • Env: K8S 1.32 with scaling from 0 to 4000 Workloads
  • Metrics:
    • Scheduling Latency: Topo aware was way more latency-affected
    • Scheduler Memory util: 30% of memory saved with dra
    • APi-Server Memory: Another 20& of memory saved
  • Result: They are confident that DRA will bew stable and even save memeory and cpu util