2.4 KiB
2.4 KiB
title, weight, tags
| title | weight | tags | |
|---|---|---|---|
| Yor Cluster Isn't flat: A First-Class API for Real-World Infrastructure Topology | 9 |
|
By a volcano maintainer from Huawei - a very wholesome guy. I don't know why the organizers always tend to schedule these very technical topic by people with a bit of an harder accent (totally understandable but very quiet) near the end of the conference or day? I thank the Sakura Edition Red Bull for keeping my attention span up and running for the last two sessions of the day.
History of vokcano
- 2017: Kube-Batch open soruce
- 2019: Volcano Open Source
- 2020: CNCF Sandbox
- 2022: CNCF Incubation
- 2026: Road to graduation
Volcano feature overview
- Unified Scheduler
- Queue Management
- Workload Colocation
- Multi cluster scheduling
- Heterogenus Device Support
- Multiple Scheduling policies
Why topology awareness?
- Scenario 1: Bottlenecks in LLM-Training when jobns are not placed on GPUs that are close
- Scenario 2: Inference runs as Seperate Prefill and Decode Jobs on different hardware -> Short network hops needed
- Node labels can be used but are very limited
- Datacenter network architectures are heterogenus -> Everyone can buil in their own style
Scheduler notation mechansis
- Label: Kueue, Koordinator, KAI Scheduler
- Vendor-Specific Syntax
- No hierarchy
- Need to be manually set
- No healthchecks
- Cloud Specific
- CRD (Long term): Volcano
- Standardized API (HyperNBode)
- Hierarchical (Trees/Zones)
- Auto-discovery - Plugin-Ready (e.G. NVIDIA)
- Healhchecks
- Unified across clouds and on-prem
Architecture CRD Sample
TODO: Steal Leaf sample from slides
What's next
- GPU 3D Architectures (Internal interconnects, NUMA, external interconnects)
- DRA integration/collabaration
- Promotion of HyperNode to a first-class citizen -> Extraction from Volcano to be truly generic