diff --git a/content/day-2/09_clusternotflat.md b/content/day-2/09_clusternotflat.md new file mode 100644 index 0000000..6028929 --- /dev/null +++ b/content/day-2/09_clusternotflat.md @@ -0,0 +1,63 @@ +--- +title: "Yor Cluster Isn't flat: A First-Class API for Real-World Infrastructure Topology" +weight: 9 +tags: + - rejekts +--- + + + + + +By a volcano maintainer from Huawei - a very wholesome guy. +I don't know why the organizers always tend to schedule these very technical topic by people with a bit of an harder accent (totally understandable but very quiet) near the end of the conference or day? I thank the Sakura Edition Red Bull for keeping my attention span up and running for the last two sessions of the day. + +## History of vokcano + +- 2017: Kube-Batch open soruce +- 2019: Volcano Open Source +- 2020: CNCF Sandbox +- 2022: CNCF Incubation +- 2026: Road to graduation + +## Volcano feature overview + +- Unified Scheduler +- Queue Management +- Workload Colocation +- Multi cluster scheduling +- Heterogenus Device Support +- Multiple Scheduling policies + +## Why topology awareness? + +- Scenario 1: Bottlenecks in LLM-Training when jobns are not placed on GPUs that are close +- Scenario 2: Inference runs as Seperate Prefill and Decode Jobs on different hardware -> Short network hops needed +- Node labels can be used but are very limited +- Datacenter network architectures are heterogenus -> Everyone can buil in their own style + +## Scheduler notation mechansis + +- Label: Kueue, Koordinator, KAI Scheduler + - Vendor-Specific Syntax + - No hierarchy + - Need to be manually set + - No healthchecks + - Cloud Specific +- CRD (Long term): Volcano + - Standardized API (HyperNBode) + - Hierarchical (Trees/Zones) + - Auto-discovery - Plugin-Ready (e.G. NVIDIA) + - Healhchecks + - Unified across clouds and on-prem + +## Architecture CRD Sample + +TODO: Steal Leaf sample from slides + + +## What's next + +- GPU 3D Architectures (Internal interconnects, NUMA, external interconnects) +- DRA integration/collabaration +- Promotion of HyperNode to a first-class citizen -> Extraction from Volcano to be truly generic diff --git a/content/day-2/_index.md b/content/day-2/_index.md index 754e834..954fd40 100644 --- a/content/day-2/_index.md +++ b/content/day-2/_index.md @@ -22,4 +22,9 @@ I have to admit that I'm very bad with names and don't always regocnize people b ## Other stuff I learned or people i talk to -- TODO: \ No newline at end of file +- Arik about dprecation of CNCF projects +- Simon and Koray about demo prep for talks +- Arik and Simon about the review process for conference talks +- Nico +- Stephan +- A nice guy who's name i forgot (did i mention that I'm bad with names yet?) about the process of bleaching/dyeing my hair (he asked for a friend) \ No newline at end of file