docs(day-2): Added multitenancy talk
This commit is contained in:
63
content/day-2/09_clusternotflat.md
Normal file
63
content/day-2/09_clusternotflat.md
Normal file
@@ -0,0 +1,63 @@
|
||||
---
|
||||
title: "Yor Cluster Isn't flat: A First-Class API for Real-World Infrastructure Topology"
|
||||
weight: 9
|
||||
tags:
|
||||
- rejekts
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
<!-- {{% button href="https://github.com/JesseStutler" style="info" icon="code" %}}Code/Demo{{% /button %}} -->
|
||||
|
||||
By a volcano maintainer from Huawei - a very wholesome guy.
|
||||
I don't know why the organizers always tend to schedule these very technical topic by people with a bit of an harder accent (totally understandable but very quiet) near the end of the conference or day? I thank the Sakura Edition Red Bull for keeping my attention span up and running for the last two sessions of the day.
|
||||
|
||||
## History of vokcano
|
||||
|
||||
- 2017: Kube-Batch open soruce
|
||||
- 2019: Volcano Open Source
|
||||
- 2020: CNCF Sandbox
|
||||
- 2022: CNCF Incubation
|
||||
- 2026: Road to graduation
|
||||
|
||||
## Volcano feature overview
|
||||
|
||||
- Unified Scheduler
|
||||
- Queue Management
|
||||
- Workload Colocation
|
||||
- Multi cluster scheduling
|
||||
- Heterogenus Device Support
|
||||
- Multiple Scheduling policies
|
||||
|
||||
## Why topology awareness?
|
||||
|
||||
- Scenario 1: Bottlenecks in LLM-Training when jobns are not placed on GPUs that are close
|
||||
- Scenario 2: Inference runs as Seperate Prefill and Decode Jobs on different hardware -> Short network hops needed
|
||||
- Node labels can be used but are very limited
|
||||
- Datacenter network architectures are heterogenus -> Everyone can buil in their own style
|
||||
|
||||
## Scheduler notation mechansis
|
||||
|
||||
- Label: Kueue, Koordinator, KAI Scheduler
|
||||
- Vendor-Specific Syntax
|
||||
- No hierarchy
|
||||
- Need to be manually set
|
||||
- No healthchecks
|
||||
- Cloud Specific
|
||||
- CRD (Long term): Volcano
|
||||
- Standardized API (HyperNBode)
|
||||
- Hierarchical (Trees/Zones)
|
||||
- Auto-discovery - Plugin-Ready (e.G. NVIDIA)
|
||||
- Healhchecks
|
||||
- Unified across clouds and on-prem
|
||||
|
||||
## Architecture CRD Sample
|
||||
|
||||
TODO: Steal Leaf sample from slides
|
||||
|
||||
|
||||
## What's next
|
||||
|
||||
- GPU 3D Architectures (Internal interconnects, NUMA, external interconnects)
|
||||
- DRA integration/collabaration
|
||||
- Promotion of HyperNode to a first-class citizen -> Extraction from Volcano to be truly generic
|
||||
@@ -22,4 +22,9 @@ I have to admit that I'm very bad with names and don't always regocnize people b
|
||||
|
||||
## Other stuff I learned or people i talk to
|
||||
|
||||
- TODO:
|
||||
- Arik about dprecation of CNCF projects
|
||||
- Simon and Koray about demo prep for talks
|
||||
- Arik and Simon about the review process for conference talks
|
||||
- Nico
|
||||
- Stephan
|
||||
- A nice guy who's name i forgot (did i mention that I'm bad with names yet?) about the process of bleaching/dyeing my hair (he asked for a friend)
|
||||
Reference in New Issue
Block a user