docs(day-2): Added multitenancy talk

This commit is contained in:
2026-03-21 16:50:20 +01:00
parent 9f9371bd71
commit 25aa419cc5
2 changed files with 69 additions and 1 deletions

View File

@@ -0,0 +1,63 @@
---
title: "Yor Cluster Isn't flat: A First-Class API for Real-World Infrastructure Topology"
weight: 9
tags:
- rejekts
---
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
<!-- {{% button href="https://github.com/JesseStutler" style="info" icon="code" %}}Code/Demo{{% /button %}} -->
By a volcano maintainer from Huawei - a very wholesome guy.
I don't know why the organizers always tend to schedule these very technical topic by people with a bit of an harder accent (totally understandable but very quiet) near the end of the conference or day? I thank the Sakura Edition Red Bull for keeping my attention span up and running for the last two sessions of the day.
## History of vokcano
- 2017: Kube-Batch open soruce
- 2019: Volcano Open Source
- 2020: CNCF Sandbox
- 2022: CNCF Incubation
- 2026: Road to graduation
## Volcano feature overview
- Unified Scheduler
- Queue Management
- Workload Colocation
- Multi cluster scheduling
- Heterogenus Device Support
- Multiple Scheduling policies
## Why topology awareness?
- Scenario 1: Bottlenecks in LLM-Training when jobns are not placed on GPUs that are close
- Scenario 2: Inference runs as Seperate Prefill and Decode Jobs on different hardware -> Short network hops needed
- Node labels can be used but are very limited
- Datacenter network architectures are heterogenus -> Everyone can buil in their own style
## Scheduler notation mechansis
- Label: Kueue, Koordinator, KAI Scheduler
- Vendor-Specific Syntax
- No hierarchy
- Need to be manually set
- No healthchecks
- Cloud Specific
- CRD (Long term): Volcano
- Standardized API (HyperNBode)
- Hierarchical (Trees/Zones)
- Auto-discovery - Plugin-Ready (e.G. NVIDIA)
- Healhchecks
- Unified across clouds and on-prem
## Architecture CRD Sample
TODO: Steal Leaf sample from slides
## What's next
- GPU 3D Architectures (Internal interconnects, NUMA, external interconnects)
- DRA integration/collabaration
- Promotion of HyperNode to a first-class citizen -> Extraction from Volcano to be truly generic

View File

@@ -22,4 +22,9 @@ I have to admit that I'm very bad with names and don't always regocnize people b
## Other stuff I learned or people i talk to
- TODO:
- Arik about dprecation of CNCF projects
- Simon and Koray about demo prep for talks
- Arik and Simon about the review process for conference talks
- Nico
- Stephan
- A nice guy who's name i forgot (did i mention that I'm bad with names yet?) about the process of bleaching/dyeing my hair (he asked for a friend)