Files
kubecon26/content/day-2/09_clusternotflat.md

2.4 KiB

title, weight, tags
title weight tags
Yor Cluster Isn't flat: A First-Class API for Real-World Infrastructure Topology 9
rejekts

By a volcano maintainer from Huawei - a very wholesome guy. I don't know why the organizers always tend to schedule these very technical topic by people with a bit of an harder accent (totally understandable but very quiet) near the end of the conference or day? I thank the Sakura Edition Red Bull for keeping my attention span up and running for the last two sessions of the day.

History of vokcano

  • 2017: Kube-Batch open soruce
  • 2019: Volcano Open Source
  • 2020: CNCF Sandbox
  • 2022: CNCF Incubation
  • 2026: Road to graduation

Volcano feature overview

  • Unified Scheduler
  • Queue Management
  • Workload Colocation
  • Multi cluster scheduling
  • Heterogenus Device Support
  • Multiple Scheduling policies

Why topology awareness?

  • Scenario 1: Bottlenecks in LLM-Training when jobns are not placed on GPUs that are close
  • Scenario 2: Inference runs as Seperate Prefill and Decode Jobs on different hardware -> Short network hops needed
  • Node labels can be used but are very limited
  • Datacenter network architectures are heterogenus -> Everyone can buil in their own style

Scheduler notation mechansis

  • Label: Kueue, Koordinator, KAI Scheduler
    • Vendor-Specific Syntax
    • No hierarchy
    • Need to be manually set
    • No healthchecks
    • Cloud Specific
  • CRD (Long term): Volcano
    • Standardized API (HyperNBode)
    • Hierarchical (Trees/Zones)
    • Auto-discovery - Plugin-Ready (e.G. NVIDIA)
    • Healhchecks
    • Unified across clouds and on-prem

Architecture CRD Sample

TODO: Steal Leaf sample from slides

What's next

  • GPU 3D Architectures (Internal interconnects, NUMA, external interconnects)
  • DRA integration/collabaration
  • Promotion of HyperNode to a first-class citizen -> Extraction from Volcano to be truly generic