--- title: "The Hitchhiker's Guide to Kubernetes Platforms: Don’t Panic, Just Launch!" weight: 7 tags: - platform - scaling - operators - dx --- This talks looks at bootstrapping Platforms using KSere. They do this in regards to AI Workflows. ## Szenario * Deploy AI Workloads - Sometime consiting of different parts * Models get stored in a model registry ## Baseline * Consistent APIs throughout the platform * Not the kube api directly b/c: * Data scientists are a bit overpowered by the kube api * Not only Kubernetes (also monitoring tools, feedback tools, etc) * Better debugging experience for specific workloads ## The debugging api * Specific API with enhanced statuses and consistent UX across Code and UI * Exampüle Endpoints: Pods, Deployments, InferenceServices * Provides a status summary-> Consistent health info across all related ressources * Example: Deployments have progress/availability, Pods have phases, Containers have readyness -> What do we interpret how? * Evaluation: Progressing, Available Count vs Readyness, Replicafailure, Pod Phase, Container Readyness * The rules themselfes may be pretty complex, but - since the user doesn't have to check them themselves - the status is simple ### Debugging Metrics * Dashboards (Utilization, throughput, latency) * Events * Logs ## Deployment API * Launchpad: Just select your model and version -> The DB (dock) stores all manifests (Spaceship) * Manifests relate to models from a model registry * Multi-tenancy is implemented using k8s namespaces * Kine is used to replace/extend etcd with the relational dock db -> Relation namespace<->manifests is stored here and RBAC can be used * Launchpad: Select Namespace and check resource (fuel) availability/utilization ### Clsuter maintainance * Deplyoments can be launched to multiple clusters (even two clusters at once) -> HA through identical clusters * The excact same manifests get deployed to two clusters * Cluster desired state is stored externally to enable effortless upogrades, rescale, etc ### Versioning API * Basicly the dock DB * CRDs are the representations of the inference manifests * Rollbacks, Promotion and History is managed via the CRs * Why not GitOps: Internal Diffs, deployment overrides, customized features ### UX * User driven API design * Customized tools * Everything gets 1:1 replicated for HA * Large onboarding guide