kubecon24/content/day4/06_global_operator.md

2.8 KiB
Raw Blame History

title weight tags
TikToks Edge Symphony: Scaling Beyond Boundaries with Multi-Cluster Controllers 6
platform
operator
scaling

A talk by TikTok/ByteDace (duh) focussed on using central controllers instead of on the edge.

Background

Global means non-china

  • Edge platform team for cdn, livestreaming, uploads, realtime communication, etc.
  • Around 250 cluster with 10-600 nodes each - mostly non-cloud aka baremetal
  • Architecture: Control plane clusters (platform services) - data plane clusters (workload by other teams)
  • Platform includes logs, metrics, configs, secrets, ...

Challenges

Operators

  • Operators are essential for platform features
  • As the feature requests increase, more operators are needed
  • The deployment of operators throughout many clusters is complex (namespace, deployments, pollicies, ...)

Edge

  • Limited ressources
  • Cost implication of platfor features
  • Real time processing demands by platform features
  • Balancing act between ressorces used by workload vs platform features (20-25%)

The classic flow

  1. New feature get's requested
  2. Use kube-buiders with the sdk to create the operator
  3. Create namespaces and configs in all clusters
  4. Deploy operator to all clsuters

Possible Solution

Centralized Control Plane

  • Problem: The controller implementation is limited to a cluster boundry
  • Idea: Why not create a signle operator that can manage multiple edge clusters
  • Implementation: Just modify kubebuilder to accept multiple clients (and caches)
  • Result: It works -> Simpler deployment and troubleshooting
  • Concerns: High code complexity -> Long familiarization
  • Balance between "simple central operator" and operator-complexity is hard

Attempt it a bit more like kubebuilder

  • Each cluster has its own manager
  • There is a central multimanager that starts all of the cluster specific manager
  • Controller registration to the manager now handles cluster names
  • The reconciler knows which cluster it is working on
  • The multi cluster management basicly just tets all of the cluster secrets and create a manager+controller for each cluster secret
  • Challenges: Network connectifiy
  • Solutions:
    • Dynamic add/remove of clusters with go channels to prevent pod restarts
    • Connectivity health checks -> For loss the recreate manager get's triggered
flowchart TD
    mcm-->m1
    mcm-->m2
    mcm-->m3
flowchart LR
    secrets-->ch(go channels)
    ch-->|CREATE|create(Create manager + Add controller + Start manager)
    ch-->|UPDATE|update(Stop manager + Create manager + Add controller + Start manager)
    ch-->|DELETE|delete(Stop manager)

Conclusion

  • Acknowlege ressource contrains on edge
  • Embrace open source adoption instead of build your own
  • Simplify deployment
  • Recognize your own optionated approach and it's use cases