--- title: "TikTok’s Edge Symphony: Scaling Beyond Boundaries with Multi-Cluster Controllers" weight: 6 tags: - platform - operator - scaling --- A talk by TikTok/ByteDace (duh) focussed on using central controllers instead of on the edge. ## Background > Global means non-china * Edge platform team for cdn, livestreaming, uploads, realtime communication, etc. * Around 250 cluster with 10-600 nodes each - mostly non-cloud aka baremetal * Architecture: Control plane clusters (platform services) - data plane clusters (workload by other teams) * Platform includes logs, metrics, configs, secrets, ... ## Challenges ### Operators * Operators are essential for platform features * As the feature requests increase, more operators are needed * The deployment of operators throughout many clusters is complex (namespace, deployments, pollicies, ...) ### Edge * Limited ressources * Cost implication of platfor features * Real time processing demands by platform features * Balancing act between ressorces used by workload vs platform features (20-25%) ### The classic flow 1. New feature get's requested 2. Use kube-buiders with the sdk to create the operator 3. Create namespaces and configs in all clusters 4. Deploy operator to all clsuters ## Possible Solution ### Centralized Control Plane * Problem: The controller implementation is limited to a cluster boundry * Idea: Why not create a signle operator that can manage multiple edge clusters * Implementation: Just modify kubebuilder to accept multiple clients (and caches) * Result: It works -> Simpler deployment and troubleshooting * Concerns: High code complexity -> Long familiarization * Balance between "simple central operator" and operator-complexity is hard ### Attempt it a bit more like kubebuilder * Each cluster has its own manager * There is a central multimanager that starts all of the cluster specific manager * Controller registration to the manager now handles cluster names * The reconciler knows which cluster it is working on * The multi cluster management basicly just tets all of the cluster secrets and create a manager+controller for each cluster secret * Challenges: Network connectifiy * Solutions: * Dynamic add/remove of clusters with go channels to prevent pod restarts * Connectivity health checks -> For loss the recreate manager get's triggered ```mermaid flowchart TD mcm-->m1 mcm-->m2 mcm-->m3 ``` ```mermaid flowchart LR secrets-->ch(go channels) ch-->|CREATE|create(Create manager + Add controller + Start manager) ch-->|UPDATE|update(Stop manager + Create manager + Add controller + Start manager) ch-->|DELETE|delete(Stop manager) ``` ## Conclusion * Acknowlege ressource contrains on edge * Embrace open source adoption instead of build your own * Simplify deployment * Recognize your own optionated approach and it's use cases