2.8 KiB
2.8 KiB
title | weight |
---|---|
TikTok’s Edge Symphony: Scaling Beyond Boundaries with Multi-Cluster Controllers | 6 |
A talk by TikTok/ByteDace (duh) focussed on using central controllers instead of on the edge.
Background
Global means non-china
- Edge platform team for cdn, livestreaming, uploads, realtime communication, etc.
- Around 250 cluster with 10-600 nodes each - mostly non-cloud aka baremetal
- Architecture: Control plane clusters (platform services) - data plane clusters (workload by other teams)
- Platform includes logs, metrics, configs, secrets, ...
Challenges
Operators
- Operators are essential for platform features
- As the feature requests increase, more operators are needed
- The deployment of operators throughout many clusters is complex (namespace, deployments, pollicies, ...)
Edge
- Limited ressources
- Cost implication of platfor features
- Real time processing demands by platform features
- Balancing act between ressorces used by workload vs platform features (20-25%)
The classic flow
- New feature get's requested
- Use kube-buiders with the sdk to create the operator
- Create namespaces and configs in all clusters
- Deploy operator to all clsuters
Possible Solution
Centralized Control Plane
- Problem: The controller implementation is limited to a cluster boundry
- Idea: Why not create a signle operator that can manage multiple edge clusters
- Implementation: Just modify kubebuilder to accept multiple clients (and caches)
- Result: It works -> Simpler deployment and troubleshooting
- Concerns: High code complexity -> Long familiarization
- Balance between "simple central operator" and operator-complexity is hard
Attempt it a bit more like kubebuilder
- Each cluster has its own manager
- There is a central multimanager that starts all of the cluster specific manager
- Controller registration to the manager now handles cluster names
- The reconciler knows which cluster it is working on
- The multi cluster management basicly just tets all of the cluster secrets and create a manager+controller for each cluster secret
- Challenges: Network connectifiy
- Solutions:
- Dynamic add/remove of clusters with go channels to prevent pod restarts
- Connectivity health checks -> For loss the recreate manager get's triggered
flowchart TD
mcm-->m1
mcm-->m2
mcm-->m3
flowchart LR
secrets-->ch(go channels)
ch-->|CREATE|create(Create manager + Add controller + Start manager)
ch-->|UPDATE|update(Stop manager + Create manager + Add controller + Start manager)
ch-->|DELETE|delete(Stop manager)
Conclusion
- Acknowlege ressource contrains on edge
- Embrace open source adoption instead of build your own
- Simplify deployment
- Recognize your own optionated approach and it's use cases