87 lines
2.8 KiB
Markdown
87 lines
2.8 KiB
Markdown
---
|
||
title: "TikTok’s Edge Symphony: Scaling Beyond Boundaries with Multi-Cluster Controllers"
|
||
weight: 6
|
||
tags:
|
||
- platform
|
||
- operator
|
||
- scaling
|
||
---
|
||
|
||
A talk by TikTok/ByteDance (duh) focussed on using central controllers instead of on the edge.
|
||
|
||
## Background
|
||
|
||
> Global means non-china
|
||
|
||
* Edge platform team for CDN, livestreaming, uploads, real-time communication, etc.
|
||
* Around 250 cluster with 10-600 nodes each - mostly non-cloud aka bare-metal
|
||
* Architecture: Control plane clusters (platform services) - data plane clusters (workload by other teams)
|
||
* Platform includes logs, metrics, configs, secrets, ...
|
||
|
||
## Challenges
|
||
|
||
### Operators
|
||
|
||
* Operators are essential for platform features
|
||
* As the feature requests increase, more operators are needed
|
||
* The deployment of operators throughout many clusters is complex (namespace, deployments, policies, ...)
|
||
|
||
### Edge
|
||
|
||
* Limited resources
|
||
* Cost implication of platform features
|
||
* Real time processing demands by platform features
|
||
* Balancing act between resources used by workload vs platform features (20-25%)
|
||
|
||
### The classic flow
|
||
|
||
1. New feature gets requested
|
||
2. Use kubebuider with the SDK to create the operator
|
||
3. Create namespaces and configs in all clusters
|
||
4. Deploy operator to all clusters
|
||
|
||
## Possible Solution
|
||
|
||
### Centralized Control Plane
|
||
|
||
* Problem: The controller implementation is limited to a cluster boundary
|
||
* Idea: Why not create a single operator that can manage multiple edge clusters
|
||
* Implementation: Just modify kubebuilder to accept multiple clients (and caches)
|
||
* Result: It works -> Simpler deployment and troubleshooting
|
||
* Concerns: High code complexity -> Long familiarization
|
||
* Balance between "simple central operator" and operator-complexity is hard
|
||
|
||
### Attempt it a bit more like kubebuilder
|
||
|
||
* Each cluster has its own manager
|
||
* There is a central multimanager that starts all the cluster specific manager
|
||
* Controller registration to the manager now handles cluster names
|
||
* The reconciler knows which cluster it is working on
|
||
* The multi cluster management basically just test all the cluster secrets and create a manager+controller for each cluster secret
|
||
* Challenges: Network connectivity
|
||
* Solutions:
|
||
* Dynamic add/remove of clusters with go channels to prevent pod restarts
|
||
* Connectivity health checks -> For loss the `recreate manager` gets triggered
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
mcm-->m1
|
||
mcm-->m2
|
||
mcm-->m3
|
||
```
|
||
|
||
```mermaid
|
||
flowchart LR
|
||
secrets-->ch(go channels)
|
||
ch-->|CREATE|create(Create manager + Add controller + Start manager)
|
||
ch-->|UPDATE|update(Stop manager + Create manager + Add controller + Start manager)
|
||
ch-->|DELETE|delete(Stop manager)
|
||
```
|
||
|
||
## Conclusion
|
||
|
||
* Acknowledge resource constraints on edge
|
||
* Embrace open source adoption instead of build your own
|
||
* Simplify deployment
|
||
* Recognize your own opinionated approach and it's use cases
|