---
title: "TikTok’s Edge Symphony: Scaling Beyond Boundaries with Multi-Cluster Controllers"
weight: 6
tags:
  - platform
  - operator
  - scaling
---

A talk by TikTok/ByteDace (duh) focussed on using central controllers instead of on the edge.

## Background

> Global means non-china

* Edge platform team for cdn, livestreaming, uploads, realtime communication, etc.
* Around 250 cluster with 10-600 nodes each - mostly non-cloud aka baremetal
* Architecture: Control plane clusters (platform services) - data plane clusters (workload by other teams)
* Platform includes logs, metrics, configs, secrets, ...

## Challenges

### Operators

* Operators are essential for platform features
* As the feature requests increase, more operators are needed
* The deployment of operators throughout many clusters is complex (namespace, deployments, pollicies, ...)

### Edge

* Limited ressources
* Cost implication of platfor features
* Real time processing demands by platform features
* Balancing act between ressorces used by workload vs platform features (20-25%)

### The classic flow

1. New feature get's requested
2. Use kube-buiders with the sdk to create the operator
3. Create namespaces and configs in all clusters
4. Deploy operator to all clsuters

## Possible Solution

### Centralized Control Plane

* Problem: The controller implementation is limited to a cluster boundry
* Idea: Why not create a signle operator that can manage multiple edge clusters
* Implementation: Just modify kubebuilder to accept multiple clients (and caches)
* Result: It works -> Simpler deployment and troubleshooting
* Concerns: High code complexity -> Long familiarization
* Balance between "simple central operator" and operator-complexity is hard

### Attempt it a bit more like kubebuilder

* Each cluster has its own manager
* There is a central multimanager that starts all of the cluster specific manager
* Controller registration to the manager now handles cluster names
* The reconciler knows which cluster it is working on
* The multi cluster management basicly just tets all of the cluster secrets and create a manager+controller for each cluster secret
* Challenges: Network connectifiy
* Solutions:
  * Dynamic add/remove of clusters with go channels to prevent pod restarts
  * Connectivity health checks -> For loss the recreate manager get's triggered

```mermaid
flowchart TD
    mcm-->m1
    mcm-->m2
    mcm-->m3
```

```mermaid
flowchart LR
    secrets-->ch(go channels)
    ch-->|CREATE|create(Create manager + Add controller + Start manager)
    ch-->|UPDATE|update(Stop manager + Create manager + Add controller + Start manager)
    ch-->|DELETE|delete(Stop manager)
```

## Conclusion

* Acknowlege ressource contrains on edge
* Embrace open source adoption instead of build your own
* Simplify deployment
* Recognize your own optionated approach and it's use cases