diff --git a/content/day4/06_global_operator.md b/content/day4/06_global_operator.md new file mode 100644 index 0000000..501f30b --- /dev/null +++ b/content/day4/06_global_operator.md @@ -0,0 +1,82 @@ +--- +title: "TikTok’s Edge Symphony: Scaling Beyond Boundaries with Multi-Cluster Controllers" +weight: 6 +--- + +A talk by TikTok/ByteDace (duh) focussed on using central controllers instead of on the edge. + +## Background + +> Global means non-china + +* Edge platform team for cdn, livestreaming, uploads, realtime communication, etc. +* Around 250 cluster with 10-600 nodes each - mostly non-cloud aka baremetal +* Architecture: Control plane clusters (platform services) - data plane clusters (workload by other teams) +* Platform includes logs, metrics, configs, secrets, ... + +## Challenges + +### Operators + +* Operators are essential for platform features +* As the feature requests increase, more operators are needed +* The deployment of operators throughout many clusters is complex (namespace, deployments, pollicies, ...) + +### Edge + +* Limited ressources +* Cost implication of platfor features +* Real time processing demands by platform features +* Balancing act between ressorces used by workload vs platform features (20-25%) + +### The classic flow + +1. New feature get's requested +2. Use kube-buiders with the sdk to create the operator +3. Create namespaces and configs in all clusters +4. Deploy operator to all clsuters + +## Possible Solution + +### Centralized Control Plane + +* Problem: The controller implementation is limited to a cluster boundry +* Idea: Why not create a signle operator that can manage multiple edge clusters +* Implementation: Just modify kubebuilder to accept multiple clients (and caches) +* Result: It works -> Simpler deployment and troubleshooting +* Concerns: High code complexity -> Long familiarization +* Balance between "simple central operator" and operator-complexity is hard + +### Attempt it a bit more like kubebuilder + +* Each cluster has its own manager +* There is a central multimanager that starts all of the cluster specific manager +* Controller registration to the manager now handles cluster names +* The reconciler knows which cluster it is working on +* The multi cluster management basicly just tets all of the cluster secrets and create a manager+controller for each cluster secret +* Challenges: Network connectifiy +* Solutions: + * Dynamic add/remove of clusters with go channels to prevent pod restarts + * Connectivity health checks -> For loss the recreate manager get's triggered + +```mermaid +flowchart TD + mcm-->m1 + mcm-->m2 + mcm-->m3 +``` + +```mermaid +flowchart LR + secrets-->ch(go channels) + ch-->|CREATE|create(Create manager + Add controller + Start manager) + ch-->|UPDATE|update(Stop manager + Create manager + Add controller + Start manager) + ch-->|DELETE|delete(Stop manager) +``` + +## Conclusion + +* Acknowlege ressource contrains on edge +* Embrace open source adoption instead of build your own +* Simplify deployment +* Recognize your own optionated approach and it's use cases diff --git a/content/day4/07_fluentbit.md b/content/day4/07_fluentbit.md new file mode 100644 index 0000000..e70d177 --- /dev/null +++ b/content/day4/07_fluentbit.md @@ -0,0 +1,78 @@ +--- +title: "Fluent Bit v3: Unified Layer for Logs, Metrics and Traces" +weight: 7 +--- + +The last talk of the conference. +Notes may be a bit unstructured due to tired note taker. + +## Background + +* FluentD is already graduated +* FluentBit is a daughter-project of FluentD (also graduated) + +## Basics + +* Fluentbit is compatible with + * prometheus (It can replace the prometheus scraper and node exporter) + * openmetrics + * opentelemetry (HTTPS input/output) +* FluentBit can export to Prometheus, Splunk, InfluxDB or others +* So pretty much it can be used to collect data from a bunch of sources and pipe it out to different backend destinations +* Fluent ecosystem: No vendor lock-in to observability + +### Arhitectures + +* The fluent agent collects data and can send it to one or multiple locations +* FluentBit can be used for aggregation from other sources + +### In the kubernetes logging ecosystem + +* Pods logs to console -> Streamed stdout/err gets piped to file +* The logs in the file get encoded as JSON with metadata (date, channel) +* Labels and annotations only live in the control plane -> You have to collect it additionally -> Expensive + +## New stuff + +### Limitations with classic architectures + +* Problem: Multiple filters slow down the main loop + +```mermaid +flowchart LR + subgraph main[Main Thread/Event loop] + buffer + schedule + retry + fitler1 + filter2 + filter3 + end + in-->|pipe in data|main + main-->|filter and pipe out|out +``` + +### Solution + +* Solution: Processor - a seperate thread segmented by telemetry type +* Plugins can be written in your favourite language /c, rust, go, ...) + +```mermaid +flowchart LR + subgraph in + reader + streamner1 + processor2 + processor3 + end + in-->|pipe in data|main(Main Thread/Event loop) + main-->|filter and pipe out|out +``` + +### General new features in v3 + +* Native HTTP/2 support in core +* Contetn modifier with multiple operations (insert, upsert, delete, rename, hash, extract, convert) +* Metrics selector (include or exclude metrics) with matcher (name, prefix, substring, regex) +* SQL processor -> Use SQL expression for selections (instead of filters) +* Better OpenTelemetry output diff --git a/content/lessons_learned/01_operators.md b/content/lessons_learned/01_operators.md index 6b5341e..b8726d1 100644 --- a/content/lessons_learned/01_operators.md +++ b/content/lessons_learned/01_operators.md @@ -4,4 +4,8 @@ title: Operators ## Observability -* Export reconcile loop steps as opentelemetry traces \ No newline at end of file +* Export reconcile loop steps as opentelemetry traces + +## Work queue + +* Go channels as queues