kubecon24/content/day2/08_multicloud_saas.md

---
title: Building a large scale multi-cloud multi-region SaaS platform with kubernetes controllers
weight: 8
tags:
  - platform
  - operator
  - scaling
---

{{% button href="https://youtu.be/VhloarnpxVo" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}

> Interchangeable wording in this talk: controller == operator

A talk by elastic.

## About elastic

* Elastic cloud as a managed service
* Deployed across AWS/GCP/Azure in over 50 regions
* 600000+ Containers

### Elastic and Kube

* They offer elastic observability
* They offer the ECK operator for simplified deployments

## The baseline

* Goal: A large scale (1M+ containers) resilient platform on k8s
* Architecture
  * Global Control: The control plane (API) for users with controllers
  * Regional Apps: The "shitload" of Kubernetes clusters where the actual customer services live

## Scalability

* Challenge: How large can our cluster be, how many clusters do we need
* Problem: Only basic guidelines exist for that
* Decision: Horizontally scale the number of clusters (5ßß-1K nodes each)
* Decision: Disposable clusters
  * Throw away without data loss
  * Single source of truth is not cluster etcd but external -> No etcd backups needed
  * Everything can be recreated any time

## Controllers

{{% notice style="note" %}}
I won't copy the explanations of operators/controllers in these notes
{{% /notice %}}

* Many controllers, including (but not limited to)
  * cluster controller: Register cluster to controller
  * Project controller: Schedule user's project to cluster
  * Product controllers (Elasticsearch, Kibana, etc.)
  * Ingress/Cert manager
* Sometimes controllers depend on controllers -> potential complexity
* Pro:
  * Resilient (Self-healing)
  * Level triggered (desired state vs procedure triggered)
  * Simple reasoning when comparing desired state vs state machine
  * Official controller runtime lib
* Workqueue: Automatic Dedup, Retry back off and so on

## Global Controllers

* Basic operation
  * Uses project config from Elastic cloud as the desired state
  * The actual state is a k9s resource in another cluster
* Challenge: Where is the source of truth if the data is not stored in etcd
* Solution: External data store (Postgres)
* Challenge: How do we sync the db sources to Kubernetes
* Potential solutions: Replace etcd with the external db
* Chosen solution:
  * The controllers don't use CRDs for storage, but they expose a web-API
  * Reconciliation still now interacts with the external db and go channels (queue) instead
  * Then the CRs for the operators get created by the global controller

### Large scale

* Problem: Reconcile gets triggered for all objects on restart -> Make sure nothing gets missed and is used with the latest controller version
* Idea: Just create more workers for 100K+ Objects
* Problem: CPU go brrr and db gets overloaded
* Problem: If you create an item during restart, suddenly it is at the end of a 100Kü item work-queue

### Reconcile

* User-driven events are processed asap
* reconcile of everything should happen, bus with low priority slowly in the background
* Solution: Status: LastReconciledRevision (timestamp) gets compare to revision, if larger -> User change
* Prioritization: Just a custom event handler with the normal queue and a low priority
* Queue: Just a queue that adds items to the normal work-queue with a rate limit

```mermaid
flowchart LR
    low-->rl(ratelimit)
    rl-->wq(work queue)
    wq-->controller
    high-->wq
```

## Related

* Argo for CI/CD
* Crossplane for cluster autoprovision
day2 the next episode 2024-03-20 15:58:50 +00:00			`---`
			`title: Building a large scale multi-cloud multi-region SaaS platform with kubernetes controllers`
			`weight: 8`
added tags 2024-03-25 12:45:10 +00:00			`tags:`
			`- platform`
			`- operator`
			`- scaling`
day2 the next episode 2024-03-20 15:58:50 +00:00			`---`

talk links 2024-03-26 14:43:47 +00:00			`{{% button href="https://youtu.be/VhloarnpxVo" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}`

day2 the next episode 2024-03-20 15:58:50 +00:00			`> Interchangeable wording in this talk: controller == operator`

			`A talk by elastic.`

			`## About elastic`

Day 2 typos 2024-03-26 14:00:48 +00:00			`* Elastic cloud as a managed service`
day2 the next episode 2024-03-20 15:58:50 +00:00			`* Deployed across AWS/GCP/Azure in over 50 regions`
Day 2 typos 2024-03-26 14:00:48 +00:00			`* 600000+ Containers`
day2 the next episode 2024-03-20 15:58:50 +00:00
			`### Elastic and Kube`

Day 2 typos 2024-03-26 14:00:48 +00:00			`* They offer elastic observability`
day2 the next episode 2024-03-20 15:58:50 +00:00			`* They offer the ECK operator for simplified deployments`

			`## The baseline`

Day 2 typos 2024-03-26 14:00:48 +00:00			`* Goal: A large scale (1M+ containers) resilient platform on k8s`
day2 the next episode 2024-03-20 15:58:50 +00:00			`* Architecture`
Day 2 typos 2024-03-26 14:00:48 +00:00			`* Global Control: The control plane (API) for users with controllers`
			`* Regional Apps: The "shitload" of Kubernetes clusters where the actual customer services live`
day2 the next episode 2024-03-20 15:58:50 +00:00
			`## Scalability`

			`* Challenge: How large can our cluster be, how many clusters do we need`
			`* Problem: Only basic guidelines exist for that`
Day 2 typos 2024-03-26 14:00:48 +00:00			`* Decision: Horizontally scale the number of clusters (5ßß-1K nodes each)`
day2 the next episode 2024-03-20 15:58:50 +00:00			`* Decision: Disposable clusters`
			`* Throw away without data loss`
Day 2 typos 2024-03-26 14:00:48 +00:00			`* Single source of truth is not cluster etcd but external -> No etcd backups needed`
day2 the next episode 2024-03-20 15:58:50 +00:00			`* Everything can be recreated any time`

			`## Controllers`

			`{{% notice style="note" %}}`
Day 2 typos 2024-03-26 14:00:48 +00:00			`I won't copy the explanations of operators/controllers in these notes`
day2 the next episode 2024-03-20 15:58:50 +00:00			`{{% /notice %}}`

Day 2 typos 2024-03-26 14:00:48 +00:00			`* Many controllers, including (but not limited to)`
			`* cluster controller: Register cluster to controller`
day2 the next episode 2024-03-20 15:58:50 +00:00			`* Project controller: Schedule user's project to cluster`
			`* Product controllers (Elasticsearch, Kibana, etc.)`
Day 2 typos 2024-03-26 14:00:48 +00:00			`* Ingress/Cert manager`
day2 the next episode 2024-03-20 15:58:50 +00:00			`* Sometimes controllers depend on controllers -> potential complexity`
			`* Pro:`
Day 2 typos 2024-03-26 14:00:48 +00:00			`* Resilient (Self-healing)`
day2 the next episode 2024-03-20 15:58:50 +00:00			`* Level triggered (desired state vs procedure triggered)`
			`* Simple reasoning when comparing desired state vs state machine`
			`* Official controller runtime lib`
Day 2 typos 2024-03-26 14:00:48 +00:00			`* Workqueue: Automatic Dedup, Retry back off and so on`
day2 the next episode 2024-03-20 15:58:50 +00:00
			`## Global Controllers`

			`* Basic operation`
			`* Uses project config from Elastic cloud as the desired state`
Day 2 typos 2024-03-26 14:00:48 +00:00			`* The actual state is a k9s resource in another cluster`
			`* Challenge: Where is the source of truth if the data is not stored in etcd`
			`* Solution: External data store (Postgres)`
			`* Challenge: How do we sync the db sources to Kubernetes`
day2 the next episode 2024-03-20 15:58:50 +00:00			`* Potential solutions: Replace etcd with the external db`
			`* Chosen solution:`
Day 2 typos 2024-03-26 14:00:48 +00:00			`* The controllers don't use CRDs for storage, but they expose a web-API`
			`* Reconciliation still now interacts with the external db and go channels (queue) instead`
day2 the next episode 2024-03-20 15:58:50 +00:00			`* Then the CRs for the operators get created by the global controller`

			`### Large scale`

			`* Problem: Reconcile gets triggered for all objects on restart -> Make sure nothing gets missed and is used with the latest controller version`
			`* Idea: Just create more workers for 100K+ Objects`
			`* Problem: CPU go brrr and db gets overloaded`
			`* Problem: If you create an item during restart, suddenly it is at the end of a 100Kü item work-queue`

			`### Reconcile`

			`* User-driven events are processed asap`
Day 2 typos 2024-03-26 14:00:48 +00:00			`* reconcile of everything should happen, bus with low priority slowly in the background`
			`* Solution: Status: LastReconciledRevision (timestamp) gets compare to revision, if larger -> User change`
			`* Prioritization: Just a custom event handler with the normal queue and a low priority`
			`* Queue: Just a queue that adds items to the normal work-queue with a rate limit`
day2 the next episode 2024-03-20 15:58:50 +00:00
			```mermaid
			`flowchart LR`
			`low-->rl(ratelimit)`
			`rl-->wq(work queue)`
			`wq-->controller`
			`high-->wq`
			```

			`## Related`

			`* Argo for CI/CD`
			`* Crossplane for cluster autoprovision`