last day2 talk

day2 the next episode
2024-03-20 17:58:13 +01:00 · 2024-03-20 16:58:50 +01:00
9 changed files with 400 additions and 3 deletions
--- a/content/day2/04_sponsored_ai_platform.md
+++ b/content/day2/04_sponsored_ai_platform.md
@ -1,5 +1,5 @@
 ---
-title: Sponsored: Build an open source platform for ai/ml
+title: "Sponsored: Build an open source platform for ai/ml"
 weight: 4
 ---

--- a/content/day2/07_is_your_image_distroless.md
+++ b/content/day2/07_is_your_image_distroless.md
@ -1,6 +1,6 @@
 ---
 title: Is your image really distroless?
-weight:7
+weight: 7
 ---

 Laurent Goderre from Docker.
--- a/content/day2/08_multicloud_saas.md
+++ b/content/day2/08_multicloud_saas.md
@ -0,0 +1,98 @@
+---
+title: Building a large scale multi-cloud multi-region SaaS platform with kubernetes controllers
+weight: 8
+---
+
+> Interchangeable wording in this talk: controller == operator
+
+A talk by elastic.
+
+## About elastic
+
+* Elestic cloud as a managed service
+* Deployed across AWS/GCP/Azure in over 50 regions
+* 600.000+ Containers
+
+### Elastic and Kube
+
+* They offer elastic obervability
+* They offer the ECK operator for simplified deployments
+
+## The baseline
+
+* Goal: A large scale (1M+ containers resilient platform on k8s
+* Architecture
+  * Global Control: The control plane (api) for users with controllers
+  * Regional Apps: The "shitload" of kubernetes clusters where the actual customer services live
+
+## Scalability
+
+* Challenge: How large can our cluster be, how many clusters do we need
+* Problem: Only basic guidelines exist for that
+* Decision: Horizontaly scale the number of clusters (5ßß-1K nodes each)
+* Decision: Disposable clusters
+  * Throw away without data loss
+  * Single source of throuth is not cluster etcd but external -> No etcd backups needed
+  * Everything can be recreated any time
+
+## Controllers
+
+{{% notice style="note" %}}
+I won't copy the explanations of operators/controllers in this notes
+{{% /notice %}}
+
+
+* Many different controllers, including (but not limited to)
+  * cluster controler: Register cluster to controller
+  * Project controller: Schedule user's project to cluster
+  * Product controllers (Elasticsearch, Kibana, etc.)
+  * Ingress/Certmanager
+* Sometimes controllers depend on controllers -> potential complexity
+* Pro:
+  * Resilient (Selfhealing)
+  * Level triggered (desired state vs procedure triggered)
+  * Simple reasoning when comparing desired state vs state machine
+  * Official controller runtime lib
+* Workque: Automatic Dedup, Retry backoff and so on
+
+## Global Controllers
+
+* Basic operation
+  * Uses project config from Elastic cloud as the desired state
+  * The actual state is a k9s ressource in another cluster
+* Challenge: Where is the source of thruth if the data is not stored in etc
+* Solution: External datastore (postgres)
+* Challenge: How do we sync the db sources to kubernetes
+* Potential solutions: Replace etcd with the external db
+* Chosen solution:
+  * The controllers don't use CRDs for storage, but they expose a webapi
+  * Reconciliation still now interacts with the external db and go channels (que) instead 
+  * Then the CRs for the operators get created by the global controller
+
+### Large scale
+
+* Problem: Reconcile gets triggered for all objects on restart -> Make sure nothing gets missed and is used with the latest controller version
+* Idea: Just create more workers for 100K+ Objects
+* Problem: CPU go brrr and db gets overloaded
+* Problem: If you create an item during restart, suddenly it is at the end of a 100Kü item work-queue
+
+### Reconcile
+
+* User-driven events are processed asap
+* reconcole of everything should happen, bus with low prio slowly in the background
+* Solution: Status: LastReconciledRevision (timestamp) get's compare to revision, if larger -> User change
+* Prioritization: Just a custom event handler with the normal queue and a low prio
+* Low Prio Queue: Just a queue that adds items to the normal work-queue with a rate limit
+
+```mermaid
+flowchart LR
+    low-->rl(ratelimit)
+    rl-->wq(work queue)
+    wq-->controller
+    high-->wq
+```
+
+## Related
+
+* Argo for CI/CD
+* Crossplane for cluster autoprovision
--- a/content/day2/09_safety_usability_auth.md
+++ b/content/day2/09_safety_usability_auth.md
@ -0,0 +1,85 @@
+---
+title: "Safety or usability: Why not both? Towards referential auth in k8s"
+weight: 9
+---
+
+A talk by Google and Microsoft with the premise of bether auth in k8s.
+
+## Baselines
+
+* Most access controllers have read access to all secrets -> They are not really designed for keeping these secrets
+* Result: CVEs
+* Example: Just use ingress, nginx, put in some lua code in the config and voila: Service account token
+* Fix: No more fun
+
+## Basic solutions
+
+* Seperate Control (the controller) from data (the ingress)
+* Namespace limited ingress
+
+## Current state of cross namespace stuff
+
+* Why: Reference tls cert for gateway api in the cert team'snamespace
+* Why: Move all ingress configs to one namespace
+* Classic Solution: Annotations in contour that references a namespace that contains all certs (rewrites secret to certs/secret)
+* Gateway Solution:
+  * Gateway TLS secret ref includes a namespace
+  * ReferenceGrant pretty mutch allows referencing from X (Gatway) to Y (Secret)
+* Limits: 
+  * Has to be implemented via controllers
+  * The controllers still have readall - they just check if they are supposed to do this
+
+## Goals
+
+### Global
+
+* Grant access to controller to only ressources relevant for them (using references and maybe class segmentation)
+* Allow for safe cross namespace references
+* Make it easy for api devs to adopt it
+
+### Personas
+
+* Alex API author
+* Kai controller author
+* Rohan Resource owner
+
+### What our stakeholders want
+
+* Alex: Define relationships via ReferencePatterns
+* Kai: Specify controller identity (Serviceaccount), define relationship API
+* Rohan: Define cross namespace references (aka ressource grants that allow access to their ressources)
+
+## Result of the paper
+
+### Architecture
+
+* ReferencePattern: Where do i find the references -> example: GatewayClass in the gateway API
+* ReferenceConsumer: Who (IOdentity) has access under which conditions?
+* ReferenceGrant: Allow specific references
+
+### POC
+
+* Minimum access: You only get access if the grant is there AND the reference actually exists
+* Their basic implementation works with the kube api
+
+### Open questions
+
+* Naming
+* Make people adopt this
+* What about namespace-scoped ReferenceConsumer
+* Is there a need of RBAC verb support (not only read access)
+
+## Alternative
+
+* Idea: Just extend RBAC Roles with a selector (match labels, etc)
+* Problems:
+  * Requires changes to kubernetes core auth
+  * Everything bus list and watch is a pain
+  * How do you handle AND vs OR selection
+  * Field selectors: They exist
+* Benefits: Simple controller implementation
+
+## Meanwhile
+
+* Prefer tools that support isolatiobn between controller and dataplane
+* Disable all non-needed features -> Especially scripting
--- a/content/day2/10_dev_ux.md
+++ b/content/day2/10_dev_ux.md
@ -0,0 +1,34 @@
+---
+title: Developers Demand UX for K8s!
+weight: 10
+---
+
+A talk by UX and software people at RedHat (Podman team).
+The talk mainly followed the academic study process (aka this is the survey I did for my bachelors/masters thesis).
+
+## Research
+
+* User research Study including 11 devs and platform engineers over three months
+* Focus was on an new podman desktop feature
+* Experence range 2-3 years experience average (low no experience, high oldschool kube)
+* 16 questions regarding environment, workflow, debugging and pain points
+* Analysis: Affinity mapping
+
+## Findings
+
+* Where do I start when things are broken? -> There may be solutions, but devs don't know about them
+* Network debugging is hard b/c many layers and problems occuring in between cni and infra are really hard -> Network topology issues are rare but hard
+* YAML identation -> Tool support is needed for visualisation
+* YAML validation -> Just use validation in dev and gitops
+* YAML Cleanup -> Normalize YAML (order, anchors, etc) for easy diff
+* Inadequate security analysis (too verbose, non-issues are warnings) -> Realtime insights (and during dev)
+* Crash Loop -> Identify stuck containers, simple debug containers
+* CLI vs GUI -> Enable eperience level oriented gui, Enhance intime troubleshooting
+
+## General issues
+
+* No direct fs access
+* Multiple kubeconfigs
+* SaaS is sometimes only provided on kube, which sounds like complexity
+* Where do i begin my troubleshooting
+* Interoperability/Fragility with updates
--- a/content/day2/11_sidecarless.md
+++ b/content/day2/11_sidecarless.md
@ -0,0 +1,153 @@
+---
+title: Comparing sidecarless service mesh from cilium and istio
+weight: 11
+---
+
+Global field CTO at Solo.io with a hint of servicemesh background.
+
+## History
+
+* LinkerD 1.X was the first moder servicemesh and basicly a opt-in serviceproxy
+* Challenges: JVM (size), latencies, ...
+
+### Why not node-proxy?
+
+* Per-node resource consumption is unpredictable
+* Per-node proxy must ensure fairness
+* Blast radius is always the entire node
+* Per-node proxy is a fresh attack vector
+
+### Why sidecar?
+
+* Transparent (ish)
+* PArt of app lifecycle (up/down)
+* Single tennant
+* No noisy neighbor
+
+### Sidecar drawbacks
+
+* Race conditions
+* Security of certs/keys
+* Difficult sizing
+* Apps need to be proxy aware
+* Can be circumvented
+* Challenging upgrades (infra and app live side by side)
+
+## Our lord and savior
+
+* Potential solution: eBPF
+* Problem: Not quite the perfect solution
+* Result: We still need a L7 proxy (but some L4 stuff can be implemented in kernel)
+
+### Why sidecarless
+
+* Full transparency
+* Optimized networking
+* Lower ressource allocation
+* No race conditions
+* No manual pod injection
+* No credentials in the app
+
+## Architecture
+
+* Control Plane
+* Data Plane
+* mTLS
+* Observability
+* Traffic Control
+
+## Cilium
+
+### Basics
+
+* CNI with eBPF on L3/4
+* A lot of nice observability
+* Kubeproxy replacement
+* Ingress (via Gateway API)
+* Mutual Authentication
+* Specialiced CiliumNetworkPolicy
+* Configure Envoy throgh Cilium
+
+### Control Plane
+
+* Cilium-Agent on each node that reacts to scheduled workloads by programming the local dataplane
+* API via Gateway API and CiliumNetworkPolicy
+
+```mermaid
+flowchart TD
+    subgraph kubeserver
+        kubeapi
+    end
+    subgraph node1
+        kubeapi<-->control1
+        control1-->data1
+    end
+    subgraph node2
+        kubeapi<-->control2
+        control2-->data2
+    end
+    subgraph node3
+        kubeapi<-->control3
+        control3-->data3
+    end
+```
+
+### Data plane
+
+* Configured by control plane
+* Does all of the eBPF things in L4
+* Does all of the envoy things in L7
+* In-Kernel Wireguard for optional transparent encryption
+
+### mTLS
+
+* Network Policies get applied at the eBPF layer (check if id a can talk to id 2)
+* When mTLS is enabled there is a auth check in advance -> It it fails, proceed with agents
+* Agents talk to each other for mTLS Auth and save the result to a cache -> Now ebpf can say yes
+* Problems: The caches can lead to id confusion
+
+## Istio
+
+### Basiscs
+
+* L4/7 Service mesh without it's own CNI
+* Based on envoy
+* mTLS
+* Classicly via sidecar, nowadays
+
+### Ambient mode
+
+* Seperate L4 and L7 -> Can run on cilium
+* mTLS
+* Gateway API
+
+### Control plane
+
+```mermaid
+flowchart TD
+    kubeapi-->xDS
+
+    xDS-->dataplane1
+    xDS-->dataplane2
+
+    subgraph node1
+        dataplane1
+    end
+
+    subgraph node2
+        dataplane2
+    end
+```
+
+* Central xDS Control Plane
+* Per-Node Dataplane that reads updates from Control Plane
+
+### Data Plane
+
+* L4 runs via zTunnel Daemonset that handels mTLS
+* The zTunnel traffic get's handed over to the CNI
+* L7 Proxy lives somewhere™ and traffic get's routed through it as an "extra hop" aka waypoint
+
+### mTLS
+
+* The zTunnel creates a HBONE (http overlay network) tunnel with mTLS
--- a/content/day2/99_networking.md
+++ b/content/day2/99_networking.md
@ -26,4 +26,29 @@ Who have I talked to today, are there any follow-ups or learnings?
 They will follow up
 {{% /notice %}}

-* We mostly talked about traefik hub as an API-portal
+* We mostly talked about traefik hub as an API-portal
+
+## Postman
+
+* I asked them about their new cloud-only stuff: They will keep their direction
+* The are also planning to work on info materials on why postman SaaS is not a big security risk
+
+## Mattermost
+
+{{% notice style="note" %}}
+I should follow up
+{{% /notice %}}
+
+* I talked about our problems with the mattermost operator and was asked to get back to them with the errors
+* They're currently migrating the mattermost cloud offering to arm - therefor arm support will be coming in the next months
+* The mattermost guy had exactly the same problems with notifications and read/unread using element
+
+## Vercel
+
+* Nice guys, talked a bit about convincing customers to switch to the edge
+* Also talked about policy validation
+
+## Renovate
+
+* The paid renovate offering now includes build failure estimation
+* I was told not to buy it after telling the technical guy that we just use build pipelines as MR verification 
--- a/content/day2/_index.md
+++ b/content/day2/_index.md
@ -1,6 +1,7 @@
 ---
 archetype: chapter 
 title: Day 2
+weight: 2
 ---

 Day two is also the official day one of KubeCon (Day one was just CloudNativeCon).
--- a/content/lessons_learned/99_checkout.md
+++ b/content/lessons_learned/99_checkout.md
@ -5,3 +5,4 @@ title: Check this out
 Just a loose list of stuff that souded interesting

 * Dapr
+* etcd backups
Author	SHA1	Message	Date
Nicolai Ort	00d8ae29c4	last day2 talk	2024-03-20 17:58:13 +01:00
Nicolai Ort	33f615aaf0	day2 the next episode	2024-03-20 16:58:50 +01:00