diff --git a/content/day2/04_sponsored_ai_platform.md b/content/day2/04_sponsored_ai_platform.md index f935a04..6dbdfd5 100644 --- a/content/day2/04_sponsored_ai_platform.md +++ b/content/day2/04_sponsored_ai_platform.md @@ -1,5 +1,5 @@ --- -title: Sponsored: Build an open source platform for ai/ml +title: "Sponsored: Build an open source platform for ai/ml" weight: 4 --- diff --git a/content/day2/07_is_your_image_distroless.md b/content/day2/07_is_your_image_distroless.md index 6990eab..a3a924d 100644 --- a/content/day2/07_is_your_image_distroless.md +++ b/content/day2/07_is_your_image_distroless.md @@ -1,6 +1,6 @@ --- title: Is your image really distroless? -weight:7 +weight: 7 --- Laurent Goderre from Docker. diff --git a/content/day2/08_multicloud_saas.md b/content/day2/08_multicloud_saas.md new file mode 100644 index 0000000..07fcf06 --- /dev/null +++ b/content/day2/08_multicloud_saas.md @@ -0,0 +1,98 @@ +--- +title: Building a large scale multi-cloud multi-region SaaS platform with kubernetes controllers +weight: 8 +--- + +> Interchangeable wording in this talk: controller == operator + +A talk by elastic. + +## About elastic + +* Elestic cloud as a managed service +* Deployed across AWS/GCP/Azure in over 50 regions +* 600.000+ Containers + +### Elastic and Kube + +* They offer elastic obervability +* They offer the ECK operator for simplified deployments + +## The baseline + +* Goal: A large scale (1M+ containers resilient platform on k8s +* Architecture + * Global Control: The control plane (api) for users with controllers + * Regional Apps: The "shitload" of kubernetes clusters where the actual customer services live + +## Scalability + +* Challenge: How large can our cluster be, how many clusters do we need +* Problem: Only basic guidelines exist for that +* Decision: Horizontaly scale the number of clusters (5ßß-1K nodes each) +* Decision: Disposable clusters + * Throw away without data loss + * Single source of throuth is not cluster etcd but external -> No etcd backups needed + * Everything can be recreated any time + +## Controllers + +{{% notice style="note" %}} +I won't copy the explanations of operators/controllers in this notes +{{% /notice %}} + + +* Many different controllers, including (but not limited to) + * cluster controler: Register cluster to controller + * Project controller: Schedule user's project to cluster + * Product controllers (Elasticsearch, Kibana, etc.) + * Ingress/Certmanager +* Sometimes controllers depend on controllers -> potential complexity +* Pro: + * Resilient (Selfhealing) + * Level triggered (desired state vs procedure triggered) + * Simple reasoning when comparing desired state vs state machine + * Official controller runtime lib +* Workque: Automatic Dedup, Retry backoff and so on + +## Global Controllers + +* Basic operation + * Uses project config from Elastic cloud as the desired state + * The actual state is a k9s ressource in another cluster +* Challenge: Where is the source of thruth if the data is not stored in etc +* Solution: External datastore (postgres) +* Challenge: How do we sync the db sources to kubernetes +* Potential solutions: Replace etcd with the external db +* Chosen solution: + * The controllers don't use CRDs for storage, but they expose a webapi + * Reconciliation still now interacts with the external db and go channels (que) instead + * Then the CRs for the operators get created by the global controller + +### Large scale + +* Problem: Reconcile gets triggered for all objects on restart -> Make sure nothing gets missed and is used with the latest controller version +* Idea: Just create more workers for 100K+ Objects +* Problem: CPU go brrr and db gets overloaded +* Problem: If you create an item during restart, suddenly it is at the end of a 100Kü item work-queue + +### Reconcile + +* User-driven events are processed asap +* reconcole of everything should happen, bus with low prio slowly in the background +* Solution: Status: LastReconciledRevision (timestamp) get's compare to revision, if larger -> User change +* Prioritization: Just a custom event handler with the normal queue and a low prio +* Low Prio Queue: Just a queue that adds items to the normal work-queue with a rate limit + +```mermaid +flowchart LR + low-->rl(ratelimit) + rl-->wq(work queue) + wq-->controller + high-->wq +``` + +## Related + +* Argo for CI/CD +* Crossplane for cluster autoprovision diff --git a/content/day2/09_safety_usability_auth.md b/content/day2/09_safety_usability_auth.md new file mode 100644 index 0000000..71b8c5b --- /dev/null +++ b/content/day2/09_safety_usability_auth.md @@ -0,0 +1,85 @@ +--- +title: "Safety or usability: Why not both? Towards referential auth in k8s" +weight: 9 +--- + +A talk by Google and Microsoft with the premise of bether auth in k8s. + +## Baselines + +* Most access controllers have read access to all secrets -> They are not really designed for keeping these secrets +* Result: CVEs +* Example: Just use ingress, nginx, put in some lua code in the config and voila: Service account token +* Fix: No more fun + +## Basic solutions + +* Seperate Control (the controller) from data (the ingress) +* Namespace limited ingress + +## Current state of cross namespace stuff + +* Why: Reference tls cert for gateway api in the cert team'snamespace +* Why: Move all ingress configs to one namespace +* Classic Solution: Annotations in contour that references a namespace that contains all certs (rewrites secret to certs/secret) +* Gateway Solution: + * Gateway TLS secret ref includes a namespace + * ReferenceGrant pretty mutch allows referencing from X (Gatway) to Y (Secret) +* Limits: + * Has to be implemented via controllers + * The controllers still have readall - they just check if they are supposed to do this + +## Goals + +### Global + +* Grant access to controller to only ressources relevant for them (using references and maybe class segmentation) +* Allow for safe cross namespace references +* Make it easy for api devs to adopt it + +### Personas + +* Alex API author +* Kai controller author +* Rohan Resource owner + +### What our stakeholders want + +* Alex: Define relationships via ReferencePatterns +* Kai: Specify controller identity (Serviceaccount), define relationship API +* Rohan: Define cross namespace references (aka ressource grants that allow access to their ressources) + +## Result of the paper + +### Architecture + +* ReferencePattern: Where do i find the references -> example: GatewayClass in the gateway API +* ReferenceConsumer: Who (IOdentity) has access under which conditions? +* ReferenceGrant: Allow specific references + +### POC + +* Minimum access: You only get access if the grant is there AND the reference actually exists +* Their basic implementation works with the kube api + +### Open questions + +* Naming +* Make people adopt this +* What about namespace-scoped ReferenceConsumer +* Is there a need of RBAC verb support (not only read access) + +## Alternative + +* Idea: Just extend RBAC Roles with a selector (match labels, etc) +* Problems: + * Requires changes to kubernetes core auth + * Everything bus list and watch is a pain + * How do you handle AND vs OR selection + * Field selectors: They exist +* Benefits: Simple controller implementation + +## Meanwhile + +* Prefer tools that support isolatiobn between controller and dataplane +* Disable all non-needed features -> Especially scripting \ No newline at end of file diff --git a/content/day2/10_dev_ux.md b/content/day2/10_dev_ux.md new file mode 100644 index 0000000..9a4d8fb --- /dev/null +++ b/content/day2/10_dev_ux.md @@ -0,0 +1,34 @@ +--- +title: Developers Demand UX for K8s! +weight: 10 +--- + +A talk by UX and software people at RedHat (Podman team). +The talk mainly followed the academic study process (aka this is the survey I did for my bachelors/masters thesis). + +## Research + +* User research Study including 11 devs and platform engineers over three months +* Focus was on an new podman desktop feature +* Experence range 2-3 years experience average (low no experience, high oldschool kube) +* 16 questions regarding environment, workflow, debugging and pain points +* Analysis: Affinity mapping + +## Findings + +* Where do I start when things are broken? -> There may be solutions, but devs don't know about them +* Network debugging is hard b/c many layers and problems occuring in between cni and infra are really hard -> Network topology issues are rare but hard +* YAML identation -> Tool support is needed for visualisation +* YAML validation -> Just use validation in dev and gitops +* YAML Cleanup -> Normalize YAML (order, anchors, etc) for easy diff +* Inadequate security analysis (too verbose, non-issues are warnings) -> Realtime insights (and during dev) +* Crash Loop -> Identify stuck containers, simple debug containers +* CLI vs GUI -> Enable eperience level oriented gui, Enhance intime troubleshooting + +## General issues + +* No direct fs access +* Multiple kubeconfigs +* SaaS is sometimes only provided on kube, which sounds like complexity +* Where do i begin my troubleshooting +* Interoperability/Fragility with updates diff --git a/content/day2/99_networking.md b/content/day2/99_networking.md index 1a04830..402cc94 100644 --- a/content/day2/99_networking.md +++ b/content/day2/99_networking.md @@ -26,4 +26,29 @@ Who have I talked to today, are there any follow-ups or learnings? They will follow up {{% /notice %}} -* We mostly talked about traefik hub as an API-portal \ No newline at end of file +* We mostly talked about traefik hub as an API-portal + +## Postman + +* I asked them about their new cloud-only stuff: They will keep their direction +* The are also planning to work on info materials on why postman SaaS is not a big security risk + +## Mattermost + +{{% notice style="note" %}} +I should follow up +{{% /notice %}} + +* I talked about our problems with the mattermost operator and was asked to get back to them with the errors +* They're currently migrating the mattermost cloud offering to arm - therefor arm support will be coming in the next months +* The mattermost guy had exactly the same problems with notifications and read/unread using element + +## Vercel + +* Nice guys, talked a bit about convincing customers to switch to the edge +* Also talked about policy validation + +## Renovate + +* The paid renovate offering now includes build failure estimation +* I was told not to buy it after telling the technical guy that we just use build pipelines as MR verification diff --git a/content/day2/_index.md b/content/day2/_index.md index 33edfa0..179f361 100644 --- a/content/day2/_index.md +++ b/content/day2/_index.md @@ -1,6 +1,7 @@ --- archetype: chapter title: Day 2 +weight: 2 --- Day two is also the official day one of KubeCon (Day one was just CloudNativeCon). diff --git a/content/lessons_learned/99_checkout.md b/content/lessons_learned/99_checkout.md index b7619df..19f580e 100644 --- a/content/lessons_learned/99_checkout.md +++ b/content/lessons_learned/99_checkout.md @@ -5,3 +5,4 @@ title: Check this out Just a loose list of stuff that souded interesting * Dapr +* etcd backups