Compare commits
53 Commits
936a4c8c3a
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| b9060af72d | |||
| 3afb07e4c1 | |||
| 4becb06ad3 | |||
| 0e24bf4fd6 | |||
| f06c486182 | |||
| f71971e793 | |||
| a7a3817a03 | |||
| 47f7869257 | |||
|
b2fd7a4c81
|
|||
|
1213be7c30
|
|||
|
1f49a42edc
|
|||
|
c6f716ced1
|
|||
|
09ac5a9051
|
|||
|
5ed623d0ca
|
|||
| f8ca21416b | |||
| dc4dd2d883 | |||
| 957bc94344 | |||
| 44a3653c84 | |||
| 6bf47e49c5 | |||
| 39d92acdb4 | |||
| 4d528bf5de | |||
| d2f3f5f95d | |||
| 6d0c95a8ac | |||
| 3e4fbb616b | |||
| d9605d602e | |||
| 745e8f5896 | |||
| 78ca5973b8 | |||
| 77f34ed1ab | |||
| a36f562cf4 | |||
| 9ad9af0f9c | |||
| 4f39c1102c | |||
| df93624814 | |||
| 46b06c66fd | |||
| b4d8aa29c3 | |||
| 4cec1917bf | |||
| bd7d9fe87d | |||
| f4858d81a8 | |||
| bfcfe88cea | |||
| 45a26383e0 | |||
| 8dbdfd938f | |||
| 8941108720 | |||
| f8512dc6ae | |||
| c09bf8f637 | |||
| d90d5b8eab | |||
| 8b78108a60 | |||
| d09e3ff3d1 | |||
| 8ddf87d2f4 | |||
| 720d68803d | |||
| f0229abafd | |||
| 723051c498 | |||
| 7e6d0fc47f | |||
| fe8fa9693a | |||
| 8aab9217fe |
@@ -1,4 +1,4 @@
|
||||
FROM registry.odit.services/hub/hugomods/hugo:exts AS build
|
||||
FROM registry.odit.services/hub/hugomods/hugo:exts-0.145.0 AS build
|
||||
WORKDIR /app
|
||||
|
||||
COPY . /app/
|
||||
|
||||
@@ -6,5 +6,6 @@ tags:
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
TODO:
|
||||
@@ -10,11 +10,12 @@ This current version is probably full of typos - will fix later. This is what ty
|
||||
## How did I get there?
|
||||
|
||||
I attended Cloud Native Rejekts and KubeCon + CloudNativeCon Europe 2025 in London.
|
||||
This year I was sent there by my employer [DATEV eG](https://datev.de) - thanks again to everyone who helped me with getting this trip approved (you know who you are).
|
||||
|
||||
Why? Because learning about all new things in the world of cloud is really important and war stories help to avoid mistakes that other's already made.
|
||||
And [last year's experience](https://kubecon24.nicolai-ort.com) was really good, so I wanted to go again.
|
||||
|
||||
Plus I actually presented a talk at Cloud Native Rejekts.
|
||||
Plus I actually presented a talk at Cloud Native Rejekts 🥳.
|
||||
|
||||
## And how does this website get it's content
|
||||
|
||||
@@ -24,9 +25,22 @@ graph LR
|
||||
Nicolai-->|"Takes notes (and typos) + commits"|Repo
|
||||
Repo-->|Triggers|Actions
|
||||
Actions-->|Builds image and pushes to|Registry
|
||||
Kubernetes-->|Pulls latest image|Registry
|
||||
Flux-->|Detects new image|Registry
|
||||
Flux-->|Rolls out new image|Kubernetes
|
||||
```
|
||||
|
||||
## Changelog™️
|
||||
|
||||
- 2025-03-28: Inital repo and deployment setup
|
||||
- 2025-03-30: First day of Cloud Native Rejekts
|
||||
- 2025-03-31: Second day of Cloud Native Rejekts
|
||||
- 2025-04-01: First day of KubeCon/CloudNativeCon
|
||||
- 2025-04-02: Second day of KubeCon/CloudNativeCon
|
||||
- 2025-04-03: Added video links for Cloud Native Rejekts
|
||||
- 2025-04-03: Third day of KubeCon/CloudNativeCon
|
||||
- 2025-04-04: Fourth day of KubeCon/CloudNativeCon
|
||||
- 2025-04-07: Added missing images and slide links for KubeCon/CloudNativeCon
|
||||
|
||||
## Style Guide
|
||||
|
||||
The basic structure is as follows: `day/event-or-session`.
|
||||
@@ -6,7 +6,8 @@ tags:
|
||||
- security
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
{{% button href="https://www.youtube.com/watch?v=JAy6Ra0ulSw" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
## BAseline
|
||||
|
||||
|
||||
@@ -2,10 +2,12 @@
|
||||
title: "The Hidden Brains of Kubernetes: Meet Controllers Powering the Cloud"
|
||||
weight: 2
|
||||
tags:
|
||||
- <tag>
|
||||
- rejekts
|
||||
- operator
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
{{% button href="https://www.youtube.com/watch?v=PciVvE02L2w" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
## Big Picture
|
||||
|
||||
|
||||
12
content/day-1/02_gslb.md
Normal file
@@ -0,0 +1,12 @@
|
||||
---
|
||||
title: Evaluating Global Load Balancing Options for Kubernetes in Practice
|
||||
weight: 2
|
||||
tags:
|
||||
- rejekts
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/RBMRU8rtxfI" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
{{% button href="https://github.com/nicolaiort/rejekts2025-gslb" style="tip" icon="code" %}}Demo-Code and more{{% /button %}}
|
||||
{{% button href="https://de.slideshare.net/slideshow/evaluating-global-load-balancing-options-for-kubernetes-in-practice-kubermatic-datev/277640385" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||
|
||||
My talk, notes will be released soon
|
||||
@@ -5,7 +5,8 @@ tags:
|
||||
- rejekts
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
{{% button href="https://www.youtube.com/watch?v=DdQzGsiounY" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
## The clans (popular solutions)
|
||||
|
||||
|
||||
@@ -2,11 +2,11 @@
|
||||
title: Understanding and Debugging DNS in Kubernetes Clusters
|
||||
weight: 4
|
||||
tags:
|
||||
- <tag>
|
||||
- rejekts
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
{{% button href="https://github.com/mqasimsarfraz/talks/tree/main/CloudNativeRejekts-2025" style="transparent" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||
{{% button href="https://www.youtube.com/watch?v=awXjABDknww" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
{{% button href="https://github.com/mqasimsarfraz/talks/tree/main/CloudNativeRejekts-2025" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||
|
||||
|
||||
|
||||
|
||||
@@ -6,7 +6,8 @@ tags:
|
||||
- edge
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
{{% button href="https://www.youtube.com/watch?v=jywpFlOH3z0" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
## The far edge
|
||||
|
||||
|
||||
@@ -6,7 +6,8 @@ tags:
|
||||
- multicluster
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
{{% button href="https://www.youtube.com/watch?v=w8rDxtrMGG8" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
## Baseline Infra
|
||||
|
||||
@@ -48,4 +49,11 @@ TODO: Steal diagram from slides
|
||||
|
||||
## Demo
|
||||
|
||||
Pretty interesting, watch the video to find out
|
||||
Pretty interesting, watch the video to find out
|
||||
|
||||
|
||||
## Q&A
|
||||
|
||||
- Do you need a flat network: No just expose the tcp lb
|
||||
- Did you think about using etcd to implement the leases instead of objects: They use managed hostplanes and dont want another etcd
|
||||
- Have you tried to commit upstream: Nope, pretty much not an option thanks to the managed control-plane not being able to set apropriate flags
|
||||
|
||||
@@ -9,10 +9,10 @@ This was another very interesting day and I can only recommend attending cloud n
|
||||
|
||||
## Talk recommendations
|
||||
|
||||
- My Talk: [Evaluating Global Load Balancing Options for Kubernetes in Practice](todo:)
|
||||
- Service Mesh Intro + Comparison: [The service mesh wars - a new hope for kubernetes](../03_service-mesh)
|
||||
- How to handle evection and statefulness across clusters: [Scaling PDBs: Introducing Multi-Cluster Resilience with x-pdb](../06_scaling-pdbs)
|
||||
- Intro to operators: [The Hidden Brains of Kubernetes: Meet Controllers Powering the Cloud](../02_controllers)
|
||||
- My Talk: [Evaluating Global Load Balancing Options for Kubernetes in Practice](./02_gslb)
|
||||
- Service Mesh Intro + Comparison: [The service mesh wars - a new hope for kubernetes](./03_service-mesh)
|
||||
- How to handle evection and statefulness across clusters: [Scaling PDBs: Introducing Multi-Cluster Resilience with x-pdb](./06_scaling-pdbs)
|
||||
- Intro to operators: [The Hidden Brains of Kubernetes: Meet Controllers Powering the Cloud](./02_controllers)
|
||||
|
||||
## Other stuff I learned or people i talk to
|
||||
|
||||
|
||||
@@ -7,5 +7,6 @@ tags:
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
Short opening keynote thanking volunteers and attendees.
|
||||
@@ -8,7 +8,8 @@ tags:
|
||||
- multicluster
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
{{% button href="https://www.youtube.com/watch?v=r0W6cCJAGro" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
The talk started with a base introduction of ClusterAPI and the operations at gigantswarm.
|
||||
|
||||
|
||||
@@ -6,7 +6,8 @@ tags:
|
||||
- keynote
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
{{% button href="https://www.youtube.com/watch?v=m9NRk-6MSvY" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
A short keynote from micrososft about their contributions to open source and used tools:
|
||||
- infra (kubernates, istio, hyperlight)
|
||||
|
||||
@@ -6,7 +6,8 @@ tags:
|
||||
- multicluster
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
{{% button href="https://www.youtube.com/watch?v=e1BmT0jc_Fs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
## Background
|
||||
|
||||
|
||||
@@ -5,7 +5,8 @@ tags:
|
||||
- rejekts
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
{{% button href="https://www.youtube.com/watch?v=CAPtQnH4rPY" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
## Recruitment & Staffing
|
||||
|
||||
|
||||
@@ -5,7 +5,8 @@ tags:
|
||||
- rejekts
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
{{% button href="https://www.youtube.com/watch?v=qNShvqSTKCU" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
## Background: The state of cloud in mauritius
|
||||
|
||||
|
||||
@@ -6,7 +6,8 @@ tags:
|
||||
- performance
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
{{% button href="https://www.youtube.com/watch?v=EYipC5y-8rM" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
There were more details in the talk than I copied into these notes.
|
||||
Most of them were just too much to write down or application specific.
|
||||
|
||||
@@ -6,7 +6,8 @@ tags:
|
||||
- crossplane
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
{{% button href="https://www.youtube.com/watch?v=D4bKe4rAasc" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
Joint effort of novo-nordik and upbound.
|
||||
|
||||
|
||||
@@ -6,7 +6,8 @@ tags:
|
||||
- security
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
{{% button href="https://www.youtube.com/watch?v=rJacyDygVi0" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
## Why does e2e authenticity matter?
|
||||
|
||||
|
||||
@@ -5,7 +5,8 @@ tags:
|
||||
- rejekts
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
{{% button href="https://www.youtube.com/watch?v=1US_-3udMDo" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
## Hypothesis
|
||||
|
||||
|
||||
@@ -10,12 +10,12 @@ This is the first day of Cloud Native Rejekts and the first time of me attending
|
||||
|
||||
> Ranked by should watch to could watch
|
||||
|
||||
- How to hire, manage and develop engineers: [Tech is broken and AI won't fix it](../05_broken-tech)
|
||||
- What if my homelab is an african island: [Geographically Distributed Clusters: Resilient Distributed Compute on the Edge](../06_geo-distributed-clusters)
|
||||
- Bootstrap and CI/CD with crossplane: [Building air-gapped control planes for a global pharma leader using crossplane and argo](../08_airgapped-cp)
|
||||
- Handling large number of clusters: [CRD Data Architecture for Multi-Cluster Kubernetes](../04_multicluster-crd)
|
||||
- Handling large scale migrations: [The Cluster API Migration Retrospective: Live migrating hundreds of clusters to Cluster API](../02_clusterapi)
|
||||
- How to hire, manage and develop engineers: [Tech is broken and AI won't fix it](./05_broken-tech)
|
||||
- What if my homelab is an african island: [Geographically Distributed Clusters: Resilient Distributed Compute on the Edge](./06_geo-distributed-clusters)
|
||||
- Bootstrap and CI/CD with crossplane: [Building air-gapped control planes for a global pharma leader using crossplane and argo](./08_airgapped-cp)
|
||||
- Handling large number of clusters: [CRD Data Architecture for Multi-Cluster Kubernetes](./04_multicluster-crd)
|
||||
- Handling large scale migrations: [The Cluster API Migration Retrospective: Live migrating hundreds of clusters to Cluster API](./02_clusterapi)
|
||||
|
||||
## Other stuff I learned or people i talk to
|
||||
|
||||
- Throughout the lunch break I talked to a nice guy who heared my government question during the [Tech is broken and AI won't fix it](../05_broken-tech)-Talk, we talked
|
||||
- Throughout the lunch break I talked to a nice guy who heared my government question during the [Tech is broken and AI won't fix it](./05_broken-tech)-Talk, we talked
|
||||
27
content/day0/01_project-update.md
Normal file
@@ -0,0 +1,27 @@
|
||||
---
|
||||
title: Project update
|
||||
weight: 1
|
||||
tags:
|
||||
- platform
|
||||
- cloudnativecon
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
{{% button href="https://static.sched.com/hosted_files/colocatedeventseu2025/70/Platforms%20WG%20Update%20slides%20-%20Kubecon%20EU%202025.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||
|
||||
An update from the platform working group which will be renamed to the CNCF Platform Engineering Community.
|
||||
Alongside the new name a bit of restructuring will take place bacause the working group outgrew the working group label.
|
||||
|
||||
## Initiatives
|
||||
|
||||
### Supported initianives
|
||||
|
||||
- Platform Glossary and Whitepaper: What is a platform
|
||||
- Platform Maturity Model & Assesment: A Platform is a living thing that evolves
|
||||
- Platform as a Product: Currently in the research stage
|
||||
- Platform Community Formation: The - above mentioned - restructuring
|
||||
|
||||
### Monitored Initiative
|
||||
|
||||
- Cloud Native Platform Engineering Associate (CNPA): Certification is being formed
|
||||
- Cloud Native Platform Engineer (CNPE): Will follow after CNPA
|
||||
30
content/day0/02_sponsored-stbsdw.md
Normal file
@@ -0,0 +1,30 @@
|
||||
---
|
||||
title: Stop building, start delivering workloads
|
||||
weight: 2
|
||||
tags:
|
||||
- platform
|
||||
- cloudnativecon
|
||||
- sponsored
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/7tbs3J7mgE0" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
## States of platform
|
||||
|
||||
1. Platform is being build and getting delayed
|
||||
2. Platform finished and not adopted
|
||||
3. Re-Platforming and guessing if the new platform will meet the same end
|
||||
4. Platform is low maintainance and devs are happy (nice story bro)
|
||||
|
||||
Failure should be fine but it's no longer an option in most cases
|
||||
|
||||
## What do we want?
|
||||
|
||||
> Whishlist
|
||||
|
||||
- Support for all workload
|
||||
- Consistent experiences across ui, api, cli and gitops
|
||||
- Pathway from preview to prod
|
||||
- Multi-cloud and onprem
|
||||
- Abstract infra
|
||||
32
content/day0/03_sponsored-cortex.md
Normal file
@@ -0,0 +1,32 @@
|
||||
---
|
||||
title: "Platform Engineering with a Product Management Mindset: 10x your DevEx"
|
||||
weight: 3
|
||||
tags:
|
||||
- platform
|
||||
- cloudnativecon
|
||||
- sponsored
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/MFLXFNlmMMI" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
This whole talk is pretty much a product managers view on platform engieering.
|
||||
|
||||
## Where can it go wrong
|
||||
|
||||
- Assuming customer needs - build for hypothetical developers
|
||||
- Output > Outcome
|
||||
- Ignore stakeholder ecosystem
|
||||
|
||||
TODO: Steal slide
|
||||
|
||||
## PaaP (Platform as a product)
|
||||
|
||||
- Anticipate developer needs: Dont just fulfill requests
|
||||
- Design for all personas and survey related teams
|
||||
- Prioritize Features according to research themes
|
||||
- Deliver inremental value with feedback loops
|
||||
|
||||
## Hierarchy of goals and baselines
|
||||
|
||||
TODO: Copy slide over
|
||||
27
content/day0/04_sponsored-gitpod.md
Normal file
@@ -0,0 +1,27 @@
|
||||
---
|
||||
title: "The platform Engineer gauntlent: Three defining challenges in the AI era"
|
||||
weight: 4
|
||||
tags:
|
||||
- platform
|
||||
- cloudnativecon
|
||||
- sponsored
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
## Conviciton
|
||||
|
||||
- Background: There is an absence of platform leadership
|
||||
- Reason: Most "leaders" don't push services or features to developers with conviction
|
||||
- Solution: Be proud and use your leadership role with courage
|
||||
|
||||
## Focus
|
||||
|
||||
- Focus on developers
|
||||
- Don't only focus on the production ecosystem (observability, ci/cd) but also the path to this end
|
||||
|
||||
## Foundations
|
||||
|
||||
- Problem: Many companies are running behind their ai goals thanks to missing baseline automation
|
||||
- Solution: Embrace the AI
|
||||
13
content/day0/05_sponsored-vultr.md
Normal file
@@ -0,0 +1,13 @@
|
||||
---
|
||||
title: "Containerization beyond CPUs - A Kubernetes based serverless platform for ai native applications"
|
||||
weight: 5
|
||||
tags:
|
||||
- platform
|
||||
- cloudnativecon
|
||||
- sponsored
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/XrMsJIL35Oc" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
Hypothesis: We are at the beginning of a 10 year cycle that is moving towards ai-native applications.
|
||||
61
content/day0/06_hire-engineers.md
Normal file
@@ -0,0 +1,61 @@
|
||||
---
|
||||
title: So you want to hire for platform engineering
|
||||
weight: 6
|
||||
tags:
|
||||
- platform
|
||||
- cloudnativecon
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/cl-MO7j7MHY" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
Hypothesis: The bar for good interviewing is somewhere near the earth's core and we need to improve this (because we need more engineers)
|
||||
|
||||
## Resilience engineering
|
||||
|
||||
> The overarching concepts that apply to platforms or just "how to make code work"
|
||||
|
||||
Idea: Four main goals that align with different roles unter the mothership "resilience engineering"
|
||||
|
||||
- Rebound: SRE
|
||||
- Robustness: Infra
|
||||
- Graceful extensibility: Platform Engineering
|
||||
- Sustained adaptability: DevEx (often pulled out into something else)
|
||||
|
||||
Bonus things to look out for
|
||||
|
||||
- Intellectual Humility: The ability to learn new things and accepting that you might now much but not everything
|
||||
- Ecological awe: The awe expereienced when looking at beautiful nature and feeling small or just looking at the cncf landscape
|
||||
|
||||
## What do you need for the first team
|
||||
|
||||
- People who are able to hire new people and willing to step up to leadership in the long term
|
||||
- Generalists
|
||||
|
||||
## The process and what to do
|
||||
|
||||
What should happen before we hire someone (either in one or multiple interviews).
|
||||
|
||||
1. Learn about each other
|
||||
2. Solve a technical problem together
|
||||
3. Solve a socological problem together
|
||||
4. How do you and your future coworkers/stakeholders get along
|
||||
|
||||
Make sure the end2end time (first interview to ye or no) is low (best is under two meeks)
|
||||
All of your current engineers should be able to pass the interview without studying in advance (no stupid)
|
||||
|
||||
## Potential Failures and fallacies
|
||||
|
||||
- The fallacy of demographics in = demographics out
|
||||
- Treating interviews like hazing
|
||||
- you don't track afer-hire indicators
|
||||
- Whireboard interviews: They are stupid repetition and regurgitation and have 0 relations to the real world work
|
||||
- There are no real studies on how to asses and hire talent
|
||||
|
||||
### Flags
|
||||
|
||||
- Passion is usually interpreted as "puts up with abuse" and should not be mistaken for caring -> See "Ecological awe"
|
||||
- Side projects probably indicate lack in family/social time "i make my wife raise the kids" -> Sideprojects are not a good indicator, maybe their are brilliant at their job but love their free time
|
||||
- A Moneyball-like process (data-driven decision) completely counters how talent is perceived -> Expand the hiring pool to anybody and ignore the clasical "indicators of talent"
|
||||
- Discriminated demographics probably have a better grip on systems thinking (doe to being forced to make choices)
|
||||
- Systems thinking is more important than platform knowledge (If you can think in terms of organization and dependencies you can work on platforms)
|
||||
62
content/day0/07_past-present-future.md
Normal file
@@ -0,0 +1,62 @@
|
||||
---
|
||||
title: The past, the present and the future of platform engineering
|
||||
weight: 7
|
||||
tags:
|
||||
- platform
|
||||
- cloudnativecon
|
||||
- viktor
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/uwDoHm-AxTM" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
The good old baseline is "iam an an developer, i write code - now i have to do stuff to continue writing code".
|
||||
Most developers will continue on to "now i have to write scripts" on order to just do their jobs instead of working on infra.
|
||||
|
||||
These scripts evolve to tools which evolve into an internal platform (if everyone starts using it).
|
||||
Other base components can also feel like platforms (for example application servers).
|
||||
|
||||
## The early day evolution
|
||||
|
||||
- Hudson
|
||||
- Docker: Not really building platforms, rather standardized application packaging
|
||||
- Kubernetes (and Nomad + Swarm): A new concept of scheduling instead of jsut running the application in a container
|
||||
|
||||
=> We've been building platforms (or failing to build them) for years and years but now we kinda agree about what a platform is
|
||||
|
||||
## Present
|
||||
|
||||
We have the base idea of a platform
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
ServiceConsumers-->|Consume through|HTTPAPI-->|Trigger work on|Controllers-->|So|Services
|
||||
ServiceOwner-->|Manages|Services
|
||||
```
|
||||
|
||||
- The fist question: Do we use public controllers (e.g. the cncf landscape projects) or build our own?
|
||||
- Result: Mostly a mix starting with public, realizing needs and expanding
|
||||
|
||||
## Make it your own
|
||||
|
||||
- Goal: Make the platform domain specific for your developers
|
||||
- Evolution: Tools like DAPR for developers or Crossplane for api-building
|
||||
- Build the API and Controllers first - dashboard, gitops, observability, ... second
|
||||
- Remember that kubernetes can manage anything - not just containers
|
||||
|
||||
TODO: Steal image
|
||||
|
||||
## Blueprints
|
||||
|
||||
Take all of the projects you need, combine them and hide the complexity
|
||||
High level architecture of internal platforms is the same as public ones (aws, ...) but internal and built on kubernetes.
|
||||
|
||||
TODO: Steal images for platform blueprints (3 slides)
|
||||
|
||||
## Future
|
||||
|
||||
- Platform Engineering certification by the CNCF is on the horizon
|
||||
- Do we need to hide kubernetes from developers? Maybe -> The CNCF is starting groups to get app devs closer to platform engineers
|
||||
- More multi-cluster specialized tools are sprawling in the last year (scheduling, discovery, management)
|
||||
- AI things are happening and we should utilize it but not just by calling a llm directly and calling it a day -> e.g. dapr llm abstraction api
|
||||
- Platforms are not built in isolation, we need to help each other
|
||||
75
content/day0/08_product-thinking.md
Normal file
@@ -0,0 +1,75 @@
|
||||
---
|
||||
title: Product thinking for cloud native engineers
|
||||
weight: 8
|
||||
tags:
|
||||
- platform
|
||||
- cloudnativecon
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/8_pB9RAfzrY" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
{{% button href="https://static.sched.com/hosted_files/colocatedeventseu2025/48/Product%20Thinking%20for%20Cloud%20Native%20Engineers%20PlatformEngineeringDay-EU-25.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||
|
||||
## How & Why
|
||||
|
||||
- IT was a cost center for a long time - not it's critical but still treated as a cost center
|
||||
- Why is it important: To much focus in the technical aspects instead of value delivery
|
||||
- Importance: Show the value of your work (which means your work has to provide value)
|
||||
- Operations and coordination work is not easily visible, but very important
|
||||
|
||||
## Principles
|
||||
|
||||
- Focus on user value: User problems > Solutions
|
||||
- Outcome (Value) > Output (Tickets closed)
|
||||
- Products (lifecycle and ownership) before projects (just setting stuff up)
|
||||
|
||||
### User value
|
||||
|
||||
- "Who is the user": Builders, Enablers, Regulatory, "Viewers"
|
||||
- "What is the value": Make the organization more efficient while avoiding risks
|
||||
|
||||
## How to start?
|
||||
|
||||

|
||||
|
||||
### Exploring the Problem Space
|
||||
|
||||
Goals:
|
||||
- Identify top pains
|
||||
- Build empathy and understanding
|
||||
- Investigate key business aims
|
||||
|
||||
Techinques:
|
||||
- Customer and stakeholder interviews: Talk to people, they will probably tell you about their pain
|
||||
- Data/Process analysis: Where are out bottlenecks
|
||||
- Shadowing: Really see how the day to day works
|
||||
- Ask "Why"
|
||||
- Read business updates (current goals)
|
||||
- Build dashboards that show progress and value
|
||||
|
||||
### Defining the problem space
|
||||
|
||||
Goals:
|
||||
- Identify opportunities
|
||||
- Prioritise
|
||||
- Gather insignts and data
|
||||
|
||||
Techniques:
|
||||
- Value stream mapping
|
||||
- RICE, Value vs Effort or ather cost benefit analysis
|
||||
- Analyse your exploration process
|
||||
|
||||
## Did we reach our goal?
|
||||
|
||||
### Product metrics
|
||||
|
||||
- Someone will measure your work, hope they do it right or rather do it yourself to show how you provide value
|
||||
- Product metrics should measure outcome not output (or performance metrics)
|
||||
- Baseline: You need to know the desired outcome
|
||||
|
||||
|
||||
### Frameworks
|
||||
|
||||
- DevEx: Triangle of flow state (build&test speed), feedback loops () and cognitive load (code complexity, docs clarity)
|
||||
- DORA
|
||||
- SPACE
|
||||
- DX Core 4
|
||||
129
content/day0/09_promotions.md
Normal file
@@ -0,0 +1,129 @@
|
||||
---
|
||||
title: A million ways to promote changes between environments
|
||||
weight: 9
|
||||
tags:
|
||||
- argo
|
||||
- cloudnativecon
|
||||
- viktor
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/iCTgRC3AQQk" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
{{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||
|
||||
## Baseline
|
||||
|
||||
- Promotion: Move things from one env to another
|
||||
- Options: Sequentially or both
|
||||
- Challenge: Env differences
|
||||
- Challenge: How do we link our promotion tasks?
|
||||
|
||||
### GitOps
|
||||
|
||||
- Declarative: YAML, JSON, XML (Not helm or kcl or anything else)
|
||||
- Versioned and immutable: Git
|
||||
- Pulled automatiocally: No wirte access from cluster
|
||||
- Continously reconciled: Maintain parity between desired and actual state
|
||||
|
||||
### Rules
|
||||
|
||||
- Part of SLDC
|
||||
- Declarative
|
||||
- Versioned and immutable
|
||||
- Pulled automatiocally
|
||||
- Continously reconciled
|
||||
|
||||
## Workflows
|
||||
|
||||
### Manual
|
||||
|
||||
1. Deploy
|
||||
2. Run tests
|
||||
3. Push to next stage
|
||||
4. Test again or roll back
|
||||
|
||||
### Manual with gitops
|
||||
|
||||
1. Update manifest
|
||||
2. Push to git
|
||||
3. Test
|
||||
4. Next stage
|
||||
|
||||
Problem: Eventual consistency makes the process async instead of sync (important for tests)
|
||||
|
||||
### Generic workflows
|
||||
|
||||
1. Dev: Bump, push
|
||||
2. QS: Wait for success of 1 (how?), do the same
|
||||
3. Prod: Wait for success of 2 (how?)
|
||||
|
||||
TODO: Steal code screenshots from slides
|
||||
|
||||
## Tools
|
||||
|
||||
### Extend your standard CI
|
||||
|
||||
|
||||
Not async, risk of flapping, either blindly trust the state or break the pull-principle by running argo sync or kubectl apply
|
||||
|
||||
### AppSets Progressive Sync
|
||||
|
||||
- Built in to Application Sets (alpha)
|
||||
- Targeting by label, promotes everything
|
||||
- Not supported with autosync, bechause it basically manually triggers sync one after another
|
||||
- Changes from git have to be manually triggered
|
||||
|
||||
### Image updater
|
||||
|
||||
- Subscribe to semver based image updates and write them to kubernetes and/or git
|
||||
- You have to implement promotions via image naming schemes
|
||||
|
||||
TODO: Steal flowchart
|
||||
|
||||
### Kargo
|
||||
|
||||
- Freight: Artifact or manifest versions to promote
|
||||
- Stage: ArgoCD Apps
|
||||
|
||||
TODO: Steal flowchart
|
||||
|
||||
### Telefonistka
|
||||
|
||||
- IaC Agnostic tooling
|
||||
- Idea: Watch folder contents and copy contents to new folder
|
||||
- Pretty mutch a bundeled CI-Script
|
||||
|
||||
TODO: Draw your own chart
|
||||
|
||||
### Codefresh GitOps
|
||||
|
||||
> This is one of the speaker's tools
|
||||
|
||||
- Product: Applications with relationships
|
||||
- Env: Any cluster and/or namespace
|
||||
- Promotion: CRD for policy (when does it happen, what get's validated)
|
||||
- Promotions can happen manually or automated via commit/pr
|
||||
- BAsed on argo workflows
|
||||
|
||||
### GitOps Promoter (Intuit)
|
||||
|
||||
- Define Manifests once and hydrate them later
|
||||
- Sourcehydrator: Argocd feature that handels the rendering and commits it to a new dedicated branch (one branch per stage)
|
||||
- The Branches are the branches used by argo, e.g. `environments/dev` get's watched by the dev cluster
|
||||
- Changes result in environment proposal branches, PR get's oppened, PR checks run, when PR requirements are met (Tests), it will merge them into the real env branches
|
||||
|
||||
TODO: Steal Pattern
|
||||
|
||||
## Overview of the philosopies
|
||||
|
||||
Artifact Oriented: Imageupdater, Kargo
|
||||
Define Manifests once: AppSets Progessive Sync, GitOps Promoter
|
||||
Deff and workflow: CI, Codefresh
|
||||
|
||||
TODO: Steal from slides
|
||||
|
||||
## Best practives
|
||||
|
||||
- Can you recover from git at any point? No -> Do better
|
||||
- Does git reflect what's deployed without looking?
|
||||
- Does this enable SDLC?
|
||||
- Interfaces in folders, not branches? -> Branches may get crowded
|
||||
89
content/day0/10_abstractions.md
Normal file
@@ -0,0 +1,89 @@
|
||||
---
|
||||
title: "Platform abstractions: Asset or liability"
|
||||
weight: 10
|
||||
tags:
|
||||
- platform
|
||||
- cloudnativecon
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/M5X5NCzlzIA" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
{{% button href="https://static.sched.com/hosted_files/colocatedeventseu2025/52/atul-talk-platform-engineering-kubecon-london-2025_final.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||
|
||||
Fair warning: Food analogies incoming
|
||||
|
||||
## Baseline
|
||||
|
||||
### What do abstractions achive
|
||||
|
||||
- Structure through simplification
|
||||
- Complexity made simple
|
||||
- Hiden Details, visible value
|
||||
|
||||
### Dilemma
|
||||
|
||||
1. Platform team creates abstraction
|
||||
2. Abstraction works for 10 Teams
|
||||
3. Other team requests extension
|
||||
4. Question: How do we deal with this
|
||||
|
||||
### Possible Solutions
|
||||
|
||||
- Add Config Options: Increases complexity of abstraction
|
||||
- Make One-off exceptions: Breaks standardization, introduces inconsistency
|
||||
- Require conformity: Hinders innovation, creates enemies
|
||||
- Allow bypassing: Creates shadow it, risking security and resource control
|
||||
|
||||
=> Debt trap: The cost of maintaining a stable platform rises and rises
|
||||
|
||||
## The debt cycle
|
||||
|
||||
### The abstraction cycle
|
||||
|
||||
1. Simplify
|
||||
2. Adobt
|
||||
3. New Requirements
|
||||
4. Add complexity
|
||||
5. Repeat
|
||||
|
||||

|
||||
|
||||
### Warning signs
|
||||
|
||||
- Rizing customization requests
|
||||
- Workarounds
|
||||
- Shadow IT
|
||||
|
||||
### Impact
|
||||
|
||||
- Each new feature becomes harder to implement
|
||||
- Teams lose trust in the platform capabilities
|
||||
- Platform evolutions slows down
|
||||
- New tech is difficult to incorporate
|
||||
|
||||
## Abstraction elacity
|
||||
|
||||
> The abstraction should stretch a bit to accommodate change without brakuing
|
||||
|
||||
- Adaptability: Ease of handling new requirements
|
||||
- Transparency: Understand what your user wants and why
|
||||
- Extension PAtterns: Document ways to customize the platform behavior
|
||||
- Migration Paths: Ease of moving away from the platform abstraction
|
||||
|
||||
### Elasticity
|
||||
|
||||
- Can teams access lower level controls (when needed) while staying with the abstraction
|
||||
- Do users understand what happens underneath (when needed)
|
||||
- Are ther documented extension/customization points?
|
||||
|
||||
## Patterns to break the debt trap
|
||||
|
||||
- Layered abstraction patterns: start with low-level abstractions that get abstracted on higher levels to allow users to choose the right abstraction level for themselves without having to configure everything themselfes
|
||||
- Expert-ap: Additional api parameters that are not needed but can be set
|
||||
- Policy based guard rails: Change the guardrails based on the environment (e.g. deep access in dev, not in prod)
|
||||
|
||||
## The end goal
|
||||
|
||||
- Increase adoption
|
||||
- Eliminate shadow IT
|
||||
- Improved satisfaction
|
||||
- Reduced overhead
|
||||
43
content/day0/11_t-env.md
Normal file
@@ -0,0 +1,43 @@
|
||||
---
|
||||
title: "The story of t-env: Scaling a platform to impriove the volocity of hundreds of developers"
|
||||
weight: 11
|
||||
tags:
|
||||
- platform
|
||||
- cloudnativecon
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/qXRHpIYxU_c" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
{{% button href="https://static.sched.com/hosted_files/colocatedeventseu2025/da/KubeCon%20Talk_%20Lemonade%27s%20t-env.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||
|
||||
Okteto: Ephemeral environents for testing
|
||||
|
||||
## History
|
||||
|
||||
- Starting point: Local Dev -> Setup for new devices or devs is realy slow (on average 10hrs a week)
|
||||
- Next Idea: EC2 Instances with a fancy docker-compose and scripts -> No more local dev
|
||||
- Problems: Still complex - just in the cloud, manual updates, allways-on required (no working in the train)
|
||||
- Risks: Developers will just create workarounds and shadow it
|
||||
|
||||
## T-Env
|
||||
|
||||
- Baseline: Setup an environment on kubernetes for each dev with ci/cd
|
||||
- Okteto: A single command to enter dev mode `t dev start` with file sync from local
|
||||
- Implementation: Wrapper arount the okteto cli
|
||||
- Why: Becaus dev seems to love the cli
|
||||
- Self service observability for troubleshooting in your env
|
||||
|
||||
Used Open soruce Tools: Pulumi, Grafana, Okteto, K8s
|
||||
|
||||
### Did it work?
|
||||
|
||||
- The time to test is way faster
|
||||
- The path was clear
|
||||
- The environments should be ephemeral but devs don't like that -> They decided to allow for long lived envs
|
||||
- Cloud cost is relatively high with long living envs -> They implemented a sleep system based on dev timezone
|
||||
(or manual wake-up)
|
||||
|
||||
## The futuuuuure
|
||||
|
||||
- The company is not getting smaller -> More devs annd more services
|
||||
- AI agents will write some of the code in the future
|
||||
- Idea: Only run modified code in env instead of everything
|
||||
50
content/day0/12_many-clusters.md
Normal file
@@ -0,0 +1,50 @@
|
||||
---
|
||||
title: "Perfomance preseverance: Taming 1000 kubernetes clusters"
|
||||
weight: 12
|
||||
tags:
|
||||
- platform
|
||||
- cloudnativecon
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/ZTT8M74RD1M" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
{{% button href="https://static.sched.com/hosted_files/colocatedeventseu2025/d5/kubecon_2025_v4.2.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||
|
||||
## History
|
||||
|
||||
- They started with upstream kubernetes - the hard way
|
||||
- Env grew to over 200 prod apps
|
||||
- Pains: Single Cluster, single point of failure and complexity
|
||||
- What worked: Dev adoption and autonomy, no vendor
|
||||
|
||||
## Challenges
|
||||
|
||||
> Based on stakeholder expectations
|
||||
|
||||
- One tenant per cluster -> Over 1000 Clusters
|
||||
- Release management
|
||||
- Small team (3 Engineers)
|
||||
|
||||
## Guiding principles
|
||||
|
||||
- Platform as a product
|
||||
- Stability: trust
|
||||
- Standardization -> Scalability and inter team collab
|
||||
- Day 2 support
|
||||
- Dogfooding
|
||||
|
||||
## Tenancy
|
||||
|
||||
- One cluster per product
|
||||
- Own CLI, devs like cli
|
||||
- Custom operator and crds
|
||||
|
||||
## Stack
|
||||
|
||||
- Keopsctl? Pretty much their own cluster operator
|
||||
- A Simple Cluster CRD
|
||||
|
||||
## Migration
|
||||
|
||||
1. Build trust in platform
|
||||
2. Support with docs, oboarding, q&a
|
||||
3. Co-create with devs while keeping an eye on day2 -> Feature-Flag based rollout
|
||||
56
content/day0/13_paap.md
Normal file
@@ -0,0 +1,56 @@
|
||||
---
|
||||
title: Platform as a Product
|
||||
weight: 13
|
||||
tags:
|
||||
- platform
|
||||
- cloudnativecon
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/DoiaHfl9Y7Y" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
The CNCF's research into product thinking for platforms.
|
||||
|
||||
## But why
|
||||
|
||||
- Get insights into the current product thinking practives of platform builders
|
||||
- Topics: Needs/Paintpoints/Behaviour
|
||||
- Target: Create personas based on insights
|
||||
- Find out what people are doing, not hew they are doing
|
||||
|
||||
## How?
|
||||
|
||||
- Survey for quantity
|
||||
- Interviews for quality
|
||||
|
||||
## Challenges
|
||||
|
||||
- Asking questions without sugessting answers
|
||||
- Consensus on research goals
|
||||
- Motivation and time investment (on interviewer and interviewee side) + Non-Responses
|
||||
- Toolsing: There is no standard tooling at the CNCF for this kind of research
|
||||
- Small sample size -> No real research insights, just signals/hints
|
||||
|
||||
## Analysis
|
||||
|
||||
- Working with assumptions was hard in combination with the small sample size
|
||||
- Survey: Survey Tool (Google Forms) combined with a whiteboard tool for clustering and analysis
|
||||
- Interviews: They used ai for time efficiency but the prompt escalated a bit leading to no real time gain -> But you can scale the same prompt to infinite sample sized
|
||||
- Challemnge: AI confidently churns out wrong answers -> Use source links to verify and scoping
|
||||
|
||||
TODO: Steal worklow from slides
|
||||
|
||||
## Outcome/Signals
|
||||
|
||||
- Platform Orgs use Prioritization Frameworks onconsciously: "We don't use product management and tools like that" -> Well you do, you just don't call it PM and are a bit unstructured
|
||||
- Structured Activities: Interviews (talking to each other), Focus groups, quantitative data, ...
|
||||
- Roadmap influence: Insight, prioritization, painpoints, backlogs
|
||||
- Regular planning meetings
|
||||
- Platform orgs struggle to define and actually implement measures of success: Measure activity over impact, success is often felt instead of proved
|
||||
- Platform teams have varied control over their work: Depndening on company size and business relationships
|
||||
|
||||
## Future
|
||||
|
||||
- Baseline: They have some signals
|
||||
- Question: Are these pattern successfull
|
||||
- Needed: More data and better organization
|
||||
58
content/day0/14_lego.md
Normal file
@@ -0,0 +1,58 @@
|
||||
---
|
||||
title: Building Platforms with empathy and yaml at the lego group
|
||||
weight: 14
|
||||
tags:
|
||||
- platform
|
||||
- cloudnativecon
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/8FmJWd7vRt4" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
Very nice kids playing with lego intro analogy about creativity, sharing and colaboration.
|
||||
|
||||
## The golden brick
|
||||
|
||||
- The brick could get picked up and sometimes picking it up is mandatory
|
||||
- Devemopment in close colab and trust with users
|
||||
- Focus on good enough instead of perfect but everyone is unhapy
|
||||
|
||||
### Guidelines
|
||||
|
||||
- API first: Define a speration beween users and services with abstractions
|
||||
- Self services: Freedom of choice and combination
|
||||
- Constraints that are soft and can be modified on feedback
|
||||
|
||||
### Offers
|
||||
|
||||
- Kubernetes as a service
|
||||
- Runtime as a Service: NAmespace as a service with argo and without cluster access
|
||||
- Problem: Users want kubeapi access
|
||||
- Method: Talk with the users
|
||||
- Solution: Zero Trust proxy that provides operational access to kubeapi via OIDC
|
||||
- There are multiple APIs that can be combined -> You need constraints
|
||||
|
||||
### What's needed
|
||||
|
||||
- Conversation
|
||||
- Trust
|
||||
- Striking a balance
|
||||
|
||||
## The human aspect
|
||||
|
||||
- Treat people as colleagues instead of customers
|
||||
- Build empathy to reach a ballanced "good enough"
|
||||
- Lead with transparency: Publish your metrics
|
||||
- Visit their context
|
||||
- Explore unknowns together
|
||||
- Create a shared understanding of challenges
|
||||
|
||||
### Team culture
|
||||
|
||||
- Know who you are helping an who helps you
|
||||
- Empower them to shine by getting to know their context
|
||||
- Hear them out in small meetings ore in person
|
||||
|
||||
## Platform maturity
|
||||
|
||||
TODO: Steal maturity chart
|
||||
29
content/day0/15_internal-marketing.md
Normal file
@@ -0,0 +1,29 @@
|
||||
---
|
||||
title: 10 Quick tips on how to internally market your platform
|
||||
weight: 15
|
||||
tags:
|
||||
- platform
|
||||
- cloudnativecon
|
||||
- lightning
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/kiUV8En8Co4" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
{{% button href="https://static.sched.com/hosted_files/colocatedeventseu2025/42/2025-PE-Day-10-Tips.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||
|
||||
## Baseline
|
||||
|
||||
- Event great tech does not sell itself - you need marketing
|
||||
- We don't have a big marketing budget for our internal platform
|
||||
- No adoption -> No Trust -> No new users -> No adoption
|
||||
|
||||
## Tips
|
||||
|
||||
- Define personas and a value proposition map
|
||||
- Build a brand: Name, logo, story, swag
|
||||
- Have a launch party or milestone parties
|
||||
- Provide clear accesible communication (with clear channels, docs, ...)
|
||||
- Build a commmunity that can help each other (and don't seperate yourself from the community)
|
||||
- Capture metrics for success for yourself and from a user's perspective
|
||||
- Provide a 5minute Wow-Moment/demo werhe the user can feel like they achived something
|
||||
- Level up with gamification
|
||||
- Leverage external events for internal visibility
|
||||
BIN
content/day0/_img/abstraction-cycle.png
Normal file
|
After Width: | Height: | Size: 572 KiB |
BIN
content/day0/_img/product-compass.png
Normal file
|
After Width: | Height: | Size: 270 KiB |
@@ -4,8 +4,27 @@ title: Day 0
|
||||
weight: 4
|
||||
---
|
||||
|
||||
TODO:
|
||||
Day 0 of KubeCon aka CloudNativeCon aka the day on which the co-located events happen.
|
||||
This year I spent most of my time at the platform engineering day (with a short visit to argocon).
|
||||
The emerging motto of platform engineering day was "platform as a product".
|
||||
|
||||
This was the third conference day (fourth travel day) and in the afternoon i started to feel the brain-overflow.
|
||||
But powewring through I ended up attending two keynotes (no notes, they were pretty much a welcome and goodbye) and 14 talks.
|
||||
|
||||
And most importantly: This is the day my friends an coworkers joined (they are only in town for kubecon, not for rejekts).
|
||||
Sometimes we ended up in the same talks, sometimes in different talks which lead to a rich set of talk notes.
|
||||
|
||||
## Talk recommendations
|
||||
|
||||
* TODO:
|
||||
- How to design a good hireing process: [So you want to hire for platform engineering](./06_hire-engineers)
|
||||
- Evolution of Platforms and Platform Engineering: [The past, the present and the future of platform engineering](./07_past-present-future)
|
||||
- How to design a good product: [Product thinking for cloud native engineers](./08_product-thinking)
|
||||
- Staging with gitops: [A million ways to promote changes between environments](./09_promotions)
|
||||
- How to handle abstractions and new requriements: [Platform abstractions: Asset or liability](./10_abstractions)
|
||||
- Very nice slides: [Building Platforms with empathy and yaml at the lego group](./14_lego)
|
||||
|
||||
## Other stuff I learned or people i talk to
|
||||
|
||||
- Talked to the Vultr people - they have a manifesto for ai with amd and nvidia gpus
|
||||
- Talked to Meshcloud: They build developer platform tooling (currently mostly integrated with cloud providers)
|
||||
- Want to look into Okteto for dev envs: <https://github.com/okteto/okteto>
|
||||
77
content/day1/01_scaling-gpu.md
Normal file
@@ -0,0 +1,77 @@
|
||||
---
|
||||
title: Scaling GPU Clusters without melting down
|
||||
weight: 1
|
||||
tags:
|
||||
- ml
|
||||
- nvidia
|
||||
- ai
|
||||
- apiserver
|
||||
- go
|
||||
- kubecon
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/dUfp3j1j-mg" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
{{% button href="https://static.sched.com/hosted_files/kccnceu2025/50/Scaling%20GPU%20Clusters%20Without%20Melting%20Down%21%20%281%29.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||
|
||||
## Baseline
|
||||
|
||||
- We need mroe and more gpus -> Control Plane needs to keep track of more objects
|
||||
- Goal: Scale Workers without scaling control plane
|
||||
|
||||
## Current Problems
|
||||
|
||||
### Secret list calls go up and control plane goes down
|
||||
|
||||
- Scenario: High number of list calls with larger secrets
|
||||
- Problem: OOM apiserver b/c cache
|
||||
- Fix: API Priority & Fairness (only allow two concurrent list calls, queue the rest)
|
||||
- Result: Decreased number of oom crashes
|
||||
|
||||
### High memory usage until we restart the apiserver
|
||||
|
||||
- Scenario: API-Server frees up to 40% of it's memory util when restarted
|
||||
- Main suspect: Memory collection
|
||||
- Idea: Tune GOGC (ENV Var `GOCC`) -> They set the default 100 to 50
|
||||
- Result: Decrease in memory util and no more growing util over time
|
||||
|
||||
### Large skew in memory utilization
|
||||
|
||||
- Scanario: Scew between api server memory utilization across api-server pods
|
||||
- Problem: If a pod with high util get's hist with a list, the api-server will oom -> The LB redirects to the other 2 -> Those OOM
|
||||
- Observation: The lb in fron of the api server pods also shows some skew -> Explains the skew
|
||||
- Root cause: lb has long living tcp connections to the servers and balances based on connections and not requests
|
||||
- Idea: Switch up the lb configuration -> Not quite the right angle
|
||||
- Fix: Goaway-chance param in apiserver - random `COAWAY TCP` message get's sent -> Tearing down connection gracefully, recreate connection
|
||||
|
||||
### Architectural mistakes
|
||||
|
||||
- Large number of secrets per workload -> List, Encode/Decode overhead
|
||||
- No caching -> To many list calls
|
||||
|
||||
### Preview
|
||||
|
||||
- There are a bunch of sig api-machinery improvements planned
|
||||
|
||||
## The future
|
||||
|
||||
- The switch from NUMA GPU-Devices to DRA
|
||||
- DRA is powerfull engough to get rid of custom numa stuff
|
||||
|
||||
### The stack
|
||||
|
||||
- Currently:
|
||||
- CP: APIServer, Controller manager, Scheduler and Topology aware scheduler
|
||||
- Worker: Device Plugin, nfd topology updater
|
||||
- Future
|
||||
- CP: APIServer, Controller manager, Scheduler
|
||||
- Worker: Device Plugin
|
||||
|
||||
### Testing scaling
|
||||
|
||||
- Tool: KWOK (Kubernetes WithOut Kublet) - used to simulate gpu workout
|
||||
- Env: K8S 1.32 with scaling from 0 to 4000 Workloads
|
||||
- Metrics:
|
||||
- Scheduling Latency: Topo aware was way more latency-affected
|
||||
- Scheduler Memory util: 30% of memory saved with dra
|
||||
- APi-Server Memory: Another 20& of memory saved
|
||||
- Result: They are confident that DRA will bew stable and even save memeory and cpu util
|
||||
81
content/day1/02_migrations.md
Normal file
@@ -0,0 +1,81 @@
|
||||
---
|
||||
title: Day 2000 - Migrating from kubeadm + ansible to clusterapi+talos
|
||||
weight: 2
|
||||
tags:
|
||||
- kubecon
|
||||
- platform
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/uQ_WN1kuDo0" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
{{% button href="https://static.sched.com/hosted_files/kccnceu2025/fd/day2000-migration-ClusterAPI-talos.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||
|
||||
## Background
|
||||
|
||||
- They use large, shared clusters
|
||||
- The oldest cluster is 2099 days (5,8 years) old
|
||||
- Onprem hosted on vSphere with vanilla kubeadm
|
||||
- Fun fact: They run chaosmonkey on all clusters -> Automaticly prepares for updates
|
||||
|
||||
### Legacy provisioning
|
||||
|
||||
1. Terraform create debian vm
|
||||
2. Deploy base tools with puppet
|
||||
3. Register nodes in inventory yaml file
|
||||
4. run ansible playbook -> Renders configs and runs kubeadm
|
||||
5. Configure ArgoCD
|
||||
|
||||
### Target
|
||||
|
||||
- Use Clusterapi to manage the workload-clusters
|
||||
- Basic CRDS: Cluster, MachineDeployment, Machine
|
||||
- Talos: Immutable, minimal, ephemeral with declarative config via grpc api
|
||||
|
||||

|
||||
|
||||
|
||||
## Migration
|
||||
|
||||
1. Config matching between kubeadm and talos+capi
|
||||
2. Import PKI/Certs
|
||||
3. Create ClusterAPI CRDs
|
||||
4. Add ClusterAPI Nodes
|
||||
5. Remove kubeadm nodes
|
||||
|
||||
### 1. Config matching
|
||||
|
||||
1. Serviceaccount Issuer: Talos has it's own default
|
||||
2. etcd encryption key names are hardcoded in talos
|
||||
3. Re-Encrypt all secrets (get secrets, replace secrets)
|
||||
|
||||
### 2. PKI
|
||||
|
||||
1. Talos includes some logic that can generate a secrets bundle from an existing API
|
||||
2. Import: The etcd, k8s, serviceaccount and os (talos specific, used for the talos api auth) certificates
|
||||
|
||||
### 3. CRDs
|
||||
|
||||
- One namespace per workload cluster
|
||||
- Cluster-CRD: Ref to CP and Infrastructure
|
||||
- ControlPlane-CRD: Create cp MDs
|
||||
- Infrastructure: References template for wokrer-MDs
|
||||
|
||||

|
||||
|
||||
### 4. Add ClusterAPI Nodes
|
||||
|
||||
- Add new CP and Worker Nodes to the cluster that are managed by CAPI (slowly, stuff will break)
|
||||
- Remove the old nodes one by one over weeks ore months
|
||||
- Potential Problems:
|
||||
- Mismatched serviceaccountissuer
|
||||
- Missing etcd encryption key
|
||||
- Wrong etcd encryption key
|
||||
- Loss of quorum: `--force-new-cluster` can force recovery on one node of the etcd cluster
|
||||
|
||||
## Demo
|
||||
|
||||
I reccomend watching the demo
|
||||
Talos seems pretty cool.
|
||||
|
||||
## Bootstrapping
|
||||
|
||||
- Kind cluster in github action or on local device
|
||||
79
content/day1/03_operator-mistakes.md
Normal file
@@ -0,0 +1,79 @@
|
||||
---
|
||||
title: "Don't write controllers like charlie don't does: Avoiding common kubernetes controller mistakes"
|
||||
weight: 3
|
||||
tags:
|
||||
- kubecon
|
||||
- operator
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/tnSraS9JqZ8" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
{{% button href="https://static.sched.com/hosted_files/kccnceu2025/53/Don%27t%20write%20controllers%20like%20Charlie%20Don%27t%20does_%20avoiding%20common%20Kubernetes%20controller%20mistakes.pptx.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||
|
||||
## Common mistake
|
||||
|
||||
### Not using a simple client but directly talk to the api server
|
||||
|
||||
- Problem: A
|
||||
- Problem: Updates send in the whole object -> Noop updates waste apiserver resources
|
||||
- Fix: Use a cache client
|
||||
- Problem: Caching validation
|
||||
|
||||
### Don't use custom caching
|
||||
|
||||
- Problem: Good Luck dealing with concurrency
|
||||
- Hard: Controllers mus maintain a per kind cache
|
||||
- Problem: Eventual consistency makes everything more complicated
|
||||
- Fix: Use a framework
|
||||
|
||||
### Predecates only apply to the current
|
||||
|
||||
- If you have a predecate in the for (predecate) only appy to this call, not to other watchers
|
||||
- Also check if you shold be reconciling your low-level object or reconciling the higher level ones that ref to them is better
|
||||
|
||||
## Tools
|
||||
|
||||
### KRT
|
||||
|
||||
> Still under development
|
||||
|
||||
- Operatorions in collections (kubernetes objects with state tracking)
|
||||
- Fetch function that handels transformation
|
||||
|
||||
### StateDB
|
||||
|
||||
- In-memory database for go with watch channels
|
||||
- You can setup a table that stores all objects of a kind (provided by the client)
|
||||
- Triggers hooks when changes happen in the database that you can react to
|
||||
|
||||
### Controller-Runtime
|
||||
|
||||
> The kubebuilder one
|
||||
|
||||
- Includes a chached client
|
||||
- Works on the reconciler pattern -> Makes triggers simpe
|
||||
|
||||
## Tips
|
||||
|
||||
- Limit the number of api server updates
|
||||
- Check for dif yourself and don't send updates if there is nothing new
|
||||
- Use patch instead of update just with changed fields -> Especially for `.status`
|
||||
- Use a framework that handles watching, coalescing and caching (krt, statedb, controller-runtime)
|
||||
- Use predecates if you're using controller-runtime, this helps you filter out no-op events by checking them against the cache and filters
|
||||
|
||||
## Q&A
|
||||
|
||||
- Do you know where your reconciliations are coming from:
|
||||
- Counts: Yes the frameworks provide metrics and you can implement your own
|
||||
- But controller runtime abstracts the patch source so you have to compare before and after state yourself - but you should not do that
|
||||
- What about state sharing across multiple threads?
|
||||
- Controller runtime handels each reconcile as idempotent, so you can just multithread
|
||||
- But handling consistency can still be hard because you have to design all of your operations as idempotent by rebuilding the state each time
|
||||
- What are your thoughts on controllers that do stuff in the real world (especially b/c it takes longer and there are no natie observers)
|
||||
- Do something like the krt project by keeping the state seperatly
|
||||
- What if someone changes things at the cloud provider
|
||||
- A question of philosophy -> Usually just treat the operator at the source of throuth
|
||||
- How do you test your operators?
|
||||
- Depends on your output (kubernetes objects make stuf simple)
|
||||
- For cilium: Simple b/c it's just creating kubernetes projects
|
||||
- With oputside interaction: In-memory state representation or mocking
|
||||
- For complex controllers split the operator into: Ingestion, data model and transformation
|
||||
56
content/day1/04_gpus-go-round.md
Normal file
@@ -0,0 +1,56 @@
|
||||
---
|
||||
title: The GPUs on the bus go round and round
|
||||
weight: 4
|
||||
tags:
|
||||
- kubecon
|
||||
- gpu
|
||||
- nvidia
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/cLJRh4y4vXg" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
## Background
|
||||
|
||||
- They are the GForce Now folks
|
||||
- Large fleet of clusters all over the world (60.000+ GPUs)
|
||||
- They use kubevirt to pass through GPUs (vfio driver) or vGPUs
|
||||
- Devices fail from time to time
|
||||
- Sometimes failures needs restarts
|
||||
|
||||
## Failure discovery
|
||||
|
||||
- Goal: Maintain capacity
|
||||
- Failure reasons: Overheating, insufficient power, driver issues, hardware faults, ...
|
||||
- Problem: They only detected failure by detecting capacity decreasing or not being able to switch drivers
|
||||
- Fix: First detect failure, then remidiate
|
||||
- GPU Problem detector as part of their internal device plugin
|
||||
- Node Problem detector -> triggers remediation through maintainance
|
||||
|
||||
## Remidiation approaches
|
||||
|
||||
- Reboot: Works every time, but has workload related downsides -> Legit solutiom, but drain can take very long
|
||||
- Discovery of remidiation loops -> Too many reboots indicate something being not quite right
|
||||
- Optimized drain: Prioritize draining of nodes with failed devices before other maintainance
|
||||
- The current workflow is: Reboot (automated) -> Power cycle (automated) -> Rebuild Node (automated) -> Manual intervention / RMA
|
||||
|
||||
## Prevention
|
||||
|
||||
> Problems should not affect workload
|
||||
|
||||
- Healthchecks with alerts
|
||||
- Firmware & Driver updates
|
||||
- Thermal & Powermanagement
|
||||
|
||||
## Future Challenges
|
||||
|
||||
- What if a high density with 8 GPUs has one failure?
|
||||
- What is an acceptable rate of working to broken GPUs per Node
|
||||
- If there is a problematic node that has to be rebooted every couple of days should the scheduler avoid thus node?
|
||||
|
||||
## Q&A
|
||||
|
||||
- Are there any plans to opensource the gpu problem detection: We could certainly do it, not on the roadmap r/n
|
||||
- Are the failure rates representative and what is counted as failure:
|
||||
- Failure is not being able to run a workload on a node (could be hardware or driver failure)
|
||||
- The failure rate is 0,6% but the affected capacity is 1,2% (with 2 GPUs per node)
|
||||
64
content/day1/05_ressource-submission-bookkeeping.md
Normal file
@@ -0,0 +1,64 @@
|
||||
---
|
||||
title: "Reliable k8s resource Submission & Bookkeeping"
|
||||
weight: 5
|
||||
tags:
|
||||
- kubecon
|
||||
- platform
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/NCkHrvqFMl8" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
{{% button href="https://static.sched.com/hosted_files/kccnceu2025/0d/Reliable%20K8S%20Resource%20Submission%20and%20Bookkeeping.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||
|
||||
## Service offerings
|
||||
|
||||
- Product: HA Container Platform for general utility with a focus on run-to-complete
|
||||
- Use-Cases: ML Orchestration, CI/CD, Machine maintainace, Financial analysis, Data Processing pipeline
|
||||
- Requirements: Observability, Scheduling Events, Approval process, Bookkeeping, Datacenter Reseliency
|
||||
- Focus: Resiliency (HA with datacenter failover)
|
||||
- What the user needs: Workflow (e.g. generate report, persist report, notify)
|
||||
- What we need for the user: ConfigMaps + Secrets, Workflow templates for the steps
|
||||
|
||||
## Challenges
|
||||
|
||||
- Read after modify across multiople datacenters
|
||||
- Many reads against kubeapi that could overload the apiserver
|
||||
- No native approval flows and limited audit
|
||||
|
||||
## Submission flows from a users perspective
|
||||
|
||||
### Submission of runnables
|
||||
|
||||
- User: Submits runnable to subnitter with audit
|
||||
- Submitter: Handels retry, verification, ...
|
||||
- Submitter: Configures workload on workload clusters
|
||||
|
||||

|
||||
|
||||
### Submission of deployables
|
||||
|
||||
- User: deploys mutation to audit/sourceoftrough
|
||||
- Syncer: Syncs deployables to workload clusters
|
||||
|
||||

|
||||
|
||||
## Reporting
|
||||
|
||||
- User wants: UI with latest status for all jobs
|
||||
- Compliance wants: Transactions on given resource for auditing
|
||||
- Implementation: Highly available inventory as single source of truth
|
||||
|
||||
```mermaid
|
||||
graph
|
||||
WorkflowAPI-->|reads|inventory
|
||||
Consumer-->|updates|inventory
|
||||
Producer-->|publishes events to|Consumer
|
||||
```
|
||||
|
||||
### Potential Problems
|
||||
|
||||
- Problem: Delete event does not get propagated from syncer to producer leading to zombie ressources
|
||||
- Fix: Periodic Cleanup
|
||||
|
||||
### Overview
|
||||
|
||||

|
||||
BIN
content/day1/_img/capi.png
Normal file
|
After Width: | Height: | Size: 75 KiB |
BIN
content/day1/_img/clusterapi-crd.png
Normal file
|
After Width: | Height: | Size: 112 KiB |
BIN
content/day1/_img/deployables.png
Normal file
|
After Width: | Height: | Size: 220 KiB |
BIN
content/day1/_img/runnables.png
Normal file
|
After Width: | Height: | Size: 266 KiB |
BIN
content/day1/_img/submission.png
Normal file
|
After Width: | Height: | Size: 297 KiB |
@@ -4,8 +4,26 @@ title: Day 1
|
||||
weight: 5
|
||||
---
|
||||
|
||||
TODO:
|
||||
Day 1 of the main KubeCon event startet with a bunch of keynotes from the cncf themselfes (anouncing the next locations for kubecon - amsterdam and barcelona).
|
||||
The also announced a new sovereign cloud edge initiative (CNCF/LF meets EU and soem german ministry) called "NeoNephos" with members like SAP, StackIt or T-Systems.
|
||||
|
||||
This is also the day the sponsor showcase opened - so expect more talking to people and meetings or demos and less straight up talks.
|
||||
|
||||
## Talk recommendations
|
||||
|
||||
* TODO:
|
||||
- Not that much about gpus with good control plane scaling advice: [Scaling GPU Clusters without melting down](./01_scaling-gpu)
|
||||
- Migrate a cluster to ClusterAPI without downtime: [Day 2000 - Migrating from kubeadm + ansible to clusterapi+talos](./02_migrations)
|
||||
- Some basic operator tips with good Q&A questions: [Don't write controllers like charlie don't does: Avoiding common kubernetes controller mistakes](./03_operator-mistakes)
|
||||
|
||||
## Other stuff I learned or people i talk to
|
||||
|
||||
- The crossplane maintainers (Upbound)
|
||||
- Anynines
|
||||
- Cloudfoundry/Korifi
|
||||
- FlatCar
|
||||
- Cert-Manager
|
||||
- Flux maintainers
|
||||
- OVH
|
||||
- Kubermatic
|
||||
- Isovalent
|
||||
- Spacelift: They employ some of the opentofu core maintainers
|
||||
38
content/day2/01_chance-of-kubernetes.md
Normal file
@@ -0,0 +1,38 @@
|
||||
---
|
||||
title: "Cloudy with a chance of kubernetes"
|
||||
weight: 1
|
||||
tags:
|
||||
- kubecon
|
||||
- platform
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/iCAFXF5ECto" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
{{% button href="https://static.sched.com/hosted_files/kccnceu2025/bc/KubeCon%20EU%202025%20-%20Cloudy%20with%20a%20chance%20of%20Kubernetes_%20Going%20from%20one%20to%20three%20cloud%20providers%20-%20Laurent%20Bernaille%20%26%20Maxime%20Visonneau,%20Datadog.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||
|
||||
## Background
|
||||
|
||||
- Scale: 100s of clusters
|
||||
- Cloud: Azure, AWS, GCP
|
||||
- The baseline: Single AWS Region and applications on vms
|
||||
- Goal: Operate on different locations
|
||||
- History: They added more and more regions - 6 Providers in 6 Regions across 29 locations
|
||||
- Problem: Different tooling across different cloud providers
|
||||
- Idea: Kubernetes abstracts the specific cloud provider infra
|
||||
|
||||
## The way
|
||||
|
||||
- Idea: Use managed kubernetes
|
||||
- Problem: In 2018 the managed offerings were in beta or very limited
|
||||
- Challenge: Opinionated cloud specific stuff
|
||||
|
||||
### Iterations
|
||||
|
||||
1. Clusters based on vms created by terraform and other automation tools -> They realized that they need multiple clusters per region
|
||||
2. Their own application delivery platform that deployed to the right clusters across regions for better DevEx
|
||||
3. k8s on k8s (hosted cp) -> Current setup with a terraform managed parent cluster
|
||||
4. Idea: Host the Partent-Cluster on managed kubernetes -> They need to abstract some things away
|
||||
5. Solution: Use their good old aplication delivery platform
|
||||
|
||||
### Abstractions
|
||||
|
||||
- Use custom CRDs to abstract the same behaviour across providers
|
||||
@@ -4,8 +4,21 @@ title: Day 2
|
||||
weight: 6
|
||||
---
|
||||
|
||||
TODO:
|
||||
The second day of kubecon was my main "meeting day" this year - aka there were a bunch of scheduled meetings with manufacturers, partners, potential partners or just to get to know someone/a project.
|
||||
What does this mean for you? Another day with only a few sessions (I only managed to attend two and only one was worthy of note taking) - the meeting notes are not available online.
|
||||
|
||||
## Talk recommendations
|
||||
In the evening we attended the "German Community Stammtisch".
|
||||
|
||||
* TODO:
|
||||
## Other stuff I learned or people i talk to
|
||||
|
||||
- Isovalent
|
||||
- Kubermatic
|
||||
- Portworx
|
||||
- Fastly
|
||||
- Syseleven
|
||||
- Netbird
|
||||
- VMware
|
||||
- Stackit
|
||||
- Harness
|
||||
- Mia Platform
|
||||
- and many, many more...
|
||||
53
content/day3/01_day-two.md
Normal file
@@ -0,0 +1,53 @@
|
||||
---
|
||||
title: "Surviving Day2: Picking the right tool to secure your kubernetes habitat"
|
||||
weight: 1
|
||||
tags:
|
||||
- kubecon
|
||||
- security
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/FqUPqroF-Rw" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
{{% button href="https://static.sched.com/hosted_files/kccnceu2025/a1/Surviving%20Day2%20-%20Picking%20the%20Right%20Tool%20To%20Secure%20Your%20Kubernetes%20Habitat.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||
|
||||
Premise: The CNCF landscape includes a huuuge number (80+) of security(related) projects.
|
||||
Analogy: Animal kingdom (includes simmilar-ish animals that might do some of the same stuff but not entirely the same)
|
||||
|
||||
## Build Phase
|
||||
|
||||
- How can i scan my container for vulnerabilities? -> Well you probably mean your image
|
||||
- The image itself is just a bunch of static layerns and we kinda have to trust the layers you didn't build yourself
|
||||
- The main tool used is still trivy with some easy steps
|
||||
1. Extract layers
|
||||
2. Build FS
|
||||
3. Identify OS and Non-OS Packages
|
||||
4. Compare with vuln-db
|
||||
- The animal in our analogy: Racoon
|
||||
|
||||
## Deploy Phase
|
||||
|
||||
- Kubernetes Native: Admission Controller
|
||||
- Tool used: Kyverno (integrates as an admission controller with yaml/crd based configuration)
|
||||
1. Modify (e.g. add default resource limits)
|
||||
2. Validate (check policies)
|
||||
- The animal is actually a human: The forrest guard
|
||||
|
||||
## Start Phase
|
||||
|
||||
- Before the pod itself is running CSI, CNI and secret related processes (the once we want to look into) happen
|
||||
- Problems: Secrets have no rotation or versioning mechanism, there is no default integration for external kms
|
||||
- Project: External Secrets -> Get secrets from external kms, automaticly sync (e.g. new versions)
|
||||
- The chosen animal: Capricorn
|
||||
|
||||
## Run Phase
|
||||
|
||||
- Goal: Runtime scannning without including specialized instrumentation in each application
|
||||
- Tool: Falco utilizing eBPF to check system calls against rules
|
||||
- Idea: Detect dangerous behaviour (e.g. check for someone trying to exploit a fresh CVE)
|
||||
- The analogy: Falcon
|
||||
|
||||
## TL;DR
|
||||
|
||||
1. Scan images (trivy)
|
||||
2. Enforce best pracices (kyverno)
|
||||
3. Use an external kms (external secrets)
|
||||
4. Scan at runtime (falco)
|
||||
30
content/day3/02_open-feature.md
Normal file
@@ -0,0 +1,30 @@
|
||||
---
|
||||
title: "Type-safe feature flagging in openfeature: Lessons learned from using feature flags at google"
|
||||
weight: 2
|
||||
tags:
|
||||
- kubecon
|
||||
- dev
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/mewXGSwDCE4" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
{{% button href="https://static.sched.com/hosted_files/kccnceu2025/f6/Type-safe%20Feature%20Flagging%20in%20OpenFeature.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||
|
||||
## Featureflags?
|
||||
|
||||
- Idea: Change the behaviour of an application without rebuilding it
|
||||
- Goal: Control rollout, reduce risk, experiment (a/b)
|
||||
- At google: A huge number of feature flags (150k+) but that's because people forget to turn them off
|
||||
|
||||
## Where does the flag come from
|
||||
|
||||
- Lifecycle of a flag: Create, Manage, Deprecate, Delete -> But will it be created frist in code or in the service
|
||||
- Classic implementation: Just a if/else that uses a function to get the flag
|
||||
- Problem: What if the flag names missmatch between the code and flag ser -> Muliple sources of truth
|
||||
- Solution: Require use of auto-generated flag bindings (codegen from the management system) to mitigate typos, etc.
|
||||
|
||||
## OpenFeature
|
||||
|
||||
- Goal: Vendor agnostic, standardized, open source
|
||||
- Basic setup: Register provider (once per app), create a client, use client to get flags
|
||||
- CLI: Integrate into management system, keep a local manifest of all flags and generate code (generates the client)
|
||||
- Now: Just call the client's method instead of hard-coding feature flag names
|
||||
43
content/day3/03_etcd-reliability.md
Normal file
@@ -0,0 +1,43 @@
|
||||
---
|
||||
title: "Don't let your kubernetes cluster go wild: Ensuring etcd reliability"
|
||||
weight: 3
|
||||
tags:
|
||||
- kubecon
|
||||
- etcd
|
||||
---
|
||||
|
||||
{{% button href="https://youtu.be/J93U9n_qxSI" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
|
||||
Fair warning: This talk was very technical and pretty interesing - but don't even try to understand it if you're tired (or if it's the thrid to last session on the last day of a long conference).
|
||||
|
||||
## Baseline
|
||||
|
||||
- Standard example: Write and read KV-Data, `put(A,2) -> Get (A)`
|
||||
- Problem: Concurrency
|
||||
|
||||
TODO: Steal image from intuition of correctness
|
||||
|
||||
## Correctness
|
||||
|
||||
- Correctness: Kinda funky when it comes to time
|
||||
- Fix: Define serialization that executes parallel request one after another to bring them in an order
|
||||
|
||||
## Failures
|
||||
|
||||
- What happens is connections between etcd nodes go down -> Serving stale data
|
||||
- What happens if data corrupts -> If enough members are online, it can repair itself
|
||||
- And many more that can happen at random times -> Hard to test
|
||||
|
||||
TODO: Steal "in a concurrent world"
|
||||
|
||||
## Robustness framework
|
||||
|
||||
- Automates tests for failures
|
||||
- Includes reliable reproductions of past (seamingly random) errors
|
||||
- Currently a mixture of existing go debugging tools
|
||||
|
||||
## Future
|
||||
|
||||
- Reproduce more bugs consistently
|
||||
- Run additional consistency checks
|
||||
@@ -4,8 +4,15 @@ title: Day 3
|
||||
weight: 7
|
||||
---
|
||||
|
||||
TODO:
|
||||
The last day of KubeCon - aka the day everone leaves early.
|
||||
But not me and I had no meetings scheduled for this day -> More talks for me and notes for you.
|
||||
|
||||
This being my 7th day of the trip and 6th day of non-stop conferences took a bit of a toll on my note taking skills (expect more spelling mistakes).
|
||||
|
||||
## Talk recommendations
|
||||
|
||||
* TODO:
|
||||
- Intro to feature flags and related tips: [Type-safe feature flagging in openfeature: Lessons learned from using feature flags at google](./02_open-feature)
|
||||
|
||||
## Other stuff I learned or people i talk to
|
||||
|
||||
- TODO:
|
||||
@@ -4,4 +4,6 @@ title: Lessons Learned
|
||||
weight: 8
|
||||
---
|
||||
|
||||
Not related to any talk directly, but i can recommend this [Blog Post](https://smudge.ai/blog/ratelimit-algorithms) and [Video](https://www.youtube.com/watch?v=8QyygfIloMc&) about rate limiting.
|
||||
|
||||
TODO:
|
||||
|
||||