Compare commits
53 Commits
936a4c8c3a
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| b9060af72d | |||
| 3afb07e4c1 | |||
| 4becb06ad3 | |||
| 0e24bf4fd6 | |||
| f06c486182 | |||
| f71971e793 | |||
| a7a3817a03 | |||
| 47f7869257 | |||
|
b2fd7a4c81
|
|||
|
1213be7c30
|
|||
|
1f49a42edc
|
|||
|
c6f716ced1
|
|||
|
09ac5a9051
|
|||
|
5ed623d0ca
|
|||
| f8ca21416b | |||
| dc4dd2d883 | |||
| 957bc94344 | |||
| 44a3653c84 | |||
| 6bf47e49c5 | |||
| 39d92acdb4 | |||
| 4d528bf5de | |||
| d2f3f5f95d | |||
| 6d0c95a8ac | |||
| 3e4fbb616b | |||
| d9605d602e | |||
| 745e8f5896 | |||
| 78ca5973b8 | |||
| 77f34ed1ab | |||
| a36f562cf4 | |||
| 9ad9af0f9c | |||
| 4f39c1102c | |||
| df93624814 | |||
| 46b06c66fd | |||
| b4d8aa29c3 | |||
| 4cec1917bf | |||
| bd7d9fe87d | |||
| f4858d81a8 | |||
| bfcfe88cea | |||
| 45a26383e0 | |||
| 8dbdfd938f | |||
| 8941108720 | |||
| f8512dc6ae | |||
| c09bf8f637 | |||
| d90d5b8eab | |||
| 8b78108a60 | |||
| d09e3ff3d1 | |||
| 8ddf87d2f4 | |||
| 720d68803d | |||
| f0229abafd | |||
| 723051c498 | |||
| 7e6d0fc47f | |||
| fe8fa9693a | |||
| 8aab9217fe |
@@ -1,4 +1,4 @@
|
|||||||
FROM registry.odit.services/hub/hugomods/hugo:exts AS build
|
FROM registry.odit.services/hub/hugomods/hugo:exts-0.145.0 AS build
|
||||||
WORKDIR /app
|
WORKDIR /app
|
||||||
|
|
||||||
COPY . /app/
|
COPY . /app/
|
||||||
|
|||||||
@@ -6,5 +6,6 @@ tags:
|
|||||||
---
|
---
|
||||||
|
|
||||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
TODO:
|
TODO:
|
||||||
@@ -10,11 +10,12 @@ This current version is probably full of typos - will fix later. This is what ty
|
|||||||
## How did I get there?
|
## How did I get there?
|
||||||
|
|
||||||
I attended Cloud Native Rejekts and KubeCon + CloudNativeCon Europe 2025 in London.
|
I attended Cloud Native Rejekts and KubeCon + CloudNativeCon Europe 2025 in London.
|
||||||
|
This year I was sent there by my employer [DATEV eG](https://datev.de) - thanks again to everyone who helped me with getting this trip approved (you know who you are).
|
||||||
|
|
||||||
Why? Because learning about all new things in the world of cloud is really important and war stories help to avoid mistakes that other's already made.
|
Why? Because learning about all new things in the world of cloud is really important and war stories help to avoid mistakes that other's already made.
|
||||||
And [last year's experience](https://kubecon24.nicolai-ort.com) was really good, so I wanted to go again.
|
And [last year's experience](https://kubecon24.nicolai-ort.com) was really good, so I wanted to go again.
|
||||||
|
|
||||||
Plus I actually presented a talk at Cloud Native Rejekts.
|
Plus I actually presented a talk at Cloud Native Rejekts 🥳.
|
||||||
|
|
||||||
## And how does this website get it's content
|
## And how does this website get it's content
|
||||||
|
|
||||||
@@ -24,9 +25,22 @@ graph LR
|
|||||||
Nicolai-->|"Takes notes (and typos) + commits"|Repo
|
Nicolai-->|"Takes notes (and typos) + commits"|Repo
|
||||||
Repo-->|Triggers|Actions
|
Repo-->|Triggers|Actions
|
||||||
Actions-->|Builds image and pushes to|Registry
|
Actions-->|Builds image and pushes to|Registry
|
||||||
Kubernetes-->|Pulls latest image|Registry
|
Flux-->|Detects new image|Registry
|
||||||
|
Flux-->|Rolls out new image|Kubernetes
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Changelog™️
|
||||||
|
|
||||||
|
- 2025-03-28: Inital repo and deployment setup
|
||||||
|
- 2025-03-30: First day of Cloud Native Rejekts
|
||||||
|
- 2025-03-31: Second day of Cloud Native Rejekts
|
||||||
|
- 2025-04-01: First day of KubeCon/CloudNativeCon
|
||||||
|
- 2025-04-02: Second day of KubeCon/CloudNativeCon
|
||||||
|
- 2025-04-03: Added video links for Cloud Native Rejekts
|
||||||
|
- 2025-04-03: Third day of KubeCon/CloudNativeCon
|
||||||
|
- 2025-04-04: Fourth day of KubeCon/CloudNativeCon
|
||||||
|
- 2025-04-07: Added missing images and slide links for KubeCon/CloudNativeCon
|
||||||
|
|
||||||
## Style Guide
|
## Style Guide
|
||||||
|
|
||||||
The basic structure is as follows: `day/event-or-session`.
|
The basic structure is as follows: `day/event-or-session`.
|
||||||
@@ -6,7 +6,8 @@ tags:
|
|||||||
- security
|
- security
|
||||||
---
|
---
|
||||||
|
|
||||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
{{% button href="https://www.youtube.com/watch?v=JAy6Ra0ulSw" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
## BAseline
|
## BAseline
|
||||||
|
|
||||||
|
|||||||
@@ -2,10 +2,12 @@
|
|||||||
title: "The Hidden Brains of Kubernetes: Meet Controllers Powering the Cloud"
|
title: "The Hidden Brains of Kubernetes: Meet Controllers Powering the Cloud"
|
||||||
weight: 2
|
weight: 2
|
||||||
tags:
|
tags:
|
||||||
- <tag>
|
- rejekts
|
||||||
|
- operator
|
||||||
---
|
---
|
||||||
|
|
||||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
{{% button href="https://www.youtube.com/watch?v=PciVvE02L2w" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
## Big Picture
|
## Big Picture
|
||||||
|
|
||||||
|
|||||||
12
content/day-1/02_gslb.md
Normal file
@@ -0,0 +1,12 @@
|
|||||||
|
---
|
||||||
|
title: Evaluating Global Load Balancing Options for Kubernetes in Practice
|
||||||
|
weight: 2
|
||||||
|
tags:
|
||||||
|
- rejekts
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/RBMRU8rtxfI" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
{{% button href="https://github.com/nicolaiort/rejekts2025-gslb" style="tip" icon="code" %}}Demo-Code and more{{% /button %}}
|
||||||
|
{{% button href="https://de.slideshare.net/slideshow/evaluating-global-load-balancing-options-for-kubernetes-in-practice-kubermatic-datev/277640385" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||||
|
|
||||||
|
My talk, notes will be released soon
|
||||||
@@ -5,7 +5,8 @@ tags:
|
|||||||
- rejekts
|
- rejekts
|
||||||
---
|
---
|
||||||
|
|
||||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
{{% button href="https://www.youtube.com/watch?v=DdQzGsiounY" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
## The clans (popular solutions)
|
## The clans (popular solutions)
|
||||||
|
|
||||||
|
|||||||
@@ -2,11 +2,11 @@
|
|||||||
title: Understanding and Debugging DNS in Kubernetes Clusters
|
title: Understanding and Debugging DNS in Kubernetes Clusters
|
||||||
weight: 4
|
weight: 4
|
||||||
tags:
|
tags:
|
||||||
- <tag>
|
- rejekts
|
||||||
---
|
---
|
||||||
|
|
||||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
{{% button href="https://www.youtube.com/watch?v=awXjABDknww" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
{{% button href="https://github.com/mqasimsarfraz/talks/tree/main/CloudNativeRejekts-2025" style="transparent" icon="person-chalkboard" %}}Slides{{% /button %}}
|
{{% button href="https://github.com/mqasimsarfraz/talks/tree/main/CloudNativeRejekts-2025" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -6,7 +6,8 @@ tags:
|
|||||||
- edge
|
- edge
|
||||||
---
|
---
|
||||||
|
|
||||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
{{% button href="https://www.youtube.com/watch?v=jywpFlOH3z0" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
## The far edge
|
## The far edge
|
||||||
|
|
||||||
|
|||||||
@@ -6,7 +6,8 @@ tags:
|
|||||||
- multicluster
|
- multicluster
|
||||||
---
|
---
|
||||||
|
|
||||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
{{% button href="https://www.youtube.com/watch?v=w8rDxtrMGG8" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
## Baseline Infra
|
## Baseline Infra
|
||||||
|
|
||||||
@@ -48,4 +49,11 @@ TODO: Steal diagram from slides
|
|||||||
|
|
||||||
## Demo
|
## Demo
|
||||||
|
|
||||||
Pretty interesting, watch the video to find out
|
Pretty interesting, watch the video to find out
|
||||||
|
|
||||||
|
|
||||||
|
## Q&A
|
||||||
|
|
||||||
|
- Do you need a flat network: No just expose the tcp lb
|
||||||
|
- Did you think about using etcd to implement the leases instead of objects: They use managed hostplanes and dont want another etcd
|
||||||
|
- Have you tried to commit upstream: Nope, pretty much not an option thanks to the managed control-plane not being able to set apropriate flags
|
||||||
|
|||||||
@@ -9,10 +9,10 @@ This was another very interesting day and I can only recommend attending cloud n
|
|||||||
|
|
||||||
## Talk recommendations
|
## Talk recommendations
|
||||||
|
|
||||||
- My Talk: [Evaluating Global Load Balancing Options for Kubernetes in Practice](todo:)
|
- My Talk: [Evaluating Global Load Balancing Options for Kubernetes in Practice](./02_gslb)
|
||||||
- Service Mesh Intro + Comparison: [The service mesh wars - a new hope for kubernetes](../03_service-mesh)
|
- Service Mesh Intro + Comparison: [The service mesh wars - a new hope for kubernetes](./03_service-mesh)
|
||||||
- How to handle evection and statefulness across clusters: [Scaling PDBs: Introducing Multi-Cluster Resilience with x-pdb](../06_scaling-pdbs)
|
- How to handle evection and statefulness across clusters: [Scaling PDBs: Introducing Multi-Cluster Resilience with x-pdb](./06_scaling-pdbs)
|
||||||
- Intro to operators: [The Hidden Brains of Kubernetes: Meet Controllers Powering the Cloud](../02_controllers)
|
- Intro to operators: [The Hidden Brains of Kubernetes: Meet Controllers Powering the Cloud](./02_controllers)
|
||||||
|
|
||||||
## Other stuff I learned or people i talk to
|
## Other stuff I learned or people i talk to
|
||||||
|
|
||||||
|
|||||||
@@ -7,5 +7,6 @@ tags:
|
|||||||
---
|
---
|
||||||
|
|
||||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
Short opening keynote thanking volunteers and attendees.
|
Short opening keynote thanking volunteers and attendees.
|
||||||
@@ -8,7 +8,8 @@ tags:
|
|||||||
- multicluster
|
- multicluster
|
||||||
---
|
---
|
||||||
|
|
||||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
{{% button href="https://www.youtube.com/watch?v=r0W6cCJAGro" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
The talk started with a base introduction of ClusterAPI and the operations at gigantswarm.
|
The talk started with a base introduction of ClusterAPI and the operations at gigantswarm.
|
||||||
|
|
||||||
|
|||||||
@@ -6,7 +6,8 @@ tags:
|
|||||||
- keynote
|
- keynote
|
||||||
---
|
---
|
||||||
|
|
||||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
{{% button href="https://www.youtube.com/watch?v=m9NRk-6MSvY" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
A short keynote from micrososft about their contributions to open source and used tools:
|
A short keynote from micrososft about their contributions to open source and used tools:
|
||||||
- infra (kubernates, istio, hyperlight)
|
- infra (kubernates, istio, hyperlight)
|
||||||
|
|||||||
@@ -6,7 +6,8 @@ tags:
|
|||||||
- multicluster
|
- multicluster
|
||||||
---
|
---
|
||||||
|
|
||||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
{{% button href="https://www.youtube.com/watch?v=e1BmT0jc_Fs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
## Background
|
## Background
|
||||||
|
|
||||||
|
|||||||
@@ -5,7 +5,8 @@ tags:
|
|||||||
- rejekts
|
- rejekts
|
||||||
---
|
---
|
||||||
|
|
||||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
{{% button href="https://www.youtube.com/watch?v=CAPtQnH4rPY" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
## Recruitment & Staffing
|
## Recruitment & Staffing
|
||||||
|
|
||||||
|
|||||||
@@ -5,7 +5,8 @@ tags:
|
|||||||
- rejekts
|
- rejekts
|
||||||
---
|
---
|
||||||
|
|
||||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
{{% button href="https://www.youtube.com/watch?v=qNShvqSTKCU" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
## Background: The state of cloud in mauritius
|
## Background: The state of cloud in mauritius
|
||||||
|
|
||||||
|
|||||||
@@ -6,7 +6,8 @@ tags:
|
|||||||
- performance
|
- performance
|
||||||
---
|
---
|
||||||
|
|
||||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
{{% button href="https://www.youtube.com/watch?v=EYipC5y-8rM" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
There were more details in the talk than I copied into these notes.
|
There were more details in the talk than I copied into these notes.
|
||||||
Most of them were just too much to write down or application specific.
|
Most of them were just too much to write down or application specific.
|
||||||
|
|||||||
@@ -6,7 +6,8 @@ tags:
|
|||||||
- crossplane
|
- crossplane
|
||||||
---
|
---
|
||||||
|
|
||||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
{{% button href="https://www.youtube.com/watch?v=D4bKe4rAasc" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
Joint effort of novo-nordik and upbound.
|
Joint effort of novo-nordik and upbound.
|
||||||
|
|
||||||
|
|||||||
@@ -6,7 +6,8 @@ tags:
|
|||||||
- security
|
- security
|
||||||
---
|
---
|
||||||
|
|
||||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
{{% button href="https://www.youtube.com/watch?v=rJacyDygVi0" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
## Why does e2e authenticity matter?
|
## Why does e2e authenticity matter?
|
||||||
|
|
||||||
|
|||||||
@@ -5,7 +5,8 @@ tags:
|
|||||||
- rejekts
|
- rejekts
|
||||||
---
|
---
|
||||||
|
|
||||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
{{% button href="https://www.youtube.com/watch?v=1US_-3udMDo" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
## Hypothesis
|
## Hypothesis
|
||||||
|
|
||||||
|
|||||||
@@ -10,12 +10,12 @@ This is the first day of Cloud Native Rejekts and the first time of me attending
|
|||||||
|
|
||||||
> Ranked by should watch to could watch
|
> Ranked by should watch to could watch
|
||||||
|
|
||||||
- How to hire, manage and develop engineers: [Tech is broken and AI won't fix it](../05_broken-tech)
|
- How to hire, manage and develop engineers: [Tech is broken and AI won't fix it](./05_broken-tech)
|
||||||
- What if my homelab is an african island: [Geographically Distributed Clusters: Resilient Distributed Compute on the Edge](../06_geo-distributed-clusters)
|
- What if my homelab is an african island: [Geographically Distributed Clusters: Resilient Distributed Compute on the Edge](./06_geo-distributed-clusters)
|
||||||
- Bootstrap and CI/CD with crossplane: [Building air-gapped control planes for a global pharma leader using crossplane and argo](../08_airgapped-cp)
|
- Bootstrap and CI/CD with crossplane: [Building air-gapped control planes for a global pharma leader using crossplane and argo](./08_airgapped-cp)
|
||||||
- Handling large number of clusters: [CRD Data Architecture for Multi-Cluster Kubernetes](../04_multicluster-crd)
|
- Handling large number of clusters: [CRD Data Architecture for Multi-Cluster Kubernetes](./04_multicluster-crd)
|
||||||
- Handling large scale migrations: [The Cluster API Migration Retrospective: Live migrating hundreds of clusters to Cluster API](../02_clusterapi)
|
- Handling large scale migrations: [The Cluster API Migration Retrospective: Live migrating hundreds of clusters to Cluster API](./02_clusterapi)
|
||||||
|
|
||||||
## Other stuff I learned or people i talk to
|
## Other stuff I learned or people i talk to
|
||||||
|
|
||||||
- Throughout the lunch break I talked to a nice guy who heared my government question during the [Tech is broken and AI won't fix it](../05_broken-tech)-Talk, we talked
|
- Throughout the lunch break I talked to a nice guy who heared my government question during the [Tech is broken and AI won't fix it](./05_broken-tech)-Talk, we talked
|
||||||
27
content/day0/01_project-update.md
Normal file
@@ -0,0 +1,27 @@
|
|||||||
|
---
|
||||||
|
title: Project update
|
||||||
|
weight: 1
|
||||||
|
tags:
|
||||||
|
- platform
|
||||||
|
- cloudnativecon
|
||||||
|
---
|
||||||
|
|
||||||
|
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||||
|
{{% button href="https://static.sched.com/hosted_files/colocatedeventseu2025/70/Platforms%20WG%20Update%20slides%20-%20Kubecon%20EU%202025.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||||
|
|
||||||
|
An update from the platform working group which will be renamed to the CNCF Platform Engineering Community.
|
||||||
|
Alongside the new name a bit of restructuring will take place bacause the working group outgrew the working group label.
|
||||||
|
|
||||||
|
## Initiatives
|
||||||
|
|
||||||
|
### Supported initianives
|
||||||
|
|
||||||
|
- Platform Glossary and Whitepaper: What is a platform
|
||||||
|
- Platform Maturity Model & Assesment: A Platform is a living thing that evolves
|
||||||
|
- Platform as a Product: Currently in the research stage
|
||||||
|
- Platform Community Formation: The - above mentioned - restructuring
|
||||||
|
|
||||||
|
### Monitored Initiative
|
||||||
|
|
||||||
|
- Cloud Native Platform Engineering Associate (CNPA): Certification is being formed
|
||||||
|
- Cloud Native Platform Engineer (CNPE): Will follow after CNPA
|
||||||
30
content/day0/02_sponsored-stbsdw.md
Normal file
@@ -0,0 +1,30 @@
|
|||||||
|
---
|
||||||
|
title: Stop building, start delivering workloads
|
||||||
|
weight: 2
|
||||||
|
tags:
|
||||||
|
- platform
|
||||||
|
- cloudnativecon
|
||||||
|
- sponsored
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/7tbs3J7mgE0" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
|
## States of platform
|
||||||
|
|
||||||
|
1. Platform is being build and getting delayed
|
||||||
|
2. Platform finished and not adopted
|
||||||
|
3. Re-Platforming and guessing if the new platform will meet the same end
|
||||||
|
4. Platform is low maintainance and devs are happy (nice story bro)
|
||||||
|
|
||||||
|
Failure should be fine but it's no longer an option in most cases
|
||||||
|
|
||||||
|
## What do we want?
|
||||||
|
|
||||||
|
> Whishlist
|
||||||
|
|
||||||
|
- Support for all workload
|
||||||
|
- Consistent experiences across ui, api, cli and gitops
|
||||||
|
- Pathway from preview to prod
|
||||||
|
- Multi-cloud and onprem
|
||||||
|
- Abstract infra
|
||||||
32
content/day0/03_sponsored-cortex.md
Normal file
@@ -0,0 +1,32 @@
|
|||||||
|
---
|
||||||
|
title: "Platform Engineering with a Product Management Mindset: 10x your DevEx"
|
||||||
|
weight: 3
|
||||||
|
tags:
|
||||||
|
- platform
|
||||||
|
- cloudnativecon
|
||||||
|
- sponsored
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/MFLXFNlmMMI" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
|
This whole talk is pretty much a product managers view on platform engieering.
|
||||||
|
|
||||||
|
## Where can it go wrong
|
||||||
|
|
||||||
|
- Assuming customer needs - build for hypothetical developers
|
||||||
|
- Output > Outcome
|
||||||
|
- Ignore stakeholder ecosystem
|
||||||
|
|
||||||
|
TODO: Steal slide
|
||||||
|
|
||||||
|
## PaaP (Platform as a product)
|
||||||
|
|
||||||
|
- Anticipate developer needs: Dont just fulfill requests
|
||||||
|
- Design for all personas and survey related teams
|
||||||
|
- Prioritize Features according to research themes
|
||||||
|
- Deliver inremental value with feedback loops
|
||||||
|
|
||||||
|
## Hierarchy of goals and baselines
|
||||||
|
|
||||||
|
TODO: Copy slide over
|
||||||
27
content/day0/04_sponsored-gitpod.md
Normal file
@@ -0,0 +1,27 @@
|
|||||||
|
---
|
||||||
|
title: "The platform Engineer gauntlent: Three defining challenges in the AI era"
|
||||||
|
weight: 4
|
||||||
|
tags:
|
||||||
|
- platform
|
||||||
|
- cloudnativecon
|
||||||
|
- sponsored
|
||||||
|
---
|
||||||
|
|
||||||
|
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
|
## Conviciton
|
||||||
|
|
||||||
|
- Background: There is an absence of platform leadership
|
||||||
|
- Reason: Most "leaders" don't push services or features to developers with conviction
|
||||||
|
- Solution: Be proud and use your leadership role with courage
|
||||||
|
|
||||||
|
## Focus
|
||||||
|
|
||||||
|
- Focus on developers
|
||||||
|
- Don't only focus on the production ecosystem (observability, ci/cd) but also the path to this end
|
||||||
|
|
||||||
|
## Foundations
|
||||||
|
|
||||||
|
- Problem: Many companies are running behind their ai goals thanks to missing baseline automation
|
||||||
|
- Solution: Embrace the AI
|
||||||
13
content/day0/05_sponsored-vultr.md
Normal file
@@ -0,0 +1,13 @@
|
|||||||
|
---
|
||||||
|
title: "Containerization beyond CPUs - A Kubernetes based serverless platform for ai native applications"
|
||||||
|
weight: 5
|
||||||
|
tags:
|
||||||
|
- platform
|
||||||
|
- cloudnativecon
|
||||||
|
- sponsored
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/XrMsJIL35Oc" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
|
Hypothesis: We are at the beginning of a 10 year cycle that is moving towards ai-native applications.
|
||||||
61
content/day0/06_hire-engineers.md
Normal file
@@ -0,0 +1,61 @@
|
|||||||
|
---
|
||||||
|
title: So you want to hire for platform engineering
|
||||||
|
weight: 6
|
||||||
|
tags:
|
||||||
|
- platform
|
||||||
|
- cloudnativecon
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/cl-MO7j7MHY" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
|
Hypothesis: The bar for good interviewing is somewhere near the earth's core and we need to improve this (because we need more engineers)
|
||||||
|
|
||||||
|
## Resilience engineering
|
||||||
|
|
||||||
|
> The overarching concepts that apply to platforms or just "how to make code work"
|
||||||
|
|
||||||
|
Idea: Four main goals that align with different roles unter the mothership "resilience engineering"
|
||||||
|
|
||||||
|
- Rebound: SRE
|
||||||
|
- Robustness: Infra
|
||||||
|
- Graceful extensibility: Platform Engineering
|
||||||
|
- Sustained adaptability: DevEx (often pulled out into something else)
|
||||||
|
|
||||||
|
Bonus things to look out for
|
||||||
|
|
||||||
|
- Intellectual Humility: The ability to learn new things and accepting that you might now much but not everything
|
||||||
|
- Ecological awe: The awe expereienced when looking at beautiful nature and feeling small or just looking at the cncf landscape
|
||||||
|
|
||||||
|
## What do you need for the first team
|
||||||
|
|
||||||
|
- People who are able to hire new people and willing to step up to leadership in the long term
|
||||||
|
- Generalists
|
||||||
|
|
||||||
|
## The process and what to do
|
||||||
|
|
||||||
|
What should happen before we hire someone (either in one or multiple interviews).
|
||||||
|
|
||||||
|
1. Learn about each other
|
||||||
|
2. Solve a technical problem together
|
||||||
|
3. Solve a socological problem together
|
||||||
|
4. How do you and your future coworkers/stakeholders get along
|
||||||
|
|
||||||
|
Make sure the end2end time (first interview to ye or no) is low (best is under two meeks)
|
||||||
|
All of your current engineers should be able to pass the interview without studying in advance (no stupid)
|
||||||
|
|
||||||
|
## Potential Failures and fallacies
|
||||||
|
|
||||||
|
- The fallacy of demographics in = demographics out
|
||||||
|
- Treating interviews like hazing
|
||||||
|
- you don't track afer-hire indicators
|
||||||
|
- Whireboard interviews: They are stupid repetition and regurgitation and have 0 relations to the real world work
|
||||||
|
- There are no real studies on how to asses and hire talent
|
||||||
|
|
||||||
|
### Flags
|
||||||
|
|
||||||
|
- Passion is usually interpreted as "puts up with abuse" and should not be mistaken for caring -> See "Ecological awe"
|
||||||
|
- Side projects probably indicate lack in family/social time "i make my wife raise the kids" -> Sideprojects are not a good indicator, maybe their are brilliant at their job but love their free time
|
||||||
|
- A Moneyball-like process (data-driven decision) completely counters how talent is perceived -> Expand the hiring pool to anybody and ignore the clasical "indicators of talent"
|
||||||
|
- Discriminated demographics probably have a better grip on systems thinking (doe to being forced to make choices)
|
||||||
|
- Systems thinking is more important than platform knowledge (If you can think in terms of organization and dependencies you can work on platforms)
|
||||||
62
content/day0/07_past-present-future.md
Normal file
@@ -0,0 +1,62 @@
|
|||||||
|
---
|
||||||
|
title: The past, the present and the future of platform engineering
|
||||||
|
weight: 7
|
||||||
|
tags:
|
||||||
|
- platform
|
||||||
|
- cloudnativecon
|
||||||
|
- viktor
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/uwDoHm-AxTM" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
|
The good old baseline is "iam an an developer, i write code - now i have to do stuff to continue writing code".
|
||||||
|
Most developers will continue on to "now i have to write scripts" on order to just do their jobs instead of working on infra.
|
||||||
|
|
||||||
|
These scripts evolve to tools which evolve into an internal platform (if everyone starts using it).
|
||||||
|
Other base components can also feel like platforms (for example application servers).
|
||||||
|
|
||||||
|
## The early day evolution
|
||||||
|
|
||||||
|
- Hudson
|
||||||
|
- Docker: Not really building platforms, rather standardized application packaging
|
||||||
|
- Kubernetes (and Nomad + Swarm): A new concept of scheduling instead of jsut running the application in a container
|
||||||
|
|
||||||
|
=> We've been building platforms (or failing to build them) for years and years but now we kinda agree about what a platform is
|
||||||
|
|
||||||
|
## Present
|
||||||
|
|
||||||
|
We have the base idea of a platform
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
graph LR
|
||||||
|
ServiceConsumers-->|Consume through|HTTPAPI-->|Trigger work on|Controllers-->|So|Services
|
||||||
|
ServiceOwner-->|Manages|Services
|
||||||
|
```
|
||||||
|
|
||||||
|
- The fist question: Do we use public controllers (e.g. the cncf landscape projects) or build our own?
|
||||||
|
- Result: Mostly a mix starting with public, realizing needs and expanding
|
||||||
|
|
||||||
|
## Make it your own
|
||||||
|
|
||||||
|
- Goal: Make the platform domain specific for your developers
|
||||||
|
- Evolution: Tools like DAPR for developers or Crossplane for api-building
|
||||||
|
- Build the API and Controllers first - dashboard, gitops, observability, ... second
|
||||||
|
- Remember that kubernetes can manage anything - not just containers
|
||||||
|
|
||||||
|
TODO: Steal image
|
||||||
|
|
||||||
|
## Blueprints
|
||||||
|
|
||||||
|
Take all of the projects you need, combine them and hide the complexity
|
||||||
|
High level architecture of internal platforms is the same as public ones (aws, ...) but internal and built on kubernetes.
|
||||||
|
|
||||||
|
TODO: Steal images for platform blueprints (3 slides)
|
||||||
|
|
||||||
|
## Future
|
||||||
|
|
||||||
|
- Platform Engineering certification by the CNCF is on the horizon
|
||||||
|
- Do we need to hide kubernetes from developers? Maybe -> The CNCF is starting groups to get app devs closer to platform engineers
|
||||||
|
- More multi-cluster specialized tools are sprawling in the last year (scheduling, discovery, management)
|
||||||
|
- AI things are happening and we should utilize it but not just by calling a llm directly and calling it a day -> e.g. dapr llm abstraction api
|
||||||
|
- Platforms are not built in isolation, we need to help each other
|
||||||
75
content/day0/08_product-thinking.md
Normal file
@@ -0,0 +1,75 @@
|
|||||||
|
---
|
||||||
|
title: Product thinking for cloud native engineers
|
||||||
|
weight: 8
|
||||||
|
tags:
|
||||||
|
- platform
|
||||||
|
- cloudnativecon
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/8_pB9RAfzrY" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
{{% button href="https://static.sched.com/hosted_files/colocatedeventseu2025/48/Product%20Thinking%20for%20Cloud%20Native%20Engineers%20PlatformEngineeringDay-EU-25.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||||
|
|
||||||
|
## How & Why
|
||||||
|
|
||||||
|
- IT was a cost center for a long time - not it's critical but still treated as a cost center
|
||||||
|
- Why is it important: To much focus in the technical aspects instead of value delivery
|
||||||
|
- Importance: Show the value of your work (which means your work has to provide value)
|
||||||
|
- Operations and coordination work is not easily visible, but very important
|
||||||
|
|
||||||
|
## Principles
|
||||||
|
|
||||||
|
- Focus on user value: User problems > Solutions
|
||||||
|
- Outcome (Value) > Output (Tickets closed)
|
||||||
|
- Products (lifecycle and ownership) before projects (just setting stuff up)
|
||||||
|
|
||||||
|
### User value
|
||||||
|
|
||||||
|
- "Who is the user": Builders, Enablers, Regulatory, "Viewers"
|
||||||
|
- "What is the value": Make the organization more efficient while avoiding risks
|
||||||
|
|
||||||
|
## How to start?
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
### Exploring the Problem Space
|
||||||
|
|
||||||
|
Goals:
|
||||||
|
- Identify top pains
|
||||||
|
- Build empathy and understanding
|
||||||
|
- Investigate key business aims
|
||||||
|
|
||||||
|
Techinques:
|
||||||
|
- Customer and stakeholder interviews: Talk to people, they will probably tell you about their pain
|
||||||
|
- Data/Process analysis: Where are out bottlenecks
|
||||||
|
- Shadowing: Really see how the day to day works
|
||||||
|
- Ask "Why"
|
||||||
|
- Read business updates (current goals)
|
||||||
|
- Build dashboards that show progress and value
|
||||||
|
|
||||||
|
### Defining the problem space
|
||||||
|
|
||||||
|
Goals:
|
||||||
|
- Identify opportunities
|
||||||
|
- Prioritise
|
||||||
|
- Gather insignts and data
|
||||||
|
|
||||||
|
Techniques:
|
||||||
|
- Value stream mapping
|
||||||
|
- RICE, Value vs Effort or ather cost benefit analysis
|
||||||
|
- Analyse your exploration process
|
||||||
|
|
||||||
|
## Did we reach our goal?
|
||||||
|
|
||||||
|
### Product metrics
|
||||||
|
|
||||||
|
- Someone will measure your work, hope they do it right or rather do it yourself to show how you provide value
|
||||||
|
- Product metrics should measure outcome not output (or performance metrics)
|
||||||
|
- Baseline: You need to know the desired outcome
|
||||||
|
|
||||||
|
|
||||||
|
### Frameworks
|
||||||
|
|
||||||
|
- DevEx: Triangle of flow state (build&test speed), feedback loops () and cognitive load (code complexity, docs clarity)
|
||||||
|
- DORA
|
||||||
|
- SPACE
|
||||||
|
- DX Core 4
|
||||||
129
content/day0/09_promotions.md
Normal file
@@ -0,0 +1,129 @@
|
|||||||
|
---
|
||||||
|
title: A million ways to promote changes between environments
|
||||||
|
weight: 9
|
||||||
|
tags:
|
||||||
|
- argo
|
||||||
|
- cloudnativecon
|
||||||
|
- viktor
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/iCTgRC3AQQk" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
{{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||||
|
|
||||||
|
## Baseline
|
||||||
|
|
||||||
|
- Promotion: Move things from one env to another
|
||||||
|
- Options: Sequentially or both
|
||||||
|
- Challenge: Env differences
|
||||||
|
- Challenge: How do we link our promotion tasks?
|
||||||
|
|
||||||
|
### GitOps
|
||||||
|
|
||||||
|
- Declarative: YAML, JSON, XML (Not helm or kcl or anything else)
|
||||||
|
- Versioned and immutable: Git
|
||||||
|
- Pulled automatiocally: No wirte access from cluster
|
||||||
|
- Continously reconciled: Maintain parity between desired and actual state
|
||||||
|
|
||||||
|
### Rules
|
||||||
|
|
||||||
|
- Part of SLDC
|
||||||
|
- Declarative
|
||||||
|
- Versioned and immutable
|
||||||
|
- Pulled automatiocally
|
||||||
|
- Continously reconciled
|
||||||
|
|
||||||
|
## Workflows
|
||||||
|
|
||||||
|
### Manual
|
||||||
|
|
||||||
|
1. Deploy
|
||||||
|
2. Run tests
|
||||||
|
3. Push to next stage
|
||||||
|
4. Test again or roll back
|
||||||
|
|
||||||
|
### Manual with gitops
|
||||||
|
|
||||||
|
1. Update manifest
|
||||||
|
2. Push to git
|
||||||
|
3. Test
|
||||||
|
4. Next stage
|
||||||
|
|
||||||
|
Problem: Eventual consistency makes the process async instead of sync (important for tests)
|
||||||
|
|
||||||
|
### Generic workflows
|
||||||
|
|
||||||
|
1. Dev: Bump, push
|
||||||
|
2. QS: Wait for success of 1 (how?), do the same
|
||||||
|
3. Prod: Wait for success of 2 (how?)
|
||||||
|
|
||||||
|
TODO: Steal code screenshots from slides
|
||||||
|
|
||||||
|
## Tools
|
||||||
|
|
||||||
|
### Extend your standard CI
|
||||||
|
|
||||||
|
|
||||||
|
Not async, risk of flapping, either blindly trust the state or break the pull-principle by running argo sync or kubectl apply
|
||||||
|
|
||||||
|
### AppSets Progressive Sync
|
||||||
|
|
||||||
|
- Built in to Application Sets (alpha)
|
||||||
|
- Targeting by label, promotes everything
|
||||||
|
- Not supported with autosync, bechause it basically manually triggers sync one after another
|
||||||
|
- Changes from git have to be manually triggered
|
||||||
|
|
||||||
|
### Image updater
|
||||||
|
|
||||||
|
- Subscribe to semver based image updates and write them to kubernetes and/or git
|
||||||
|
- You have to implement promotions via image naming schemes
|
||||||
|
|
||||||
|
TODO: Steal flowchart
|
||||||
|
|
||||||
|
### Kargo
|
||||||
|
|
||||||
|
- Freight: Artifact or manifest versions to promote
|
||||||
|
- Stage: ArgoCD Apps
|
||||||
|
|
||||||
|
TODO: Steal flowchart
|
||||||
|
|
||||||
|
### Telefonistka
|
||||||
|
|
||||||
|
- IaC Agnostic tooling
|
||||||
|
- Idea: Watch folder contents and copy contents to new folder
|
||||||
|
- Pretty mutch a bundeled CI-Script
|
||||||
|
|
||||||
|
TODO: Draw your own chart
|
||||||
|
|
||||||
|
### Codefresh GitOps
|
||||||
|
|
||||||
|
> This is one of the speaker's tools
|
||||||
|
|
||||||
|
- Product: Applications with relationships
|
||||||
|
- Env: Any cluster and/or namespace
|
||||||
|
- Promotion: CRD for policy (when does it happen, what get's validated)
|
||||||
|
- Promotions can happen manually or automated via commit/pr
|
||||||
|
- BAsed on argo workflows
|
||||||
|
|
||||||
|
### GitOps Promoter (Intuit)
|
||||||
|
|
||||||
|
- Define Manifests once and hydrate them later
|
||||||
|
- Sourcehydrator: Argocd feature that handels the rendering and commits it to a new dedicated branch (one branch per stage)
|
||||||
|
- The Branches are the branches used by argo, e.g. `environments/dev` get's watched by the dev cluster
|
||||||
|
- Changes result in environment proposal branches, PR get's oppened, PR checks run, when PR requirements are met (Tests), it will merge them into the real env branches
|
||||||
|
|
||||||
|
TODO: Steal Pattern
|
||||||
|
|
||||||
|
## Overview of the philosopies
|
||||||
|
|
||||||
|
Artifact Oriented: Imageupdater, Kargo
|
||||||
|
Define Manifests once: AppSets Progessive Sync, GitOps Promoter
|
||||||
|
Deff and workflow: CI, Codefresh
|
||||||
|
|
||||||
|
TODO: Steal from slides
|
||||||
|
|
||||||
|
## Best practives
|
||||||
|
|
||||||
|
- Can you recover from git at any point? No -> Do better
|
||||||
|
- Does git reflect what's deployed without looking?
|
||||||
|
- Does this enable SDLC?
|
||||||
|
- Interfaces in folders, not branches? -> Branches may get crowded
|
||||||
89
content/day0/10_abstractions.md
Normal file
@@ -0,0 +1,89 @@
|
|||||||
|
---
|
||||||
|
title: "Platform abstractions: Asset or liability"
|
||||||
|
weight: 10
|
||||||
|
tags:
|
||||||
|
- platform
|
||||||
|
- cloudnativecon
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/M5X5NCzlzIA" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
{{% button href="https://static.sched.com/hosted_files/colocatedeventseu2025/52/atul-talk-platform-engineering-kubecon-london-2025_final.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||||
|
|
||||||
|
Fair warning: Food analogies incoming
|
||||||
|
|
||||||
|
## Baseline
|
||||||
|
|
||||||
|
### What do abstractions achive
|
||||||
|
|
||||||
|
- Structure through simplification
|
||||||
|
- Complexity made simple
|
||||||
|
- Hiden Details, visible value
|
||||||
|
|
||||||
|
### Dilemma
|
||||||
|
|
||||||
|
1. Platform team creates abstraction
|
||||||
|
2. Abstraction works for 10 Teams
|
||||||
|
3. Other team requests extension
|
||||||
|
4. Question: How do we deal with this
|
||||||
|
|
||||||
|
### Possible Solutions
|
||||||
|
|
||||||
|
- Add Config Options: Increases complexity of abstraction
|
||||||
|
- Make One-off exceptions: Breaks standardization, introduces inconsistency
|
||||||
|
- Require conformity: Hinders innovation, creates enemies
|
||||||
|
- Allow bypassing: Creates shadow it, risking security and resource control
|
||||||
|
|
||||||
|
=> Debt trap: The cost of maintaining a stable platform rises and rises
|
||||||
|
|
||||||
|
## The debt cycle
|
||||||
|
|
||||||
|
### The abstraction cycle
|
||||||
|
|
||||||
|
1. Simplify
|
||||||
|
2. Adobt
|
||||||
|
3. New Requirements
|
||||||
|
4. Add complexity
|
||||||
|
5. Repeat
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
### Warning signs
|
||||||
|
|
||||||
|
- Rizing customization requests
|
||||||
|
- Workarounds
|
||||||
|
- Shadow IT
|
||||||
|
|
||||||
|
### Impact
|
||||||
|
|
||||||
|
- Each new feature becomes harder to implement
|
||||||
|
- Teams lose trust in the platform capabilities
|
||||||
|
- Platform evolutions slows down
|
||||||
|
- New tech is difficult to incorporate
|
||||||
|
|
||||||
|
## Abstraction elacity
|
||||||
|
|
||||||
|
> The abstraction should stretch a bit to accommodate change without brakuing
|
||||||
|
|
||||||
|
- Adaptability: Ease of handling new requirements
|
||||||
|
- Transparency: Understand what your user wants and why
|
||||||
|
- Extension PAtterns: Document ways to customize the platform behavior
|
||||||
|
- Migration Paths: Ease of moving away from the platform abstraction
|
||||||
|
|
||||||
|
### Elasticity
|
||||||
|
|
||||||
|
- Can teams access lower level controls (when needed) while staying with the abstraction
|
||||||
|
- Do users understand what happens underneath (when needed)
|
||||||
|
- Are ther documented extension/customization points?
|
||||||
|
|
||||||
|
## Patterns to break the debt trap
|
||||||
|
|
||||||
|
- Layered abstraction patterns: start with low-level abstractions that get abstracted on higher levels to allow users to choose the right abstraction level for themselves without having to configure everything themselfes
|
||||||
|
- Expert-ap: Additional api parameters that are not needed but can be set
|
||||||
|
- Policy based guard rails: Change the guardrails based on the environment (e.g. deep access in dev, not in prod)
|
||||||
|
|
||||||
|
## The end goal
|
||||||
|
|
||||||
|
- Increase adoption
|
||||||
|
- Eliminate shadow IT
|
||||||
|
- Improved satisfaction
|
||||||
|
- Reduced overhead
|
||||||
43
content/day0/11_t-env.md
Normal file
@@ -0,0 +1,43 @@
|
|||||||
|
---
|
||||||
|
title: "The story of t-env: Scaling a platform to impriove the volocity of hundreds of developers"
|
||||||
|
weight: 11
|
||||||
|
tags:
|
||||||
|
- platform
|
||||||
|
- cloudnativecon
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/qXRHpIYxU_c" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
{{% button href="https://static.sched.com/hosted_files/colocatedeventseu2025/da/KubeCon%20Talk_%20Lemonade%27s%20t-env.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||||
|
|
||||||
|
Okteto: Ephemeral environents for testing
|
||||||
|
|
||||||
|
## History
|
||||||
|
|
||||||
|
- Starting point: Local Dev -> Setup for new devices or devs is realy slow (on average 10hrs a week)
|
||||||
|
- Next Idea: EC2 Instances with a fancy docker-compose and scripts -> No more local dev
|
||||||
|
- Problems: Still complex - just in the cloud, manual updates, allways-on required (no working in the train)
|
||||||
|
- Risks: Developers will just create workarounds and shadow it
|
||||||
|
|
||||||
|
## T-Env
|
||||||
|
|
||||||
|
- Baseline: Setup an environment on kubernetes for each dev with ci/cd
|
||||||
|
- Okteto: A single command to enter dev mode `t dev start` with file sync from local
|
||||||
|
- Implementation: Wrapper arount the okteto cli
|
||||||
|
- Why: Becaus dev seems to love the cli
|
||||||
|
- Self service observability for troubleshooting in your env
|
||||||
|
|
||||||
|
Used Open soruce Tools: Pulumi, Grafana, Okteto, K8s
|
||||||
|
|
||||||
|
### Did it work?
|
||||||
|
|
||||||
|
- The time to test is way faster
|
||||||
|
- The path was clear
|
||||||
|
- The environments should be ephemeral but devs don't like that -> They decided to allow for long lived envs
|
||||||
|
- Cloud cost is relatively high with long living envs -> They implemented a sleep system based on dev timezone
|
||||||
|
(or manual wake-up)
|
||||||
|
|
||||||
|
## The futuuuuure
|
||||||
|
|
||||||
|
- The company is not getting smaller -> More devs annd more services
|
||||||
|
- AI agents will write some of the code in the future
|
||||||
|
- Idea: Only run modified code in env instead of everything
|
||||||
50
content/day0/12_many-clusters.md
Normal file
@@ -0,0 +1,50 @@
|
|||||||
|
---
|
||||||
|
title: "Perfomance preseverance: Taming 1000 kubernetes clusters"
|
||||||
|
weight: 12
|
||||||
|
tags:
|
||||||
|
- platform
|
||||||
|
- cloudnativecon
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/ZTT8M74RD1M" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
{{% button href="https://static.sched.com/hosted_files/colocatedeventseu2025/d5/kubecon_2025_v4.2.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||||
|
|
||||||
|
## History
|
||||||
|
|
||||||
|
- They started with upstream kubernetes - the hard way
|
||||||
|
- Env grew to over 200 prod apps
|
||||||
|
- Pains: Single Cluster, single point of failure and complexity
|
||||||
|
- What worked: Dev adoption and autonomy, no vendor
|
||||||
|
|
||||||
|
## Challenges
|
||||||
|
|
||||||
|
> Based on stakeholder expectations
|
||||||
|
|
||||||
|
- One tenant per cluster -> Over 1000 Clusters
|
||||||
|
- Release management
|
||||||
|
- Small team (3 Engineers)
|
||||||
|
|
||||||
|
## Guiding principles
|
||||||
|
|
||||||
|
- Platform as a product
|
||||||
|
- Stability: trust
|
||||||
|
- Standardization -> Scalability and inter team collab
|
||||||
|
- Day 2 support
|
||||||
|
- Dogfooding
|
||||||
|
|
||||||
|
## Tenancy
|
||||||
|
|
||||||
|
- One cluster per product
|
||||||
|
- Own CLI, devs like cli
|
||||||
|
- Custom operator and crds
|
||||||
|
|
||||||
|
## Stack
|
||||||
|
|
||||||
|
- Keopsctl? Pretty much their own cluster operator
|
||||||
|
- A Simple Cluster CRD
|
||||||
|
|
||||||
|
## Migration
|
||||||
|
|
||||||
|
1. Build trust in platform
|
||||||
|
2. Support with docs, oboarding, q&a
|
||||||
|
3. Co-create with devs while keeping an eye on day2 -> Feature-Flag based rollout
|
||||||
56
content/day0/13_paap.md
Normal file
@@ -0,0 +1,56 @@
|
|||||||
|
---
|
||||||
|
title: Platform as a Product
|
||||||
|
weight: 13
|
||||||
|
tags:
|
||||||
|
- platform
|
||||||
|
- cloudnativecon
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/DoiaHfl9Y7Y" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
|
The CNCF's research into product thinking for platforms.
|
||||||
|
|
||||||
|
## But why
|
||||||
|
|
||||||
|
- Get insights into the current product thinking practives of platform builders
|
||||||
|
- Topics: Needs/Paintpoints/Behaviour
|
||||||
|
- Target: Create personas based on insights
|
||||||
|
- Find out what people are doing, not hew they are doing
|
||||||
|
|
||||||
|
## How?
|
||||||
|
|
||||||
|
- Survey for quantity
|
||||||
|
- Interviews for quality
|
||||||
|
|
||||||
|
## Challenges
|
||||||
|
|
||||||
|
- Asking questions without sugessting answers
|
||||||
|
- Consensus on research goals
|
||||||
|
- Motivation and time investment (on interviewer and interviewee side) + Non-Responses
|
||||||
|
- Toolsing: There is no standard tooling at the CNCF for this kind of research
|
||||||
|
- Small sample size -> No real research insights, just signals/hints
|
||||||
|
|
||||||
|
## Analysis
|
||||||
|
|
||||||
|
- Working with assumptions was hard in combination with the small sample size
|
||||||
|
- Survey: Survey Tool (Google Forms) combined with a whiteboard tool for clustering and analysis
|
||||||
|
- Interviews: They used ai for time efficiency but the prompt escalated a bit leading to no real time gain -> But you can scale the same prompt to infinite sample sized
|
||||||
|
- Challemnge: AI confidently churns out wrong answers -> Use source links to verify and scoping
|
||||||
|
|
||||||
|
TODO: Steal worklow from slides
|
||||||
|
|
||||||
|
## Outcome/Signals
|
||||||
|
|
||||||
|
- Platform Orgs use Prioritization Frameworks onconsciously: "We don't use product management and tools like that" -> Well you do, you just don't call it PM and are a bit unstructured
|
||||||
|
- Structured Activities: Interviews (talking to each other), Focus groups, quantitative data, ...
|
||||||
|
- Roadmap influence: Insight, prioritization, painpoints, backlogs
|
||||||
|
- Regular planning meetings
|
||||||
|
- Platform orgs struggle to define and actually implement measures of success: Measure activity over impact, success is often felt instead of proved
|
||||||
|
- Platform teams have varied control over their work: Depndening on company size and business relationships
|
||||||
|
|
||||||
|
## Future
|
||||||
|
|
||||||
|
- Baseline: They have some signals
|
||||||
|
- Question: Are these pattern successfull
|
||||||
|
- Needed: More data and better organization
|
||||||
58
content/day0/14_lego.md
Normal file
@@ -0,0 +1,58 @@
|
|||||||
|
---
|
||||||
|
title: Building Platforms with empathy and yaml at the lego group
|
||||||
|
weight: 14
|
||||||
|
tags:
|
||||||
|
- platform
|
||||||
|
- cloudnativecon
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/8FmJWd7vRt4" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
|
Very nice kids playing with lego intro analogy about creativity, sharing and colaboration.
|
||||||
|
|
||||||
|
## The golden brick
|
||||||
|
|
||||||
|
- The brick could get picked up and sometimes picking it up is mandatory
|
||||||
|
- Devemopment in close colab and trust with users
|
||||||
|
- Focus on good enough instead of perfect but everyone is unhapy
|
||||||
|
|
||||||
|
### Guidelines
|
||||||
|
|
||||||
|
- API first: Define a speration beween users and services with abstractions
|
||||||
|
- Self services: Freedom of choice and combination
|
||||||
|
- Constraints that are soft and can be modified on feedback
|
||||||
|
|
||||||
|
### Offers
|
||||||
|
|
||||||
|
- Kubernetes as a service
|
||||||
|
- Runtime as a Service: NAmespace as a service with argo and without cluster access
|
||||||
|
- Problem: Users want kubeapi access
|
||||||
|
- Method: Talk with the users
|
||||||
|
- Solution: Zero Trust proxy that provides operational access to kubeapi via OIDC
|
||||||
|
- There are multiple APIs that can be combined -> You need constraints
|
||||||
|
|
||||||
|
### What's needed
|
||||||
|
|
||||||
|
- Conversation
|
||||||
|
- Trust
|
||||||
|
- Striking a balance
|
||||||
|
|
||||||
|
## The human aspect
|
||||||
|
|
||||||
|
- Treat people as colleagues instead of customers
|
||||||
|
- Build empathy to reach a ballanced "good enough"
|
||||||
|
- Lead with transparency: Publish your metrics
|
||||||
|
- Visit their context
|
||||||
|
- Explore unknowns together
|
||||||
|
- Create a shared understanding of challenges
|
||||||
|
|
||||||
|
### Team culture
|
||||||
|
|
||||||
|
- Know who you are helping an who helps you
|
||||||
|
- Empower them to shine by getting to know their context
|
||||||
|
- Hear them out in small meetings ore in person
|
||||||
|
|
||||||
|
## Platform maturity
|
||||||
|
|
||||||
|
TODO: Steal maturity chart
|
||||||
29
content/day0/15_internal-marketing.md
Normal file
@@ -0,0 +1,29 @@
|
|||||||
|
---
|
||||||
|
title: 10 Quick tips on how to internally market your platform
|
||||||
|
weight: 15
|
||||||
|
tags:
|
||||||
|
- platform
|
||||||
|
- cloudnativecon
|
||||||
|
- lightning
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/kiUV8En8Co4" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
{{% button href="https://static.sched.com/hosted_files/colocatedeventseu2025/42/2025-PE-Day-10-Tips.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||||
|
|
||||||
|
## Baseline
|
||||||
|
|
||||||
|
- Event great tech does not sell itself - you need marketing
|
||||||
|
- We don't have a big marketing budget for our internal platform
|
||||||
|
- No adoption -> No Trust -> No new users -> No adoption
|
||||||
|
|
||||||
|
## Tips
|
||||||
|
|
||||||
|
- Define personas and a value proposition map
|
||||||
|
- Build a brand: Name, logo, story, swag
|
||||||
|
- Have a launch party or milestone parties
|
||||||
|
- Provide clear accesible communication (with clear channels, docs, ...)
|
||||||
|
- Build a commmunity that can help each other (and don't seperate yourself from the community)
|
||||||
|
- Capture metrics for success for yourself and from a user's perspective
|
||||||
|
- Provide a 5minute Wow-Moment/demo werhe the user can feel like they achived something
|
||||||
|
- Level up with gamification
|
||||||
|
- Leverage external events for internal visibility
|
||||||
BIN
content/day0/_img/abstraction-cycle.png
Normal file
|
After Width: | Height: | Size: 572 KiB |
BIN
content/day0/_img/product-compass.png
Normal file
|
After Width: | Height: | Size: 270 KiB |
@@ -4,8 +4,27 @@ title: Day 0
|
|||||||
weight: 4
|
weight: 4
|
||||||
---
|
---
|
||||||
|
|
||||||
TODO:
|
Day 0 of KubeCon aka CloudNativeCon aka the day on which the co-located events happen.
|
||||||
|
This year I spent most of my time at the platform engineering day (with a short visit to argocon).
|
||||||
|
The emerging motto of platform engineering day was "platform as a product".
|
||||||
|
|
||||||
|
This was the third conference day (fourth travel day) and in the afternoon i started to feel the brain-overflow.
|
||||||
|
But powewring through I ended up attending two keynotes (no notes, they were pretty much a welcome and goodbye) and 14 talks.
|
||||||
|
|
||||||
|
And most importantly: This is the day my friends an coworkers joined (they are only in town for kubecon, not for rejekts).
|
||||||
|
Sometimes we ended up in the same talks, sometimes in different talks which lead to a rich set of talk notes.
|
||||||
|
|
||||||
## Talk recommendations
|
## Talk recommendations
|
||||||
|
|
||||||
* TODO:
|
- How to design a good hireing process: [So you want to hire for platform engineering](./06_hire-engineers)
|
||||||
|
- Evolution of Platforms and Platform Engineering: [The past, the present and the future of platform engineering](./07_past-present-future)
|
||||||
|
- How to design a good product: [Product thinking for cloud native engineers](./08_product-thinking)
|
||||||
|
- Staging with gitops: [A million ways to promote changes between environments](./09_promotions)
|
||||||
|
- How to handle abstractions and new requriements: [Platform abstractions: Asset or liability](./10_abstractions)
|
||||||
|
- Very nice slides: [Building Platforms with empathy and yaml at the lego group](./14_lego)
|
||||||
|
|
||||||
|
## Other stuff I learned or people i talk to
|
||||||
|
|
||||||
|
- Talked to the Vultr people - they have a manifesto for ai with amd and nvidia gpus
|
||||||
|
- Talked to Meshcloud: They build developer platform tooling (currently mostly integrated with cloud providers)
|
||||||
|
- Want to look into Okteto for dev envs: <https://github.com/okteto/okteto>
|
||||||
77
content/day1/01_scaling-gpu.md
Normal file
@@ -0,0 +1,77 @@
|
|||||||
|
---
|
||||||
|
title: Scaling GPU Clusters without melting down
|
||||||
|
weight: 1
|
||||||
|
tags:
|
||||||
|
- ml
|
||||||
|
- nvidia
|
||||||
|
- ai
|
||||||
|
- apiserver
|
||||||
|
- go
|
||||||
|
- kubecon
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/dUfp3j1j-mg" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
{{% button href="https://static.sched.com/hosted_files/kccnceu2025/50/Scaling%20GPU%20Clusters%20Without%20Melting%20Down%21%20%281%29.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||||
|
|
||||||
|
## Baseline
|
||||||
|
|
||||||
|
- We need mroe and more gpus -> Control Plane needs to keep track of more objects
|
||||||
|
- Goal: Scale Workers without scaling control plane
|
||||||
|
|
||||||
|
## Current Problems
|
||||||
|
|
||||||
|
### Secret list calls go up and control plane goes down
|
||||||
|
|
||||||
|
- Scenario: High number of list calls with larger secrets
|
||||||
|
- Problem: OOM apiserver b/c cache
|
||||||
|
- Fix: API Priority & Fairness (only allow two concurrent list calls, queue the rest)
|
||||||
|
- Result: Decreased number of oom crashes
|
||||||
|
|
||||||
|
### High memory usage until we restart the apiserver
|
||||||
|
|
||||||
|
- Scenario: API-Server frees up to 40% of it's memory util when restarted
|
||||||
|
- Main suspect: Memory collection
|
||||||
|
- Idea: Tune GOGC (ENV Var `GOCC`) -> They set the default 100 to 50
|
||||||
|
- Result: Decrease in memory util and no more growing util over time
|
||||||
|
|
||||||
|
### Large skew in memory utilization
|
||||||
|
|
||||||
|
- Scanario: Scew between api server memory utilization across api-server pods
|
||||||
|
- Problem: If a pod with high util get's hist with a list, the api-server will oom -> The LB redirects to the other 2 -> Those OOM
|
||||||
|
- Observation: The lb in fron of the api server pods also shows some skew -> Explains the skew
|
||||||
|
- Root cause: lb has long living tcp connections to the servers and balances based on connections and not requests
|
||||||
|
- Idea: Switch up the lb configuration -> Not quite the right angle
|
||||||
|
- Fix: Goaway-chance param in apiserver - random `COAWAY TCP` message get's sent -> Tearing down connection gracefully, recreate connection
|
||||||
|
|
||||||
|
### Architectural mistakes
|
||||||
|
|
||||||
|
- Large number of secrets per workload -> List, Encode/Decode overhead
|
||||||
|
- No caching -> To many list calls
|
||||||
|
|
||||||
|
### Preview
|
||||||
|
|
||||||
|
- There are a bunch of sig api-machinery improvements planned
|
||||||
|
|
||||||
|
## The future
|
||||||
|
|
||||||
|
- The switch from NUMA GPU-Devices to DRA
|
||||||
|
- DRA is powerfull engough to get rid of custom numa stuff
|
||||||
|
|
||||||
|
### The stack
|
||||||
|
|
||||||
|
- Currently:
|
||||||
|
- CP: APIServer, Controller manager, Scheduler and Topology aware scheduler
|
||||||
|
- Worker: Device Plugin, nfd topology updater
|
||||||
|
- Future
|
||||||
|
- CP: APIServer, Controller manager, Scheduler
|
||||||
|
- Worker: Device Plugin
|
||||||
|
|
||||||
|
### Testing scaling
|
||||||
|
|
||||||
|
- Tool: KWOK (Kubernetes WithOut Kublet) - used to simulate gpu workout
|
||||||
|
- Env: K8S 1.32 with scaling from 0 to 4000 Workloads
|
||||||
|
- Metrics:
|
||||||
|
- Scheduling Latency: Topo aware was way more latency-affected
|
||||||
|
- Scheduler Memory util: 30% of memory saved with dra
|
||||||
|
- APi-Server Memory: Another 20& of memory saved
|
||||||
|
- Result: They are confident that DRA will bew stable and even save memeory and cpu util
|
||||||
81
content/day1/02_migrations.md
Normal file
@@ -0,0 +1,81 @@
|
|||||||
|
---
|
||||||
|
title: Day 2000 - Migrating from kubeadm + ansible to clusterapi+talos
|
||||||
|
weight: 2
|
||||||
|
tags:
|
||||||
|
- kubecon
|
||||||
|
- platform
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/uQ_WN1kuDo0" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
{{% button href="https://static.sched.com/hosted_files/kccnceu2025/fd/day2000-migration-ClusterAPI-talos.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||||
|
|
||||||
|
## Background
|
||||||
|
|
||||||
|
- They use large, shared clusters
|
||||||
|
- The oldest cluster is 2099 days (5,8 years) old
|
||||||
|
- Onprem hosted on vSphere with vanilla kubeadm
|
||||||
|
- Fun fact: They run chaosmonkey on all clusters -> Automaticly prepares for updates
|
||||||
|
|
||||||
|
### Legacy provisioning
|
||||||
|
|
||||||
|
1. Terraform create debian vm
|
||||||
|
2. Deploy base tools with puppet
|
||||||
|
3. Register nodes in inventory yaml file
|
||||||
|
4. run ansible playbook -> Renders configs and runs kubeadm
|
||||||
|
5. Configure ArgoCD
|
||||||
|
|
||||||
|
### Target
|
||||||
|
|
||||||
|
- Use Clusterapi to manage the workload-clusters
|
||||||
|
- Basic CRDS: Cluster, MachineDeployment, Machine
|
||||||
|
- Talos: Immutable, minimal, ephemeral with declarative config via grpc api
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
## Migration
|
||||||
|
|
||||||
|
1. Config matching between kubeadm and talos+capi
|
||||||
|
2. Import PKI/Certs
|
||||||
|
3. Create ClusterAPI CRDs
|
||||||
|
4. Add ClusterAPI Nodes
|
||||||
|
5. Remove kubeadm nodes
|
||||||
|
|
||||||
|
### 1. Config matching
|
||||||
|
|
||||||
|
1. Serviceaccount Issuer: Talos has it's own default
|
||||||
|
2. etcd encryption key names are hardcoded in talos
|
||||||
|
3. Re-Encrypt all secrets (get secrets, replace secrets)
|
||||||
|
|
||||||
|
### 2. PKI
|
||||||
|
|
||||||
|
1. Talos includes some logic that can generate a secrets bundle from an existing API
|
||||||
|
2. Import: The etcd, k8s, serviceaccount and os (talos specific, used for the talos api auth) certificates
|
||||||
|
|
||||||
|
### 3. CRDs
|
||||||
|
|
||||||
|
- One namespace per workload cluster
|
||||||
|
- Cluster-CRD: Ref to CP and Infrastructure
|
||||||
|
- ControlPlane-CRD: Create cp MDs
|
||||||
|
- Infrastructure: References template for wokrer-MDs
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
### 4. Add ClusterAPI Nodes
|
||||||
|
|
||||||
|
- Add new CP and Worker Nodes to the cluster that are managed by CAPI (slowly, stuff will break)
|
||||||
|
- Remove the old nodes one by one over weeks ore months
|
||||||
|
- Potential Problems:
|
||||||
|
- Mismatched serviceaccountissuer
|
||||||
|
- Missing etcd encryption key
|
||||||
|
- Wrong etcd encryption key
|
||||||
|
- Loss of quorum: `--force-new-cluster` can force recovery on one node of the etcd cluster
|
||||||
|
|
||||||
|
## Demo
|
||||||
|
|
||||||
|
I reccomend watching the demo
|
||||||
|
Talos seems pretty cool.
|
||||||
|
|
||||||
|
## Bootstrapping
|
||||||
|
|
||||||
|
- Kind cluster in github action or on local device
|
||||||
79
content/day1/03_operator-mistakes.md
Normal file
@@ -0,0 +1,79 @@
|
|||||||
|
---
|
||||||
|
title: "Don't write controllers like charlie don't does: Avoiding common kubernetes controller mistakes"
|
||||||
|
weight: 3
|
||||||
|
tags:
|
||||||
|
- kubecon
|
||||||
|
- operator
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/tnSraS9JqZ8" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
{{% button href="https://static.sched.com/hosted_files/kccnceu2025/53/Don%27t%20write%20controllers%20like%20Charlie%20Don%27t%20does_%20avoiding%20common%20Kubernetes%20controller%20mistakes.pptx.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||||
|
|
||||||
|
## Common mistake
|
||||||
|
|
||||||
|
### Not using a simple client but directly talk to the api server
|
||||||
|
|
||||||
|
- Problem: A
|
||||||
|
- Problem: Updates send in the whole object -> Noop updates waste apiserver resources
|
||||||
|
- Fix: Use a cache client
|
||||||
|
- Problem: Caching validation
|
||||||
|
|
||||||
|
### Don't use custom caching
|
||||||
|
|
||||||
|
- Problem: Good Luck dealing with concurrency
|
||||||
|
- Hard: Controllers mus maintain a per kind cache
|
||||||
|
- Problem: Eventual consistency makes everything more complicated
|
||||||
|
- Fix: Use a framework
|
||||||
|
|
||||||
|
### Predecates only apply to the current
|
||||||
|
|
||||||
|
- If you have a predecate in the for (predecate) only appy to this call, not to other watchers
|
||||||
|
- Also check if you shold be reconciling your low-level object or reconciling the higher level ones that ref to them is better
|
||||||
|
|
||||||
|
## Tools
|
||||||
|
|
||||||
|
### KRT
|
||||||
|
|
||||||
|
> Still under development
|
||||||
|
|
||||||
|
- Operatorions in collections (kubernetes objects with state tracking)
|
||||||
|
- Fetch function that handels transformation
|
||||||
|
|
||||||
|
### StateDB
|
||||||
|
|
||||||
|
- In-memory database for go with watch channels
|
||||||
|
- You can setup a table that stores all objects of a kind (provided by the client)
|
||||||
|
- Triggers hooks when changes happen in the database that you can react to
|
||||||
|
|
||||||
|
### Controller-Runtime
|
||||||
|
|
||||||
|
> The kubebuilder one
|
||||||
|
|
||||||
|
- Includes a chached client
|
||||||
|
- Works on the reconciler pattern -> Makes triggers simpe
|
||||||
|
|
||||||
|
## Tips
|
||||||
|
|
||||||
|
- Limit the number of api server updates
|
||||||
|
- Check for dif yourself and don't send updates if there is nothing new
|
||||||
|
- Use patch instead of update just with changed fields -> Especially for `.status`
|
||||||
|
- Use a framework that handles watching, coalescing and caching (krt, statedb, controller-runtime)
|
||||||
|
- Use predecates if you're using controller-runtime, this helps you filter out no-op events by checking them against the cache and filters
|
||||||
|
|
||||||
|
## Q&A
|
||||||
|
|
||||||
|
- Do you know where your reconciliations are coming from:
|
||||||
|
- Counts: Yes the frameworks provide metrics and you can implement your own
|
||||||
|
- But controller runtime abstracts the patch source so you have to compare before and after state yourself - but you should not do that
|
||||||
|
- What about state sharing across multiple threads?
|
||||||
|
- Controller runtime handels each reconcile as idempotent, so you can just multithread
|
||||||
|
- But handling consistency can still be hard because you have to design all of your operations as idempotent by rebuilding the state each time
|
||||||
|
- What are your thoughts on controllers that do stuff in the real world (especially b/c it takes longer and there are no natie observers)
|
||||||
|
- Do something like the krt project by keeping the state seperatly
|
||||||
|
- What if someone changes things at the cloud provider
|
||||||
|
- A question of philosophy -> Usually just treat the operator at the source of throuth
|
||||||
|
- How do you test your operators?
|
||||||
|
- Depends on your output (kubernetes objects make stuf simple)
|
||||||
|
- For cilium: Simple b/c it's just creating kubernetes projects
|
||||||
|
- With oputside interaction: In-memory state representation or mocking
|
||||||
|
- For complex controllers split the operator into: Ingestion, data model and transformation
|
||||||
56
content/day1/04_gpus-go-round.md
Normal file
@@ -0,0 +1,56 @@
|
|||||||
|
---
|
||||||
|
title: The GPUs on the bus go round and round
|
||||||
|
weight: 4
|
||||||
|
tags:
|
||||||
|
- kubecon
|
||||||
|
- gpu
|
||||||
|
- nvidia
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/cLJRh4y4vXg" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
|
## Background
|
||||||
|
|
||||||
|
- They are the GForce Now folks
|
||||||
|
- Large fleet of clusters all over the world (60.000+ GPUs)
|
||||||
|
- They use kubevirt to pass through GPUs (vfio driver) or vGPUs
|
||||||
|
- Devices fail from time to time
|
||||||
|
- Sometimes failures needs restarts
|
||||||
|
|
||||||
|
## Failure discovery
|
||||||
|
|
||||||
|
- Goal: Maintain capacity
|
||||||
|
- Failure reasons: Overheating, insufficient power, driver issues, hardware faults, ...
|
||||||
|
- Problem: They only detected failure by detecting capacity decreasing or not being able to switch drivers
|
||||||
|
- Fix: First detect failure, then remidiate
|
||||||
|
- GPU Problem detector as part of their internal device plugin
|
||||||
|
- Node Problem detector -> triggers remediation through maintainance
|
||||||
|
|
||||||
|
## Remidiation approaches
|
||||||
|
|
||||||
|
- Reboot: Works every time, but has workload related downsides -> Legit solutiom, but drain can take very long
|
||||||
|
- Discovery of remidiation loops -> Too many reboots indicate something being not quite right
|
||||||
|
- Optimized drain: Prioritize draining of nodes with failed devices before other maintainance
|
||||||
|
- The current workflow is: Reboot (automated) -> Power cycle (automated) -> Rebuild Node (automated) -> Manual intervention / RMA
|
||||||
|
|
||||||
|
## Prevention
|
||||||
|
|
||||||
|
> Problems should not affect workload
|
||||||
|
|
||||||
|
- Healthchecks with alerts
|
||||||
|
- Firmware & Driver updates
|
||||||
|
- Thermal & Powermanagement
|
||||||
|
|
||||||
|
## Future Challenges
|
||||||
|
|
||||||
|
- What if a high density with 8 GPUs has one failure?
|
||||||
|
- What is an acceptable rate of working to broken GPUs per Node
|
||||||
|
- If there is a problematic node that has to be rebooted every couple of days should the scheduler avoid thus node?
|
||||||
|
|
||||||
|
## Q&A
|
||||||
|
|
||||||
|
- Are there any plans to opensource the gpu problem detection: We could certainly do it, not on the roadmap r/n
|
||||||
|
- Are the failure rates representative and what is counted as failure:
|
||||||
|
- Failure is not being able to run a workload on a node (could be hardware or driver failure)
|
||||||
|
- The failure rate is 0,6% but the affected capacity is 1,2% (with 2 GPUs per node)
|
||||||
64
content/day1/05_ressource-submission-bookkeeping.md
Normal file
@@ -0,0 +1,64 @@
|
|||||||
|
---
|
||||||
|
title: "Reliable k8s resource Submission & Bookkeeping"
|
||||||
|
weight: 5
|
||||||
|
tags:
|
||||||
|
- kubecon
|
||||||
|
- platform
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/NCkHrvqFMl8" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
{{% button href="https://static.sched.com/hosted_files/kccnceu2025/0d/Reliable%20K8S%20Resource%20Submission%20and%20Bookkeeping.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||||
|
|
||||||
|
## Service offerings
|
||||||
|
|
||||||
|
- Product: HA Container Platform for general utility with a focus on run-to-complete
|
||||||
|
- Use-Cases: ML Orchestration, CI/CD, Machine maintainace, Financial analysis, Data Processing pipeline
|
||||||
|
- Requirements: Observability, Scheduling Events, Approval process, Bookkeeping, Datacenter Reseliency
|
||||||
|
- Focus: Resiliency (HA with datacenter failover)
|
||||||
|
- What the user needs: Workflow (e.g. generate report, persist report, notify)
|
||||||
|
- What we need for the user: ConfigMaps + Secrets, Workflow templates for the steps
|
||||||
|
|
||||||
|
## Challenges
|
||||||
|
|
||||||
|
- Read after modify across multiople datacenters
|
||||||
|
- Many reads against kubeapi that could overload the apiserver
|
||||||
|
- No native approval flows and limited audit
|
||||||
|
|
||||||
|
## Submission flows from a users perspective
|
||||||
|
|
||||||
|
### Submission of runnables
|
||||||
|
|
||||||
|
- User: Submits runnable to subnitter with audit
|
||||||
|
- Submitter: Handels retry, verification, ...
|
||||||
|
- Submitter: Configures workload on workload clusters
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
### Submission of deployables
|
||||||
|
|
||||||
|
- User: deploys mutation to audit/sourceoftrough
|
||||||
|
- Syncer: Syncs deployables to workload clusters
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
## Reporting
|
||||||
|
|
||||||
|
- User wants: UI with latest status for all jobs
|
||||||
|
- Compliance wants: Transactions on given resource for auditing
|
||||||
|
- Implementation: Highly available inventory as single source of truth
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
graph
|
||||||
|
WorkflowAPI-->|reads|inventory
|
||||||
|
Consumer-->|updates|inventory
|
||||||
|
Producer-->|publishes events to|Consumer
|
||||||
|
```
|
||||||
|
|
||||||
|
### Potential Problems
|
||||||
|
|
||||||
|
- Problem: Delete event does not get propagated from syncer to producer leading to zombie ressources
|
||||||
|
- Fix: Periodic Cleanup
|
||||||
|
|
||||||
|
### Overview
|
||||||
|
|
||||||
|

|
||||||
BIN
content/day1/_img/capi.png
Normal file
|
After Width: | Height: | Size: 75 KiB |
BIN
content/day1/_img/clusterapi-crd.png
Normal file
|
After Width: | Height: | Size: 112 KiB |
BIN
content/day1/_img/deployables.png
Normal file
|
After Width: | Height: | Size: 220 KiB |
BIN
content/day1/_img/runnables.png
Normal file
|
After Width: | Height: | Size: 266 KiB |
BIN
content/day1/_img/submission.png
Normal file
|
After Width: | Height: | Size: 297 KiB |
@@ -4,8 +4,26 @@ title: Day 1
|
|||||||
weight: 5
|
weight: 5
|
||||||
---
|
---
|
||||||
|
|
||||||
TODO:
|
Day 1 of the main KubeCon event startet with a bunch of keynotes from the cncf themselfes (anouncing the next locations for kubecon - amsterdam and barcelona).
|
||||||
|
The also announced a new sovereign cloud edge initiative (CNCF/LF meets EU and soem german ministry) called "NeoNephos" with members like SAP, StackIt or T-Systems.
|
||||||
|
|
||||||
|
This is also the day the sponsor showcase opened - so expect more talking to people and meetings or demos and less straight up talks.
|
||||||
|
|
||||||
## Talk recommendations
|
## Talk recommendations
|
||||||
|
|
||||||
* TODO:
|
- Not that much about gpus with good control plane scaling advice: [Scaling GPU Clusters without melting down](./01_scaling-gpu)
|
||||||
|
- Migrate a cluster to ClusterAPI without downtime: [Day 2000 - Migrating from kubeadm + ansible to clusterapi+talos](./02_migrations)
|
||||||
|
- Some basic operator tips with good Q&A questions: [Don't write controllers like charlie don't does: Avoiding common kubernetes controller mistakes](./03_operator-mistakes)
|
||||||
|
|
||||||
|
## Other stuff I learned or people i talk to
|
||||||
|
|
||||||
|
- The crossplane maintainers (Upbound)
|
||||||
|
- Anynines
|
||||||
|
- Cloudfoundry/Korifi
|
||||||
|
- FlatCar
|
||||||
|
- Cert-Manager
|
||||||
|
- Flux maintainers
|
||||||
|
- OVH
|
||||||
|
- Kubermatic
|
||||||
|
- Isovalent
|
||||||
|
- Spacelift: They employ some of the opentofu core maintainers
|
||||||
38
content/day2/01_chance-of-kubernetes.md
Normal file
@@ -0,0 +1,38 @@
|
|||||||
|
---
|
||||||
|
title: "Cloudy with a chance of kubernetes"
|
||||||
|
weight: 1
|
||||||
|
tags:
|
||||||
|
- kubecon
|
||||||
|
- platform
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/iCAFXF5ECto" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
{{% button href="https://static.sched.com/hosted_files/kccnceu2025/bc/KubeCon%20EU%202025%20-%20Cloudy%20with%20a%20chance%20of%20Kubernetes_%20Going%20from%20one%20to%20three%20cloud%20providers%20-%20Laurent%20Bernaille%20%26%20Maxime%20Visonneau,%20Datadog.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||||
|
|
||||||
|
## Background
|
||||||
|
|
||||||
|
- Scale: 100s of clusters
|
||||||
|
- Cloud: Azure, AWS, GCP
|
||||||
|
- The baseline: Single AWS Region and applications on vms
|
||||||
|
- Goal: Operate on different locations
|
||||||
|
- History: They added more and more regions - 6 Providers in 6 Regions across 29 locations
|
||||||
|
- Problem: Different tooling across different cloud providers
|
||||||
|
- Idea: Kubernetes abstracts the specific cloud provider infra
|
||||||
|
|
||||||
|
## The way
|
||||||
|
|
||||||
|
- Idea: Use managed kubernetes
|
||||||
|
- Problem: In 2018 the managed offerings were in beta or very limited
|
||||||
|
- Challenge: Opinionated cloud specific stuff
|
||||||
|
|
||||||
|
### Iterations
|
||||||
|
|
||||||
|
1. Clusters based on vms created by terraform and other automation tools -> They realized that they need multiple clusters per region
|
||||||
|
2. Their own application delivery platform that deployed to the right clusters across regions for better DevEx
|
||||||
|
3. k8s on k8s (hosted cp) -> Current setup with a terraform managed parent cluster
|
||||||
|
4. Idea: Host the Partent-Cluster on managed kubernetes -> They need to abstract some things away
|
||||||
|
5. Solution: Use their good old aplication delivery platform
|
||||||
|
|
||||||
|
### Abstractions
|
||||||
|
|
||||||
|
- Use custom CRDs to abstract the same behaviour across providers
|
||||||
@@ -4,8 +4,21 @@ title: Day 2
|
|||||||
weight: 6
|
weight: 6
|
||||||
---
|
---
|
||||||
|
|
||||||
TODO:
|
The second day of kubecon was my main "meeting day" this year - aka there were a bunch of scheduled meetings with manufacturers, partners, potential partners or just to get to know someone/a project.
|
||||||
|
What does this mean for you? Another day with only a few sessions (I only managed to attend two and only one was worthy of note taking) - the meeting notes are not available online.
|
||||||
|
|
||||||
## Talk recommendations
|
In the evening we attended the "German Community Stammtisch".
|
||||||
|
|
||||||
* TODO:
|
## Other stuff I learned or people i talk to
|
||||||
|
|
||||||
|
- Isovalent
|
||||||
|
- Kubermatic
|
||||||
|
- Portworx
|
||||||
|
- Fastly
|
||||||
|
- Syseleven
|
||||||
|
- Netbird
|
||||||
|
- VMware
|
||||||
|
- Stackit
|
||||||
|
- Harness
|
||||||
|
- Mia Platform
|
||||||
|
- and many, many more...
|
||||||
53
content/day3/01_day-two.md
Normal file
@@ -0,0 +1,53 @@
|
|||||||
|
---
|
||||||
|
title: "Surviving Day2: Picking the right tool to secure your kubernetes habitat"
|
||||||
|
weight: 1
|
||||||
|
tags:
|
||||||
|
- kubecon
|
||||||
|
- security
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/FqUPqroF-Rw" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
{{% button href="https://static.sched.com/hosted_files/kccnceu2025/a1/Surviving%20Day2%20-%20Picking%20the%20Right%20Tool%20To%20Secure%20Your%20Kubernetes%20Habitat.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||||
|
|
||||||
|
Premise: The CNCF landscape includes a huuuge number (80+) of security(related) projects.
|
||||||
|
Analogy: Animal kingdom (includes simmilar-ish animals that might do some of the same stuff but not entirely the same)
|
||||||
|
|
||||||
|
## Build Phase
|
||||||
|
|
||||||
|
- How can i scan my container for vulnerabilities? -> Well you probably mean your image
|
||||||
|
- The image itself is just a bunch of static layerns and we kinda have to trust the layers you didn't build yourself
|
||||||
|
- The main tool used is still trivy with some easy steps
|
||||||
|
1. Extract layers
|
||||||
|
2. Build FS
|
||||||
|
3. Identify OS and Non-OS Packages
|
||||||
|
4. Compare with vuln-db
|
||||||
|
- The animal in our analogy: Racoon
|
||||||
|
|
||||||
|
## Deploy Phase
|
||||||
|
|
||||||
|
- Kubernetes Native: Admission Controller
|
||||||
|
- Tool used: Kyverno (integrates as an admission controller with yaml/crd based configuration)
|
||||||
|
1. Modify (e.g. add default resource limits)
|
||||||
|
2. Validate (check policies)
|
||||||
|
- The animal is actually a human: The forrest guard
|
||||||
|
|
||||||
|
## Start Phase
|
||||||
|
|
||||||
|
- Before the pod itself is running CSI, CNI and secret related processes (the once we want to look into) happen
|
||||||
|
- Problems: Secrets have no rotation or versioning mechanism, there is no default integration for external kms
|
||||||
|
- Project: External Secrets -> Get secrets from external kms, automaticly sync (e.g. new versions)
|
||||||
|
- The chosen animal: Capricorn
|
||||||
|
|
||||||
|
## Run Phase
|
||||||
|
|
||||||
|
- Goal: Runtime scannning without including specialized instrumentation in each application
|
||||||
|
- Tool: Falco utilizing eBPF to check system calls against rules
|
||||||
|
- Idea: Detect dangerous behaviour (e.g. check for someone trying to exploit a fresh CVE)
|
||||||
|
- The analogy: Falcon
|
||||||
|
|
||||||
|
## TL;DR
|
||||||
|
|
||||||
|
1. Scan images (trivy)
|
||||||
|
2. Enforce best pracices (kyverno)
|
||||||
|
3. Use an external kms (external secrets)
|
||||||
|
4. Scan at runtime (falco)
|
||||||
30
content/day3/02_open-feature.md
Normal file
@@ -0,0 +1,30 @@
|
|||||||
|
---
|
||||||
|
title: "Type-safe feature flagging in openfeature: Lessons learned from using feature flags at google"
|
||||||
|
weight: 2
|
||||||
|
tags:
|
||||||
|
- kubecon
|
||||||
|
- dev
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/mewXGSwDCE4" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
{{% button href="https://static.sched.com/hosted_files/kccnceu2025/f6/Type-safe%20Feature%20Flagging%20in%20OpenFeature.pdf" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
|
||||||
|
|
||||||
|
## Featureflags?
|
||||||
|
|
||||||
|
- Idea: Change the behaviour of an application without rebuilding it
|
||||||
|
- Goal: Control rollout, reduce risk, experiment (a/b)
|
||||||
|
- At google: A huge number of feature flags (150k+) but that's because people forget to turn them off
|
||||||
|
|
||||||
|
## Where does the flag come from
|
||||||
|
|
||||||
|
- Lifecycle of a flag: Create, Manage, Deprecate, Delete -> But will it be created frist in code or in the service
|
||||||
|
- Classic implementation: Just a if/else that uses a function to get the flag
|
||||||
|
- Problem: What if the flag names missmatch between the code and flag ser -> Muliple sources of truth
|
||||||
|
- Solution: Require use of auto-generated flag bindings (codegen from the management system) to mitigate typos, etc.
|
||||||
|
|
||||||
|
## OpenFeature
|
||||||
|
|
||||||
|
- Goal: Vendor agnostic, standardized, open source
|
||||||
|
- Basic setup: Register provider (once per app), create a client, use client to get flags
|
||||||
|
- CLI: Integrate into management system, keep a local manifest of all flags and generate code (generates the client)
|
||||||
|
- Now: Just call the client's method instead of hard-coding feature flag names
|
||||||
43
content/day3/03_etcd-reliability.md
Normal file
@@ -0,0 +1,43 @@
|
|||||||
|
---
|
||||||
|
title: "Don't let your kubernetes cluster go wild: Ensuring etcd reliability"
|
||||||
|
weight: 3
|
||||||
|
tags:
|
||||||
|
- kubecon
|
||||||
|
- etcd
|
||||||
|
---
|
||||||
|
|
||||||
|
{{% button href="https://youtu.be/J93U9n_qxSI" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
|
Fair warning: This talk was very technical and pretty interesing - but don't even try to understand it if you're tired (or if it's the thrid to last session on the last day of a long conference).
|
||||||
|
|
||||||
|
## Baseline
|
||||||
|
|
||||||
|
- Standard example: Write and read KV-Data, `put(A,2) -> Get (A)`
|
||||||
|
- Problem: Concurrency
|
||||||
|
|
||||||
|
TODO: Steal image from intuition of correctness
|
||||||
|
|
||||||
|
## Correctness
|
||||||
|
|
||||||
|
- Correctness: Kinda funky when it comes to time
|
||||||
|
- Fix: Define serialization that executes parallel request one after another to bring them in an order
|
||||||
|
|
||||||
|
## Failures
|
||||||
|
|
||||||
|
- What happens is connections between etcd nodes go down -> Serving stale data
|
||||||
|
- What happens if data corrupts -> If enough members are online, it can repair itself
|
||||||
|
- And many more that can happen at random times -> Hard to test
|
||||||
|
|
||||||
|
TODO: Steal "in a concurrent world"
|
||||||
|
|
||||||
|
## Robustness framework
|
||||||
|
|
||||||
|
- Automates tests for failures
|
||||||
|
- Includes reliable reproductions of past (seamingly random) errors
|
||||||
|
- Currently a mixture of existing go debugging tools
|
||||||
|
|
||||||
|
## Future
|
||||||
|
|
||||||
|
- Reproduce more bugs consistently
|
||||||
|
- Run additional consistency checks
|
||||||
@@ -4,8 +4,15 @@ title: Day 3
|
|||||||
weight: 7
|
weight: 7
|
||||||
---
|
---
|
||||||
|
|
||||||
TODO:
|
The last day of KubeCon - aka the day everone leaves early.
|
||||||
|
But not me and I had no meetings scheduled for this day -> More talks for me and notes for you.
|
||||||
|
|
||||||
|
This being my 7th day of the trip and 6th day of non-stop conferences took a bit of a toll on my note taking skills (expect more spelling mistakes).
|
||||||
|
|
||||||
## Talk recommendations
|
## Talk recommendations
|
||||||
|
|
||||||
* TODO:
|
- Intro to feature flags and related tips: [Type-safe feature flagging in openfeature: Lessons learned from using feature flags at google](./02_open-feature)
|
||||||
|
|
||||||
|
## Other stuff I learned or people i talk to
|
||||||
|
|
||||||
|
- TODO:
|
||||||
@@ -4,4 +4,6 @@ title: Lessons Learned
|
|||||||
weight: 8
|
weight: 8
|
||||||
---
|
---
|
||||||
|
|
||||||
|
Not related to any talk directly, but i can recommend this [Blog Post](https://smudge.ai/blog/ratelimit-algorithms) and [Video](https://www.youtube.com/watch?v=8QyygfIloMc&) about rate limiting.
|
||||||
|
|
||||||
TODO:
|
TODO:
|
||||||
|
|||||||