Compare commits

...

10 Commits

14 changed files with 636 additions and 7 deletions

View File

@ -1,6 +1,6 @@
# @niggl/kubecon25
# @niggl/cnsmunich25
My experiences at Cloud Native Rejekts and KubeCon + CloudNativeCon Europe 2025 in London.
My experiences at Cloud Native Summit 2025 in Munich.
## Quickstart 🐳

View File

@ -5,12 +5,17 @@ title: Cloud Native Summit Munich 2025
All about the things I did and sessions I attended at Cloud Native Summit 2025 in Munich.
This current version is probably full of typos - will fix later. This is what typing the notes blindly in real time get's you.
This current v[text](.templates/talk.md)ersion is probably full of typos - might fix later (prbly won't tbh). This is what typing the notes blindly in real time get's you.
## How did I get there?
I attended Cloud Native Rejekts and KubeCon + CloudNativeCon Europe 2025 in London and some of the atendees reccomended checking out CNS Munich for another event in the same spirit as Cloud Native Rejekts.
After a short talk with my boss, I there by my employer [DATEV eG](https://datev.de) alongside two of my coworkers.
After a short talk with my boss, I got sent there by my employer [DATEV eG](https://datev.de) alongside two of my coworkers.
## And how was it.
I'd say that attending CNS Munich 2025 was worth it. The event is pretty close to my place of employment (2hrs by car or train) and relatively small in size (400 attendees). The talks varied a bit - the first day had a bunch of interesting talks but the second day indulged in ai-related talks (and they were not quite my cup of tea). This might me fine for others but I've heard enogh about ai use cases for the coming years at the last events i attended (or just reddit).
Maybe disributing the ai-talks over the two days - while always providing an interesting alternative - might be the right move for next time.
## And how does this website get it's content

78
content/day1/07_devex.md Normal file
View File

@ -0,0 +1,78 @@
---
title: What going cloud native taught us about developer experience
weight: 7
tags:
- devex
---
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
## History/Base on-prem
- Monolith
- High autonomy regarding releases into prod (auto gen)
- Comfort features -> It just works™
## Goals
- Microservices
- Accelerated processes
- Better dev experiences
## The road
### Expectations
- Expected new work for developers: CI/CD, GitOps, Monitoring, Security, Resilience, Connect to other services
- New for developers: Kubernetes with a bunch of surrounding tech
### Journey
1. Start of jurney: Usually a transition towards the cloud with some kubernetes and deployment templates and less legacy stuff
2. Result:
- Implementation: Gigantic base manifests with maybe sume overlay abstraction (kustomize or helm)
- Expectation: Works in all environments
3. Considerations:
- Developers will change config (its only a question of when and not if)
- The migration from an env file to kubernetes compliant yaml can be a hard one
4. Iteration: The developer friendly config is our new goal
### The developer friendly config
> e.g. in a helm values file
- Easy to understand and configure
- Think about the dev experience (sane defaults)
- Allow templating
- Provide documentation
## Developer centric approach to cloud native
> There are not many technical problems in cloud native, most are experience related
### Remember
- Your users won't react the way you expect them to
- The plattform sould serve the need of your users, not the other way around
- Users will come to you, if you build a nice environment for them
### What do your users need
- Every service is like it's own area -> What connections does it need to the outside and how do i ensure it's health
- Reduced cognitive load: Avoid developers being occupied with foundational work instead of delivering value
- The new env needs to be as nice or nicer than the old env
### How to help them
- Bootstrapping: Define blueprints (including dependency services line databases) and automate stuff with defined goals (my service should be deployed 5 mins after bootstrapping)
- Tooling: Backstage (yay)
- Establish training programms and communities
## Wrap up
- Interfaces are important
- Ensure roads to your service are well maintained and documented
- Build standards and contracts to ensure that others can rely on your service
- Build a example project that is not too big, but tackels real-world challenges
- Remember that developers are used to the old way of working which has a bunch of creature comforts

83
content/day1/08_auth.md Normal file
View File

@ -0,0 +1,83 @@
---
title: How Google Built a Consistent, Global, Authorization System with Zansibar and you can too
weight: 8
tags:
- auth
- security
---
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
Challenge: You send an mail via gmail that has a google drive attachment -> Those are two seperate apps but a central auth check needs to take place to provide access to the recipient.
## Access controll types
- ACL (access control list): Pretty basic
- RBAC: The defacto standard for a long time
- ABAC (Attribute based access controll): Check attributes (user-id, ip address, ...) on access time to make a decision
- ReBAC (Relationship based access controll)
## ReBAC
### Baseline
```mermaid
graph LR
document-->|Is part of|folder-->|was created by|user
```
### Relation Tuple
- `document:123#owner@user:3` -> User 3 is owhner of document 123
- `groud:engineering#membner@group:security` -> Group security is a member of the group engineering
### Graph representation (DAG)
```mermaid
graph LR
somedocument-->reader
somedocument-->writer
reader-.->|is also available via|writer
reader-->UserA
reader-->UserB
writer-->UserC
writer-->UserD
```
And check if there is a unidirectional way from somedocument to UserA over writer -> No = No access
## Zansibar
- Globaly distributed
- ReBAC based
- Zentral API
### Hotspots
- Problem: Some checks need to happen often
- Solution: Distributed caching
- Cache validity: Time stamp optimization by rounding to a second or 50ms
- Improvement: Internal use of grpc
- Lock table: If the same query get's executed multiple times at once, calculate query once and return cached response to all waiting queries
- Improve cache population: Don't kill sub-checks instantly but delayed
### Zookies
- Specify a specific point in time (e.g. to bypass cache with "give me the latest")
- Allows control over the latency vs real-time trade-off
- Solves the new enemy problem: You loose access at the same time it get's changed -> may result in phantom access to the new version if cached data get's used
### Implementations
> Some of the popular oppen source implementations, just for later
- SpiceDB
- ORY
- Permify
### Pro
- Low latency with high throughput
- Global consistency
- Composable and hierarchical permission models

View File

@ -0,0 +1,58 @@
---
title: Building a Confidential AI Inference Platform on Kubernetes
weight: 9
tags:
- security
- ai
---
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
> Felt a bit like a showcase of their product's architecture - not bad, just nothing really to take home
Backgrund: How do we protect the data flowing into and out of our ai models?
## Goals
- Cloud based interference api
- E2E Encryption
- E2E Attestation
## Encryption Mechanisms
- Idea: Combine data at rest with data in transit and data in use encryption (encrypted memory)
- Attestation: CPU has a private key and issues certificates
## Confidential Containers
- Traditional: Full VM-based isolation
- Kubernetes: Advanced contaoiner isolation using virtual sockets and much more
- Implementation: Frameworks like contrast
### Threat model
- Isolated: Container
- Shared: Kubernetes, Hypervisor, Cloud Infra, Hardware
### Architecture
```mermaid
graph LR
User
User-->|Accesses with trust|AICode
User-->|Key exchange|SecretService-->|Key exchange|AICode
Manifest-->|Configure|ContrastCoordinator
subgraph Cluster
ContrastCoordinator(Contrast Coordinator)
ContrastCoordinator-->|Verify|Worker
subgraph Worker
AICode(AI Code)
AttestationAgent
end
AICode-->|Accesses|GPU
AttestationAgent-->|Verify|GPU
SecretService
end
ContrastCoordinator-->|Attest|User
```

View File

@ -0,0 +1,51 @@
---
title: "Think Big: Monitoring Stack was yesterday - Observability Platform at scale!"
weight: 10
tags:
- monitoring
- observability
---
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
## Where do you start with monitoring
- The cloud standard solution: Prometheus
- But: What if we don't just monitor one app but a cluster or muiltiple clusters?
- Problem: Prometheus isn't quite the best when it comes to scaling
- And: We want Dashboards, Traces, Alerting, Logs, Auditing, ...
## Trying to build the master monitoring by just adding stuff on the side
- Add custom stuff
- More complex setups
- Less and less documentation and standardization
## But how do we regain controll
- Product Thinking: Let's collect the problems
- Result: No clear seperation of the product, no vision (just firefighting), We want better releases and improve resource usage
### Transition
1. Overview of the current stack -> Just list all components -> We're no longer just a monitoring stack, but we do overvability
2.
1. Long term goals and vision -> Add clear interfaces and contracts (hey platform mindset, we've heard that one before) based on expectations
2. Target groups and journeys -> Clear reponsibility cut-off between platfrom<->users
3. Improve the plattform -> Needs full buy in to be the **central**, **open** and **selfservice** platform
- In their case: Focus on Mimir instead of prometheus and alloy but keep grafana and loki
- Define everything else as out of scope (for now)
- Expand scope by improving the experience instead of just "adding tools"
## Pillars of Observability
- Data management: Ingest, Query
- Dashboard Management: Create, Update, Export
- Alert Management: Rules, Routing, Analytics, Silence
## Wrap up
- Do i need monitoring or more (both is fine)?
- Identify the target audience and their journey (not jsut the tools they want to use)
- Improve the experience and say no if a user requests something that would not improve it

View File

@ -10,4 +10,5 @@ The first day started with the usual organizational topics (schedule, sponsors a
- For everyone: [IT-Grundschutz trifft Kubernetes: Praxisnahe Umsetzung sicherheitsrelevanter Anforderungen](./03_grundschutz)(it was presented in an engaging way)
- If you're interested in metal³: [Bringing Cloud-Native Agility to Bare-Metal Kubernetes with Cluster API and Metal³](./05_baremetal)
- DevEx: [What going cloud native taught us about developer experience](./07_devex) (and honestly worth the speaker's accent and city skylines metaphor)
- DevEx: [What going cloud native taught us about developer experience](./07_devex) (and honestly worth the speaker's accent and city skylines metaphor)
- If you're interested in different access control patterns: [How Google Built a Consistent, Global, Authorization System with Zansibar and you can too](./08_auth)

View File

@ -0,0 +1,48 @@
---
title: "Beyond MicroserVices: Running VMS, WASM, and AI WOrkloads on Kubernetes"
weight: 1
tags:
- wasm
- vm
---
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
This is more of an "overview" talk and less actual new knowledge or specialized stuff.
## Baseline
We all know
- Deployments
- Statefulsets
- Functions
- and so on
## Strange new World: VMs on Kubernetes
- Why VM? Legacy! (and VDI and some testing envs)
- The cool thing: VMs are basicly Pods with virtualization powered by kvm/qemu/libvirt
- Demo: Kubernetes on GCP with KubeVirt installed and deployment of a vm with guest tool access
- TL;DR: Kubevirt makes the vm management ux pretty good
TODO: Steal vm vs container vs kubevirt layers illustration
## Kind of a different universe: WASM
- WASM: Low level typed intermediate machine code
- WASI: System Interface for externel functions (fs, network, ...)
- Pro: Secure, Portable and performant
- Con: Bleeding-edge, complex, and not feature-complete
### Now on kubernetes (with spinkube)
- Still executed on a node with a pod, but this pod does not contain a container, but a Spin which contains the service as a wasm container (via the containerd-wasm shim)
- Up to 10x faster spin up than a traditional container
## And how about ai?
- Goal: Host yourself or at least in the EU
- Simple quickstart: Ollama
- Challenge: Cost planning

45
content/day2/02_agent.md Normal file
View File

@ -0,0 +1,45 @@
---
title: "Works on my LLM: Building your own ai code assistant that isn't completely useless"
weight: 2
tags:
- ai
- vibecoding
---
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
Build or improvde your own ai coding agent (well mostly improve).
## Baseline
- AI enables us to produce usless code 10x faster
- Problem: Traditional vibe coding is just a short instruction "build me a web app"
- Solution: Context Engineering to support the next step with the right information
- Agent has Multiple Parts: LLM, Context Window, External Context (files), MCP
## Set up the bootloader
- Rule-File: Coding style, conventions, best practives -> "always do this"
- Workflows: Helpers like scripts, etc
- e.g.: Gather Requirements -> Clarify -> Create specification
- Can be wirtten in normal english and maybe annotated using agent-specific tags
## Load domain specific knowledge
- Useful: Add questions regarding approach/architecture to your workflows
- This is where mcp servers can come in
- Challenge: Picking the right and right amount of information to provide to the agent
## Micro context strategy
- Problem: Monolythic context that can be filled up and even trunkated
- Idea: Split into multiple smaller contexts that will be combined before sending to the ai
- Implementation: Save the context into different files and chunk the results into files
- Pro: Can be used for statless interaction
## State Managmeent
- Memory Bank: Always keep updated documents with summaries for the implementation task
- The rabit hole problem: Trying workaround after workaround resulting in a full context with useless non-working workaround
- Checkpoint Restoration: Create checkpoints and recreate contexts from them instead of trying to force the ai back on track

View File

@ -0,0 +1,56 @@
---
title: "Brains on the edge - running ai workloads with k3s and gpu nodes"
weight: 3
tags:
- ai
- gpu
---
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
I decided not to note down the usual "typical challenges on the edge" slides (about 10 mins of the talk)
## Baseline
- Edge can be split up: Near Edge, Far Edge, Device Edge
- They use k3s for all edge clusters
## Prerequisites
- Software: GPU Driver, Container Toolkit, Device Plugin
- Hardware: NVIDIA GPU with a supported distro
- Runtime: Not all runtimes support GPUs (containerd and CRI-O do)
## Architecture
```mermaid
graph LR
subgraph Edge
MQTT
Kafka
Analytics
MQTT-->|Publish collected sensor data|Kafka
Kafka-->|Provide data to run|Analytics
end
subgraph Azure
Storage
Monitoring
MLFlow
Storage-->|Provide long term analytics|MLFlow
end
Analytics<-->|Sync models|MLFlow
Kafka-->|Save to long term|Storage
Monitoring-.->|Observe|Storage
Monitoring-.->|Observe|MLFlow
```
## Q&A
- Did you use the nvidia gpu operator: Yes
- Which runtime did you use: ContainerD via K3S
- Why k3s over k0s: Because we used it
- Were you power limited: Nope, the edge was on a large ship

View File

@ -0,0 +1,91 @@
---
title: "Many Cooks, One Platform: Balancing Ownership and Contribution for the Perfect Broth"
weight: 4
tags:
- platform
---
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
{{% button href="https://docs.google.com/presentation/d/104LXd5-aPQIs4By6ftnyNWFhi4fqGLHZ4uVwpAE3RMo/mobilepresent?slide=id.g370dcb83b32_0_1" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}}
Not unlike a war story talk about trying to make a product work at a gov organization in the netherlands.
## Tech stack
- OpenShift migrating to a new WIP platform (WIP for 2 years)
- 4 Clusters (2x DEV, 1xPROD, 1xMGMT)
- Azure DevOps for application delivery
## Org
- IT Service Provider for the durch juduciary - mostly building webapps
- Value Teams: Cross functional (Dev, Testing, Ops) with a focus on java and C#
- Bases (aka Worstreams): Container Management Platform Base is the focus on this story
## Why platform?
### Where we came from
- The old days: Deployment Scripts
- Evolution: Loosely coupled services like jenkins -> Loose interactions make for fun problems and diverging standards
- The new hot stuff: Platform that solves the entire lifecycle
### The golden Path
subgraph IDP
DeployService-->triggers|BuildService
BuildService<-->|Interact with code|RepositoryService
BuildService-->|pushes image to|Registry
DeployService-->|deploy to|Prod
end
Prod
TODO: Steal image from slides
### Bricks vs Builds
- Brick: Do it yourself
- Build: Ready to use but needs diverse "implementations"
## Vision and scope
- Problem(scope): Not defined scope results in feature-wish creep
- Problem(scope): Things being excluded that feel like they should be part of a platform -> You now have the pleasure of talkint to multiple departments
- Old platform used an internal registry -> Business decided we want artifactory -> HA Artifactory costs as much as the rest of the platform
- The company decides that builds now run in Azure DevOps
- Problem(vision): It's easy to call a bunch of services "a platform" without actually integrating them with each other
## DevOps is an Antipattern
- Classic: Developers are seperated from Ops by a wall of confusion
- Modern™: Just run it yourself! How? I'm not gonna tell you
- Solution™: Add an Enabling Team between Dev and Platform
- Problem: This usually results in creating more work for the platform team that has to support the enabling team in addition
- Solution: The enabling team should be created out of both dev and ops people to create a deep understanding -> Just build a communiuty
## Comunity building
- Cornerstones: Conmsistency (same time, same community), Safe to ask questions (Vegas rule), Aknowledge both good and bad/rant, Follow up on discussions
### Real World Example: SRE Meetup
> Spoiler: This failed
- Every team was asked to send one SRE
- Meeting tends to get cancelled one minute before due to "nothing to discuss"
- Feels like the SREs have ideas or greviances and the platform team defends itself or attacks the asker of questions
- Was replaced
### Real Workd Example: The Microservice Guild
- Contribution via Invitation: Hey I heard you built something cool, please tell us about it
- Agenda always share in advance
- Focus on solutions instead of offense/defense
## Summary
- A Platform is a colaborative effort
- Scope has to be communicated early and often
- Build a community
- Sometimes you need to let things go if they don't work out

104
content/day2/05_kcp.md Normal file
View File

@ -0,0 +1,104 @@
---
title: Building a Platform Engineering API Layer with KCP
weight: 5
tags:
- kcp
- platform
---
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
## Baseline
- Platform is automated and self service
- We always have a bunch of consumers and service providers that get connected via an internal db plattform
```mermaid
graph TD
subgraph Consume
A
B
end
subgraph Provider
Cert
DB
end
IDP
A-->|discover available services|IDP
A-->|order db|IDP
IDP-->|Notify|DB
DB-->|fulfill|A
```
## Why KubeAPI
- We have it all: Groups, Versions, Optionally Namespaced, ...
- It is extendable via CRDs
- Challenges: CRDs are Cluster Scoped -> Everyone shares them across Namespaces
- Idea: Everyone get's their own cluster
- Problem: Spinning up clusters is slow and resource intensive
- Idea: "Lightweight clusters" aka Hosted Control Plane
- Problem: Now we have to share CRDs Across Clusters
## WTF is KCP?
- Idea: What if we had seperate control planes but with a shared datastore
- Goal: Horitontally scalable control plane for extenable APIs
- You don't need Kubernetes to run KCP (it's a standalone binary)
- It does not spin up a real api server but a workspace wit a low memory footprint
- It does not implement all of the container related stuff (Pod, Deployment, ...)
### Access and setup
```mermaid
graph LR
User-->|Create APIServer Team A|KCP
KCP-->|Kubeconfig|User
subgraph KCP
APIA(Workspace Team A)
APIB(Workspace Team B)
Datastore
end
User-->|Kubectl get ns|APIA
APIA-->|Return NS for Workspace A|User
```
### Internal Organization
- Workspaces are organized in a tree
- Possibility of nested fun: `/clusters/root:org-a:team-a`
- Sub-Workspaces can't access ressources from the root workspace
```mermaid
graph TD
Root
Root-->OrgA
Root-->OrgB
OrgA-->TeamA
```
### Sharing
- KCP owns all Workspaces -> It can share stuff across clusters
- To share: APIExport (Can Share multiple CRDs in one Package)
- To use: APIBinding (Just reference the Exported API By Path to Workspace and name)
### Order fulfillment
- Classic Kubernetes: Controller -> But they are isolated, aren't they?
- Virtual Workspace: Provite a computed view of parts of a workspace -> Basicly a URL that you provide to the controller that can be used to watch objects accross workspaces
- Part of KCPs magic -> You don't create it, but it get's managed for each APIExport
## Notes from the demo
- Spin up locally is near instant
- Switching to the namespace can be achived with a simple api command or
## But why do we even need a universal API layer
- Service Providers should not be responsible to make things discoverable, the plattform should
- The internal pülatform can be bought, customized or diyed but the api layer does not change -> Interchangeable backend switching
- Kubernetes is already widespread and makes it easy to use different projects
- Backed by the CNCF, flat learning curve

View File

@ -4,3 +4,11 @@ title: Day 2
weight: 2
---
The schedule on day 2 was pretty ai platform focused.
Sadly all of the ai focused talks were about building workflows and platforms with gitops and friends, not about actually building the base (gpus scheduling and so on).
We also had some "normal" work tasks resulting in less talks visited and more "normal" work + networking.
## Reccomended talks
- Good speaker: [Many Cooks, One Platform: Balancing Ownership and Contribution for the Perfect Broth](./04_many-cooks)
- Good intro to kcp: [Building a Platform Engineering API Layer with KCP](05_kcp)

View File

@ -4,6 +4,7 @@ title: Lessons Learned
weight: 3
---
## Mal anschauen
## Maybe look into
- Otterize für Netzwrok policies
- Otterize für Netzwrok policies
- Spinkube/Wasm Cloud for optimized wasm on kubernetes