kubecon25/02_migrations.md at 46b06c66fd5bdec3c6bb39c097e3fbb668739b64 - kubecon25 - ODIT.Services

niggl/kubecon25

Nicolai Ort 46b06c66fd

Build latest image / build-container (push) Successful in 49s

Details

docs: Added slides button to all pages

2025-04-02 13:21:27 +02:00

2.2 KiB

Raw Blame History

title, weight, tags

title

weight

tags

Day 2000 - Migrating from kubeadm + ansible to clusterapi+talos

2

kubecon

platform

Background

They use large, shared clusters
The oldest cluster is 2099 days (5,8 years) old
Onprem hosted on vSphere with vanilla kubeadm
Fun fact: They run chaosmonkey on all clusters -> Automaticly prepares for updates

Legacy provisioning

Terraform create debian vm
Deploy base tools with puppet
Register nodes in inventory yaml file
run ansible playbook -> Renders configs and runs kubeadm
Configure ArgoCD

Target

Use Clusterapi to manage the workload-clusters
- Basic CRDS: Cluster, MachineDeployment, Machine
Talos: Immutable, minimal, ephemeral with declarative config via grpc api

TODO: Steal diagrams from slides

Migration

Config matching between kubeadm and talos+capi
Import PKI/Certs
Create ClusterAPI CRDs
Add ClusterAPI Nodes
Remove kubeadm nodes

1. Config matching

Serviceaccount Issuer: Talos has it's own default
etcd encryption key names are hardcoded in talos
Re-Encrypt all secrets (get secrets, replace secrets)

2. PKI

Talos includes some logic that can generate a secrets bundle from an existing API
Import: The etcd, k8s, serviceaccount and os (talos specific, used for the talos api auth) certificates

3. CRDs

One namespace per workload cluster
Cluster-CRD: Ref to CP and Infrastructure
ControlPlane-CRD: Create cp MDs
Infrastructure: References template for wokrer-MDs

TODO: Steal image

4. Add ClusterAPI Nodes

Add new CP and Worker Nodes to the cluster that are managed by CAPI (slowly, stuff will break)
Remove the old nodes one by one over weeks ore months
Potential Problems:
- Mismatched serviceaccountissuer
- Missing etcd encryption key
- Wrong etcd encryption key
- Loss of quorum: --force-new-cluster can force recovery on one node of the etcd cluster

Demo

I reccomend watching the demo