From 4cec1917bfd90cf3026430d226f4844152b3458e Mon Sep 17 00:00:00 2001 From: Nicolai Ort Date: Wed, 2 Apr 2025 13:17:43 +0200 Subject: [PATCH] docs(day1): Added migration talk --- content/day1/01_scaling-gpu.md | 1 + content/day1/02_migrations.md | 75 ++++++++++++++++++++++++++++++++++ content/day1/_index.md | 6 ++- 3 files changed, 80 insertions(+), 2 deletions(-) create mode 100644 content/day1/02_migrations.md diff --git a/content/day1/01_scaling-gpu.md b/content/day1/01_scaling-gpu.md index fab9b41..9f14ec0 100644 --- a/content/day1/01_scaling-gpu.md +++ b/content/day1/01_scaling-gpu.md @@ -7,6 +7,7 @@ tags: - ai - apiserver - go + - kubecon --- diff --git a/content/day1/02_migrations.md b/content/day1/02_migrations.md new file mode 100644 index 0000000..880a6ef --- /dev/null +++ b/content/day1/02_migrations.md @@ -0,0 +1,75 @@ +--- +title: Day 2000 - Migrating from kubeadm + ansible to clusterapi+talos +weight: 2 +tags: + - kubecon + - platform +--- + + + +## Background + +- They use large, shared clusters +- The oldest cluster is 2099 days (5,8 years) old +- Onprem hosted on vSphere with vanilla kubeadm +- Fun fact: They run chaosmonkey on all clusters -> Automaticly prepares for updates + +### Legacy provisioning + +1. Terraform create debian vm +2. Deploy base tools with puppet +3. Register nodes in inventory yaml file +4. run ansible playbook -> Renders configs and runs kubeadm +5. Configure ArgoCD + +### Target + +- Use Clusterapi to manage the workload-clusters + - Basic CRDS: Cluster, MachineDeployment, Machine +- Talos: Immutable, minimal, ephemeral with declarative config via grpc api + +TODO: Steal diagrams from slides + + +## Migration + +1. Config matching between kubeadm and talos+capi +2. Import PKI/Certs +3. Create ClusterAPI CRDs +4. Add ClusterAPI Nodes +5. Remove kubeadm nodes + +### 1. Config matching + +1. Serviceaccount Issuer: Talos has it's own default +2. etcd encryption key names are hardcoded in talos +3. Re-Encrypt all secrets (get secrets, replace secrets) + +### 2. PKI + +1. Talos includes some logic that can generate a secrets bundle from an existing API +2. Import: The etcd, k8s, serviceaccount and os (talos specific, used for the talos api auth) certificates + +### 3. CRDs + +- One namespace per workload cluster +- Cluster-CRD: Ref to CP and Infrastructure +- ControlPlane-CRD: Create cp MDs +- Infrastructure: References template for wokrer-MDs + +TODO: Steal image + +### 4. Add ClusterAPI Nodes + +- Add new CP and Worker Nodes to the cluster that are managed by CAPI (slowly, stuff will break) +- Remove the old nodes one by one over weeks ore months +- Potential Problems: + - Mismatched serviceaccountissuer + - Missing etcd encryption key + - Wrong etcd encryption key + - Loss of quorum: `--force-new-cluster` can force recovery on one node of the etcd cluster + +## Demo + +I reccomend watching the demo \ No newline at end of file diff --git a/content/day1/_index.md b/content/day1/_index.md index 15bfc9b..c504db2 100644 --- a/content/day1/_index.md +++ b/content/day1/_index.md @@ -4,11 +4,13 @@ title: Day 1 weight: 5 --- -Day 1 of the main KubeCon event startet with a bunch of keynotes from the cncf themselfes (anouncing the next ) +Day 1 of the main KubeCon event startet with a bunch of keynotes from the cncf themselfes (anouncing the next locations for kubecon - amsterdam and barcelona). +The also announced a new sovereign cloud edge initiative (CNCF/LF meets EU and soem german ministry) called "NeoNephos" with members like SAP, StackIt or T-Systems. ## Talk recommendations -- TODO: +- Not that much about gpus with good control plane scaling advice: [Scaling GPU Clusters without melting down](../01_scaling-gpu) +- Migrate a cluster to ClusterAPI without downtime: [Day 2000 - Migrating from kubeadm + ansible to clusterapi+talos](../02_migrations) ## Other stuff I learned or people i talk to