kubecon25/content/day1/02_migrations.md
Nicolai Ort 46b06c66fd
All checks were successful
Build latest image / build-container (push) Successful in 49s
docs: Added slides button to all pages
2025-04-02 13:21:27 +02:00

2.2 KiB

title, weight, tags
title weight tags
Day 2000 - Migrating from kubeadm + ansible to clusterapi+talos 2
kubecon
platform

Background

  • They use large, shared clusters
  • The oldest cluster is 2099 days (5,8 years) old
  • Onprem hosted on vSphere with vanilla kubeadm
  • Fun fact: They run chaosmonkey on all clusters -> Automaticly prepares for updates

Legacy provisioning

  1. Terraform create debian vm
  2. Deploy base tools with puppet
  3. Register nodes in inventory yaml file
  4. run ansible playbook -> Renders configs and runs kubeadm
  5. Configure ArgoCD

Target

  • Use Clusterapi to manage the workload-clusters
    • Basic CRDS: Cluster, MachineDeployment, Machine
  • Talos: Immutable, minimal, ephemeral with declarative config via grpc api

TODO: Steal diagrams from slides

Migration

  1. Config matching between kubeadm and talos+capi
  2. Import PKI/Certs
  3. Create ClusterAPI CRDs
  4. Add ClusterAPI Nodes
  5. Remove kubeadm nodes

1. Config matching

  1. Serviceaccount Issuer: Talos has it's own default
  2. etcd encryption key names are hardcoded in talos
  3. Re-Encrypt all secrets (get secrets, replace secrets)

2. PKI

  1. Talos includes some logic that can generate a secrets bundle from an existing API
  2. Import: The etcd, k8s, serviceaccount and os (talos specific, used for the talos api auth) certificates

3. CRDs

  • One namespace per workload cluster
  • Cluster-CRD: Ref to CP and Infrastructure
  • ControlPlane-CRD: Create cp MDs
  • Infrastructure: References template for wokrer-MDs

TODO: Steal image

4. Add ClusterAPI Nodes

  • Add new CP and Worker Nodes to the cluster that are managed by CAPI (slowly, stuff will break)
  • Remove the old nodes one by one over weeks ore months
  • Potential Problems:
    • Mismatched serviceaccountissuer
    • Missing etcd encryption key
    • Wrong etcd encryption key
    • Loss of quorum: --force-new-cluster can force recovery on one node of the etcd cluster

Demo

I reccomend watching the demo