kubecon25/content/day3/03_etcd-reliability.md
Nicolai Ort 0e24bf4fd6
Some checks failed
Build latest image / build-container (push) Failing after 50s
docs: Added youtube links
2025-05-07 07:07:48 +02:00

1.4 KiB

title, weight, tags
title weight tags
Don't let your kubernetes cluster go wild: Ensuring etcd reliability 3
kubecon
etcd

{{% button href="https://youtu.be/J93U9n_qxSI" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}

Fair warning: This talk was very technical and pretty interesing - but don't even try to understand it if you're tired (or if it's the thrid to last session on the last day of a long conference).

Baseline

  • Standard example: Write and read KV-Data, put(A,2) -> Get (A)
  • Problem: Concurrency

TODO: Steal image from intuition of correctness

Correctness

  • Correctness: Kinda funky when it comes to time
  • Fix: Define serialization that executes parallel request one after another to bring them in an order

Failures

  • What happens is connections between etcd nodes go down -> Serving stale data
  • What happens if data corrupts -> If enough members are online, it can repair itself
  • And many more that can happen at random times -> Hard to test

TODO: Steal "in a concurrent world"

Robustness framework

  • Automates tests for failures
  • Includes reliable reproductions of past (seamingly random) errors
  • Currently a mixture of existing go debugging tools

Future

  • Reproduce more bugs consistently
  • Run additional consistency checks