--- title: "Don't let your kubernetes cluster go wild: Ensuring etcd reliability" weight: 3 tags: - kubecon - etcd --- {{% button href="https://youtu.be/J93U9n_qxSI" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} Fair warning: This talk was very technical and pretty interesing - but don't even try to understand it if you're tired (or if it's the thrid to last session on the last day of a long conference). ## Baseline - Standard example: Write and read KV-Data, `put(A,2) -> Get (A)` - Problem: Concurrency TODO: Steal image from intuition of correctness ## Correctness - Correctness: Kinda funky when it comes to time - Fix: Define serialization that executes parallel request one after another to bring them in an order ## Failures - What happens is connections between etcd nodes go down -> Serving stale data - What happens if data corrupts -> If enough members are online, it can repair itself - And many more that can happen at random times -> Hard to test TODO: Steal "in a concurrent world" ## Robustness framework - Automates tests for failures - Includes reliable reproductions of past (seamingly random) errors - Currently a mixture of existing go debugging tools ## Future - Reproduce more bugs consistently - Run additional consistency checks