From 8b1edb32c317c522ad6b2912e6cbcd7cc8cd51c3 Mon Sep 17 00:00:00 2001 From: Nicolai Ort Date: Sat, 21 Mar 2026 12:52:23 +0100 Subject: [PATCH] docs(day-2): Added telemetry talk --- ...bility.md => 04_operator-estensibility.md} | 0 content/day-2/05_selvimproving.md | 68 +++++++++++++++++++ content/day-2/_index.md | 3 +- 3 files changed, 70 insertions(+), 1 deletion(-) rename content/day-2/{04_operator estensibility.md => 04_operator-estensibility.md} (100%) create mode 100644 content/day-2/05_selvimproving.md diff --git a/content/day-2/04_operator estensibility.md b/content/day-2/04_operator-estensibility.md similarity index 100% rename from content/day-2/04_operator estensibility.md rename to content/day-2/04_operator-estensibility.md diff --git a/content/day-2/05_selvimproving.md b/content/day-2/05_selvimproving.md new file mode 100644 index 0000000..3946526 --- /dev/null +++ b/content/day-2/05_selvimproving.md @@ -0,0 +1,68 @@ +--- +title: "The self-improving platform: Closing the Loop Between Telemetry and Tuning" +weight: 5 +tags: + - rejekts + - telemetry +--- + + + +TODO: Copy repo link for samples + +The statistics of these talks are based on a survey including multiple companies, focused on ones that build and run applications + +## Baseline + +- Usually the golden path for devs only goes up to deploying their app, not day2/monitoring +- Most platform teams just provide the metrics and basic dashboards but no alterts or key healthiness identifiers + +## Overvations regarding stakeholders + +- Stakeholders + - ~43% of companies have a dedicated platform team, the rest have a mixed team/shared efford + - only ~18% have a dedicated SRE team that couples application to platforms +- Ownership: over 50% of companies ue a shared ownership model -> Not my problem +- Priorities + - Product Team_: Ship features fast (a dollar spend on RND is worth more than one saved) + - SRE: Keep everything up (an hour of uptime is worth more that the cost of a buffer) + - FinOps: Reduce the bill (a dollar wasted is a dollar stolen from RND) +- Conflict: Cost saving (FinOps) vs Satety (SRE) when it comes to overprovisioning +- 75% of interviewees use kubernetes with over 50% using JVM as the runtime + +## Pain points + +- Main focus: Cost vs performance +- Side-note: Reloability +- Result: We need a flexible path that can decern between + - User facing app: Performance first + - Critical app: Reliability first + - Non-critical apps: Reduce cost + +## Optimizatiomn + + +- Tuning: Only 18% are tuning their container and runtime +- We need a full stack approach: + - Don't just increase pod resources but also update things like the heap-size in your runtime + - Use HPA to sale if you already right-sized your pod+runtime + - Get to know your per node usage to improve node autoscaling + +## Building a continuus automation layer + +- Telemetry: Import Metrics +- Analysis with tuning profiles (historic data) for optimizations +- GitOps for automatic PR creation and previews +- Sample Architecture: + - Import: OTEL Metric into Prometheus + - Visualize: Grafana + - Analyze: Cronjob that collects the last 30mins of metrics + - Optimize: Run the analyzed metrics against policies (like i want 20% headrooom for memory) that then act and create PRs (they did this through OPA) + +TODO: Steal image from slides + +## Wrap-up + +- Automated optimization with human in the loop to keep the experts in touch and enable fast but secure changes +- Optimization should be an invisible platform capability (like renovate/dependabot for dependencies) +- Optimization is a domino effect: The right foundations enable better future decisions \ No newline at end of file diff --git a/content/day-2/_index.md b/content/day-2/_index.md index c75184f..3e9a5f1 100644 --- a/content/day-2/_index.md +++ b/content/day-2/_index.md @@ -16,7 +16,8 @@ I have to admit that I'm very bad with names and don't always regocnize people b ## Talk recommendations -- If you're building operators: [Solving Operator Extensibility: A gRPC Plugin Framework for kubernetes](./04_operator%20estensibility) +- If you're building operators: [Solving Operator Extensibility: A gRPC Plugin Framework for kubernetes](./04_operator-estensibility) +- The idea behind [The self-improving platform: Closing the Loop Between Telemetry and Tuning](./05_selvimproving) is very interesting but the first half of the talk is kinda confusing as it discusses a study that could have been shortened drasticly. But the way they automaticly create PRs for resource utilizations is cool ## Other stuff I learned or people i talk to