docs(day-2): Added telemetry talk

2026-03-21 12:52:23 +01:00
parent 0129ca03ad
commit 8b1edb32c3
3 changed files with 70 additions and 1 deletions
@@ -0,0 +1,68 @@
+---
+title: "The self-improving platform: Closing the Loop Between Telemetry and Tuning"
+weight: 5
+tags:
+ - rejekts
+ - telemetry
+---
+
+<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
+<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
+TODO: Copy repo link for samples
+
+The statistics of these talks are based on a survey including multiple companies, focused on ones that build and run applications
+
+## Baseline
+
+- Usually the golden path for devs only goes up to deploying their app, not day2/monitoring
+- Most platform teams just provide the metrics and basic dashboards but no alterts or key healthiness identifiers
+
+## Overvations regarding stakeholders
+
+- Stakeholders
+  - ~43% of companies have a dedicated platform team, the rest have a mixed team/shared efford
+  - only ~18% have a dedicated SRE team that couples application to platforms
+- Ownership: over 50% of companies ue a shared ownership model -> Not my problem
+- Priorities
+  - Product Team_: Ship features fast (a dollar spend on RND is worth more than one saved)
+  - SRE: Keep everything up (an hour of uptime is worth more that the cost of a buffer)
+  - FinOps: Reduce the bill (a dollar wasted is a dollar stolen from RND)
+- Conflict: Cost saving (FinOps) vs Satety (SRE) when it comes to overprovisioning
+- 75% of interviewees use kubernetes with over 50% using JVM as the runtime 
+
+## Pain points
+
+- Main focus: Cost vs performance
+- Side-note: Reloability
+- Result: We need a flexible path that can decern between
+  - User facing app: Performance first
+  - Critical app: Reliability first
+  - Non-critical apps: Reduce cost
+
+## Optimizatiomn
+
+
+- Tuning: Only 18% are tuning their container and runtime
+- We need a full stack approach:
+  - Don't just increase pod resources but also update things like the heap-size in your runtime
+  - Use HPA to sale if you already right-sized your pod+runtime
+  - Get to know your per node usage to improve node autoscaling
+
+## Building a continuus automation layer
+
+- Telemetry: Import Metrics
+- Analysis with tuning profiles (historic data) for optimizations
+- GitOps for automatic PR creation and previews
+- Sample Architecture:
+  - Import: OTEL Metric into Prometheus
+  - Visualize: Grafana
+  - Analyze: Cronjob that collects the last 30mins of metrics
+  - Optimize: Run the analyzed metrics against policies (like i want 20% headrooom for memory) that then act and create PRs (they did this through OPA)
+
+TODO: Steal image from slides
+
+## Wrap-up
+
+- Automated optimization with human in the loop to keep the experts in touch and enable fast but secure changes
+- Optimization should be an invisible platform capability (like renovate/dependabot for dependencies)
+- Optimization is a domino effect: The right foundations enable better future decisions
@@ -16,7 +16,8 @@ I have to admit that I'm very bad with names and don't always regocnize people b

 ## Talk recommendations

- If you're building operators: [Solving Operator Extensibility: A gRPC Plugin Framework for kubernetes](./04_operator%20estensibility)
+- If you're building operators: [Solving Operator Extensibility: A gRPC Plugin Framework for kubernetes](./04_operator-estensibility)
+- The idea behind [The self-improving platform: Closing the Loop Between Telemetry and Tuning](./05_selvimproving) is very interesting but the first half of the talk is kinda confusing as it discusses a study that could have been shortened drasticly. But the way they automaticly create PRs for resource utilizations is cool

 ## Other stuff I learned or people i talk to