docs(day-2): Added telemetry talk
All checks were successful
Build latest image / build-container (push) Successful in 55s

This commit is contained in:
2026-03-21 12:52:23 +01:00
parent 0129ca03ad
commit 8b1edb32c3
3 changed files with 70 additions and 1 deletions

View File

@@ -0,0 +1,68 @@
---
title: "The self-improving platform: Closing the Loop Between Telemetry and Tuning"
weight: 5
tags:
- rejekts
- telemetry
---
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
TODO: Copy repo link for samples
The statistics of these talks are based on a survey including multiple companies, focused on ones that build and run applications
## Baseline
- Usually the golden path for devs only goes up to deploying their app, not day2/monitoring
- Most platform teams just provide the metrics and basic dashboards but no alterts or key healthiness identifiers
## Overvations regarding stakeholders
- Stakeholders
- ~43% of companies have a dedicated platform team, the rest have a mixed team/shared efford
- only ~18% have a dedicated SRE team that couples application to platforms
- Ownership: over 50% of companies ue a shared ownership model -> Not my problem
- Priorities
- Product Team_: Ship features fast (a dollar spend on RND is worth more than one saved)
- SRE: Keep everything up (an hour of uptime is worth more that the cost of a buffer)
- FinOps: Reduce the bill (a dollar wasted is a dollar stolen from RND)
- Conflict: Cost saving (FinOps) vs Satety (SRE) when it comes to overprovisioning
- 75% of interviewees use kubernetes with over 50% using JVM as the runtime
## Pain points
- Main focus: Cost vs performance
- Side-note: Reloability
- Result: We need a flexible path that can decern between
- User facing app: Performance first
- Critical app: Reliability first
- Non-critical apps: Reduce cost
## Optimizatiomn
- Tuning: Only 18% are tuning their container and runtime
- We need a full stack approach:
- Don't just increase pod resources but also update things like the heap-size in your runtime
- Use HPA to sale if you already right-sized your pod+runtime
- Get to know your per node usage to improve node autoscaling
## Building a continuus automation layer
- Telemetry: Import Metrics
- Analysis with tuning profiles (historic data) for optimizations
- GitOps for automatic PR creation and previews
- Sample Architecture:
- Import: OTEL Metric into Prometheus
- Visualize: Grafana
- Analyze: Cronjob that collects the last 30mins of metrics
- Optimize: Run the analyzed metrics against policies (like i want 20% headrooom for memory) that then act and create PRs (they did this through OPA)
TODO: Steal image from slides
## Wrap-up
- Automated optimization with human in the loop to keep the experts in touch and enable fast but secure changes
- Optimization should be an invisible platform capability (like renovate/dependabot for dependencies)
- Optimization is a domino effect: The right foundations enable better future decisions

View File

@@ -16,7 +16,8 @@ I have to admit that I'm very bad with names and don't always regocnize people b
## Talk recommendations
- If you're building operators: [Solving Operator Extensibility: A gRPC Plugin Framework for kubernetes](./04_operator%20estensibility)
- If you're building operators: [Solving Operator Extensibility: A gRPC Plugin Framework for kubernetes](./04_operator-estensibility)
- The idea behind [The self-improving platform: Closing the Loop Between Telemetry and Tuning](./05_selvimproving) is very interesting but the first half of the talk is kinda confusing as it discusses a study that could have been shortened drasticly. But the way they automaticly create PRs for resource utilizations is cool
## Other stuff I learned or people i talk to