--- title: "The self-improving platform: Closing the Loop Between Telemetry and Tuning" weight: 5 tags: - rejekts - telemetry --- {{% button href="https://github.com/graz-dev/automatic-reosurce-optimization-loop" style="info" icon="code" %}}Code/Demo{{% /button %}} The statistics of these talks are based on a survey including multiple companies, focused on ones that build and run applications ## Baseline - Usually the golden path for devs only goes up to deploying their app, not day2/monitoring - Most platform teams just provide the metrics and basic dashboards but no alterts or key healthiness identifiers ## Overvations regarding stakeholders - Stakeholders - ~43% of companies have a dedicated platform team, the rest have a mixed team/shared efford - only ~18% have a dedicated SRE team that couples application to platforms - Ownership: over 50% of companies ue a shared ownership model -> Not my problem - Priorities - Product Team_: Ship features fast (a dollar spend on RND is worth more than one saved) - SRE: Keep everything up (an hour of uptime is worth more that the cost of a buffer) - FinOps: Reduce the bill (a dollar wasted is a dollar stolen from RND) - Conflict: Cost saving (FinOps) vs Satety (SRE) when it comes to overprovisioning - 75% of interviewees use kubernetes with over 50% using JVM as the runtime ## Pain points - Main focus: Cost vs performance - Side-note: Reloability - Result: We need a flexible path that can decern between - User facing app: Performance first - Critical app: Reliability first - Non-critical apps: Reduce cost ## Optimizatiomn - Tuning: Only 18% are tuning their container and runtime - We need a full stack approach: - Don't just increase pod resources but also update things like the heap-size in your runtime - Use HPA to sale if you already right-sized your pod+runtime - Get to know your per node usage to improve node autoscaling ## Building a continuus automation layer - Telemetry: Import Metrics - Analysis with tuning profiles (historic data) for optimizations - GitOps for automatic PR creation and previews - Sample Architecture: - Import: OTEL Metric into Prometheus - Visualize: Grafana - Analyze: Cronjob that collects the last 30mins of metrics - Optimize: Run the analyzed metrics against policies (like i want 20% headrooom for memory) that then act and create PRs (they did this through OPA) TODO: Steal image from slides ## Wrap-up - Automated optimization with human in the loop to keep the experts in touch and enable fast but secure changes - Optimization should be an invisible platform capability (like renovate/dependabot for dependencies) - Optimization is a domino effect: The right foundations enable better future decisions