---
title: "The self-improving platform: Closing the Loop Between Telemetry and Tuning"
weight: 5
tags:
 - rejekts
 - telemetry
---

<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
{{% button href="https://github.com/graz-dev/automatic-reosurce-optimization-loop" style="info" icon="code" %}}Code/Demo{{% /button %}}

The statistics of these talks are based on a survey including multiple companies, focused on ones that build and run applications

## Baseline

- Usually the golden path for devs only goes up to deploying their app, not day2/monitoring
- Most platform teams just provide the metrics and basic dashboards but no alterts or key healthiness identifiers

## Overvations regarding stakeholders

- Stakeholders
  - ~43% of companies have a dedicated platform team, the rest have a mixed team/shared efford
  - only ~18% have a dedicated SRE team that couples application to platforms
- Ownership: over 50% of companies ue a shared ownership model -> Not my problem
- Priorities
  - Product Team_: Ship features fast (a dollar spend on RND is worth more than one saved)
  - SRE: Keep everything up (an hour of uptime is worth more that the cost of a buffer)
  - FinOps: Reduce the bill (a dollar wasted is a dollar stolen from RND)
- Conflict: Cost saving (FinOps) vs Satety (SRE) when it comes to overprovisioning
- 75% of interviewees use kubernetes with over 50% using JVM as the runtime 

## Pain points

- Main focus: Cost vs performance
- Side-note: Reloability
- Result: We need a flexible path that can decern between
  - User facing app: Performance first
  - Critical app: Reliability first
  - Non-critical apps: Reduce cost

## Optimizatiomn


- Tuning: Only 18% are tuning their container and runtime
- We need a full stack approach:
  - Don't just increase pod resources but also update things like the heap-size in your runtime
  - Use HPA to sale if you already right-sized your pod+runtime
  - Get to know your per node usage to improve node autoscaling

## Building a continuus automation layer

- Telemetry: Import Metrics
- Analysis with tuning profiles (historic data) for optimizations
- GitOps for automatic PR creation and previews
- Sample Architecture:
  - Import: OTEL Metric into Prometheus
  - Visualize: Grafana
  - Analyze: Cronjob that collects the last 30mins of metrics
  - Optimize: Run the analyzed metrics against policies (like i want 20% headrooom for memory) that then act and create PRs (they did this through OPA)

TODO: Steal image from slides

## Wrap-up

- Automated optimization with human in the loop to keep the experts in touch and enable fast but secure changes
- Optimization should be an invisible platform capability (like renovate/dependabot for dependencies)
- Optimization is a domino effect: The right foundations enable better future decisions