68 lines
3.0 KiB
Markdown
68 lines
3.0 KiB
Markdown
---
|
|
title: "The self-improving platform: Closing the Loop Between Telemetry and Tuning"
|
|
weight: 5
|
|
tags:
|
|
- rejekts
|
|
- telemetry
|
|
---
|
|
|
|
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
|
{{% button href="https://github.com/graz-dev/automatic-reosurce-optimization-loop" style="info" icon="code" %}}Code/Demo{{% /button %}}
|
|
|
|
The statistics of these talks are based on a survey including multiple companies, focused on ones that build and run applications
|
|
|
|
## Baseline
|
|
|
|
- Usually the golden path for devs only goes up to deploying their app, not day2/monitoring
|
|
- Most platform teams just provide the metrics and basic dashboards but no alterts or key healthiness identifiers
|
|
|
|
## Overvations regarding stakeholders
|
|
|
|
- Stakeholders
|
|
- ~43% of companies have a dedicated platform team, the rest have a mixed team/shared efford
|
|
- only ~18% have a dedicated SRE team that couples application to platforms
|
|
- Ownership: over 50% of companies ue a shared ownership model -> Not my problem
|
|
- Priorities
|
|
- Product Team_: Ship features fast (a dollar spend on RND is worth more than one saved)
|
|
- SRE: Keep everything up (an hour of uptime is worth more that the cost of a buffer)
|
|
- FinOps: Reduce the bill (a dollar wasted is a dollar stolen from RND)
|
|
- Conflict: Cost saving (FinOps) vs Satety (SRE) when it comes to overprovisioning
|
|
- 75% of interviewees use kubernetes with over 50% using JVM as the runtime
|
|
|
|
## Pain points
|
|
|
|
- Main focus: Cost vs performance
|
|
- Side-note: Reloability
|
|
- Result: We need a flexible path that can decern between
|
|
- User facing app: Performance first
|
|
- Critical app: Reliability first
|
|
- Non-critical apps: Reduce cost
|
|
|
|
## Optimizatiomn
|
|
|
|
|
|
- Tuning: Only 18% are tuning their container and runtime
|
|
- We need a full stack approach:
|
|
- Don't just increase pod resources but also update things like the heap-size in your runtime
|
|
- Use HPA to sale if you already right-sized your pod+runtime
|
|
- Get to know your per node usage to improve node autoscaling
|
|
|
|
## Building a continuus automation layer
|
|
|
|
- Telemetry: Import Metrics
|
|
- Analysis with tuning profiles (historic data) for optimizations
|
|
- GitOps for automatic PR creation and previews
|
|
- Sample Architecture:
|
|
- Import: OTEL Metric into Prometheus
|
|
- Visualize: Grafana
|
|
- Analyze: Cronjob that collects the last 30mins of metrics
|
|
- Optimize: Run the analyzed metrics against policies (like i want 20% headrooom for memory) that then act and create PRs (they did this through OPA)
|
|
|
|
TODO: Steal image from slides
|
|
|
|
## Wrap-up
|
|
|
|
- Automated optimization with human in the loop to keep the experts in touch and enable fast but secure changes
|
|
- Optimization should be an invisible platform capability (like renovate/dependabot for dependencies)
|
|
- Optimization is a domino effect: The right foundations enable better future decisions |