docs(day-2): Added telemetry talk
All checks were successful
Build latest image / build-container (push) Successful in 55s
All checks were successful
Build latest image / build-container (push) Successful in 55s
This commit is contained in:
68
content/day-2/05_selvimproving.md
Normal file
68
content/day-2/05_selvimproving.md
Normal file
@@ -0,0 +1,68 @@
|
||||
---
|
||||
title: "The self-improving platform: Closing the Loop Between Telemetry and Tuning"
|
||||
weight: 5
|
||||
tags:
|
||||
- rejekts
|
||||
- telemetry
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
TODO: Copy repo link for samples
|
||||
|
||||
The statistics of these talks are based on a survey including multiple companies, focused on ones that build and run applications
|
||||
|
||||
## Baseline
|
||||
|
||||
- Usually the golden path for devs only goes up to deploying their app, not day2/monitoring
|
||||
- Most platform teams just provide the metrics and basic dashboards but no alterts or key healthiness identifiers
|
||||
|
||||
## Overvations regarding stakeholders
|
||||
|
||||
- Stakeholders
|
||||
- ~43% of companies have a dedicated platform team, the rest have a mixed team/shared efford
|
||||
- only ~18% have a dedicated SRE team that couples application to platforms
|
||||
- Ownership: over 50% of companies ue a shared ownership model -> Not my problem
|
||||
- Priorities
|
||||
- Product Team_: Ship features fast (a dollar spend on RND is worth more than one saved)
|
||||
- SRE: Keep everything up (an hour of uptime is worth more that the cost of a buffer)
|
||||
- FinOps: Reduce the bill (a dollar wasted is a dollar stolen from RND)
|
||||
- Conflict: Cost saving (FinOps) vs Satety (SRE) when it comes to overprovisioning
|
||||
- 75% of interviewees use kubernetes with over 50% using JVM as the runtime
|
||||
|
||||
## Pain points
|
||||
|
||||
- Main focus: Cost vs performance
|
||||
- Side-note: Reloability
|
||||
- Result: We need a flexible path that can decern between
|
||||
- User facing app: Performance first
|
||||
- Critical app: Reliability first
|
||||
- Non-critical apps: Reduce cost
|
||||
|
||||
## Optimizatiomn
|
||||
|
||||
|
||||
- Tuning: Only 18% are tuning their container and runtime
|
||||
- We need a full stack approach:
|
||||
- Don't just increase pod resources but also update things like the heap-size in your runtime
|
||||
- Use HPA to sale if you already right-sized your pod+runtime
|
||||
- Get to know your per node usage to improve node autoscaling
|
||||
|
||||
## Building a continuus automation layer
|
||||
|
||||
- Telemetry: Import Metrics
|
||||
- Analysis with tuning profiles (historic data) for optimizations
|
||||
- GitOps for automatic PR creation and previews
|
||||
- Sample Architecture:
|
||||
- Import: OTEL Metric into Prometheus
|
||||
- Visualize: Grafana
|
||||
- Analyze: Cronjob that collects the last 30mins of metrics
|
||||
- Optimize: Run the analyzed metrics against policies (like i want 20% headrooom for memory) that then act and create PRs (they did this through OPA)
|
||||
|
||||
TODO: Steal image from slides
|
||||
|
||||
## Wrap-up
|
||||
|
||||
- Automated optimization with human in the loop to keep the experts in touch and enable fast but secure changes
|
||||
- Optimization should be an invisible platform capability (like renovate/dependabot for dependencies)
|
||||
- Optimization is a domino effect: The right foundations enable better future decisions
|
||||
@@ -16,7 +16,8 @@ I have to admit that I'm very bad with names and don't always regocnize people b
|
||||
|
||||
## Talk recommendations
|
||||
|
||||
- If you're building operators: [Solving Operator Extensibility: A gRPC Plugin Framework for kubernetes](./04_operator%20estensibility)
|
||||
- If you're building operators: [Solving Operator Extensibility: A gRPC Plugin Framework for kubernetes](./04_operator-estensibility)
|
||||
- The idea behind [The self-improving platform: Closing the Loop Between Telemetry and Tuning](./05_selvimproving) is very interesting but the first half of the talk is kinda confusing as it discusses a study that could have been shortened drasticly. But the way they automaticly create PRs for resource utilizations is cool
|
||||
|
||||
## Other stuff I learned or people i talk to
|
||||
|
||||
|
||||
Reference in New Issue
Block a user