docs(day0): Added finops talk

2026-03-23 16:53:28 +01:00
parent 58737cc8ed
commit ded59d665c
1 changed files with 61 additions and 0 deletions
@@ -0,0 +1,61 @@
+---
+title: When Platform Engineers Lead FinOps: Driving Reliability and $20M in Savings
+weight: 14
+tags:
+ - platformengineeringday
+ - finops
+ - legacy
+---
+
+<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
+<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
+{{% button href="https://colocatedeventseu2026.sched.com/event/2DY3O" style="error" icon="calendar" %}}Sched Link{{% /button %}}
+<!-- {{% button href="https://github.com/graz-dev/automatic-reosurce-optimization-loop" style="info" icon="code" %}}Code/Demo{{% /button %}}  -->
+<!-- {{% button href="https://cloudnativeplatforms.com" style="info" icon="link" %}}Website/Homepage{{% /button %}}  -->
+
+A case study from expedia about finops.
+
+## The Cost-Reliability disconnect
+
+- Background: Modern Infrastructure is complex ans large (1000s of clusters, multi-region,...) with huge operational responsibilities (SLA, SLO, scalabiloity, ...)
+- Platform Team: REliability, Performance, Stability
+- FinOps Team: Cloud Resources reduction, budget adherence, efficiency
+- Problem: Conflicting goals and often organizationally seperated
+  - Blind cost optimzation can lead to unintentional stability/performance problems that can quickly spiral
+  - Blind stability optimizations quickly lead to large overhead/overprovisioning and huge costs 
+
+## Patterns
+
+- Establish views & Baselines: Unserstand cost per cluster/workload and utilization patterns
+- Revisit legacy: Old configs like static sizing, huge buffers, ...
+- Embrace rearchitecture without fear: Consolidation, instance optimization, infra rededisn should all be on the table
+
+### Views & baselines
+
+> General recommendations
+
+- **Problem:** Lack of cost attribution for shared info
+- **Problem:** Lack of insights into which clusters are generating consts
+- **Problem:** No transparency into which teams are consuming resources
+- **Solution:** Bring the generation of cost together with the existance of costs
+- **Solution**: Identify a safe operating range that wraps the "optimal zone" with a buffer for over- and underutilization -> Baseline for automatic scaling
+
+### Revisiting legacy
+
+> General recommendations
+
+- **Problem childs**: Idle clusters (just in case i need one fast), oversized compute (safety buffers overdone) and unterutilized clusters
+- **Challenge**: No one wants to touch a running system
+
+1. Analyze historical utilization (identifiy spikes/traffic patterns)
+2. Identify safe optimization opportunities
+3. Roll out changes gradually
+
+### Rearchitecture without fear
+
+> What they did in their legacy systems
+
+- Find out if your current workload actually need the currently selected note types
+- Optimize Jobs into batches
+- Even if the size is right: Check if you can switch to newer nodes with better price to performance
+- Kustomize autoscaling with tools like KEDA to scale on actual load instead of diffuse side-effects