diff --git a/content/day0/14_finops.md b/content/day0/14_finops.md new file mode 100644 index 0000000..50adcc3 --- /dev/null +++ b/content/day0/14_finops.md @@ -0,0 +1,61 @@ +--- +title: When Platform Engineers Lead FinOps: Driving Reliability and $20M in Savings +weight: 14 +tags: + - platformengineeringday + - finops + - legacy +--- + + + +{{% button href="https://colocatedeventseu2026.sched.com/event/2DY3O" style="error" icon="calendar" %}}Sched Link{{% /button %}} + + + +A case study from expedia about finops. + +## The Cost-Reliability disconnect + +- Background: Modern Infrastructure is complex ans large (1000s of clusters, multi-region,...) with huge operational responsibilities (SLA, SLO, scalabiloity, ...) +- Platform Team: REliability, Performance, Stability +- FinOps Team: Cloud Resources reduction, budget adherence, efficiency +- Problem: Conflicting goals and often organizationally seperated + - Blind cost optimzation can lead to unintentional stability/performance problems that can quickly spiral + - Blind stability optimizations quickly lead to large overhead/overprovisioning and huge costs + +## Patterns + +- Establish views & Baselines: Unserstand cost per cluster/workload and utilization patterns +- Revisit legacy: Old configs like static sizing, huge buffers, ... +- Embrace rearchitecture without fear: Consolidation, instance optimization, infra rededisn should all be on the table + +### Views & baselines + +> General recommendations + +- **Problem:** Lack of cost attribution for shared info +- **Problem:** Lack of insights into which clusters are generating consts +- **Problem:** No transparency into which teams are consuming resources +- **Solution:** Bring the generation of cost together with the existance of costs +- **Solution**: Identify a safe operating range that wraps the "optimal zone" with a buffer for over- and underutilization -> Baseline for automatic scaling + +### Revisiting legacy + +> General recommendations + +- **Problem childs**: Idle clusters (just in case i need one fast), oversized compute (safety buffers overdone) and unterutilized clusters +- **Challenge**: No one wants to touch a running system + +1. Analyze historical utilization (identifiy spikes/traffic patterns) +2. Identify safe optimization opportunities +3. Roll out changes gradually + +### Rearchitecture without fear + +> What they did in their legacy systems + +- Find out if your current workload actually need the currently selected note types +- Optimize Jobs into batches +- Even if the size is right: Check if you can switch to newer nodes with better price to performance +- Kustomize autoscaling with tools like KEDA to scale on actual load instead of diffuse side-effects \ No newline at end of file