--- title: When Platform Engineers Lead FinOps: Driving Reliability and $20M in Savings weight: 14 tags: - platformengineeringday - finops - legacy --- {{% button href="https://colocatedeventseu2026.sched.com/event/2DY3O" style="error" icon="calendar" %}}Sched Link{{% /button %}} A case study from expedia about finops. ## The Cost-Reliability disconnect - Background: Modern Infrastructure is complex ans large (1000s of clusters, multi-region,...) with huge operational responsibilities (SLA, SLO, scalabiloity, ...) - Platform Team: REliability, Performance, Stability - FinOps Team: Cloud Resources reduction, budget adherence, efficiency - Problem: Conflicting goals and often organizationally seperated - Blind cost optimzation can lead to unintentional stability/performance problems that can quickly spiral - Blind stability optimizations quickly lead to large overhead/overprovisioning and huge costs ## Patterns - Establish views & Baselines: Unserstand cost per cluster/workload and utilization patterns - Revisit legacy: Old configs like static sizing, huge buffers, ... - Embrace rearchitecture without fear: Consolidation, instance optimization, infra rededisn should all be on the table ### Views & baselines > General recommendations - **Problem:** Lack of cost attribution for shared info - **Problem:** Lack of insights into which clusters are generating consts - **Problem:** No transparency into which teams are consuming resources - **Solution:** Bring the generation of cost together with the existance of costs - **Solution**: Identify a safe operating range that wraps the "optimal zone" with a buffer for over- and underutilization -> Baseline for automatic scaling ### Revisiting legacy > General recommendations - **Problem childs**: Idle clusters (just in case i need one fast), oversized compute (safety buffers overdone) and unterutilized clusters - **Challenge**: No one wants to touch a running system 1. Analyze historical utilization (identifiy spikes/traffic patterns) 2. Identify safe optimization opportunities 3. Roll out changes gradually ### Rearchitecture without fear > What they did in their legacy systems - Find out if your current workload actually need the currently selected note types - Optimize Jobs into batches - Even if the size is right: Check if you can switch to newer nodes with better price to performance - Kustomize autoscaling with tools like KEDA to scale on actual load instead of diffuse side-effects