---
title: When Platform Engineers Lead FinOps: Driving Reliability and $20M in Savings
weight: 14
tags:
 - platformengineeringday
 - finops
 - legacy
---

<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
{{% button href="https://colocatedeventseu2026.sched.com/event/2DY3O" style="error" icon="calendar" %}}Sched Link{{% /button %}}
<!-- {{% button href="https://github.com/graz-dev/automatic-reosurce-optimization-loop" style="info" icon="code" %}}Code/Demo{{% /button %}}  -->
<!-- {{% button href="https://cloudnativeplatforms.com" style="info" icon="link" %}}Website/Homepage{{% /button %}}  -->

A case study from expedia about finops.

## The Cost-Reliability disconnect

- Background: Modern Infrastructure is complex ans large (1000s of clusters, multi-region,...) with huge operational responsibilities (SLA, SLO, scalabiloity, ...)
- Platform Team: REliability, Performance, Stability
- FinOps Team: Cloud Resources reduction, budget adherence, efficiency
- Problem: Conflicting goals and often organizationally seperated
  - Blind cost optimzation can lead to unintentional stability/performance problems that can quickly spiral
  - Blind stability optimizations quickly lead to large overhead/overprovisioning and huge costs 

## Patterns

- Establish views & Baselines: Unserstand cost per cluster/workload and utilization patterns
- Revisit legacy: Old configs like static sizing, huge buffers, ...
- Embrace rearchitecture without fear: Consolidation, instance optimization, infra rededisn should all be on the table

### Views & baselines

> General recommendations

- **Problem:** Lack of cost attribution for shared info
- **Problem:** Lack of insights into which clusters are generating consts
- **Problem:** No transparency into which teams are consuming resources
- **Solution:** Bring the generation of cost together with the existance of costs
- **Solution**: Identify a safe operating range that wraps the "optimal zone" with a buffer for over- and underutilization -> Baseline for automatic scaling

### Revisiting legacy

> General recommendations

- **Problem childs**: Idle clusters (just in case i need one fast), oversized compute (safety buffers overdone) and unterutilized clusters
- **Challenge**: No one wants to touch a running system

1. Analyze historical utilization (identifiy spikes/traffic patterns)
2. Identify safe optimization opportunities
3. Roll out changes gradually

### Rearchitecture without fear

> What they did in their legacy systems

- Find out if your current workload actually need the currently selected note types
- Optimize Jobs into batches
- Even if the size is right: Check if you can switch to newer nodes with better price to performance
- Kustomize autoscaling with tools like KEDA to scale on actual load instead of diffuse side-effects