docs(day0): Added finops talk
Some checks failed
Build latest image / build-container (push) Failing after 41s

This commit is contained in:
2026-03-23 16:53:28 +01:00
parent 58737cc8ed
commit ded59d665c

61
content/day0/14_finops.md Normal file
View File

@@ -0,0 +1,61 @@
---
title: When Platform Engineers Lead FinOps: Driving Reliability and $20M in Savings
weight: 14
tags:
- platformengineeringday
- finops
- legacy
---
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
{{% button href="https://colocatedeventseu2026.sched.com/event/2DY3O" style="error" icon="calendar" %}}Sched Link{{% /button %}}
<!-- {{% button href="https://github.com/graz-dev/automatic-reosurce-optimization-loop" style="info" icon="code" %}}Code/Demo{{% /button %}} -->
<!-- {{% button href="https://cloudnativeplatforms.com" style="info" icon="link" %}}Website/Homepage{{% /button %}} -->
A case study from expedia about finops.
## The Cost-Reliability disconnect
- Background: Modern Infrastructure is complex ans large (1000s of clusters, multi-region,...) with huge operational responsibilities (SLA, SLO, scalabiloity, ...)
- Platform Team: REliability, Performance, Stability
- FinOps Team: Cloud Resources reduction, budget adherence, efficiency
- Problem: Conflicting goals and often organizationally seperated
- Blind cost optimzation can lead to unintentional stability/performance problems that can quickly spiral
- Blind stability optimizations quickly lead to large overhead/overprovisioning and huge costs
## Patterns
- Establish views & Baselines: Unserstand cost per cluster/workload and utilization patterns
- Revisit legacy: Old configs like static sizing, huge buffers, ...
- Embrace rearchitecture without fear: Consolidation, instance optimization, infra rededisn should all be on the table
### Views & baselines
> General recommendations
- **Problem:** Lack of cost attribution for shared info
- **Problem:** Lack of insights into which clusters are generating consts
- **Problem:** No transparency into which teams are consuming resources
- **Solution:** Bring the generation of cost together with the existance of costs
- **Solution**: Identify a safe operating range that wraps the "optimal zone" with a buffer for over- and underutilization -> Baseline for automatic scaling
### Revisiting legacy
> General recommendations
- **Problem childs**: Idle clusters (just in case i need one fast), oversized compute (safety buffers overdone) and unterutilized clusters
- **Challenge**: No one wants to touch a running system
1. Analyze historical utilization (identifiy spikes/traffic patterns)
2. Identify safe optimization opportunities
3. Roll out changes gradually
### Rearchitecture without fear
> What they did in their legacy systems
- Find out if your current workload actually need the currently selected note types
- Optimize Jobs into batches
- Even if the size is right: Check if you can switch to newer nodes with better price to performance
- Kustomize autoscaling with tools like KEDA to scale on actual load instead of diffuse side-effects