docs(day0): Added finops talk
Some checks failed
Build latest image / build-container (push) Failing after 41s
Some checks failed
Build latest image / build-container (push) Failing after 41s
This commit is contained in:
61
content/day0/14_finops.md
Normal file
61
content/day0/14_finops.md
Normal file
@@ -0,0 +1,61 @@
|
||||
---
|
||||
title: When Platform Engineers Lead FinOps: Driving Reliability and $20M in Savings
|
||||
weight: 14
|
||||
tags:
|
||||
- platformengineeringday
|
||||
- finops
|
||||
- legacy
|
||||
---
|
||||
|
||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||
{{% button href="https://colocatedeventseu2026.sched.com/event/2DY3O" style="error" icon="calendar" %}}Sched Link{{% /button %}}
|
||||
<!-- {{% button href="https://github.com/graz-dev/automatic-reosurce-optimization-loop" style="info" icon="code" %}}Code/Demo{{% /button %}} -->
|
||||
<!-- {{% button href="https://cloudnativeplatforms.com" style="info" icon="link" %}}Website/Homepage{{% /button %}} -->
|
||||
|
||||
A case study from expedia about finops.
|
||||
|
||||
## The Cost-Reliability disconnect
|
||||
|
||||
- Background: Modern Infrastructure is complex ans large (1000s of clusters, multi-region,...) with huge operational responsibilities (SLA, SLO, scalabiloity, ...)
|
||||
- Platform Team: REliability, Performance, Stability
|
||||
- FinOps Team: Cloud Resources reduction, budget adherence, efficiency
|
||||
- Problem: Conflicting goals and often organizationally seperated
|
||||
- Blind cost optimzation can lead to unintentional stability/performance problems that can quickly spiral
|
||||
- Blind stability optimizations quickly lead to large overhead/overprovisioning and huge costs
|
||||
|
||||
## Patterns
|
||||
|
||||
- Establish views & Baselines: Unserstand cost per cluster/workload and utilization patterns
|
||||
- Revisit legacy: Old configs like static sizing, huge buffers, ...
|
||||
- Embrace rearchitecture without fear: Consolidation, instance optimization, infra rededisn should all be on the table
|
||||
|
||||
### Views & baselines
|
||||
|
||||
> General recommendations
|
||||
|
||||
- **Problem:** Lack of cost attribution for shared info
|
||||
- **Problem:** Lack of insights into which clusters are generating consts
|
||||
- **Problem:** No transparency into which teams are consuming resources
|
||||
- **Solution:** Bring the generation of cost together with the existance of costs
|
||||
- **Solution**: Identify a safe operating range that wraps the "optimal zone" with a buffer for over- and underutilization -> Baseline for automatic scaling
|
||||
|
||||
### Revisiting legacy
|
||||
|
||||
> General recommendations
|
||||
|
||||
- **Problem childs**: Idle clusters (just in case i need one fast), oversized compute (safety buffers overdone) and unterutilized clusters
|
||||
- **Challenge**: No one wants to touch a running system
|
||||
|
||||
1. Analyze historical utilization (identifiy spikes/traffic patterns)
|
||||
2. Identify safe optimization opportunities
|
||||
3. Roll out changes gradually
|
||||
|
||||
### Rearchitecture without fear
|
||||
|
||||
> What they did in their legacy systems
|
||||
|
||||
- Find out if your current workload actually need the currently selected note types
|
||||
- Optimize Jobs into batches
|
||||
- Even if the size is right: Check if you can switch to newer nodes with better price to performance
|
||||
- Kustomize autoscaling with tools like KEDA to scale on actual load instead of diffuse side-effects
|
||||
Reference in New Issue
Block a user