Compare commits
3 Commits
9ad9af0f9c
...
78ca5973b8
Author | SHA1 | Date | |
---|---|---|---|
78ca5973b8 | |||
77f34ed1ab | |||
a36f562cf4 |
@ -1,8 +1,9 @@
|
|||||||
---
|
---
|
||||||
title: Title
|
title: Don't write controllers like charlie don't does: Avoiding common kubernetes controller mistakes
|
||||||
weight: 3
|
weight: 3
|
||||||
tags:
|
tags:
|
||||||
- <tag>
|
- kubecon
|
||||||
|
- operator
|
||||||
---
|
---
|
||||||
|
|
||||||
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||||
|
56
content/day1/04_gpus-go-round.md
Normal file
56
content/day1/04_gpus-go-round.md
Normal file
@ -0,0 +1,56 @@
|
|||||||
|
---
|
||||||
|
title: THE GPUs on the bus go round and round
|
||||||
|
weight: 4
|
||||||
|
tags:
|
||||||
|
- kubecon
|
||||||
|
- gpu
|
||||||
|
- nvidia
|
||||||
|
---
|
||||||
|
|
||||||
|
<!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
|
||||||
|
<!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
|
||||||
|
|
||||||
|
## Background
|
||||||
|
|
||||||
|
- They are the GForce Now folks
|
||||||
|
- Large fleet of clusters all over the world (60.000+ GPUs)
|
||||||
|
- They use kubevirt to pass through GPUs (vfio driver) or vGPUs
|
||||||
|
- Devices fail from time to time
|
||||||
|
- Sometimes failures needs restarts
|
||||||
|
|
||||||
|
## Failure discovery
|
||||||
|
|
||||||
|
- Goal: Maintain capacity
|
||||||
|
- Failure reasons: Overheating, insufficient power, driver issues, hardware faults, ...
|
||||||
|
- Problem: They only detected failure by detecting capacity decreasing or not being able to switch drivers
|
||||||
|
- Fix: First detect failure, then remidiate
|
||||||
|
- GPU Problem detector as part of their internal device plugin
|
||||||
|
- Node Problem detector -> triggers remediation through maintainance
|
||||||
|
|
||||||
|
## Remidiation approaches
|
||||||
|
|
||||||
|
- Reboot: Works every time, but has workload related downsides -> Legit solutiom, but drain can take very long
|
||||||
|
- Discovery of remidiation loops -> Too many reboots indicate something being not quite right
|
||||||
|
- Optimized drain: Prioritize draining of nodes with failed devices before other maintainance
|
||||||
|
- The current workflow is: Reboot (automated) -> Power cycle (automated) -> Rebuild Node (automated) -> Manual intervention / RMA
|
||||||
|
|
||||||
|
## Prevention
|
||||||
|
|
||||||
|
> Problems should not affect workload
|
||||||
|
|
||||||
|
- Healthchecks with alerts
|
||||||
|
- Firmware & Driver updates
|
||||||
|
- Thermal & Powermanagement
|
||||||
|
|
||||||
|
## Future Challenges
|
||||||
|
|
||||||
|
- What if a high density with 8 GPUs has one failure?
|
||||||
|
- What is an acceptable rate of working to broken GPUs per Node
|
||||||
|
- If there is a problematic node that has to be rebooted every couple of days should the scheduler avoid thus node?
|
||||||
|
|
||||||
|
## Q&A
|
||||||
|
|
||||||
|
- Are there any plans to opensource the gpu problem detection: We could certainly do it, not on the roadmap r/n
|
||||||
|
- Are the failure rates representative and what is counted as failure:
|
||||||
|
- Failure is not being able to run a workload on a node (could be hardware or driver failure)
|
||||||
|
- The failure rate is 0,6% but the affected capacity is 1,2% (with 2 GPUs per node)
|
@ -7,15 +7,23 @@ weight: 5
|
|||||||
Day 1 of the main KubeCon event startet with a bunch of keynotes from the cncf themselfes (anouncing the next locations for kubecon - amsterdam and barcelona).
|
Day 1 of the main KubeCon event startet with a bunch of keynotes from the cncf themselfes (anouncing the next locations for kubecon - amsterdam and barcelona).
|
||||||
The also announced a new sovereign cloud edge initiative (CNCF/LF meets EU and soem german ministry) called "NeoNephos" with members like SAP, StackIt or T-Systems.
|
The also announced a new sovereign cloud edge initiative (CNCF/LF meets EU and soem german ministry) called "NeoNephos" with members like SAP, StackIt or T-Systems.
|
||||||
|
|
||||||
This is also the day the sponsor showcase opened - so expect more talking to people and meetings or demos and less straight up talks
|
This is also the day the sponsor showcase opened - so expect more talking to people and meetings or demos and less straight up talks.
|
||||||
|
|
||||||
## Talk recommendations
|
## Talk recommendations
|
||||||
|
|
||||||
- Not that much about gpus with good control plane scaling advice: [Scaling GPU Clusters without melting down](../01_scaling-gpu)
|
- Not that much about gpus with good control plane scaling advice: [Scaling GPU Clusters without melting down](../01_scaling-gpu)
|
||||||
- Migrate a cluster to ClusterAPI without downtime: [Day 2000 - Migrating from kubeadm + ansible to clusterapi+talos](../02_migrations)
|
- Migrate a cluster to ClusterAPI without downtime: [Day 2000 - Migrating from kubeadm + ansible to clusterapi+talos](../02_migrations)
|
||||||
|
- Some basic operator tips with good Q&A questions: [Don't write controllers like charlie don't does: Avoiding common kubernetes controller mistakes](../03_operator-mistakes)
|
||||||
|
|
||||||
## Other stuff I learned or people i talk to
|
## Other stuff I learned or people i talk to
|
||||||
|
|
||||||
- The crossplane maintainers
|
- The crossplane maintainers (Upbound)
|
||||||
- Anynines
|
- Anynines
|
||||||
- Cloudfoundry/Korifi
|
- Cloudfoundry/Korifi
|
||||||
|
- FlatCar
|
||||||
|
- Cert-Manager
|
||||||
|
- Flux maintainers
|
||||||
|
- OVH
|
||||||
|
- Kubermatic
|
||||||
|
- Isovalent
|
||||||
|
- Spacelift: They employ some of the opentofu core maintainers
|
@ -4,7 +4,8 @@ title: Day 2
|
|||||||
weight: 6
|
weight: 6
|
||||||
---
|
---
|
||||||
|
|
||||||
TODO:
|
The second day of kubecon was my main "meeting day" this year - aka there were a bunch of scheduled meetings with manufacturers, partners, potential partners or just to get to know someone/a project.
|
||||||
|
What does this mean for you? Another day with only a few sessions - the meeting notes are not available online.
|
||||||
|
|
||||||
## Talk recommendations
|
## Talk recommendations
|
||||||
|
|
||||||
|
@ -4,7 +4,7 @@ title: Day 3
|
|||||||
weight: 7
|
weight: 7
|
||||||
---
|
---
|
||||||
|
|
||||||
TODO:
|
The last day of KubeCon - aka the day everone leaves early.
|
||||||
|
|
||||||
## Talk recommendations
|
## Talk recommendations
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user