docs: Updated day notes

docs(day1): GPU Talk
docs(day1): Formatted notes
2025-04-02 17:43:43 +02:00 · 2025-04-02 17:43:21 +02:00 · 2025-04-02 17:15:55 +02:00
5 changed files with 73 additions and 7 deletions
--- a/content/day1/03_operator-mistakes.md
+++ b/content/day1/03_operator-mistakes.md
@ -1,8 +1,9 @@
 ---
-title: Title
+title: Don't write controllers like charlie don't does: Avoiding common kubernetes controller mistakes
 weight: 3
 tags:
- - <tag>
+ - kubecon
 - operator
 ---
 <!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
--- a/content/day1/04_gpus-go-round.md
+++ b/content/day1/04_gpus-go-round.md
@ -0,0 +1,56 @@
 ---
 title: THE GPUs on the bus go round and round
 weight: 4
 tags:
 - kubecon
 - gpu
 - nvidia
 ---
 <!-- {{% button href="https://youtu.be/rkteV6Mzjfs" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}} -->
 <!-- {{% button href="https://docs.google.com/presentation/d/1nEK0CVC_yQgIDqwsdh-PRihB6dc9RyT-" style="tip" icon="person-chalkboard" %}}Slides{{% /button %}} -->
 ## Background
 - They are the GForce Now folks
 - Large fleet of clusters all over the world (60.000+ GPUs)
 - They use kubevirt to pass through GPUs (vfio driver) or vGPUs
 - Devices fail from time to time
 - Sometimes failures needs restarts
 ## Failure discovery
 - Goal: Maintain capacity
 - Failure reasons: Overheating, insufficient power, driver issues, hardware faults, ...
 - Problem: They only detected failure by detecting capacity decreasing or not being able to switch drivers
 - Fix: First detect failure, then remidiate
    - GPU Problem detector as part of their internal device plugin
    - Node Problem detector -> triggers remediation through maintainance
 ## Remidiation approaches
 - Reboot: Works every time, but has workload related downsides -> Legit solutiom, but drain can take very long
 - Discovery of remidiation loops -> Too many reboots indicate something being not quite right
 - Optimized drain: Prioritize draining of nodes with failed devices before other maintainance
 - The current workflow is: Reboot (automated) -> Power cycle (automated) -> Rebuild Node (automated) -> Manual intervention / RMA
 ## Prevention
 > Problems should not affect workload
 - Healthchecks with alerts
 - Firmware & Driver updates
 - Thermal & Powermanagement
 ## Future Challenges
 - What if a high density with 8 GPUs has one failure?
 - What is an acceptable rate of working to broken GPUs per Node
 - If there is a problematic node that has to be rebooted every couple of days should the scheduler avoid thus node?
 ## Q&A
 - Are there any plans to opensource the gpu problem detection: We could certainly do it, not on the roadmap r/n
 - Are the failure rates representative and what is counted as failure:
    - Failure is not being able to run a workload on a node (could be hardware or driver failure)
    - The failure rate is 0,6% but the affected capacity is 1,2% (with 2 GPUs per node)
--- a/content/day1/_index.md
+++ b/content/day1/_index.md
@ -7,15 +7,23 @@ weight: 5
 Day 1 of the main KubeCon event startet with a bunch of keynotes from the cncf themselfes (anouncing the next locations for kubecon - amsterdam and barcelona).
 The also announced a new sovereign cloud edge initiative (CNCF/LF meets EU and soem german ministry) called "NeoNephos" with members like SAP, StackIt or T-Systems.
-This is also the day the sponsor showcase opened - so expect more talking to people and meetings or demos and less straight up talks
+This is also the day the sponsor showcase opened - so expect more talking to people and meetings or demos and less straight up talks.
 ## Talk recommendations
 - Not that much about gpus with good control plane scaling advice: [Scaling GPU Clusters without melting down](../01_scaling-gpu)
 - Migrate a cluster to ClusterAPI without downtime: [Day 2000 - Migrating from kubeadm + ansible to clusterapi+talos](../02_migrations)
 - Some basic operator tips with good Q&A questions: [Don't write controllers like charlie don't does: Avoiding common kubernetes controller mistakes](../03_operator-mistakes)
 ## Other stuff I learned or people i talk to
- The crossplane maintainers
+- The crossplane maintainers (Upbound)
 - Anynines
- Cloudfoundry/Korifi
+- Cloudfoundry/Korifi
 - FlatCar
 - Cert-Manager
 - Flux maintainers
 - OVH
 - Kubermatic
 - Isovalent
 - Spacelift: They employ some of the opentofu core maintainers
--- a/content/day2/_index.md
+++ b/content/day2/_index.md
@ -4,7 +4,8 @@ title: Day 2
 weight: 6
 ---
-TODO:
+The second day of kubecon was my main "meeting day" this year - aka there were a bunch of scheduled meetings with manufacturers, partners, potential partners or just to get to know someone/a project.
 What does this mean for you? Another day with only a few sessions - the meeting notes are not available online.
 ## Talk recommendations
--- a/content/day3/_index.md
+++ b/content/day3/_index.md
@ -4,7 +4,7 @@ title: Day 3
 weight: 7
 ---
-TODO:
+The last day of KubeCon - aka the day everone leaves early.
 ## Talk recommendations
Author	SHA1	Message	Date
Nicolai Ort	78ca5973b8	docs: Updated day notes Some checks failed Build latest image / build-container (push) Failing after 34s Details	2025-04-02 17:43:43 +02:00
Nicolai Ort	77f34ed1ab	docs(day1): GPU Talk	2025-04-02 17:43:21 +02:00
Nicolai Ort	a36f562cf4	docs(day1): Formatted notes	2025-04-02 17:15:55 +02:00