kubecon24/content/day2/05_performance_sustainability.md

---
title: Optimizing performance and sustainability for ai
weight: 5
tags:
  - keynote
  - panel
---

{{% button href="https://youtu.be/VcMOr1DtTWM" style="warning" icon="video" %}}Watch talk on YouTube{{% /button %}}

A panel discussion with moderation by Google and participants from Google, Alluxio, Ampere and CERN.
It was pretty scripted with prepared (sponsor specific) slides for each question answered.

## Takeaways

* Deploying an ML should become the new deployment a web app
* The hardware should be fully utilized -> Better resource sharing and scheduling
* Smaller LLMs on CPU only is pretty cost-efficient
* Better scheduling by splitting into storage + CPU (prepare) and GPU (run) nodes to create a just-in-time flow
* Software acceleration is cool, but we should use more specialized hardware and models to run on CPUs
* We should be flexible regarding hardware, multi-cluster workloads and hybrid (onprem, burst to cloud) workloads