Shivay Lamba

keynote

AI, Network Computing

Couchbase

remote

keynote

AI, Network Computing

Multi Cluster GPU Allocation for AI Research

As the LLMs and generative models become more and more complex, one can't simply train them on CPU, or a single GPU cluster, this requires the use of multiple GPUs but managing those can be complicated.GPU partitioning in the cloud is perceived to be a complicated, resource-consuming process that is worth the exclusive involvement of narrowly focused teams or large enterprises. But what if the truth is exactly the opposite?So this talk explores why GPU partitioning is necessary for running Python AI workloads and how it can be done efficiently using open source tooling.

The talk will cover about some common myths: that this has something to do with advanced hardware configurations or prohibitive costs, which come true with large-scale distributed systems like Kubernetes.

In this talk, we will illustrate how modern frameworks like NVIDIA MIG with vCluster effectively enable seamless sharing of GPUs across different teams, leading to more efficient resource utilization, higher throughput, and broader accessibility for workloads like LLM finetuning and inference. The talk aims to inspire developers, engineers to understand the key techniques for efficient GPU scheduling and sharing of resources across multiple GPU Clusters with open source platform tooling like vCluster.

‍

Bringing Container-Native Simplicity to AI/ML

The deployment of AI projects often faces significant hurdles due to the fragmented nature of their components—datasets, models, and model weights are frequently stored in separate repositories, formats, or locations.In this talk, we will explore the critical challenges posed by these disjointed workflows. We will discuss how we can addresses these issues by introducing an open source universal, open-source packaging and versioning solution based on the Open Container Initiative (OCI) standard to package together all of AI components in a single image by sharing production examples of projects like KitOps and Project Harbour which are focused on making AI packaging more streamlined. This enables seamless integration of AI/ML components into cloud-native infrastructures, ensuring secure, auditable, and efficient deployment.Thus attendees will discover practical strategies for standardizing AI/ML packaging. And also learn to adopt this into CI/CD pipelines for more practical casesThere are such tremendous benefits for AI / ML / software teams:

Such as:

Simplified Deployment Workflows: Eliminates the need to manage multiple formats and repositories

Reduced Infrastructure Complexity: Single unified pipeline for both traditional and AI workloads

Improved Reproducibility: All components (models, datasets, weights) are versioned together

Enhanced Auditability: Complete traceability of AI/ML components through OCI-based packaging

Easier Compliance Management: Standardized format makes security scanning and policy enforcement simpler

AI has become a prominent figure in the cloud native ecosystem and there continues to be massive adoption in this emerging field. As frameworks and approaches are introduced, a pattern has emerged which threatens the ability to manage at scale: each implementation introduces their own format, runtime, and different ways of working, fragmenting the ecosystem. On other hand, open standards are the backbone of cohesive and scalable ecosystems. explore the importance of defining standards within the CNCF ecosystem, particularly focusing on AI/ML artifacts. Beyond the advantages of the standard in facilitating integration with existing cloud native tools, this conversation will delve into how the standards can serve as a foundation for innovation. Join us to understand how standardization with innovative approaches can advance the cloud native AI landscape.

All speakers

Shivay Lamba

Multi Cluster GPU Allocation for AI Research

Bringing Container-Native Simplicity to AI/ML

Renew Your Mind at LambdaConf 2025