Technically Speaking

This video can't play due to privacy settings

To change your settings, select the "Cookie Preferences" link in the footer and opt in to "Advertising Cookies."

Inside distributed inference with llm-d ft. Carlos Costa

August 6, 2025 | Technically Speaking Team

Artificial intelligence

Explore the next frontier in AI: production-grade distributed inference at scale. In this episode, we're joined by Carlos Costa, Distinguished Engineer at IBM Research, a leader in large-scale compute and a key figure in the llm-d project. We discuss how to move beyond single-server deployments and build the intelligent, AI-aware infrastructure needed to manage complex workloads efficiently.

The conversation dives deep into the llm-d project, an open-source initiative born from a shared vision between IBM, Red Hat, and other industry leaders like Google and NVIDIA. Listeners will learn about the core technical challenges and innovative solutions being developed within llm-d, such as the disaggregation of prefill and decode stages, creating state-aware inference requests, and supporting wide parallelism for Mixture of Experts (MOE) models.

Transcript

00:00 - Chris Wright
So we've mastered deploying models on a single server with tools like vLLM. But the next frontier is production-grade distributed inference at scale. This isn't just about adding more machines. It requires a new kind of intelligence in the infrastructure itself. So today on the podcast, we've got Carlos Costa, a distinguished engineer at IBM Research, a leader in large-scale compute and a key figure in the llm-d project. Welcome to Technically Speaking, where we explore how open source is shaping the future of technology. I'm your host, Chris Wright.

00:36 - Chris Wright
So, Carlos, you've got a really interesting background working HPC. Thinking about, you know, distributed systems in the context of ML and AI, but I'm really curious to understand what drew you in and what perspective were you coming from as you got into the llm-d project?

00:54 - Carlos Costa
Yeah. So I come actually from an HPC distributed computing background. Right? I’ve been developing, like, system software, distributed runtime for, like, ten years now, a little over ten years. And I got this, this feel like it’s a very interesting time because, we just like, you started to get, accelerator GPUs into large systems, right? This was like early days; IBM Research was developing the very first supercomputer to use actual GPUs, NVIDIA GPUs back then, very different time, right? And, what was interesting, though, is that HPC was all about like, standards, right? Standard programing interfaces and all of that. And it comes like an accelerator, kind of breaks the whole paradigm.

01:36 - Carlos Costa
So I kind of watch this first-hand and all the friction that this caused. Right. And how this whole community started to react like, “Hey, we have to, to like, protect ourselves. Like being trapped into like a simple vein.” and all of that. Right? That was one side of the house. The other side was actually something that, was my main focus was like, how you start to, scale like ML and data analytics on the systems and remember that the whole, like, scale out to work right on commodity cost and was just kind of merging, like, think of the Hadoop and Sparks of the world. And, I got really kind of passionate about the idea of making it simple to use for users. So if you look at the other camp, like the scale out, people didn't know, like, anything about how I land this specific task, all of that. It was a lot of, like, runtime machinery to help you. You probably were not like hitting super-high performers, but it was easy. It was like accessible, right? As opposed to HPC, which was traditionally, like, very specialized.

02:37 - Carlos Costa
So I’ve been like, advocating for, like, these conversions of words. Especially around standards like Kubernetes, how it could evolve Kubernetes to better address some of these workloads that now require specialization at many layers. Right. And that's how like, how I got to this. Right. And, so I think, kind of the driving principle here is really like open source, right. How we drive standards, how this can help us democratize access and of course, the whole new AI workloads, of course, make this even, even harder because now we’re using some of the same, building blocks.

03:14 - Chris Wright
Right. Yeah. I think there's a lot of learning that comes from HPC, from distributed compute. Certainly from the application point of view, we put a lot of effort into Kubernetes and actually making it easier to build scale out applications. But the specialized use case of, embarrassingly parallel workloads and then introducing an accelerator, a GPU in this case, there were some interesting challenges right there with, say, the network connectivity and building this, this large distributed system. Right. I love that you had a focus on ease of use, because often these are scientific systems built and run by scientists. So, so making it easy to use. What did you find some of the, the challenges in that first phase before we got into this scale out inference?

04:02 - Carlos Costa
Yeah, it's just amazing. Even like, from like a packaging perspective. Right. Remember that, I was also referring to these kind of early systems. Right. CUDA that was just coming into the picture, right? Now you have to worry about like packaging something that requires, like, specialized drivers and all these things. Right. So that, to me, created a lot of challenges that kind of broke... broke a little bit. Again, I mean, the, the paradigm of how you do things, right. And even though some of these things they were like, resolved in the cloud side, right, comes this new kind of specialized hardware. It was just hard to evolve. And I think for the most part, it's still a big problem today, even, right, we still kind of get it wrong. But one key element is we like abstraction, right? Design the right levels of abstraction, have many of them, that's what helps us to kind of create that ease of use. So we have like different entry points and then we can abstract all the way to like, again to the level that users don't really need to know a lot of this machinery.

05:01 - Chris Wright
That's holy grail. So I'll make an assumption that, that the first phase, I'll just call it traditional HPC, subsequent phase, probably more focused on things like scale out training, where we're already trying to do these large scale tasks. And where did you start to see that transition from scale out training to recognize in that we could use that same kind of mindset and infrastructure to do scale at inferencing, which we had sort of mastered on, on a, on a single node.

05:30 - Carlos Costa
Yeah, I think, I mean, for, for inference, things start to get more interesting when we move to, from away from, like, very simple request models they become much, much, much larger, right? Of course, now, even with MOE (Mixture of Experts) where you have like parallelism at a different level, right? So communication and distribution of, like tasks across like nodes, became something very critical also to inference. Right. So the lines between like, training and inference I think are the type of like constructs that you use become kind of blurred. Right. And that's how we got here. And also again, inference itself. I think we’ll talk more about this. Right. At first, I mean, we thought about this like, this is like one big monolith, right? So like you do prefill, decode, one thing. Right. But from a compute perspective. So, like very distinct steps. Right. So now, yeah, we have a much better granularity. So the problem of orchestrating different tasks with different, like a compute requirement became very central. Yeah.

06:38 - Chris Wright
Well, that sort of lays the foundation for the launch of llm-d. As we were looking from a Red Hat point of view of scaling out inference, we were looking at the, the core building blocks, those primitives of disaggregated prefill and decode and then how do you scale that across the system? And our interest was in connecting that with Kubernetes. So it was looking at building off of this sort of well understood, call it industry standard kind of infrastructure. What did you see in terms of the the problems you were hoping to solve? And, and what drew you into the llm-d project at its inception?

07:20 - Carlos Costa
Yeah, maybe we take a step back, just tell a little bit of like the story behind it, how I mean, it was this kind of, match made in heaven actually that led to llm-d. Right. Because again, Imean I'm of course coming from IBM Research and the mission of research is really to look at like inflection points, of course, like what you had. So we help the company. Right. To kind of, again, identify technologies that are important to our business before these things of course been out. So for the past like year or so research for us was very clear that inference was like where a lot of the pain, a lot of the volume, a lot of the opportunity actually, was so we started to look across the stack and and we bumped into a lot of challenges. Right. Of things like “Hey how you do better scheduling for inference requests?” Right.

08:09 - Carlos Costa
How you do better cache management and all these things that are very central to make this platform very efficient. How you do faster model loading, a lot of challenges, so it kind of touches across the whole stack. And we start to see that, the whole community was going in that direction. Right. A lot of, like, open source projects popping up, but we couldn't find, like the realization of that vision, that full stack vision. Right? We, we felt like, “Hey, we really need, like, a common AI stack. There's some kind of building blocks popping up.” But the control plane really was something that was was missing. Right? So we got involved in many communities, like in some of the Kubernetes working groups that were starting to react to these new reality, like the serving working group, right? And we debated like, what is the right place to bring some of these innovations because it's all about like, kind of building momentum, not to reinvent the wheel. Right.

09:06 - Carlos Costa
And then there was this, now, famous like a March, March 25th meeting where, research came here to meet with, the Neural Magic team that just joined. Right. And, and I still remember we got in the room and I start to describe, like, some of that vision and, and at some point, like, you know, friends like Brian Stevens, Rob Shaw, like, we look at Shaw, he’s like “Yes.” Right. Because they are, they are, of course, thinking about these as well. It was kind of brewing already in the community. And at that point we decided hey, there is an opportunity here to really bring a lot of these... well it’s not only expertise. Right? So we had the expertise, we had built, actually some prototypes of these things. Some of them were actually internal to IBM. And then with that kind of built like the insight to bring this to, to the, to the open source side, but also connections, right, to connect a lot of these communities. Of course, the Neural Magic team brought, of course, a lot of the connections, we felt like “Hey, we are really well positioned to go after this.” And that's how like the, the impetus to actually start llm-d came about, right. So llm-d was born there. Yeah. At least as a concept.

10:19 - Chris Wright
Yeah. I'm picturing you're in a room with Rob and Brian, and you're almost finishing each other's sentences. The... the shared vision, I think, is probably the most important part of a successful open source community project. There's hardware providers, there's cloud providers all in this community bringing each of their different perspectives. How did you see this set of kindred spirits, coming together and, and embarking on this, this ambitious journey?

10:47 - Carlos Costa
I think it’s actually like a very key point, like, this shared vision. Right? I think talking to the, like, the players in the community, right, it had become very clear that there was also, like, shared pain points. Right? That we're actually driving people to start from scratch, right? Sometimes reinventing the wheel. And that's not efficient. That creates, of course, like choke points that they don't benefit anyone, right? So from the very early days, when we started to kind of share the vision and actually the passion behind the vision also is something that got people very quickly actually engaged and that's what we experienced, right?

11:26 - Carlos Costa
If you look at how everything started once we kind of reached that, “Hey, like, yeah, we agree, we have a vision across the stack.”, we started to reach out and very quickly Google joined, right? Because, of course, building some of the building blocks. And the NVIDIA team also joined like a very good partners, right? Of course, building some of these things already, right? And that from there, we went to like AMD and CoreWeave and very quickly, again, folks realized that here is an opportunity to kind of realize a shared vision that helps us to address some of the pain points. We could build something that benefits all. This is like a growing, growing pie, right? Something that you can specialize, but you kind of all win. And that's how things started.

12:12 - Chris Wright
I love that the... it shows the power of open source and that collaborative model of how we can take each other's experiences, you know, each with the ability to sort of scratch their own itch. But working on this common vision, initially we had proposed, calling the project vLLM-d Ultimately it was llm-d. I think one of the motivations there was signaling that, of course we care a lot about the vLLM inference runtime, but signaling that this control plane could manage other types of inference engines. Do you think there's other aspects of that beginning phase that that were important to draw in this broader community?

12:54 - Carlos Costa
I think, again, going back to the point of like a control plane. Right. So, one of the challenges, if you look at these other parts they all kind of tie in some way to like a specific technology for no one player. Right? So I think from the beginning we say, “Hey, we actually very open, right?” Even though vLLM, we need it very much. I mean, again, a lot of the hooks that we need to orchestrate this efficiently, they actually require changes to vLLM. We are very close to the community in many, many ways. They've been amazing in terms of like reacting and helping us, right? But vLLM is like one engine and we acknowledge from the beginning that, “Hey, we could embrace other companies because they have similar challenges.”, right. You can look at the LangChain and others, right? The orchestration layer, if you're really serious about this is agnostic, not only harder level, but also engine level, right? Making that very clear from the beginning was very important, right?

13:50 - Carlos Costa
And it's not just like lip service to be honest. It's just, again, if you look at the abstractions and the modules, they are actually agnostic to the hard. And in fact, when we released llm-d, day one, it was already running on NVIDIA, AMD, Google folks had like a TPU. So, it's very much like this core value proposition. So, again, making sure that we were building modules that were agnostic, I think that was key to the beginning of llm-d.

14:22 - Chris Wright
Now we have Kubernetes already. We're already running AI workloads on Kubernetes. We can, launch a pod with an inference engine in it. There's load balancers, there's scaling. There's a lot of the building blocks, but they don't quite fit the AI workloads.

14:39 - Carlos Costa
That's a very actually central design point, right? Kubernetes has been around for some time, it's proven for like scaling. The problem though is that inference requires a different beast, right? They come in different shapes, right? They're not like uniform. They require like a hard affinity, right? And in many ways, Kubernetes today was not really designed for this, but it can evolve and it is evolving. There was a ready this realization in the Kubernetes communities, like how we can extend things like the gateway API to also include some more like a model awareness, right? And that's exactly what we start to build on, right? Very central to llm-d is the fact that, “Hey, we'll build around Kubernetes. We add the specialization to create inference requests with state awareness.”, right? Things like my KV state or locality, hard affinity, all of that, right? So that gives you much better ability to do a smarter scheduling. Because in this world, if you make mistakes, it's a big price. So you don't want to over provision. You don't want to, of course, kind of extend to the wrong hardware and so on, right?

15:50 - Carlos Costa
So very central to llm-d is the idea of let's use prefill decode disaggregation to do a better like a scheduling allocation makes perfect sense. But adds a lot of complexity... Because now we have to coordinate two different pods, there’s a handshake, how you transfer KV and all of that. So at llm-d we basically have this Inference Scheduler that is a way of PD, orchestrate that handshake... and it gives you this ability of doing a much better allocation, alright, so that’s one, very central.

16:22 - Carlos Costa
The other one is actually related to this trend of MOEs or sparse models, right? Which is very important because again we all seen that the super large models, the dense models like Llama 405B are super expensive to run, right? The reason they’re not getting a lot of traction. The other end of the spectrum is sparse models Mixture of Experts, right? The caveat though is that they require a lot of very long batches and parallelism across many GPUs. And now we have another very interesting scheduling problem because not only prefill the code, it’s actually the orchestration of many parts for one single inference request.

17:02 - Carlos Costa
Very central to llm-d is also how we support that, what we call wide parallelism, and how we leverage Kubernetes constructs, emerge ones like the LeaderWorkerSet to orchestrate pods in gangs. So this builds on, like, a deployment, there's a state full and then a LeaderWorkerSet. And what it gives you is a built again to like just like deploy things like as a gang. Right? And we're using this to enable expert parallelism in llm-d, very wide. So we can do like many, many GPUs. That's a very, very big and important feature that is coming. And when we pair this with like prefill decode disaggregation, we can have this very fine grain control like a level of parallelism for prefill, level of parallelism for decode into all this orchestration across this very complex scenario now.

17:55 - Chris Wright
A couple of different thoughts in here. Do you see the potential for taking that fine-grained approach and, recognizing different requests, requiring different types of resources as a building block to creating heterogeneous infrastructure where portions of the infrastructure could be optimized, you know, the the memory bandwidth constraint part of decoding versus the compute constrained part of prefilling it, is that something that comes into play?

18:24 - Carlos Costa
Yeah, that's a great point, Chris. Yeah, I mean, one big motivation to just again, have more control, make it more granular, decoupling exactly because of that, because now we understand much better about the requirements we have ability to score these things when they come in. And now we can send in different ways. Right? So the idea of, “Hey, we can even have a different hardware now for these phases.” is absolutely a goal, it’s like a design principle. The fact that you're building, like, abstractions in layers in a way that helps that. Right? Something that we do from the get go to, to enable. There's a lot of work, of course, ahead. Right. This is easier said than done. Absolutely that direction. I think we have co-building blocks to go to there. It's part of how I think our shared roadmap with the community. It's absolutely something needed to, to, to get to that point.

19:20 - Chris Wright
And the value there would be maybe reducing the overall cost, of the, the hardware required to support these, large models and, you know, heavily saturated workloads. Is there a connection between the, that intelligent scheduler that we've been talking about and, the MOE level scheduler? I mean, one of the aspects of DeepSeek that was, you know, talked about when it first came out was MLA, its reasoning capabilities and also, clever routing. Right? To me, they're connected, but maybe that's misunderstanding some of the building blocks of llm-d.

20:01 - Carlos Costa
Yeah, yeah, it's a great question. I think we get this question all the time because, and that's the challenge maybe in, in this field. There are so many layers and sometimes the terms, they overlap. So when we say, like, a router at some point people think like oh I’m routing to different models. That's one level of routing. That's not what we’re talking about. Right. That's why we say like a pre-scheduler these days. And then... but then there's the routing at the MOE level which is a different level. Right. So this is, again, something that happens, again, at the model level. Right. So there's no like direct connection. But of course again there are hints. Right. Because routing is all about like where you send compute and all of that. Right. So the hints of where things run could be used for like the high-level scheduling. Right? So again, probably not a direct connection, but again, this is like a cross layer type of thing that that every time you can actually provide hints across the layers it’s an opportunity to, to, to optimize. Right. So I think that sense is actually a good direction something that you should, you should explore.

21:08 - Chris Wright
Yeah. That there's just so much excitement around bringing inference into scale out production environments. I mean, a lot of the world was focused on pre-training and the large systems required to build large models where we have all of these great models, we need to make use of them. That means running them in, in production context and, and, and doing inference. Right. There must be a, a vision for where we're going. So what are some of the next critical milestones that you see in llm-d?

21:39 - Carlos Costa
Yes. So in llm-d we have these... when we, when we started, right, one of the value actually that we established was like this idea of well-lit paths. Right? The sense of creating blueprints, how you actually deploy some of these common patterns, and what actually works and what we solved so you don't have to go through the pain yourself, right? In the sense that we bumped into a lot of inefficiencies and opportunities for bad optimization of KV transfer. This whole parallelism business was also something that was not super clean, a clear was a little bit dim. So for the big milestone now for our next release is really like a clear path for PD multi-node. So making this very solid. Something to keep in mind is that we're trying to really nail core features before we kind of spread too far. Trying to avoid some of the risks that many of these projects they... they fell into, which is like let's boil the ocean. They have taken too much and nothing done. Right, exactly.

22:45 - Carlos Costa
Also better support for like cache offloading, all of that, right? But very central also is support for SLO. End of the day, anyone running production care about SLO, right? So better support for index scheduling for dealing with SLO, you know, so how you extend the schedule. So those are like a very big things. And also we have this idea of like incubating features, things that are not in the core, but that you're getting this very strong signal from the community that they care about. And in this space there are two really out-scaling is one of them because people want to kind of react to the med. And the other one is this idea of 'cold start', right? Like again how you kind of minimize or monetize the price of loading a model, right? And it's very important we have like a multi-model scenarios. Another one that is incubating was super exciting about.

23:43 - Chris Wright
What do you see about getting involved, like, what's your call to action? What would be the thing for, for people listening and getting interested in llm-d to, to take on?

23:53 - Carlos Costa
Yeah, we are absolutely just starting. Which means like many many opportunities to get involved. This is an alpha project. Right? Meaning like there are things to fix. Right. So come and help us. So if you are running an inference platform and you're feeling these pains, we have a home for you.

24:12 - Carlos Costa
llm-d is moving very fast. I think we’re hearing this community. Come bring your requirements. Come take a look and see what you can improve. We very much want to hear from how people are using this in the field. What are the things they care about the most so we can adjust. So we are a very open community. We run weekly community calls. We have seven special interest groups. Very active actually. Very nice discussions. So if you want to get a feel of what is going on in llm-d, how you can influence. Do join this calls. We have llm-d slack where you can chime in and ask questions. Drop requirements, complaints, things that don't work. Very much want to hit those as well. That's how we get better and improve. And we absolutely want to hear from you and please do join this community.

25:06 - Chris Wright
That's awesome. Well, Carlos, really appreciate you spending the time and, helping us understand what's going on. Not just broadly within llm-d, but really digging in and giving us the insights of what it means to do, production grade, scale out, inferencing.

25:23 - Carlos Costa
Yeah. Great. Thanks a lot, Chris. Yeah. Thanks. Thanks so much.

25:27 - Chris Wright
My conversation with Carlos explored a lot of the work going into the core building blocks for delivering AI inference to production at scale. Beyond the AI model itself, llm-d is focusing on the intelligent, AI-aware infrastructure needed to manage it, a system that can efficiently route, schedule, and scale these complex workloads. And what makes me optimistic is how this is being built not in a silo, but through an open, community-driven coalition. It proves the next leap for AI won't just be a new algorithm, but the open-standard platforms we build together to make this technology practical for everyone. Thanks for joining the conversation. I'm Chris Wright and I can't wait to see what we explore next on Technically Speaking.

More about artificial intelligence

Usually referring to processes and algorithms able to use data to simulate human intelligence, artificial intelligence can provide insights based on acquired knowledge.

Blog Post

Why agents are the new kingmakers

July 9, 2025 | Scott McCarty, Stevan Le Meur

...

Application development and delivery

Artificial intelligence

Data science

Hybrid cloud

Blog Post

Replicating success, not reinventing the wheel: AI in the public sector

May 30, 2025 | Jonny Williams

...

Artificial intelligence

About the show

What’s next for enterprise IT? No one has all the answers—But CTO Chris Wright knows the tech experts and industry leaders who are working on them.

Inside distributed inference with llm-d ft. Carlos Costa

Transcript

Transcript

More about artificial intelligence

About the show