Technically Speaking

This video can't play due to privacy settings

To change your settings, select the "Cookie Preferences" link in the footer and opt in to "Advertising Cookies."

Building more efficient AI with vLLM ft. Nick Hill

2 de julio de 2025 | Technically Speaking Team

IA/ML

We've seen AI deliver complex results in seconds, but there's a performance paradox at its core: the powerful GPUs that run these models are often underutilized. This core inefficiency, caused by memory bottlenecks during the inference process, drives up costs and limits what's possible with the technology. This is the world of inference optimization, and it's where open source communities are quietly shaping the future.

In this episode, Red Hat CTO Chris Wright talks to Nick Hill, a key contributor to the vLLM open source project, about the innovations directly tackling this problem. They dive into PagedAttention, the technique that eliminates memory fragmentation by changing how the KV cache is managed, and discuss how it—combined with speculative decoding—maximizes GPU throughput. This is a systems-level look at making powerful AI practical and performant at scale. Tune in to better understand the foundational technology that will power the next wave of enterprise innovation.

Transcripción

Transcript

00:00 - Chris Wright
Here's something wild to think about. Right now, there are massive language models that can write poetry, create software, and analyze complex data, but getting them to run efficiently, that's where things get really interesting. Today we're talking about the unsung heroes of AI, the people figuring out how to make these incredible models actually work in the real world. And that's where vLLM comes in, completely rethinking how we manage this process. To help us understand how this actually works, I'm joined by Nick Hill, who's been right in the middle of this innovation. Welcome to "Technically Speaking," where we explore how open source is shaping the future of technology. I'm your host, Chris Wright.

00:44 - Chris Wright
Before we jump into like, the deep technical stuff, I thought it'd be awesome just to get a little background. How did you arrive at working on inferencing? You know, what's your history? What brought you to where you are today?

01:00 - Nick Hill
So my first exposure, I guess, to working on AI infrastructure was a long time ago now, if you remember the IBM Watson "Jeopardy" win, where I think it was Ken Jennings that Watson beat on television. I wasn't part of the research team that did that, but I was helping to commercialize afterwards. They wanted to see how they could exploit this, you know, the work that had been done there and apply it to different domains. I had no idea I'd be working on, you know, cutting-edge generative AI inferencing many years later. But that was, at the time, that was kind of a form of inferencing, trying to optimize the large room full of servers that was IBM Watson and condense it down to be able to run on a single server. So that was, at the time, a much more primitive, sort of handcrafted AI that I was optimizing. And I didn't understand really much of it. It was more the system side that I knew about and treating it as a black box, but looking at okay, you know, what are the pieces that need to talk to each other and, you know, what's redundant that's being done more than once that we can kind of eliminate those kinds of things.

02:20 - Chris Wright
So you came from a systems background, pre-transformers, pre-LLMs, and your figuring out how to apply your systems knowledge to optimize the Watson. I love the "Jeopardy" connection. I'm trying to think everything I'll say will be an answer and then you give me back the question.

02:39 - Nick Hill
Yep, yep. So I remember there was special things they had to do to like make it, you know, converse in that way or understand the sort of reverse, yeah, question and answer thing.

02:51 - Chris Wright
So you're working on vLLM, you had a little bit of history leading you to vLLM. We should dig into your experience. Maybe what brought you to vLLM specifically?

03:04 - Nick Hill
So I was working on some of the early language model deployment work within IBM research before anyone had sort of thought much about, "Okay, how do we run these models efficiently in production for lots of users at the same time?" And there was some early work from Hugging Face that I got involved with and started contributing to and had an internal fork that we were building our own platform within IBM research on. And I noticed a new project, vLLM, which came from Berkeley, and it was doing things a little bit differently. They had a different approach. They were much more open for like contributions from the community and they somehow seemed to have, immediately, a much larger group of contributors that were interested in the project. And they're sort of very responsive to things that we were doing. So I, you know, felt this was a much better project to be involved with and sort of thought I could apply some of the stuff I'd already learned and help improve vLLM and then benefit from all of the model implementations that they were doing and all the other things at that time.

04:20 - Chris Wright
So you brought your systems background, your "Jeopardy" background into this awesome open-source project. Came from Skylab, I think, is the UC Berkeley team that-

04:31 - Nick Hill
Yep.

04:33 - Chris Wright
Birthed vLLM. My understanding is it started off pretty simple as a KV cache and maybe the... a lot of thought around ease of use where Triton was high performance, but had a lot of usability challenges and then quickly transformed into not just ease of use but performance. And then, you know, we saw performance parody from the vLLM world to the Triton world. There's all these different pieces of vLLM when we talk about the transformer architecture. I don't understand it as well as you do, but I know some of the challenges are in mapping, well just doing all of the matrix multiplication as you're doing token generation and mapping that to hardware in a resource-efficient way. And some of that's memory bandwidth limited. Some of that's leveraging the GPU cycles. Why don't you just give us a little bit of a view of the KV cache, the decoding, some of the optimizations that you see in vLLM?

05:41 - Nick Hill
Sure. Yeah. So you mentioned the problem with low utilization, that's sort of something inherent to large language models that have to produce one token at a time. That depends on all of the computation that's been done so far. So there's no real way to really paralyze it sort of inherently sequential. And we get the compute by, like, doing a large matrix multiplication. And the trouble is, each of these steps, you have to move all of the data from the GPU memory into the compute. And that becomes the limiting factor when you're not doing much on each step. So the KV cache itself helps reduce the compute, but then how you manage the cache and how you then try to soak up some of this extra compute to be able to do more in each step to get more throughput. There's a whole bunch of different techniques for helping with that. Some of things like including pre-fill tokens, which are the sort of processing of the prompt at the same time as you are doing decode steps, the memory bandwidth sort of heavy decode steps for the other sequences that are in flight or things like speculative decoding where you can sort of spend some of this extra compute verifying predictions that you have about new tokens that come. And so you can sort of skip ahead and not have to do, you can, you know, skip past two steps and so that helps throughput, when that's the constraint. So there's a lot of these different optimizations and part of the interesting work in vLLM is how to make them work effectively together. Some are just quite difficult to combine.

07:31 - Chris Wright
And the first initial implementation, the flagship capability was I guess the V in vLLM with page detention, a virtual memory approach to a KV cache?

07:47 - Nick Hill
That's right.

07:48 - Chris Wright
Actually don't know the V and vLLM.

07:49 - Nick Hill
That's right. Well, I think they left it a little bit ambiguous, but I think it was, yeah, mainly the V for virtual memory, because it's sort of conceptually analogous to the page attention concepts. So, you know, the KV cache is intermediate states that you have to compute that persist between iterations. So instead of having to recompute them on each new token, there's a lot of redundant data that you can store. The trouble with that is you don't know how long the sequence is gonna run on for. There's a thing called an end of sequence token that the model can choose to basically say, "I'm done." But you don't know, it could just give you a one-word answer, it could give you a paragraph. So the KV cache is actually very large, but there's a sort of slice of it that you have to append as the sequence grows, and if you don't reserve space, then you sort of run out of memory and you have to, you know, do a sort of defragmentation or something in real time or copy things around which is very slow or you reserve, but then maybe that's wasted. You could have been using that GPU memory for other sequence KV caches that you could process in parallel if that was the constraint. So page detention basically was an implementation of some of the matrix multiplication operations that could operate on a non-contiguous memory, GPU memory. So you can then arrange your memory as like a block, like equivalent to a page table and it keeps track of like each sequence is just points to a bunch of blocks that can be, you know, dispersed throughout the memory, and there's these kernels operate on that non-contiguous. So then it sort of solves the fragmentation problem. So that was like a key innovation that everyone uses now that came from vLLM.

09:42 - Chris Wright
That first claim to fame, which is taking a page right out of systems research and systems understanding. Speculative decoding to me has some parallels to the speculative nature of branch prediction in CPUs where this, I think similar goal of keeping the CPUs busy and not stalling a pipeline. The work that you're doing right now, what is your focus area in vLLM?

10:12 - Nick Hill
So a lot of what I'm doing is kind of in the middle layer. So I don't work so much on the kernels and optimizing the GPU operations, but there's a lot of sort of glue that needs to happen to make things work effectively, to get the inputs from the users into the GPUs and to sort of manage efficiently the results and stream them back to the users and do all of the batching and the scheduling and the logic to sort of sample the tokens that come out of the model to decide which ones to use, all those kinds of things. So a lot of that kind of thing is what I work on and trying to work around some of Python, the sort of inherent limitations of Python where it's difficult to parallelize. So we end up splitting things into processes and then trying to efficiently move data between those processes and that kind of thing.

11:12 - Chris Wright
So back to your systems roots, like bringing in that thinking to that layer of the vLLM stack, that initial challenge of there's GPUs, here's a model, how do you make best use of that infrastructure? Sounds like a lot of the work is in maintaining the pipeline and saturating, you know, from a memory point of view and a cycle's point of view. How do you see that performance enhancement making LLMs useful in general? Is this, you can do more with less hardware, you can do more with lower powered hardware? Is there heterogeneous hardware opportunities with having different classes of hardware for, say, pre-filling and decoding? It feels like this would be the sweet spot for vLLM.

12:10 - Nick Hill
Yeah, so there's, as you sort of try and scale up to larger deployments, there's lots of different ways you can distribute the inference between GPUs. Sometimes that's necessary because of the size of the model. Other times it, you know, there's advantages 'cause it gives you additional parallelism. Something you're alluding to is sort of disaggregating the prefill part of the compute with the decode. So you can apply different sort of strategies to sort of compute heavy prefill, which has different performance characteristics than the memory bandwidth-heavy decode stages, maybe even run them on different hardware. But then how you orchestrate these things and move the cache between them or share the KV cache, that becomes a whole nother sort of systems challenge. Yeah, there's a whole sort of distributed side of it and then dealing with the very long context is becoming more and more of an issue with the reasoning models. People are talking about 128K context models, million token context. And so that's like a whole area I think that's gonna become more and more important, and yeah, distributed inference, how you compress the KV cache, understanding the ways that you can do that and avoid losing any in a sort of-- as lossless as possible way. There's papers coming out all the time with different ideas and part of the problem is trying to decide which ones show the most promise because it's, you know, a big investment sometimes to implement them.

13:59 - Chris Wright
That brings up an interesting point because it shows that time window between research and implementation has gone from a history of, you know, systems research, most systems research is largely academic compared to systems level innovation is happening almost independent of the research. These are hand in glove, research papers come out, implementations are immediately following. That is really challenging, it's a new dynamic. I think it's part of what makes AI feel so fast and furious. And everybody's evolving and learning.

14:40 - Nick Hill
It makes it so much more fun. The team, it's almost like now that, you know, the teams are integrated, there's no sort of air between the researchers and engineers, like the researchers become engineers, the engineers become researchers and yeah, it's really fun. And because, you know, there was this AI winter, right? Where there wasn't much progress and then as soon as this whole new paradigm came around, it was like this whole new frontier opened up. And there's so much sort of fertile areas to explore. It's like you can't just wait for, you know, research to happen. You wanna sort of incorporate it immediately.

15:21 - Chris Wright
Well in that context, I know one of the conversations are scaling limitations and we talked about data scaling as a, you know, we've hit the upper limit of what we can do with adding more data to improve the output of a model. So now we're talking a lot about inference time scaling, which puts something like vLLM right in the center of attention. Then there's the attention mechanism with something like multi-headed latent attention that was part of DeepSeek-V3. So you're in this great position to see all these different aspects of innovation and research come together. What are the areas, looking forward, that you are most excited about in vLLM?

16:04 - Nick Hill
Yeah, I think enabling this kind of inference time scaling where you are exploiting the inference to get better performance on tasks by trying lots of different responses and then performing some judgment as to like, which one's the best. But obviously that's gonna multiply the computation needed. So things like MLA that you mentioned really help to compress the KV cache, which becomes, for long sequences, that really becomes the constraint.

16:35 - Chris Wright
I know another key piece here is open source. The success of vLLM is because of a thriving community. And it seems like one of the things that drew you in was not just the thriving community, but also open source itself where you could not coming from this background, you could go inside, see how it works, learn, and then encourage others to contribute. Sound like that's been a key part of your own journey. What do you think in terms of attracting people to vLLM? Is there something, a call to action, come get involved, here's an area to work on?

17:15 - Nick Hill
Yeah, I mean we have I think 900 contributors now and the community is very welcoming of new contributors. There's a lot of places to kind of get started and experiment. There's a lot of companies that sort of have internal forks of vLLM, but because of how fast it moves, they find it very difficult to keep up and some of them have sort of just given up and started contributing instead and like doing work up upstream. I think just 'cause there's so many people working on AI, we've got this sort of crowd, you know, the power of the sort of crowdsourcing is shown that we're sort of keeping up with a lot of the proprietary, proprietary labs. Both on the model side, you know, people thought they were gonna run ahead and that hasn't turned out to be the case. But also, yeah, being open, it means a lot more people interested from a kind of diverse hardware, you know, not being locked into to just one hardware provider, right? That gives them optionality. They're not locked into a particular model, right? They have sort of more control of their own destiny. So, you know, it has a lot of attraction, both from, you know, people adopting it and then people contributing it. Personally, you know, it's very nice to work in the open 'cause then everyone can see your work. You know, it's like working in a closed lab, you know, you could be doing the best work ever and you know, the world wouldn't see it and you know, or benefit from it in a lot of cases, right? And it has sort of very narrow application. So yeah, that's something that attracted me to open source AI and Red Hat as well, obviously, being, you know, that's central to the DNA of Red Hat.

18:58 - Chris Wright
I think that community angle and the hardware flexibility that comes with that. I mean, it's the obvious Nvidia and AMDs and Intels of the world. But then cloud providers have their own TPUs and inference and training and Maya and Azure and then, you know, proliferation of different types of acceleration techniques. Having this consistent piece of software that connects you to those hardware accelerators and gives you access to the models plus the techniques to saturate the hardware with the workload presented by that model. I think vLLM is in a really critical part of the overall stack. So it's been awesome to have you here and give us some insights into a really amazing and thriving open-source project. So thanks for your time, Nick.

19:50 - Nick Hill
It's been fun. Thanks, Chris.

19:52 - Chris Wright
While everyone's focused on what AI can do, we've just explored how it actually gets done. Nick showed us sometimes the most important innovations aren't in the headlines. They're in making technology work better in the real world. What we're really talking about here is democratizing AI, making it practical and accessible without needing a warehouse full of GPUs or a small power plant. The solutions we discuss today, like vLLM's innovative approach to memory management and processing are helping organizations of all sizes harness AI's potential. If you're interested in diving deeper into this technology, check out our show notes for links to the vLLM project and resources on AI infrastructure optimization. This is exactly the kind of open-source innovation that's shaping the future of enterprise technology. Thanks for joining the conversation. I'm Chris Wright and I can't wait to see what we explore next on "Technically Speaking."

Sobre el podcast

What’s next for enterprise IT? No one has all the answers—But CTO Chris Wright knows the tech experts and industry leaders who are working on them.

Building more efficient AI with vLLM ft. Nick Hill

Transcripción

Transcript

More about artificial intelligence

Sobre el podcast