Skip to main content
  1. Blog
  2. Article

Rob Gibbon
on 27 April 2026

Understanding disaggregated GenAI model serving with llm-d


What is llm-d?

llm-d is an open source solution for managing high-scale, high-performance Large Language Model (LLM) deployments. LLMs are at the heart of generative AI – so when you chat with ChatGPT or Gemini, you’re talking to an LLM.

Simple LLM deployments – where an LLM is deployed to a single server – can suffer from latency issues, even with just one user. This can be because of lack of memory-bandwidth on the server, or because of KV cache pressure on system memory. This means that you’re kept waiting for the LLM to respond to your question or instruction, which can really drag, and nobody likes to be kept waiting. 

Llm-d tries to solve this by splitting up the LLM deployment (disaggregating the deployment) and separating out different components of the architecture onto dedicated hardware. This means that the various parts of the system can be managed and scaled independently. Llm-d is a cloud-native system and uses Kubernetes as an orchestration engine for all of this, so that managing the necessary resources can be done automatically – using Kubernetes’ automation features.

Why run your own LLM service?

For some organizations, sovereignty (that is, keeping things under your own control, governance and oversight) is imperative. That’s true for sensitive data, and also for sensitive data processing, like the things folks do with LLMs, such as building Retrieval Augmented Generative AI (RAG) systems or agentic workflows. So for those organizations, there’s no question that they’re going to want to run an LLM service under their own watch, on their own systems, even in their own data center. With open weight large language models like Kimi-2.5 and GM-5 that can hold their own against Gemini Pro, Claude Sonnet and Grok becoming available, it’s never been a better time to run a sovereign AI Factory. And that’s where llm-d comes into its own.

Architecture of llm-d

At a high level, llm-d is composed of four major components, all of which run on a Kubernetes cluster:

  1. Inference scheduler – this part is an adaptive load balancer, responsible for intelligently routing user questions to worker nodes that have already cached relevant context related to the user’s question. It’s using metrics pulled from a Prometheus metrics endpoint to take routing decisions.
  2. Cache manager – this part is responsible for coordinating LLM key-value (KV) caches. Getting caching right is a critical factor in getting the best possible LLM performance.
  3. Prefill worker – llm-d splits the actual LLM workload in two. The prefill component performs the heavy, compute intensive processing of prompts and can be scaled independently.
  4. Decode worker – the decode component performs the memory-bandwidth dependent task of generating tokens (this is the part that is responsible for writing the answer to the user’s question).

Llm-d is designed to work with the very high-performance hardware these setups need, like servers with enterprise-grade GPUs and Infiniband network switching – which is an alternative networking solution to the classic ethernet.

Getting hands-on

I find that the best way to learn about something more deeply is to work with it. So to that end, I put together some Juju charms for Ubuntu, to get a better understanding of how llm-d works for myself. They can enable you to deploy an LLM to an llm-d setup in a clean, straightforward way – without needing to be a Kubernetes guru.

I’ve made the source code available in GitHub, along with some instructions about how to build the code and get things up and running. Note that I’ve just been playing around with Juju charms when building these; there may be bugs and they are not supported by Canonical, so use them at your own risk.

The diagram below illustrates how the various Juju charms that manage the system are integrated.

Juju charms offer a clean approach to devops, and the system has primitives for both cloud infrastructure and Kubernetes. If you’d like to dig in and learn more about how to develop or use Juju charms, head over to our Juju page

Further reading

Related posts


Edoardo Barbieri
30 October 2025

Why we brought hardware-optimized GenAI inference to Ubuntu 

AI Article

On October 23rd, we announced the beta availability of silicon-optimized AI models in Ubuntu. Developers can locally install DeepSeek R1 and Qwen 2.5 VL with a single command, benefiting from maximized hardware performance and automated dependency management. Application developers can access the local API of a quantized generative AI (Ge ...


Isobel Kate Maxwell
10 February 2026

Building new revenue streams: 3 strategic cloud opportunities for telcos in 2026

Cloud and server Telecommunications

PWC claimed the ‘fundamental challenge’ behind slowing growth is that telecom’s ‘core products and services’ are ‘becoming commodities.’ The way forward lies in modernizing and diversifying: evolving from traditional telecommunications to ‘techco’ (technology company) services. In 2026, many of these opportunities will come from cloud com ...


Michelle Anne Tabirao
20 December 2024

Building RAG with enterprise open source AI infrastructure

Data Platform Article

How to create a robust enterprise AI infrastructure for RAG systems using open source tooling?A highlight on how open source can help ...