NVIDIA Dynamo Planner: Automating Multi-Node LLM Inference with SLO-Driven Automation (2026)

NVIDIA Dynamo Planner: Revolutionizing Multi-Node LLM Inference with SLO-Driven Automation

The collaboration between Microsoft and NVIDIA has taken a significant leap forward with the release of Part 2 of their joint effort, focusing on optimizing large language model (LLM) inference on Azure Kubernetes Service (AKS). The initial announcement aimed for an impressive raw throughput of 1.2 million tokens per second on distributed GPU systems, but the latest release takes it a step further by introducing automated resource planning and dynamic scaling features.

At the heart of this innovation are two powerful tools: the Dynamo Planner Profiler and the SLO-based Dynamo Planner. These tools work in harmony to tackle the complex challenge of "rate matching" in disaggregated serving, where inference workloads are divided into prefill operations (processing input context) and decode operations (generating output tokens), each running on separate GPU pools. Without these tools, developers face the daunting task of manually determining the optimal GPU allocation for these phases, a process that can consume vast amounts of time and GPU resources.

The Dynamo Planner Profiler acts as a pre-deployment simulation powerhouse. It automates the search for the best configurations, eliminating the need for manual testing of various parallelization strategies and GPU counts. Developers define their requirements in a DynamoGraphDeploymentRequest (DGDR) manifest, and the profiler conducts an automated sweep of the configuration space, testing different tensor parallelism sizes for both prefill and decode stages. This meticulous process ensures that the chosen settings maximize throughput while adhering to strict latency limits.

One of the standout features of the profiler is its AI Configurator mode, which can simulate performance in just 20 to 30 seconds based on pre-measured performance data. This rapid iteration capability allows teams to fine-tune configurations without committing to physical GPU resources, resulting in a "Goodput" that strikes the perfect balance between throughput and latency constraints.

Once the system is in production, the SLO-based Dynamo Planner takes center stage as a runtime orchestration engine. Unlike traditional load balancers, it is "LLM-aware," monitoring the cluster state, including key-value cache load in the decode pool and prefill queue depth. By leveraging the performance bounds provided by the profiler, the Planner dynamically scales prefill and decode workers to meet service level goals as traffic patterns fluctuate.

To illustrate the power of these capabilities, the announcement presents a detailed scenario involving an airline assistant powered by the Qwen3-32B-FP8 model. This model adheres to strict service level agreements, ensuring a Time to First Token of 500 milliseconds and an Inter-Token Latency of 30 milliseconds. During normal operations with short passenger queries, the system operates with a single prefill worker and a single decode worker. However, when a weather disruption leads to a surge in complex rerouting requests from 200 users, the Planner swiftly scales up to two prefill workers while maintaining one decode worker, ensuring that the system remains within latency targets even during the traffic spike.

This release builds upon the foundation laid by the original Dynamo announcement, which was covered by InfoQ in December 2024. The article highlighted how Dynamo's design separates compute-intensive and memory-bound tasks across multiple GPUs, enabling teams to optimize each phase independently and match resources to workload requirements. For instance, in an e-commerce app, the prefill task might process thousands of tokens, while the decode task generates concise descriptions.

The transition from manual setup to automated, SLO-driven resource management represents a significant advancement in managing large language model deployment on Kubernetes. The Planner components provide a powerful toolkit that translates latency requirements into GPU allocation and scaling decisions, reducing the operational burden of running disaggregated inference architectures. This automation is particularly beneficial for organizations dealing with reasoning-intensive or long-context LLMs, simplifying the management of complex multi-node GPU setups and ensuring that service level goals are consistently met, even during varying traffic patterns.

NVIDIA Dynamo Planner: Automating Multi-Node LLM Inference with SLO-Driven Automation (2026)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Arline Emard IV

Last Updated:

Views: 5599

Rating: 4.1 / 5 (72 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Arline Emard IV

Birthday: 1996-07-10

Address: 8912 Hintz Shore, West Louie, AZ 69363-0747

Phone: +13454700762376

Job: Administration Technician

Hobby: Paintball, Horseback riding, Cycling, Running, Macrame, Playing musical instruments, Soapmaking

Introduction: My name is Arline Emard IV, I am a cheerful, gorgeous, colorful, joyous, excited, super, inquisitive person who loves writing and wants to share my knowledge and understanding with you.