Skip to content

Optimize Performance

Use this guide when a deployment is too slow or can’t handle enough concurrent requests.

How the GPU spends its time

Every inference request runs in two phases. They stress the hardware in opposite ways, and almost every tuning decision maps back to one of them.

Prefill is when the model reads your prompt. All input tokens are processed in one large matrix multiplication, so the GPU’s compute units are busy. This phase is FLOPS-bound. The time it takes is what users experience as Time to First Token (TTFT).

Decode is when the model generates the reply, one token at a time. To produce each token, the GPU reads the full set of model weights from memory, does a small calculation, then waits for the next memory read. The math is fast. The memory transfer isn’t. Decode is memory-bandwidth-bound, and its speed is what users experience as Inter-Token Latency (ITL).

That split matters because the fixes are different:

SymptomBottleneckFix
First token arrives slowlyFLOPSReduce prompt length, use prefix caching, choose a higher-compute GPU class
Tokens stream slowlyMemory bandwidthQuantize weights and KV-cache, pool bandwidth across more GPUs
Deployment crashes at startupVRAM capacityReduce context length, quantize, or increase GPU count

Context length is a strong lever

The KV-cache stores attention states for every token in the active context window. It lives in VRAM alongside the model weights, and it grows with every concurrent request. A model that looks small on a spec sheet can exhaust memory once you account for the cache.

For a 70B model in FP16, each request at 4K context uses roughly 1.25 GB of KV-cache. At 128K context, that’s 40 GB per request. The model weights don’t change.

If your workload doesn’t need the model’s full context window, cap it:

exo dedicated-inference deployment create "$DEPLOYMENT" \
  --model-name "$MODEL" \
  --gpu-type gpu3 \
  --gpu-count 2 \
  --replicas 1 \
  --inference-engine-params '--max-model-len=32768' \
  -z "$ZONE"

Quantization

Since decode is bottlenecked by memory reads, storing weights and KV-cache values in a smaller format directly translates to faster generation.

To quantize the KV-cache at deployment time:

--inference-engine-params '--kv-cache-dtype=fp8'

For weight quantization, if the model has already been published in AWQ or GPTQ format, you don’t need to pass any flag because vLLM can detect it automatically. If you want to request quantization at runtime:

--inference-engine-params '--quantization=awq'

Multi-GPU considerations

Raising --gpu-count does two things: it splits model weights across cards so larger models fit, and it pools memory bandwidth across all cards.

The default strategy is Tensor Parallelism. Each layer’s weight matrices get sliced across all GPUs. Every card processes every token, but only its slice of each layer. After each layer, all cards synchronize to merge their partial results before moving to the next layer.

That synchronization is a bottleneck to consider. On NVLink-connected GPUs, each synchronization runs at around 900 GB/s and costs single-digit milliseconds per pass. Over a PCIe connection, those same synchronizations can add more latency than the compute itself. Adding more GPUs can make things slower.

A strategy worth knowing is for MoE models specifically is Expert Parallelism (--enable-expert-parallel). MoE models contain dozens to hundreds of expert networks but only a subset of them fire per token. Rather than replicating all experts on every GPU, EP assigns whole experts to specific GPUs. GPU 0 owns experts 0-15, GPU 1 owns experts 16-31, and so on. When a token is routed to expert 20, it gets dispatched to GPU 1 and the result is returned. Each card only stores and computes its own experts which cuts memory pressure and keeps matrix operations at a shape that hardware executes efficiently since it minimizes GPU-to-GPU communication.

Replicas

When the model fits and generates correctly but requests are queueing, add replicas:

exo dedicated-inference deployment scale "$DEPLOYMENT" 3 -z "$ZONE"

Replicas are fully independent. No synchronization, no routing, no communication overhead between them. Just more copies of the same deployment handling separate requests in parallel.

exo dedicated-inference deployment scale "$DEPLOYMENT" 3 -z "$ZONE"

Prefix caching

If your application sends the same system prompt with every request, which most chatbots and RAG pipelines do, the model recomputes the KV-cache for that prompt from scratch every single time.

Prefix caching fixes that. On the first request, vLLM computes the KV-cache for the shared prefix and stores it. Every subsequent request with the same prefix reuses those cached blocks, skipping prefill entirely for the shared portion. For lengthy prompts at scale, that’s a significant reduction in both TTFT and GPU time per request.

Enable it with:

--inference-engine-params '--enable-prefix-caching'

Speculative decoding

The GPU’s compute units sit mostly idle during decode because the workload is memory-bound. Speculative decoding puts that spare compute to use.

A small draft model generates several tokens ahead. The main model validates all of them in a single forward pass. Validation uses the same memory bandwidth as generating one token so accepted draft tokens are nearly free.

--inference-engine-params '--speculative-config={"method":"mtp","num_speculative_tokens":3}'

The speedup depends on how often the draft guesses correctly. High-temperature sampling, diverse topics, and large batch sizes all reduce the acceptance rate and shrink the benefit.

Measure before and after

You can use a tool like AIPerf to benchmark your dedicated inference endpoints. It drives concurrent load against any OpenAI-compatible endpoint and reports TTFT, ITL, and throughput metrics that map directly to the bottlenecks covered in this guide.

aiperf profile \
  --url "https://<your-endpoint-url>" \
  --endpoint "/v1/chat/completions" \
  --endpoint-type chat \
  --model "<your-model-name>" \
  --header "Authorization:Bearer <your-api-key>" \
  --tokenizer "<your-model-name>" \
  --tokenizer-trust-remote-code \
  --streaming \
  --concurrency 10 \
  --request-count 100

Flags and default values change between AIPerf releases. Read the AIPerf documentation for the current recommended way to profile endpoints before running any benchmarks.

Read the results against what you now know about the two phases:

  • TTFT high and rising with concurrency (prefill-bound). More concurrent requests means more prefill work competing for FLOPS. Trim prompts to reduce compute per request, enable prefix caching to skip recomputing shared prompt prefixes, or move to a higher-compute GPU class.
  • ITL high regardless of concurrency (memory-bandwidth-bound). Every decode step loads the full model weights from memory to generate a single token. Even one request hits this ceiling. Quantization shrinks weight size so each step reads less data. A GPU with higher memory bandwidth completes each read faster.
  • Throughput lower than expected (VRAM capacity-bound or GPU-to-GPU communication-bound). If the KV cache is full, vLLM can’t grow the batch further. Quantize model weights to free VRAM for the cache, or quantize the KV cache itself so each entry takes less room. Reduce context length to shrink each request’s cache footprint. If adding GPUs didn’t help or made things worse, the bottleneck may be synchronization overhead rather than capacity: each extra card adds an all-reduce per forward pass, and over PCIe that cost compounds fast. On MoE models,consider Expert Parallelism to spreads expert weights across cards.

Check logs after any change

After updating parameters, check that the deployment came back up:

exo dedicated-inference deployment show "$DEPLOYMENT" -z "$ZONE"

If it doesn’t reach ready, read the logs:

exo dedicated-inference deployment logs "$DEPLOYMENT" -z "$ZONE" --tail 100

Look for invalid parameter errors, CUDA out of memory, model loading failures, and vLLM tracebacks. A deployment stuck in deploying might be loading a large model slowly, or it might have crashed during engine startup. The logs are the only way to tell the difference.

For more on diagnosing failed deployments, see Monitor and Troubleshoot.

Next steps

Last updated on