Skip to content

Optimize Performance

This guide covers techniques to optimize inference performance for your Dedicated Inference deployments.

Performance Optimization Options

Several techniques can improve inference speed:

TechniqueBenefitTrade-off
Context length tuningReduce memory usage, faster processingLimits input/output length
QuantizationLower memory, faster inferenceSlight quality reduction
KV-cache optimizationImproved TTFT by caching past queriesConfiguration complexity
Compilation optimizationFaster CUDA executionLonger startup time
Speculative decoding1.5-3× faster generationRequires compatible model pair

Why memory footprint matters: Lower memory usage allows the inference engine to increase request handling parallelization, improving overall throughput.

Understanding GPU Memory Usage

GPU memory consumption during inference has two main components:

Total GPU memory = model weights + KV-cache

  • Model weights are constant for a given model and precision. An 8B-parameter model in FP16 uses ~16 GB, in FP8 ~8 GB, and in FP4 ~4 GB.
  • KV-cache stores attention keys and values for the context window. It grows linearly with context length. Doubling the context roughly doubles the KV-cache memory.

This means that a model’s parameter count alone does not determine GPU requirements. Two critical factors also apply:

  1. Precision: Running a model in FP32 is never worth it for inference, FP16 is sufficient for most use cases. FP8 works well in practice if your hardware supports it, and FP4 offers significant cost and performance benefits - especially on newer hardware like the RTX Pro 6000 - although with decreased quality.
  2. Context length: Modern models commonly support 128k–256k token context windows, with some reaching 1M+ (Qwen3-Coder) or even 10M (Llama 4 Scout). Serving the full context window can require far more memory than the model weights alone.

Real-World Example: Ministral 8B on GPUA5000

mistralai/Ministral-3-8B-Instruct-2512 is an 8B model distributed in FP8 (~8 GB weights), which you might expect to fit easily on a single GPUA5000 (24 GB). However, its 256k default context length requires a KV-cache of ~32 GB in FP16, bringing the total to ~40 GB, well beyond what a single GPUA5000 can handle.

Several options can make it fit:

StrategyConfigurationGPU Requirement
Increase GPUsDefault settings4× GPUA5000
Reduce context length--max-model-len=600001× GPUA5000
FP8 KV-cache + moderate context--kv-cache-dtype=fp8 --max-model-len=1310721× GPUA5000

This illustrates the tradeoffs between GPU cost, context window size, and precision. Always check a model’s default precision and context length before choosing your GPU configuration.

All options are configured via --inference-engine-params. See available parameters:

exo dedicated-inference deployment create --inference-engine-parameter-help

Context Length Tuning

Limit the maximum context length to reduce memory usage and improve throughput:

exo dedicated-inference deployment create my-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 \
  --gpu-count 1 \
  --replicas 1 \
  --inference-engine-params '--max-model-len=32768' \
  -z at-vie-2

A value of 32k-64k is a reasonable default for most workloads. Use this when your workload doesn’t need the model’s full context window.

Important: Most modern models default to context windows of 128k–256k tokens, with some supporting 1M+ tokens. If your deployment fails with out-of-memory errors despite the model appearing small enough, reducing the context length is often the fastest fix. See Understanding GPU Memory Usage for details.

Quantization

Run quantized models for lower memory usage and faster inference:

--inference-engine-params '--quantization=awq'

Common quantization methods: awq, gptq, bitsandbytes. The model must be pre-quantized in that format.

You can also apply runtime quantization to models that are not pre-quantized. This is particularly useful for models distributed in FP16 or BF16:

--inference-engine-params '--quantization=fp8 --kv-cache-dtype=fp8'
  • --quantization=fp8: Quantizes model weights to FP8 at load time, halving memory compared to FP16.
  • --kv-cache-dtype=fp8: Stores the KV-cache in FP8, halving KV-cache memory and roughly doubling the number of tokens that can be held in memory.

Combining lower precision with a reduced context length is an effective way to fit large-context models on fewer GPUs. See the Ministral 8B example above.

For more details on supported quantization methods, see the vLLM quantization documentation.

KV-Cache Optimization

The KV-cache stores key-value pairs from previous tokens, allowing the model to avoid recomputation and improve time-to-first-token (TTFT). Enable prefix caching to reuse cached computations across requests with shared prefixes:

--inference-engine-params '--gpu-memory-utilization=0.9 --enable-prefix-caching'
  • --gpu-memory-utilization: Fraction of GPU memory for the model (default 0.9)
  • --enable-prefix-caching: Reuse KV-cache for repeated prefixes (useful for system prompts)

Compilation Optimization

Enable CUDA graph compilation for faster execution after initial warmup:

--inference-engine-params '--compilation-config={"level":3}'

Compilation levels:

  • Level 0: No optimization (default in V0)
  • Level 3: Recommended for production, enables torch.compile and CUDA graph optimizations (default in V1)

You can also use the shorthand -O3 syntax. Higher levels increase startup time but improve inference speed.

Additional Optimizations

Other parameters to explore via --inference-engine-parameter-help:

  • --max-num-batched-tokens: Maximum tokens processed in a single iteration. Tune this to balance throughput and latency.
  • --scheduling-policy: Controls request scheduling order. Options: fcfs (first come first served, default) or priority (based on request priority).
  • --enable-chunked-prefill: Allows prefill requests to be chunked, improving latency for concurrent requests.
  • --cpu-offload-gb: Offload part of the model to CPU memory to run larger models on smaller GPUs (requires fast CPU-GPU interconnect).

Speculative Decoding

Speculative decoding provides the largest performance gains (1.5-3×) by using two models: a small “draft” model generates candidate tokens quickly, then the large “target” model validates them in a single pass.

How It Works

  1. Draft model generates 5-10 candidate tokens
  2. Target model validates all candidates in one forward pass
  3. Valid tokens are kept; invalid tokens are corrected
  4. Process repeats until completion

Since the draft model is correct 70-90% of the time, this eliminates many expensive target model forward passes.

When to Use

Recommended for:

  • Long-form generation (articles, code, documentation)
  • High-throughput production APIs
  • Latency-sensitive applications

Not recommended for:

  • Very short responses (<50 tokens)
  • Classification tasks
  • Mismatched model families

Choosing Model Pairs

Use models from the same family with a 5-10× size difference:

Target ModelDraft ModelExpected Speedup
meta-llama/Llama-3.1-70B-Instructmeta-llama/Llama-3.1-8B-Instruct2-3×
mistralai/Mistral-Large-2mistralai/Mistral-7B-Instruct-v0.31.5-2.5×

Both models must use the same tokenizer.

Deployment Steps

1. Create both models:

exo dedicated-inference model create meta-llama/Llama-3.1-70B-Instruct \
  --huggingface-token <token> \
  -z at-vie-2

exo dedicated-inference model create meta-llama/Llama-3.1-8B-Instruct \
  --huggingface-token <token> \
  -z at-vie-2

2. Create deployment with speculative decoding:

exo dedicated-inference deployment create fast-inference \
  --model-name meta-llama/Llama-3.1-70B-Instruct \
  --gpu-type gpurtx6000pro \
  --gpu-count 2 \
  --replicas 1 \
  --inference-engine-params '--speculative-config={"method":"eagle3","model":"meta-llama/Llama-3.1-8B-Instruct","num_speculative_tokens":5}' \
  -z at-vie-2

Configuration parameters:

  • method: Speculative decoding method (e.g., "eagle3")
  • model: Draft model name
  • num_speculative_tokens: Tokens to generate per speculation (typically 5-10)

3. Wait for deployment and test:

exo dedicated-inference deployment show fast-inference -z at-vie-2
exo dedicated-inference deployment reveal-api-key fast-inference -z at-vie-2

GPU Memory Requirements

Both models must fit in GPU memory simultaneously.

Example: Llama-3.1-70B + Llama-3.1-8B

  • Target model: ~40 GB
  • Draft model: ~6 GB
  • Total: ~46 GB
  • Use: GPURTX6000pro (96 GB) or 2× GPUA5000

If deployment fails with “out of memory”, increase --gpu-count or use a larger GPU.

Monitoring Performance

Check logs for acceptance rate and throughput:

exo dedicated-inference deployment logs fast-inference -z at-vie-2

Target metrics:

  • Acceptance rate > 70%
  • 1.5-3× speedup vs baseline

If speedup is low:

  • Verify models are from the same family
  • Try a different draft model
  • Check GPU memory isn’t constrained

Complete Example

# 1. Create models
exo dedicated-inference model create meta-llama/Llama-3.1-70B-Instruct \
  --huggingface-token hf_xxxxx -z at-vie-2

exo dedicated-inference model create meta-llama/Llama-3.1-8B-Instruct \
  --huggingface-token hf_xxxxx -z at-vie-2

# 2. Wait for models
exo dedicated-inference model list -z at-vie-2

# 3. Deploy with speculative decoding
exo dedicated-inference deployment create llama-fast \
  --model-name meta-llama/Llama-3.1-70B-Instruct \
  --gpu-type gpurtx6000pro \
  --gpu-count 2 \
  --replicas 1 \
  --inference-engine-params '--speculative-config={"method":"eagle3","model":"meta-llama/Llama-3.1-8B-Instruct","num_speculative_tokens":5}' \
  -z at-vie-2

# 4. Test
exo dedicated-inference deployment show llama-fast -z at-vie-2
exo dedicated-inference deployment reveal-api-key llama-fast -z at-vie-2

# 5. Monitor
exo dedicated-inference deployment logs llama-fast -z at-vie-2

Next Steps

Last updated on