Optimize Performance

This guide covers techniques to optimize inference performance for your Dedicated Inference deployments.

Performance Optimization Options

Several techniques can improve inference speed:

Technique	Benefit	Trade-off
Context length tuning	Reduce memory usage, faster processing	Limits input/output length
Quantization	Lower memory, faster inference	Slight quality reduction
KV-cache optimization	Improved TTFT by caching past queries	Configuration complexity
Compilation optimization	Faster CUDA execution	Longer startup time
Speculative decoding	1.5-3× faster generation	Requires compatible model pair

Why memory footprint matters: Lower memory usage allows the inference engine to increase request handling parallelization, improving overall throughput.

All options are configured via --inference-engine-params. See available parameters:

exo dedicated-inference deployment create --inference-engine-parameter-help

Context Length Tuning

Limit the maximum context length to reduce memory usage and improve throughput:

exo dedicated-inference deployment create my-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 \
  --gpu-count 1 \
  --replicas 1 \
  --inference-engine-params '--max-model-len=32768' \
  -z at-vie-2

A value of 32k-64k is a reasonable default for most workloads. Use this when your workload doesn’t need the model’s full context window.

Quantization

Run quantized models for lower memory usage and faster inference:

--inference-engine-params '--quantization=awq'

Common quantization methods: awq, gptq, bitsandbytes. The model must be pre-quantized in that format.

For more details on supported quantization methods, see the vLLM quantization documentation.

KV-Cache Optimization

The KV-cache stores key-value pairs from previous tokens, allowing the model to avoid recomputation and improve time-to-first-token (TTFT). Enable prefix caching to reuse cached computations across requests with shared prefixes:

--inference-engine-params '--gpu-memory-utilization=0.9 --enable-prefix-caching'

--gpu-memory-utilization: Fraction of GPU memory for the model (default 0.9)
--enable-prefix-caching: Reuse KV-cache for repeated prefixes (useful for system prompts)

Compilation Optimization

Enable CUDA graph compilation for faster execution after initial warmup:

--inference-engine-params '--compilation-config={"level":3}'

Compilation levels:

Level 0: No optimization (default in V0)
Level 3: Recommended for production, enables torch.compile and CUDA graph optimizations (default in V1)

You can also use the shorthand -O3 syntax. Higher levels increase startup time but improve inference speed.

Additional Optimizations

Other parameters to explore via --inference-engine-parameter-help:

--max-num-batched-tokens: Maximum tokens processed in a single iteration. Tune this to balance throughput and latency.
--scheduling-policy: Controls request scheduling order. Options: fcfs (first come first served, default) or priority (based on request priority).
--enable-chunked-prefill: Allows prefill requests to be chunked, improving latency for concurrent requests.
--cpu-offload-gb: Offload part of the model to CPU memory to run larger models on smaller GPUs (requires fast CPU-GPU interconnect).

Speculative Decoding

Speculative decoding provides the largest performance gains (1.5-3×) by using two models: a small “draft” model generates candidate tokens quickly, then the large “target” model validates them in a single pass.

How It Works

Draft model generates 5-10 candidate tokens
Target model validates all candidates in one forward pass
Valid tokens are kept; invalid tokens are corrected
Process repeats until completion

Since the draft model is correct 70-90% of the time, this eliminates many expensive target model forward passes.

When to Use

Recommended for:

Long-form generation (articles, code, documentation)
High-throughput production APIs
Latency-sensitive applications

Not recommended for:

Very short responses (<50 tokens)
Classification tasks
Mismatched model families

Choosing Model Pairs

Use models from the same family with a 5-10× size difference:

Target Model	Draft Model	Expected Speedup
`meta-llama/Llama-3.1-70B-Instruct`	`meta-llama/Llama-3.1-8B-Instruct`	2-3×
`mistralai/Mistral-Large-2`	`mistralai/Mistral-7B-Instruct-v0.3`	1.5-2.5×

Both models must use the same tokenizer.

Deployment Steps

1. Create both models:

exo dedicated-inference model create meta-llama/Llama-3.1-70B-Instruct \
  --huggingface-token <token> \
  -z at-vie-2

exo dedicated-inference model create meta-llama/Llama-3.1-8B-Instruct \
  --huggingface-token <token> \
  -z at-vie-2

2. Create deployment with speculative decoding:

exo dedicated-inference deployment create fast-inference \
  --model-name meta-llama/Llama-3.1-70B-Instruct \
  --gpu-type gpurtx6000pro \
  --gpu-count 2 \
  --replicas 1 \
  --inference-engine-params '--speculative-config={"method":"eagle3","model":"meta-llama/Llama-3.1-8B-Instruct","num_speculative_tokens":5}' \
  -z at-vie-2

Configuration parameters:

method: Speculative decoding method (e.g., "eagle3")
model: Draft model name
num_speculative_tokens: Tokens to generate per speculation (typically 5-10)

3. Wait for deployment and test:

exo dedicated-inference deployment show fast-inference -z at-vie-2
exo dedicated-inference deployment reveal-api-key fast-inference -z at-vie-2

GPU Memory Requirements

Both models must fit in GPU memory simultaneously.

Example: Llama-3.1-70B + Llama-3.1-8B

Target model: ~40 GB
Draft model: ~6 GB
Total: ~46 GB
Use: GPURTX6000pro (96 GB) or 2× GPUA5000

If deployment fails with “out of memory”, increase --gpu-count or use a larger GPU.

Monitoring Performance

Check logs for acceptance rate and throughput:

exo dedicated-inference deployment logs fast-inference -z at-vie-2

Target metrics:

Acceptance rate > 70%
1.5-3× speedup vs baseline

If speedup is low:

Verify models are from the same family
Try a different draft model
Check GPU memory isn’t constrained

Complete Example

# 1. Create models
exo dedicated-inference model create meta-llama/Llama-3.1-70B-Instruct \
  --huggingface-token hf_xxxxx -z at-vie-2

exo dedicated-inference model create meta-llama/Llama-3.1-8B-Instruct \
  --huggingface-token hf_xxxxx -z at-vie-2

# 2. Wait for models
exo dedicated-inference model list -z at-vie-2

# 3. Deploy with speculative decoding
exo dedicated-inference deployment create llama-fast \
  --model-name meta-llama/Llama-3.1-70B-Instruct \
  --gpu-type gpurtx6000pro \
  --gpu-count 2 \
  --replicas 1 \
  --inference-engine-params '--speculative-config={"method":"eagle3","model":"meta-llama/Llama-3.1-8B-Instruct","num_speculative_tokens":5}' \
  -z at-vie-2

# 4. Test
exo dedicated-inference deployment show llama-fast -z at-vie-2
exo dedicated-inference deployment reveal-api-key llama-fast -z at-vie-2

# 5. Monitor
exo dedicated-inference deployment logs llama-fast -z at-vie-2

Next Steps

Last updated on January 30, 2026