Optimize Performance
This guide covers techniques to optimize inference performance for your Dedicated Inference deployments.
Performance Optimization Options
Several techniques can improve inference speed:
| Technique | Benefit | Trade-off |
|---|---|---|
| Context length tuning | Reduce memory usage, faster processing | Limits input/output length |
| Quantization | Lower memory, faster inference | Slight quality reduction |
| KV-cache optimization | Improved TTFT by caching past queries | Configuration complexity |
| Compilation optimization | Faster CUDA execution | Longer startup time |
| Speculative decoding | 1.5-3× faster generation | Requires compatible model pair |
Why memory footprint matters: Lower memory usage allows the inference engine to increase request handling parallelization, improving overall throughput.
All options are configured via --inference-engine-params. See available parameters:
exo dedicated-inference deployment create --inference-engine-parameter-helpContext Length Tuning
Limit the maximum context length to reduce memory usage and improve throughput:
exo dedicated-inference deployment create my-app \
--model-name mistralai/Mistral-7B-Instruct-v0.3 \
--gpu-type gpua5000 \
--gpu-count 1 \
--replicas 1 \
--inference-engine-params '--max-model-len=32768' \
-z at-vie-2A value of 32k-64k is a reasonable default for most workloads. Use this when your workload doesn’t need the model’s full context window.
Quantization
Run quantized models for lower memory usage and faster inference:
--inference-engine-params '--quantization=awq'Common quantization methods: awq, gptq, bitsandbytes. The model must be pre-quantized in that format.
For more details on supported quantization methods, see the vLLM quantization documentation.
KV-Cache Optimization
The KV-cache stores key-value pairs from previous tokens, allowing the model to avoid recomputation and improve time-to-first-token (TTFT). Enable prefix caching to reuse cached computations across requests with shared prefixes:
--inference-engine-params '--gpu-memory-utilization=0.9 --enable-prefix-caching'--gpu-memory-utilization: Fraction of GPU memory for the model (default 0.9)--enable-prefix-caching: Reuse KV-cache for repeated prefixes (useful for system prompts)
Compilation Optimization
Enable CUDA graph compilation for faster execution after initial warmup:
--inference-engine-params '--compilation-config={"level":3}'Compilation levels:
- Level 0: No optimization (default in V0)
- Level 3: Recommended for production, enables
torch.compileand CUDA graph optimizations (default in V1)
You can also use the shorthand -O3 syntax. Higher levels increase startup time but improve inference speed.
Additional Optimizations
Other parameters to explore via --inference-engine-parameter-help:
--max-num-batched-tokens: Maximum tokens processed in a single iteration. Tune this to balance throughput and latency.--scheduling-policy: Controls request scheduling order. Options:fcfs(first come first served, default) orpriority(based on request priority).--enable-chunked-prefill: Allows prefill requests to be chunked, improving latency for concurrent requests.--cpu-offload-gb: Offload part of the model to CPU memory to run larger models on smaller GPUs (requires fast CPU-GPU interconnect).
Speculative Decoding
Speculative decoding provides the largest performance gains (1.5-3×) by using two models: a small “draft” model generates candidate tokens quickly, then the large “target” model validates them in a single pass.
How It Works
- Draft model generates 5-10 candidate tokens
- Target model validates all candidates in one forward pass
- Valid tokens are kept; invalid tokens are corrected
- Process repeats until completion
Since the draft model is correct 70-90% of the time, this eliminates many expensive target model forward passes.
When to Use
Recommended for:
- Long-form generation (articles, code, documentation)
- High-throughput production APIs
- Latency-sensitive applications
Not recommended for:
- Very short responses (<50 tokens)
- Classification tasks
- Mismatched model families
Choosing Model Pairs
Use models from the same family with a 5-10× size difference:
| Target Model | Draft Model | Expected Speedup |
|---|---|---|
meta-llama/Llama-3.1-70B-Instruct | meta-llama/Llama-3.1-8B-Instruct | 2-3× |
mistralai/Mistral-Large-2 | mistralai/Mistral-7B-Instruct-v0.3 | 1.5-2.5× |
Both models must use the same tokenizer.
Deployment Steps
1. Create both models:
exo dedicated-inference model create meta-llama/Llama-3.1-70B-Instruct \
--huggingface-token <token> \
-z at-vie-2
exo dedicated-inference model create meta-llama/Llama-3.1-8B-Instruct \
--huggingface-token <token> \
-z at-vie-22. Create deployment with speculative decoding:
exo dedicated-inference deployment create fast-inference \
--model-name meta-llama/Llama-3.1-70B-Instruct \
--gpu-type gpurtx6000pro \
--gpu-count 2 \
--replicas 1 \
--inference-engine-params '--speculative-config={"method":"eagle3","model":"meta-llama/Llama-3.1-8B-Instruct","num_speculative_tokens":5}' \
-z at-vie-2Configuration parameters:
method: Speculative decoding method (e.g.,"eagle3")model: Draft model namenum_speculative_tokens: Tokens to generate per speculation (typically 5-10)
3. Wait for deployment and test:
exo dedicated-inference deployment show fast-inference -z at-vie-2
exo dedicated-inference deployment reveal-api-key fast-inference -z at-vie-2GPU Memory Requirements
Both models must fit in GPU memory simultaneously.
Example: Llama-3.1-70B + Llama-3.1-8B
- Target model: ~40 GB
- Draft model: ~6 GB
- Total: ~46 GB
- Use: GPURTX6000pro (96 GB) or 2× GPUA5000
If deployment fails with “out of memory”, increase --gpu-count or use a larger GPU.
Monitoring Performance
Check logs for acceptance rate and throughput:
exo dedicated-inference deployment logs fast-inference -z at-vie-2Target metrics:
- Acceptance rate > 70%
- 1.5-3× speedup vs baseline
If speedup is low:
- Verify models are from the same family
- Try a different draft model
- Check GPU memory isn’t constrained
Complete Example
# 1. Create models
exo dedicated-inference model create meta-llama/Llama-3.1-70B-Instruct \
--huggingface-token hf_xxxxx -z at-vie-2
exo dedicated-inference model create meta-llama/Llama-3.1-8B-Instruct \
--huggingface-token hf_xxxxx -z at-vie-2
# 2. Wait for models
exo dedicated-inference model list -z at-vie-2
# 3. Deploy with speculative decoding
exo dedicated-inference deployment create llama-fast \
--model-name meta-llama/Llama-3.1-70B-Instruct \
--gpu-type gpurtx6000pro \
--gpu-count 2 \
--replicas 1 \
--inference-engine-params '--speculative-config={"method":"eagle3","model":"meta-llama/Llama-3.1-8B-Instruct","num_speculative_tokens":5}' \
-z at-vie-2
# 4. Test
exo dedicated-inference deployment show llama-fast -z at-vie-2
exo dedicated-inference deployment reveal-api-key llama-fast -z at-vie-2
# 5. Monitor
exo dedicated-inference deployment logs llama-fast -z at-vie-2