Optimize Performance

This guide covers techniques to optimize inference performance for your Dedicated Inference deployments.

Performance Optimization Options

Several techniques can improve inference speed:

TechniqueBenefitTrade-off
Context length tuningReduce memory usage, faster processingLimits input/output length
QuantizationLower memory, faster inferenceSlight quality reduction
KV-cache optimizationImproved TTFT by caching past queriesConfiguration complexity
Compilation optimizationFaster CUDA executionLonger startup time
Speculative decoding1.5-3× faster generationRequires compatible model pair

Why memory footprint matters: Lower memory usage allows the inference engine to increase request handling parallelization, improving overall throughput.

All options are configured via --inference-engine-params. See available parameters:

exo dedicated-inference deployment create --inference-engine-parameter-help

Context Length Tuning

Limit the maximum context length to reduce memory usage and improve throughput:

exo dedicated-inference deployment create my-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 \
  --gpu-count 1 \
  --replicas 1 \
  --inference-engine-params '--max-model-len=32768' \
  -z at-vie-2

A value of 32k-64k is a reasonable default for most workloads. Use this when your workload doesn’t need the model’s full context window.

Quantization

Run quantized models for lower memory usage and faster inference:

--inference-engine-params '--quantization=awq'

Common quantization methods: awq, gptq, bitsandbytes. The model must be pre-quantized in that format.

For more details on supported quantization methods, see the vLLM quantization documentation.

KV-Cache Optimization

The KV-cache stores key-value pairs from previous tokens, allowing the model to avoid recomputation and improve time-to-first-token (TTFT). Enable prefix caching to reuse cached computations across requests with shared prefixes:

--inference-engine-params '--gpu-memory-utilization=0.9 --enable-prefix-caching'
  • --gpu-memory-utilization: Fraction of GPU memory for the model (default 0.9)
  • --enable-prefix-caching: Reuse KV-cache for repeated prefixes (useful for system prompts)

Compilation Optimization

Enable CUDA graph compilation for faster execution after initial warmup:

--inference-engine-params '--compilation-config={"level":3}'

Compilation levels:

  • Level 0: No optimization (default in V0)
  • Level 3: Recommended for production, enables torch.compile and CUDA graph optimizations (default in V1)

You can also use the shorthand -O3 syntax. Higher levels increase startup time but improve inference speed.

Additional Optimizations

Other parameters to explore via --inference-engine-parameter-help:

  • --max-num-batched-tokens: Maximum tokens processed in a single iteration. Tune this to balance throughput and latency.
  • --scheduling-policy: Controls request scheduling order. Options: fcfs (first come first served, default) or priority (based on request priority).
  • --enable-chunked-prefill: Allows prefill requests to be chunked, improving latency for concurrent requests.
  • --cpu-offload-gb: Offload part of the model to CPU memory to run larger models on smaller GPUs (requires fast CPU-GPU interconnect).

Speculative Decoding

Speculative decoding provides the largest performance gains (1.5-3×) by using two models: a small “draft” model generates candidate tokens quickly, then the large “target” model validates them in a single pass.

How It Works

  1. Draft model generates 5-10 candidate tokens
  2. Target model validates all candidates in one forward pass
  3. Valid tokens are kept; invalid tokens are corrected
  4. Process repeats until completion

Since the draft model is correct 70-90% of the time, this eliminates many expensive target model forward passes.

When to Use

Recommended for:

  • Long-form generation (articles, code, documentation)
  • High-throughput production APIs
  • Latency-sensitive applications

Not recommended for:

  • Very short responses (<50 tokens)
  • Classification tasks
  • Mismatched model families

Choosing Model Pairs

Use models from the same family with a 5-10× size difference:

Target ModelDraft ModelExpected Speedup
meta-llama/Llama-3.1-70B-Instructmeta-llama/Llama-3.1-8B-Instruct2-3×
mistralai/Mistral-Large-2mistralai/Mistral-7B-Instruct-v0.31.5-2.5×

Both models must use the same tokenizer.

Deployment Steps

1. Create both models:

exo dedicated-inference model create meta-llama/Llama-3.1-70B-Instruct \
  --huggingface-token <token> \
  -z at-vie-2

exo dedicated-inference model create meta-llama/Llama-3.1-8B-Instruct \
  --huggingface-token <token> \
  -z at-vie-2

2. Create deployment with speculative decoding:

exo dedicated-inference deployment create fast-inference \
  --model-name meta-llama/Llama-3.1-70B-Instruct \
  --gpu-type gpurtx6000pro \
  --gpu-count 2 \
  --replicas 1 \
  --inference-engine-params '--speculative-config={"method":"eagle3","model":"meta-llama/Llama-3.1-8B-Instruct","num_speculative_tokens":5}' \
  -z at-vie-2

Configuration parameters:

  • method: Speculative decoding method (e.g., "eagle3")
  • model: Draft model name
  • num_speculative_tokens: Tokens to generate per speculation (typically 5-10)

3. Wait for deployment and test:

exo dedicated-inference deployment show fast-inference -z at-vie-2
exo dedicated-inference deployment reveal-api-key fast-inference -z at-vie-2

GPU Memory Requirements

Both models must fit in GPU memory simultaneously.

Example: Llama-3.1-70B + Llama-3.1-8B

  • Target model: ~40 GB
  • Draft model: ~6 GB
  • Total: ~46 GB
  • Use: GPURTX6000pro (96 GB) or 2× GPUA5000

If deployment fails with “out of memory”, increase --gpu-count or use a larger GPU.

Monitoring Performance

Check logs for acceptance rate and throughput:

exo dedicated-inference deployment logs fast-inference -z at-vie-2

Target metrics:

  • Acceptance rate > 70%
  • 1.5-3× speedup vs baseline

If speedup is low:

  • Verify models are from the same family
  • Try a different draft model
  • Check GPU memory isn’t constrained

Complete Example

# 1. Create models
exo dedicated-inference model create meta-llama/Llama-3.1-70B-Instruct \
  --huggingface-token hf_xxxxx -z at-vie-2

exo dedicated-inference model create meta-llama/Llama-3.1-8B-Instruct \
  --huggingface-token hf_xxxxx -z at-vie-2

# 2. Wait for models
exo dedicated-inference model list -z at-vie-2

# 3. Deploy with speculative decoding
exo dedicated-inference deployment create llama-fast \
  --model-name meta-llama/Llama-3.1-70B-Instruct \
  --gpu-type gpurtx6000pro \
  --gpu-count 2 \
  --replicas 1 \
  --inference-engine-params '--speculative-config={"method":"eagle3","model":"meta-llama/Llama-3.1-8B-Instruct","num_speculative_tokens":5}' \
  -z at-vie-2

# 4. Test
exo dedicated-inference deployment show llama-fast -z at-vie-2
exo dedicated-inference deployment reveal-api-key llama-fast -z at-vie-2

# 5. Monitor
exo dedicated-inference deployment logs llama-fast -z at-vie-2

Next Steps

Last updated on