Optimize Deployment Costs
Dedicated Inference billing is transparent and usage-based, giving you full control over costs. This guide covers practical strategies to optimize your spending while maintaining the performance your applications need.
Understanding the Cost Model
Dedicated Inference has two billing components:
- GPU Compute
- Billed per-second for each GPU running in a deployment. Costs depend on GPU type and number of replicas.
- Object Storage
- Standard Exoscale Object Storage (SOS) costs for storing models downloaded from Hugging Face.
Key Principle: You only pay for GPU resources when deployments have active replicas. Scale to zero to stop GPU billing while preserving your URL and API key.
Cost Optimization Strategies
1. Right-Size Your GPU Selection
Choose the smallest GPU that can comfortably fit your model.
GPU Selection Guide:
| GPU Type | Memory | Best For | Relative Cost |
|---|---|---|---|
| GPUA5000 | 24 GB | Medium models (7-20B params), embeddings, RAG, text extraction | $ |
| GPURTX6000pro | 96 GB | Large models (up to 120B params), agents, coding tasks | $$ |
Use Cases by GPU:
- GPUA5000: Embeddings for RAG pipelines, text extraction and summarization, general-purpose inference
- GPURTX6000pro: Agents, coding tasks, validating concepts before fine-tuning smaller models, generating synthetic training data
Multi-GPU Scaling Examples (GPURTX6000pro):
| Configuration | Example Models |
|---|---|
| 1× GPU | gpt-oss-120b |
| 4× GPU | MiniMax-M1 with 400k context |
| 8× GPU | Qwen Coder 480B |
Example Savings:
Running a 7B model on GPUA5000 instead of GPURTX6000pro can save significantly on GPU costs with no performance difference.
How to Choose:
- Check model size on Hugging Face (look for total size in GB)
- Use the Model Memory Calculator for estimates
- Start with the GPU Selection Guide above
- If deployment fails with “out of memory,” upgrade to next GPU size
2. Scale Replicas Based on Demand
Start with minimal replicas and scale up only when needed.
Development/Testing
# Use 1 replica for development
exo dedicated-inference deployment create dev-app \
--model-name mistralai/Mistral-7B-Instruct-v0.3 \
--gpu-type gpua5000 \
--gpu-count 1 \
--replicas 1 \
-z at-vie-2Production Scaling
# Scale up for production traffic
exo dedicated-inference deployment scale prod-app 3 -z at-vie-2
# Scale down during low-traffic periods
exo dedicated-inference deployment scale prod-app 1 -z at-vie-2Cost Impact:
- 1 replica = 1× GPU cost
- 3 replicas = 3× GPU cost
- 5 replicas = 5× GPU cost
3. Scale to Zero During Idle Time
Scale to zero to stop GPU billing while preserving your deployment’s URL and API key.
When to Scale to Zero:
- Nights and weekends for development environments
- After demo presentations
- During planned maintenance windows
- When projects are on hold
How to Scale to Zero:
# Scale to zero (stops GPU billing immediately)
exo dedicated-inference deployment scale my-app 0 -z at-vie-2Resume When Needed:
# Scale back up—URL and API key are preserved
exo dedicated-inference deployment scale my-app 1 -z at-vie-2Cost Savings Example:
- Running 24/7: 730 hours/month
- Delete nights (18:00-08:00) and weekends: ~350 hours/month
- Savings: ~52% on GPU costs
4. Minimize GPU Count
Only use multiple GPUs per instance when absolutely necessary.
Single GPU (Preferred):
--gpu-count 1 --replicas 3 # Total: 3 GPUsMultiple GPUs (Only for Large Models):
--gpu-count 4 --replicas 1 # Total: 4 GPUsWhy This Matters:
- Multi-GPU deployments are for large models that don’t fit on one GPU
- If your model fits on 1 GPU, use replicas instead
- Same total GPU count, but replicas provide better fault tolerance
5. Clean Up Unused Models
Models in Object Storage incur minor SOS costs (typically < 1% of total cost). Still, delete models you no longer need:
exo dedicated-inference model list -z at-vie-2
exo dedicated-inference model delete <model-id> -z at-vie-26. Share Models Across Deployments
Create a model once and reuse it for multiple deployments.
Efficient Approach:
# Create model once
exo dedicated-inference model create mistralai/Mistral-7B-Instruct-v0.3 -z at-vie-2
# Use for multiple deployments
exo dedicated-inference deployment create dev-app \
--model-name mistralai/Mistral-7B-Instruct-v0.3 \
--gpu-type gpua5000 --gpu-count 1 --replicas 1 -z at-vie-2
exo dedicated-inference deployment create prod-app \
--model-name mistralai/Mistral-7B-Instruct-v0.3 \
--gpu-type gpua5000 --gpu-count 1 --replicas 2 -z at-vie-2Benefit: Only one copy stored in Object Storage, used by all deployments.
7. Use Speculative Decoding for Efficiency
Speculative decoding can reduce GPU time per inference by 1.5-3×, improving cost efficiency.
Without Speculative Decoding:
- 100 requests/hour on 1 replica
With Speculative Decoding:
- 200-300 requests/hour on the same hardware
- Or serve 100 requests/hour with fewer replicas
See Optimize Performance guide for implementation details.
8. Optimize Inference Parameters
Reduce unnecessary computation in your API requests:
Limit Token Generation:
{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 50 // Use only what you need
}Enable Streaming:
{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [{"role": "user", "content": "Hello"}],
"stream": true // Better perceived performance
}Cost Impact: Shorter responses = less GPU time per request = more requests per replica.
Cost Estimation
Use the Exoscale Advanced Calculator to estimate costs.
Example Calculation:
Scenario: Mistral-7B on GPUA5000, 2 replicas, 12 hours/day, 22 days/month
GPU Compute:
- GPUA5000 rate: Check GPU pricing
- Hours: 2 replicas × 12 hours/day × 22 days = 528 GPU-hours
- Cost: 528 × hourly rate
Object Storage:
- Model size: ~14 GB (Mistral-7B)
- Monthly SOS cost: ~$0.30-0.50 (varies by region)
Total: GPU compute + SOS costs
Cost Monitoring
Track Deployment Hours
Keep a log of active deployment time:
# Script to log deployment status
#!/bin/bash
DATE=$(date +%Y-%m-%d)
DEPLOYMENT="my-app"
STATUS=$(exo dedicated-inference deployment show $DEPLOYMENT -z at-vie-2 | grep Status)
echo "$DATE: $STATUS" >> deployment-log.txtUse Exoscale Portal
Monitor costs in real-time:
- Log in to Exoscale Portal
- Navigate to Billing section
- Review GPU compute and Object Storage usage
- Set up budget alerts if available
Cost Optimization Workflow
For Development
# Start of workday - scale up
exo dedicated-inference deployment scale dev-app 1 -z at-vie-2
# End of workday - scale to zero
exo dedicated-inference deployment scale dev-app 0 -z at-vie-2Daily Savings: ~14 hours × GPU rate (URL and API key preserved)
For Production
# Low traffic (1 replica)
exo dedicated-inference deployment scale prod-app 1 -z at-vie-2
# High traffic (3 replicas)
exo dedicated-inference deployment scale prod-app 3 -z at-vie-2
# Weekends - scale to zero
exo dedicated-inference deployment scale prod-app 0 -z at-vie-2Weekend Savings: 48 hours × GPU rate × replica count
Cost Comparison: Strategies in Action
Scenario: Running Mistral-7B for a production application
| Strategy | Setup | Monthly GPU Hours | Relative Cost |
|---|---|---|---|
| Unoptimized | 3× GPURTX6000, 24/7 | 2,190 | 3.0× |
| Better GPU | 3× GPUA5000, 24/7 | 2,190 | 1.5× |
| Scaled | 3× GPUA5000, 12h/day | 1,095 | 0.75× |
| Optimized | 1× GPUA5000, 12h/day, spec. decoding | ~200* | 0.15× |
*With speculative decoding improving throughput 1.5-3×, fewer GPU-hours needed for same requests
Quick Wins Checklist
- Use the smallest GPU that fits your model
- Start with 1 replica and scale only when needed
- Scale to zero during idle periods
- Clean up unused models in Object Storage
- Share models across multiple deployments
- Consider speculative decoding for high-throughput workloads
- Set
max_tokensappropriately in API requests - Monitor costs weekly in Exoscale Portal