Optimize Deployment Costs

Dedicated Inference billing is transparent and usage-based, giving you full control over costs. This guide covers practical strategies to optimize your spending while maintaining the performance your applications need.

Understanding the Cost Model

Dedicated Inference has two billing components:

GPU Compute: Billed per-second for each GPU running in a deployment. Costs depend on GPU type and number of replicas.
Object Storage: Standard Exoscale Object Storage (SOS) costs for storing models downloaded from Hugging Face.

Key Principle: You only pay for GPU resources when deployments have active replicas. Scale to zero to stop GPU billing while preserving your URL and API key.

Cost Optimization Strategies

1. Right-Size Your GPU Selection

Choose the smallest GPU that can comfortably fit your model.

GPU Selection Guide:

GPU Type	Memory	Best For	Relative Cost
GPUA5000	24 GB	Medium models (7-20B params), embeddings, RAG, text extraction	$
GPURTX6000pro	96 GB	Large models (up to 120B params), agents, coding tasks	$$

Use Cases by GPU:

GPUA5000: Embeddings for RAG pipelines, text extraction and summarization, general-purpose inference
GPURTX6000pro: Agents, coding tasks, validating concepts before fine-tuning smaller models, generating synthetic training data

Multi-GPU Scaling Examples (GPURTX6000pro):

Configuration	Example Models
1× GPU	gpt-oss-120b
4× GPU	MiniMax-M1 with 400k context
8× GPU	Qwen Coder 480B

Example Savings:

Running a 7B model on GPUA5000 instead of GPURTX6000pro can save significantly on GPU costs with no performance difference.

How to Choose:

Check model size on Hugging Face (look for total size in GB)
Use the Model Memory Calculator for estimates
Start with the GPU Selection Guide above
If deployment fails with “out of memory,” upgrade to next GPU size

2. Scale Replicas Based on Demand

Start with minimal replicas and scale up only when needed.

Development/Testing

# Use 1 replica for development
exo dedicated-inference deployment create dev-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 \
  --gpu-count 1 \
  --replicas 1 \
  -z at-vie-2

Production Scaling

# Scale up for production traffic
exo dedicated-inference deployment scale prod-app 3 -z at-vie-2

# Scale down during low-traffic periods
exo dedicated-inference deployment scale prod-app 1 -z at-vie-2

Cost Impact:

1 replica = 1× GPU cost
3 replicas = 3× GPU cost
5 replicas = 5× GPU cost

3. Scale to Zero During Idle Time

Scale to zero to stop GPU billing while preserving your deployment’s URL and API key.

When to Scale to Zero:

Nights and weekends for development environments
After demo presentations
During planned maintenance windows
When projects are on hold

How to Scale to Zero:

# Scale to zero (stops GPU billing immediately)
exo dedicated-inference deployment scale my-app 0 -z at-vie-2

Resume When Needed:

# Scale back up—URL and API key are preserved
exo dedicated-inference deployment scale my-app 1 -z at-vie-2

Cost Savings Example:

Running 24/7: 730 hours/month
Delete nights (18:00-08:00) and weekends: ~350 hours/month
Savings: ~52% on GPU costs

4. Minimize GPU Count

Only use multiple GPUs per instance when absolutely necessary.

Single GPU (Preferred):

--gpu-count 1 --replicas 3  # Total: 3 GPUs

Multiple GPUs (Only for Large Models):

--gpu-count 4 --replicas 1  # Total: 4 GPUs

Why This Matters:

Multi-GPU deployments are for large models that don’t fit on one GPU
If your model fits on 1 GPU, use replicas instead
Same total GPU count, but replicas provide better fault tolerance

5. Clean Up Unused Models

Models in Object Storage incur minor SOS costs (typically < 1% of total cost). Still, delete models you no longer need:

exo dedicated-inference model list -z at-vie-2
exo dedicated-inference model delete <model-id> -z at-vie-2

6. Share Models Across Deployments

Create a model once and reuse it for multiple deployments.

Efficient Approach:

# Create model once
exo dedicated-inference model create mistralai/Mistral-7B-Instruct-v0.3 -z at-vie-2

# Use for multiple deployments
exo dedicated-inference deployment create dev-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 --gpu-count 1 --replicas 1 -z at-vie-2

exo dedicated-inference deployment create prod-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 --gpu-count 1 --replicas 2 -z at-vie-2

Benefit: Only one copy stored in Object Storage, used by all deployments.

7. Use Speculative Decoding for Efficiency

Speculative decoding can reduce GPU time per inference by 1.5-3×, improving cost efficiency.

Without Speculative Decoding:

100 requests/hour on 1 replica

With Speculative Decoding:

200-300 requests/hour on the same hardware
Or serve 100 requests/hour with fewer replicas

See Optimize Performance guide for implementation details.

8. Optimize Inference Parameters

Reduce unnecessary computation in your API requests:

Limit Token Generation:

{
  "model": "mistralai/Mistral-7B-Instruct-v0.3",
  "messages": [{"role": "user", "content": "Hello"}],
  "max_tokens": 50  // Use only what you need
}

Enable Streaming:

{
  "model": "mistralai/Mistral-7B-Instruct-v0.3",
  "messages": [{"role": "user", "content": "Hello"}],
  "stream": true  // Better perceived performance
}

Cost Impact: Shorter responses = less GPU time per request = more requests per replica.

Cost Estimation

Use the Exoscale Advanced Calculator to estimate costs.

Example Calculation:

Scenario: Mistral-7B on GPUA5000, 2 replicas, 12 hours/day, 22 days/month

GPU Compute:
- GPUA5000 rate: Check GPU pricing
- Hours: 2 replicas × 12 hours/day × 22 days = 528 GPU-hours
- Cost: 528 × hourly rate
Object Storage:
- Model size: ~14 GB (Mistral-7B)
- Monthly SOS cost: ~$0.30-0.50 (varies by region)
Total: GPU compute + SOS costs

Cost Monitoring

Track Deployment Hours

Keep a log of active deployment time:

# Script to log deployment status
#!/bin/bash
DATE=$(date +%Y-%m-%d)
DEPLOYMENT="my-app"
STATUS=$(exo dedicated-inference deployment show $DEPLOYMENT -z at-vie-2 | grep Status)
echo "$DATE: $STATUS" >> deployment-log.txt

Use Exoscale Portal

Monitor costs in real-time:

Log in to Exoscale Portal
Navigate to Billing section
Review GPU compute and Object Storage usage
Set up budget alerts if available

Cost Optimization Workflow

For Development

# Start of workday - scale up
exo dedicated-inference deployment scale dev-app 1 -z at-vie-2

# End of workday - scale to zero
exo dedicated-inference deployment scale dev-app 0 -z at-vie-2

Daily Savings: ~14 hours × GPU rate (URL and API key preserved)

For Production

# Low traffic (1 replica)
exo dedicated-inference deployment scale prod-app 1 -z at-vie-2

# High traffic (3 replicas)
exo dedicated-inference deployment scale prod-app 3 -z at-vie-2

# Weekends - scale to zero
exo dedicated-inference deployment scale prod-app 0 -z at-vie-2

Weekend Savings: 48 hours × GPU rate × replica count

Cost Comparison: Strategies in Action

Scenario: Running Mistral-7B for a production application

Strategy	Setup	Monthly GPU Hours	Relative Cost
Unoptimized	3× GPURTX6000, 24/7	2,190	3.0×
Better GPU	3× GPUA5000, 24/7	2,190	1.5×
Scaled	3× GPUA5000, 12h/day	1,095	0.75×
Optimized	1× GPUA5000, 12h/day, spec. decoding	~200*	0.15×

*With speculative decoding improving throughput 1.5-3×, fewer GPU-hours needed for same requests

Quick Wins Checklist

Use the smallest GPU that fits your model
Start with 1 replica and scale only when needed
Scale to zero during idle periods
Clean up unused models in Object Storage
Share models across multiple deployments
Consider speculative decoding for high-throughput workloads
Set max_tokens appropriately in API requests
Monitor costs weekly in Exoscale Portal

Next Steps

Last updated on January 30, 2026