Optimize Deployment Costs

Dedicated Inference billing is transparent and usage-based, giving you full control over costs. This guide covers practical strategies to optimize your spending while maintaining the performance your applications need.

Understanding the Cost Model

Dedicated Inference has two billing components:

GPU Compute
Billed per-second for each GPU running in a deployment. Costs depend on GPU type and number of replicas.
Object Storage
Standard Exoscale Object Storage (SOS) costs for storing models downloaded from Hugging Face.

Key Principle: You only pay for GPU resources when deployments have active replicas. Scale to zero to stop GPU billing while preserving your URL and API key.

Cost Optimization Strategies

1. Right-Size Your GPU Selection

Choose the smallest GPU that can comfortably fit your model.

GPU Selection Guide:

GPU TypeMemoryBest ForRelative Cost
GPUA500024 GBMedium models (7-20B params), embeddings, RAG, text extraction$
GPURTX6000pro96 GBLarge models (up to 120B params), agents, coding tasks$$

Use Cases by GPU:

  • GPUA5000: Embeddings for RAG pipelines, text extraction and summarization, general-purpose inference
  • GPURTX6000pro: Agents, coding tasks, validating concepts before fine-tuning smaller models, generating synthetic training data

Multi-GPU Scaling Examples (GPURTX6000pro):

ConfigurationExample Models
1× GPUgpt-oss-120b
4× GPUMiniMax-M1 with 400k context
8× GPUQwen Coder 480B

Example Savings:

Running a 7B model on GPUA5000 instead of GPURTX6000pro can save significantly on GPU costs with no performance difference.

How to Choose:

  1. Check model size on Hugging Face (look for total size in GB)
  2. Use the Model Memory Calculator for estimates
  3. Start with the GPU Selection Guide above
  4. If deployment fails with “out of memory,” upgrade to next GPU size

2. Scale Replicas Based on Demand

Start with minimal replicas and scale up only when needed.

Development/Testing

# Use 1 replica for development
exo dedicated-inference deployment create dev-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 \
  --gpu-count 1 \
  --replicas 1 \
  -z at-vie-2

Production Scaling

# Scale up for production traffic
exo dedicated-inference deployment scale prod-app 3 -z at-vie-2

# Scale down during low-traffic periods
exo dedicated-inference deployment scale prod-app 1 -z at-vie-2

Cost Impact:

  • 1 replica = 1× GPU cost
  • 3 replicas = 3× GPU cost
  • 5 replicas = 5× GPU cost

3. Scale to Zero During Idle Time

Scale to zero to stop GPU billing while preserving your deployment’s URL and API key.

When to Scale to Zero:

  • Nights and weekends for development environments
  • After demo presentations
  • During planned maintenance windows
  • When projects are on hold

How to Scale to Zero:

# Scale to zero (stops GPU billing immediately)
exo dedicated-inference deployment scale my-app 0 -z at-vie-2

Resume When Needed:

# Scale back up—URL and API key are preserved
exo dedicated-inference deployment scale my-app 1 -z at-vie-2

Cost Savings Example:

  • Running 24/7: 730 hours/month
  • Delete nights (18:00-08:00) and weekends: ~350 hours/month
  • Savings: ~52% on GPU costs

4. Minimize GPU Count

Only use multiple GPUs per instance when absolutely necessary.

Single GPU (Preferred):

--gpu-count 1 --replicas 3  # Total: 3 GPUs

Multiple GPUs (Only for Large Models):

--gpu-count 4 --replicas 1  # Total: 4 GPUs

Why This Matters:

  • Multi-GPU deployments are for large models that don’t fit on one GPU
  • If your model fits on 1 GPU, use replicas instead
  • Same total GPU count, but replicas provide better fault tolerance

5. Clean Up Unused Models

Models in Object Storage incur minor SOS costs (typically < 1% of total cost). Still, delete models you no longer need:

exo dedicated-inference model list -z at-vie-2
exo dedicated-inference model delete <model-id> -z at-vie-2

6. Share Models Across Deployments

Create a model once and reuse it for multiple deployments.

Efficient Approach:

# Create model once
exo dedicated-inference model create mistralai/Mistral-7B-Instruct-v0.3 -z at-vie-2

# Use for multiple deployments
exo dedicated-inference deployment create dev-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 --gpu-count 1 --replicas 1 -z at-vie-2

exo dedicated-inference deployment create prod-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 --gpu-count 1 --replicas 2 -z at-vie-2

Benefit: Only one copy stored in Object Storage, used by all deployments.

7. Use Speculative Decoding for Efficiency

Speculative decoding can reduce GPU time per inference by 1.5-3×, improving cost efficiency.

Without Speculative Decoding:

  • 100 requests/hour on 1 replica

With Speculative Decoding:

  • 200-300 requests/hour on the same hardware
  • Or serve 100 requests/hour with fewer replicas

See Optimize Performance guide for implementation details.

8. Optimize Inference Parameters

Reduce unnecessary computation in your API requests:

Limit Token Generation:

{
  "model": "mistralai/Mistral-7B-Instruct-v0.3",
  "messages": [{"role": "user", "content": "Hello"}],
  "max_tokens": 50  // Use only what you need
}

Enable Streaming:

{
  "model": "mistralai/Mistral-7B-Instruct-v0.3",
  "messages": [{"role": "user", "content": "Hello"}],
  "stream": true  // Better perceived performance
}

Cost Impact: Shorter responses = less GPU time per request = more requests per replica.

Cost Estimation

Use the Exoscale Advanced Calculator to estimate costs.

Example Calculation:

Scenario: Mistral-7B on GPUA5000, 2 replicas, 12 hours/day, 22 days/month

  1. GPU Compute:

    • GPUA5000 rate: Check GPU pricing
    • Hours: 2 replicas × 12 hours/day × 22 days = 528 GPU-hours
    • Cost: 528 × hourly rate
  2. Object Storage:

    • Model size: ~14 GB (Mistral-7B)
    • Monthly SOS cost: ~$0.30-0.50 (varies by region)
  3. Total: GPU compute + SOS costs

Cost Monitoring

Track Deployment Hours

Keep a log of active deployment time:

# Script to log deployment status
#!/bin/bash
DATE=$(date +%Y-%m-%d)
DEPLOYMENT="my-app"
STATUS=$(exo dedicated-inference deployment show $DEPLOYMENT -z at-vie-2 | grep Status)
echo "$DATE: $STATUS" >> deployment-log.txt

Use Exoscale Portal

Monitor costs in real-time:

  1. Log in to Exoscale Portal
  2. Navigate to Billing section
  3. Review GPU compute and Object Storage usage
  4. Set up budget alerts if available

Cost Optimization Workflow

For Development

# Start of workday - scale up
exo dedicated-inference deployment scale dev-app 1 -z at-vie-2

# End of workday - scale to zero
exo dedicated-inference deployment scale dev-app 0 -z at-vie-2

Daily Savings: ~14 hours × GPU rate (URL and API key preserved)

For Production

# Low traffic (1 replica)
exo dedicated-inference deployment scale prod-app 1 -z at-vie-2

# High traffic (3 replicas)
exo dedicated-inference deployment scale prod-app 3 -z at-vie-2

# Weekends - scale to zero
exo dedicated-inference deployment scale prod-app 0 -z at-vie-2

Weekend Savings: 48 hours × GPU rate × replica count

Cost Comparison: Strategies in Action

Scenario: Running Mistral-7B for a production application

StrategySetupMonthly GPU HoursRelative Cost
Unoptimized3× GPURTX6000, 24/72,1903.0×
Better GPU3× GPUA5000, 24/72,1901.5×
Scaled3× GPUA5000, 12h/day1,0950.75×
Optimized1× GPUA5000, 12h/day, spec. decoding~200*0.15×

*With speculative decoding improving throughput 1.5-3×, fewer GPU-hours needed for same requests

Quick Wins Checklist

  • Use the smallest GPU that fits your model
  • Start with 1 replica and scale only when needed
  • Scale to zero during idle periods
  • Clean up unused models in Object Storage
  • Share models across multiple deployments
  • Consider speculative decoding for high-throughput workloads
  • Set max_tokens appropriately in API requests
  • Monitor costs weekly in Exoscale Portal

Next Steps

Last updated on