Optimize Deployment Costs

Dedicated Inference cost comes mostly from running GPUs. The goal is simple: run the GPU capacity you need, only while you need it.

This guide covers cost decisions. For command syntax, see Use CLI Commands.

What You Pay For

There are two cost sources:

GPU compute for running replicas. This is the main driver.
Object Storage for model files kept after model creation. Smaller, but it adds up if you keep many large models.

GPU usage for one deployment is:

Total GPUs = gpu-count × replicas

So --gpu-count 2 --replicas 3 uses 6 GPUs. The same deployment at --replicas 0 uses none. Object Storage cost stays even at zero replicas, because the model files remain stored.

Right-Size the GPU Setup

Start with the smallest GPU type and --gpu-count that runs your model with the context length and parameters your workload actually needs.

This is where deployments get expensive too fast. Model size alone doesn’t decide GPU memory. The context window, KV-cache, and precision all add to it. A model that looks small can still need a large GPU if it defaults to a very long context window.

Before reaching for a bigger GPU, try trimming memory use. A lower --max-model-len shrinks the KV-cache, because it no longer reserves room for the full context window:

--inference-engine-params '--max-model-len=32000'

--gpu-count is primarily for model fit. Use more than one GPU per replica only when the model plus your context settings don’t fit on one. You can’t change --gpu-count after creation, so pick it deliberately. Too low fails at startup or leaves no memory headroom.

exo dedicated-inference deployment create "$DEPLOYMENT" \
  --model-name "$MODEL" \
  --gpu-type gpu3 \
  --gpu-count 2 \
  --replicas 1 \
  --inference-engine-params '--max-model-len=32000' \
  -z "$ZONE"

Scale Replicas for Traffic

--replicas is the traffic lever. Each replica is another full copy of the GPU allocation, so a deployment created with --gpu-count 2 uses 2 GPUs at 1 replica and 6 GPUs at 3 replicas.

exo dedicated-inference deployment scale "$DEPLOYMENT" 3 -z "$ZONE"

Scale to Zero When Idle

Scaling to zero stops GPU billing while keeping the deployment, its endpoint URL, and its API key.

exo dedicated-inference deployment scale "$DEPLOYMENT" 0 -z "$ZONE"

Bring it back when needed:

exo dedicated-inference deployment scale "$DEPLOYMENT" 1 -z "$ZONE"

This fits development outside work hours, demos, paused projects, and scheduled idle periods. Coming back from zero takes a few minutes while the model reloads, so use it for production only when downtime is acceptable.

Delete What You Won’t Reuse

Scale to zero is a pause. Delete is cleanup. Choose based on whether you still want that endpoint and API key.

Delete a deployment you no longer need:

exo dedicated-inference deployment delete "$DEPLOYMENT" -z "$ZONE"

This stops GPU billing and removes the endpoint. The registered model stays stored, so you can deploy it again later.

Model files keep costing storage on their own. List and delete models you no longer use:

exo dedicated-inference model list -z "$ZONE"
exo dedicated-inference model delete <model-id> -z "$ZONE"

You can’t delete a model while a deployment still uses it. Delete the deployment first.

Reuse One Model Across Deployments

Create a model once, then deploy it more than once, for example one deployment for testing and one for production:

exo dedicated-inference model create "$MODEL" -z "$ZONE"

This avoids storing duplicate copies and keeps your workflow clean, since you can change or delete deployments without re-importing the model.

Estimate Before You Commit

Use the Exoscale Advanced Calculator before creating long-running deployments.

Monthly GPU hours = gpu-count × replicas × running hours

Monitor Costs

Review billing and usage in the Exoscale Portal: GPU compute, Object Storage for model files, running deployments, and unused models.

Make it a habit to check development and test deployments weekly. Idle GPUs are easy to forget after experiments, demos, and benchmark runs.

Cost Checklist

Use the smallest GPU setup that fits the model and context length and fits your performance needs.
Reduce context length when the full window isn’t needed.
Use gpu-count for model fit, replicas for traffic.
Scale to zero when a deployment is idle.
Delete deployments and models you won’t reuse.
Reuse one registered model across deployments.
Review costs in the Portal regularly.

Next Steps

Last updated on June 15, 2026