Operation

Scaling Your Deployments

Exoscale Dedicated Inference provides flexible scaling options to match your inference workload requirements and optimize costs.

Understanding Scaling Parameters

When creating or scaling a deployment, two key parameters control resource allocation:

Replicas
The number of identical copies of your model deployment running in parallel. Replicas enable horizontal scaling for handling concurrent requests and providing high availability.
GPU Count
The number of GPUs assigned to a single model instance. This enables vertical scaling for large models that require multiple GPUs to run efficiently.

The total number of GPUs consumed by a deployment is calculated as:

Total GPUs = gpu-count × replicas

For example:

  • --gpu-count 2 --replicas 3 = 6 total GPUs
  • --gpu-count 1 --replicas 1 = 1 total GPU

Horizontal Scaling (Replicas)

Horizontal scaling adds more replicas of your deployment to handle increased inference load and improve availability.

Use horizontal scaling when:

  • You need to handle more concurrent requests
  • You want to improve availability and fault tolerance
  • Your model fits comfortably on a single GPU (or your chosen GPU count)

Example:

# Scale to 3 replicas for higher throughput
exo dedicated-inference deployment scale my-deployment 3 -z at-vie-2

Vertical Scaling (GPU Count)

Vertical scaling distributes a single large model across multiple GPUs, enabling you to run models that exceed the memory capacity of a single GPU.

Use vertical scaling when:

  • Your model is too large to fit on a single GPU
  • You need more GPU memory for a single inference instance
  • You’re deploying very large language models (LLMs)

Note: GPU count is set during deployment creation and cannot be changed afterward. To change GPU count, you must create a new deployment.

Scaling Operations

Scale Up

Increase the number of replicas to handle more traffic:

exo dedicated-inference deployment scale <deployment-name> <new-replica-count> -z <zone>

Example:

# Scale from 1 to 3 replicas
exo dedicated-inference deployment scale demo 3 -z at-vie-2

Scale Down

Reduce the number of replicas during low-traffic periods:

# Scale down to 1 replica
exo dedicated-inference deployment scale demo 1 -z at-vie-2

Scale to Zero

Scale to zero replicas to stop GPU billing while keeping the deployment and its credentials:

exo dedicated-inference deployment scale demo 0 -z at-vie-2

When you scale to zero:

  • GPU billing stops immediately
  • The model remains stored in Object Storage (standard SOS costs apply)
  • Your URL and API key are preserved (unlike deleting, which loses them)
  • Scaling from zero takes 3-5 minutes (or longer for large models)

To resume the deployment:

exo dedicated-inference deployment scale demo 1 -z at-vie-2

Deployment Timing

Understanding deployment timing helps you plan scaling operations:

Initial Deployment
Creating a new deployment typically takes 3-5 minutes for standard models, or longer for very large models (100+ GB).
Scaling Up
Adding replicas to a running deployment takes 3-5 minutes per new replica.
Scaling from Zero
Resuming a deployment from zero replicas takes 3-5 minutes or longer, as the model must be loaded into GPU memory.
Scaling Down
Reducing replicas is nearly instantaneous, with billing stopping immediately.

Monitoring and Cost Optimization

For monitoring, troubleshooting, and cost strategies, see:

Example Workflow

Here’s a typical scaling workflow for a production deployment:

# 1. Create initial deployment with minimal resources
exo dedicated-inference deployment create my-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 \
  --gpu-count 1 \
  --replicas 1 \
  -z at-vie-2

# 2. Monitor and test with low traffic
exo dedicated-inference deployment show my-app -z at-vie-2

# 3. Scale up for production traffic
exo dedicated-inference deployment scale my-app 3 -z at-vie-2

# 4. Scale down during lower traffic periods
exo dedicated-inference deployment scale my-app 1 -z at-vie-2

# 5. Scale to zero during extended downtime (nights, weekends)
exo dedicated-inference deployment scale my-app 0 -z at-vie-2

# 6. Scale back up when needed (URL and API key are preserved)
exo dedicated-inference deployment scale my-app 1 -z at-vie-2

Scaling Limitations

GPU Count is Immutable
The --gpu-count parameter cannot be changed after deployment creation. To use a different GPU count, create a new deployment.
Quota Limits
Total GPU usage is subject to your organization’s GPU quota. Ensure sufficient quota before scaling up.
Last updated on