Operation
Scaling Your Deployments
Exoscale Dedicated Inference provides flexible scaling options to match your inference workload requirements and optimize costs.
Understanding Scaling Parameters
When creating or scaling a deployment, two key parameters control resource allocation:
- Replicas
- The number of identical copies of your model deployment running in parallel. Replicas enable horizontal scaling for handling concurrent requests and providing high availability.
- GPU Count
- The number of GPUs assigned to a single model instance. This enables vertical scaling for large models that require multiple GPUs to run efficiently.
The total number of GPUs consumed by a deployment is calculated as:
Total GPUs = gpu-count × replicasFor example:
--gpu-count 2 --replicas 3= 6 total GPUs--gpu-count 1 --replicas 1= 1 total GPU
Horizontal Scaling (Replicas)
Horizontal scaling adds more replicas of your deployment to handle increased inference load and improve availability.
Use horizontal scaling when:
- You need to handle more concurrent requests
- You want to improve availability and fault tolerance
- Your model fits comfortably on a single GPU (or your chosen GPU count)
Example:
# Scale to 3 replicas for higher throughput
exo dedicated-inference deployment scale my-deployment 3 -z at-vie-2Vertical Scaling (GPU Count)
Vertical scaling distributes a single large model across multiple GPUs, enabling you to run models that exceed the memory capacity of a single GPU.
Use vertical scaling when:
- Your model is too large to fit on a single GPU
- You need more GPU memory for a single inference instance
- You’re deploying very large language models (LLMs)
Note: GPU count is set during deployment creation and cannot be changed afterward. To change GPU count, you must create a new deployment.
Scaling Operations
Scale Up
Increase the number of replicas to handle more traffic:
exo dedicated-inference deployment scale <deployment-name> <new-replica-count> -z <zone>Example:
# Scale from 1 to 3 replicas
exo dedicated-inference deployment scale demo 3 -z at-vie-2Scale Down
Reduce the number of replicas during low-traffic periods:
# Scale down to 1 replica
exo dedicated-inference deployment scale demo 1 -z at-vie-2Scale to Zero
Scale to zero replicas to stop GPU billing while keeping the deployment and its credentials:
exo dedicated-inference deployment scale demo 0 -z at-vie-2When you scale to zero:
- GPU billing stops immediately
- The model remains stored in Object Storage (standard SOS costs apply)
- Your URL and API key are preserved (unlike deleting, which loses them)
- Scaling from zero takes 3-5 minutes (or longer for large models)
To resume the deployment:
exo dedicated-inference deployment scale demo 1 -z at-vie-2Deployment Timing
Understanding deployment timing helps you plan scaling operations:
- Initial Deployment
- Creating a new deployment typically takes 3-5 minutes for standard models, or longer for very large models (100+ GB).
- Scaling Up
- Adding replicas to a running deployment takes 3-5 minutes per new replica.
- Scaling from Zero
- Resuming a deployment from zero replicas takes 3-5 minutes or longer, as the model must be loaded into GPU memory.
- Scaling Down
- Reducing replicas is nearly instantaneous, with billing stopping immediately.
Monitoring and Cost Optimization
For monitoring, troubleshooting, and cost strategies, see:
Example Workflow
Here’s a typical scaling workflow for a production deployment:
# 1. Create initial deployment with minimal resources
exo dedicated-inference deployment create my-app \
--model-name mistralai/Mistral-7B-Instruct-v0.3 \
--gpu-type gpua5000 \
--gpu-count 1 \
--replicas 1 \
-z at-vie-2
# 2. Monitor and test with low traffic
exo dedicated-inference deployment show my-app -z at-vie-2
# 3. Scale up for production traffic
exo dedicated-inference deployment scale my-app 3 -z at-vie-2
# 4. Scale down during lower traffic periods
exo dedicated-inference deployment scale my-app 1 -z at-vie-2
# 5. Scale to zero during extended downtime (nights, weekends)
exo dedicated-inference deployment scale my-app 0 -z at-vie-2
# 6. Scale back up when needed (URL and API key are preserved)
exo dedicated-inference deployment scale my-app 1 -z at-vie-2Scaling Limitations
- GPU Count is Immutable
- The
--gpu-countparameter cannot be changed after deployment creation. To use a different GPU count, create a new deployment. - Quota Limits
- Total GPU usage is subject to your organization’s GPU quota. Ensure sufficient quota before scaling up.