# Scale Deployments

## Scaling Your Deployments

Exoscale Dedicated Inference provides flexible scaling options to match your inference workload requirements and optimize costs.

### Understanding Scaling Parameters

When creating or scaling a deployment, two key parameters control resource allocation:

**Replicas**
: The number of identical copies of your model deployment running in parallel. Replicas enable horizontal scaling for handling concurrent requests and providing high availability.

**GPU Count**
: The number of GPUs assigned to a single model instance. This enables vertical scaling for large models that require multiple GPUs to run efficiently.

The total number of GPUs consumed by a deployment is calculated as:

```
Total GPUs = gpu-count × replicas
```

For example:
- `--gpu-count 2 --replicas 3` = 6 total GPUs
- `--gpu-count 1 --replicas 1` = 1 total GPU

### Horizontal Scaling (Replicas)

Horizontal scaling adds more replicas of your deployment to handle increased inference load and improve availability.

**Use horizontal scaling when:**
- You need to handle more concurrent requests
- You want to improve availability and fault tolerance
- Your model fits comfortably on a single GPU (or your chosen GPU count)

**Example:**
```bash
# Scale to 3 replicas for higher throughput
exo dedicated-inference deployment scale my-deployment 3 -z at-vie-2
```

### Vertical Scaling (GPU Count)

Vertical scaling distributes a single large model across multiple GPUs, enabling you to run models that exceed the memory capacity of a single GPU.

**Use vertical scaling when:**
- Your model is too large to fit on a single GPU
- You need more GPU memory for a single inference instance
- You're deploying very large language models (LLMs)

**Note:** GPU count is set during deployment creation and cannot be changed afterward. To change GPU count, you must create a new deployment.

### Scaling Operations

#### Scale Up

Increase the number of replicas to handle more traffic:

```bash
exo dedicated-inference deployment scale <deployment-name> <new-replica-count> -z <zone>
```

**Example:**
```bash
# Scale from 1 to 3 replicas
exo dedicated-inference deployment scale demo 3 -z at-vie-2
```

#### Scale Down

Reduce the number of replicas during low-traffic periods:

```bash
# Scale down to 1 replica
exo dedicated-inference deployment scale demo 1 -z at-vie-2
```

#### Scale to Zero

Scale to zero replicas to stop GPU billing while keeping the deployment and its credentials:

```bash
exo dedicated-inference deployment scale demo 0 -z at-vie-2
```

When you scale to zero:
- GPU billing stops immediately
- The model remains stored in Object Storage (standard SOS costs apply)
- **Your URL and API key are preserved** (unlike deleting, which loses them)
- Scaling from zero takes 3-5 minutes (or longer for large models)

To resume the deployment:

```bash
exo dedicated-inference deployment scale demo 1 -z at-vie-2
```

### Deployment Timing

Understanding deployment timing helps you plan scaling operations:

**Initial Deployment**
: Creating a new deployment typically takes 3-5 minutes for standard models, or longer for very large models (100+ GB).

**Scaling Up**
: Adding replicas to a running deployment takes 3-5 minutes per new replica.

**Scaling from Zero**
: Resuming a deployment from zero replicas takes 3-5 minutes or longer, as the model must be loaded into GPU memory.

**Scaling Down**
: Reducing replicas is nearly instantaneous, with billing stopping immediately.

### Monitoring and Cost Optimization

For monitoring, troubleshooting, and cost strategies, see:
- [Monitor and troubleshoot deployments]({{< ref "/product/concrete-ai/dedicated-inference/how-to/monitor-troubleshoot/" >}})
- [Optimize deployment costs]({{< ref "/product/concrete-ai/dedicated-inference/how-to/optimize-costs/" >}})

### Example Workflow

Here's a typical scaling workflow for a production deployment:

```bash
# 1. Create initial deployment with minimal resources
exo dedicated-inference deployment create my-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 \
  --gpu-count 1 \
  --replicas 1 \
  -z at-vie-2

# 2. Monitor and test with low traffic
exo dedicated-inference deployment show my-app -z at-vie-2

# 3. Scale up for production traffic
exo dedicated-inference deployment scale my-app 3 -z at-vie-2

# 4. Scale down during lower traffic periods
exo dedicated-inference deployment scale my-app 1 -z at-vie-2

# 5. Scale to zero during extended downtime (nights, weekends)
exo dedicated-inference deployment scale my-app 0 -z at-vie-2

# 6. Scale back up when needed (URL and API key are preserved)
exo dedicated-inference deployment scale my-app 1 -z at-vie-2
```

### Scaling Limitations

**GPU Count is Immutable**
: The `--gpu-count` parameter cannot be changed after deployment creation. To use a different GPU count, create a new deployment.

**Quota Limits**
: Total GPU usage is subject to your organization's GPU quota. Ensure sufficient quota before scaling up.

