# Use CLI Commands

The Exoscale CLI (`exo`) provides comprehensive commands for managing Dedicated Inference models and deployments.

## Command Overview
All Dedicated Inference commands are accessed through the `exo dedicated-inference` namespace:

```bash
exo dedicated-inference --help
```

For detailed command reference and additional options, see the [Dedicated Inference CLI documentation]({{< ref "/reference/cli/exo/dedicated-inference/" >}}).

## Model Management

### Create a Model
Download and register a model from Hugging Face for use in deployments.

**Syntax:**
```bash
exo dedicated-inference model create <huggingface-model-id> \
  [--huggingface-token <token>] \
  -z <zone>
```

**Parameters:**
- `<huggingface-model-id>`: The model identifier from Hugging Face (e.g., `mistralai/Mistral-7B-Instruct-v0.3`)
- `--huggingface-token`: Required for gated models (Llama, etc.)
- `-z, --zone`: Exoscale zone where the model will be stored

**Examples:**

Public model:
```bash
exo dedicated-inference model create mistralai/Mistral-7B-Instruct-v0.3 -z at-vie-2
```

Gated model:
```bash
exo dedicated-inference model create meta-llama/Llama-3.2-1B-Instruct \
  --huggingface-token hf_YOUR_TOKEN \
  -z at-vie-2
```

> [!NOTE]
> - Model creation may take several minutes depending on model size
  > - Models are stored in Exoscale Object Storage (SOS)
- Only `safetensor` format models are supported

### List Models

View all models available in a zone.

**Syntax:**
```bash
exo dedicated-inference model list -z <zone>
```

**Example:**
```bash
exo dedicated-inference model list -z at-vie-2
```

**Output includes:**
- Model ID
- Model name (Hugging Face identifier)
- Status (`creating`, `created`, `error`)
- Creation date

### Show Model Details

Display detailed information about a specific model.

**Syntax:**
```bash
exo dedicated-inference model show <model-id> -z <zone>
```

**Example:**
```bash
exo dedicated-inference model show 12345678-1234-1234-1234-123456789abc -z at-vie-2
```

### Delete a Model

Remove a model from Object Storage.

**Syntax:**
```bash
exo dedicated-inference model delete <model-id> -z <zone>
```

**Example:**
```bash
exo dedicated-inference model delete 12345678-1234-1234-1234-123456789abc -z at-vie-2
```

> [!WARNING]
> This permanently deletes the model.

## Deployment Management

### Create a Deployment

Deploy a model as an inference endpoint on dedicated GPUs.

**Syntax:**
```bash
exo dedicated-inference deployment create <deployment-name> \
  --model-name <huggingface-model-id> \
  --gpu-type <gpu-sku> \
  --gpu-count <count> \
  --replicas <count> \
  -z <zone>
```

**Parameters:**
- `<deployment-name>`: Unique name for your deployment
- `--model-name`: Model name as created (alternative: `--model-id` for Exoscale model ID)
- `--gpu-type`: GPU type (e.g., `gpua5000`, `gpurtx6000pro`)
- `--gpu-count`: Number of GPUs per model instance (for large models)
- `--replicas`: Number of model replicas (minimum 1 during creation)
- `-z, --zone`: Exoscale zone for deployment

**Optional Parameters:**
- `--inference-engine-params`: Advanced inference engine configuration (space-separated parameters passed to the vLLM engine)
  - Use for speculative decoding: `--speculative-config={JSON}`
  - Use for GPU memory tuning: `--gpu-memory-utilization=0.9`
  - Use for context length: `--max-model-len=32768`

**Examples:**

Basic deployment:
```bash
exo dedicated-inference deployment create my-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 \
  --gpu-count 1 \
  --replicas 1 \
  -z at-vie-2
```

Multi-GPU deployment:
```bash
exo dedicated-inference deployment create large-model \
  --model-name meta-llama/Llama-3.1-70B-Instruct \
  --gpu-type gpua5000 \
  --gpu-count 4 \
  --replicas 1 \
  -z at-vie-2
```

With speculative decoding (requires larger GPU for both models):
```bash
exo dedicated-inference deployment create fast-inference \
  --model-name meta-llama/Llama-3.1-70B-Instruct \
  --gpu-type gpurtx6000pro \
  --gpu-count 2 \
  --replicas 1 \
  --inference-engine-params '--speculative-config={"method":"eagle3","model":"meta-llama/Llama-3.1-8B-Instruct","num_speculative_tokens":5}' \
  -z at-vie-2
```

> [!NOTE]
> - Total GPUs used = `gpu-count × replicas`
> - Deployment typically takes 3-5 minutes
- GPU billing starts when deployment is ready

### Inference Engine Parameters

Advanced vLLM configuration can be passed via `--inference-engine-params`. To see all available options:

```bash
exo dedicated-inference deployment create --inference-engine-parameter-help
```

For detailed documentation on vLLM parameters, see the [vLLM documentation](https://docs.vllm.ai/en/latest/).

**Example:**

```bash
exo dedicated-inference deployment create my-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 \
  --gpu-count 1 \
  --replicas 1 \
  --inference-engine-params '--max-model-len=32768 --gpu-memory-utilization=0.85' \
  -z at-vie-2
```

### List Deployments

View all deployments in a zone.

**Syntax:**
```bash
exo dedicated-inference deployment list -z <zone>
```

**Example:**
```bash
exo dedicated-inference deployment list -z at-vie-2
```

**Output includes:**
- Deployment name
- Model name
- GPU type and count
- Replica count
- Status (`deploying`, `ready`, `scaling`)

### Show Deployment Details

Display detailed information about a deployment including endpoint URL.

**Syntax:**
```bash
exo dedicated-inference deployment show <deployment-name> -z <zone>
```

**Example:**
```bash
exo dedicated-inference deployment show my-app -z at-vie-2
```

**Output includes:**
- Deployment URL (including `/v1` path)
- Current status
- GPU configuration
- Replica count
- Creation date

### Scale a Deployment

Change the number of replicas for a deployment.

**Syntax:**
```bash
exo dedicated-inference deployment scale <deployment-name> <replica-count> -z <zone>
```

**Parameters:**
- `<deployment-name>`: Name of the deployment to scale
- `<replica-count>`: New number of replicas (0 to stop billing)
- `-z, --zone`: Exoscale zone

**Examples:**

Scale up to 3 replicas:
```bash
exo dedicated-inference deployment scale my-app 3 -z at-vie-2
```

Scale down to 1 replica:
```bash
exo dedicated-inference deployment scale my-app 1 -z at-vie-2
```

Scale to zero (stop GPU billing):
```bash
exo dedicated-inference deployment scale my-app 0 -z at-vie-2
```

> [!NOTE]
> - Scale to 0 to stop GPU billing while keeping the deployment
> - URL and API key are preserved when scaling to/from zero
- Scaling takes 3-5 minutes per new replica
- `--gpu-count` cannot be changed; create new deployment instead

### Get Deployment API Key

Retrieve the API key for authenticating requests to a deployment endpoint.

**Syntax:**
```bash
exo dedicated-inference deployment reveal-api-key <deployment-name> -z <zone>
```

**Example:**
```bash
exo dedicated-inference deployment reveal-api-key my-app -z at-vie-2
```

**Security Note:** Treat this API key as a secret. The command reveals the existing key (it does not rotate it). Use it in the `Authorization: Bearer <key>` header for API requests. If you need a new key, delete and recreate the deployment.

### View Deployment Logs

Access logs for troubleshooting and monitoring.

**Syntax:**
```bash
exo dedicated-inference deployment logs <deployment-name> -z <zone> [--tail <N>]
```

**Examples:**
```bash
# Show all logs
exo dedicated-inference deployment logs my-app -z at-vie-2

# Show last 100 lines
exo dedicated-inference deployment logs my-app -z at-vie-2 --tail 100
```

### Delete a Deployment

Remove a deployment and stop GPU billing.

**Syntax:**
```bash
exo dedicated-inference deployment delete <deployment-name> -z <zone>
```

**Example:**
```bash
exo dedicated-inference deployment delete my-app -z at-vie-2
```

> [!NOTE]
> - Stops GPU billing immediately
> - The underlying model remains in Object Storage
- Endpoint URL becomes inactive
- This action cannot be undone

### List authorized instances

An instance is considered authorized if there's sufficient capacity and you have the necessary permissions to create a deployment on that kind of instance

**Syntax:**
```bash
exo dedicated-inference deployment instance-type -z <zone>
```

**Example:**
```bash
exo dedicated-inference deployment instance-type -z at-vie-2
```

## Common Workflows

### End-to-End Deployment

Complete workflow from model to production endpoint:

```bash
# 1. Create model
exo dedicated-inference model create mistralai/Mistral-7B-Instruct-v0.3 -z at-vie-2

# 2. Wait for model to be available
exo dedicated-inference model list -z at-vie-2

# 3. Create deployment
exo dedicated-inference deployment create my-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 \
  --gpu-count 1 \
  --replicas 1 \
  -z at-vie-2

# 4. Check deployment status
exo dedicated-inference deployment show my-app -z at-vie-2

# 5. Get API credentials
exo dedicated-inference deployment reveal-api-key my-app -z at-vie-2
```

### Cost Management

Manage resources to optimize costs:

```bash
# Scale down during off-hours
exo dedicated-inference deployment scale my-app 1 -z at-vie-2

# Scale to zero to stop GPU billing (preserves URL and API key)
exo dedicated-inference deployment scale my-app 0 -z at-vie-2

# Scale back up when needed
exo dedicated-inference deployment scale my-app 1 -z at-vie-2

# Clean up model completely (only if no longer needed)
exo dedicated-inference model delete <model-id> -z at-vie-2
```

### Troubleshooting

See [Monitor and troubleshoot deployments]({{< ref "/product/concrete-ai/dedicated-inference/how-to/monitor-troubleshoot/" >}}) for troubleshooting steps and diagnostics.

## Zone Availability

GPU availability varies by zone and GPU type. See [GPU availability by zone](https://www.exoscale.com/gpu/#comparison-gpu) for the current GPU-by-zone matrix. Choose zones based on your data residency requirements and proximity to users.

