Skip to content

CLI Commands

The Exoscale CLI (exo) provides comprehensive commands for managing Dedicated Inference models and deployments.

Command Overview

All Dedicated Inference commands are accessed through the exo dedicated-inference namespace:

exo dedicated-inference --help

For detailed command reference and additional options, see the Dedicated Inference CLI documentation.

Model Management

Create a Model

Download and register a model from Hugging Face for use in deployments.

Syntax:

exo dedicated-inference model create <huggingface-model-id> \
  [--huggingface-token <token>] \
  -z <zone>

Parameters:

  • <huggingface-model-id>: The model identifier from Hugging Face (e.g., mistralai/Mistral-7B-Instruct-v0.3)
  • --huggingface-token: Required for gated models (Llama, etc.)
  • -z, --zone: Exoscale zone where the model will be stored

Examples:

Public model:

exo dedicated-inference model create mistralai/Mistral-7B-Instruct-v0.3 -z at-vie-2

Gated model:

exo dedicated-inference model create meta-llama/Llama-3.2-1B-Instruct \
  --huggingface-token hf_YOUR_TOKEN \
  -z at-vie-2

Notes:

  • Model creation may take several minutes depending on model size
  • Models are stored in Exoscale Object Storage (SOS)
  • Only safetensor format models are supported

List Models

View all models available in a zone.

Syntax:

exo dedicated-inference model list -z <zone>

Example:

exo dedicated-inference model list -z at-vie-2

Output includes:

  • Model ID
  • Model name (Hugging Face identifier)
  • Status (creating, created, error)
  • Creation date

Show Model Details

Display detailed information about a specific model.

Syntax:

exo dedicated-inference model show <model-id> -z <zone>

Example:

exo dedicated-inference model show 12345678-1234-1234-1234-123456789abc -z at-vie-2

Delete a Model

Remove a model from Object Storage.

Syntax:

exo dedicated-inference model delete <model-id> -z <zone>

Example:

exo dedicated-inference model delete 12345678-1234-1234-1234-123456789abc -z at-vie-2

Warning: This permanently deletes the model. Any deployments using this model will be affected.

Deployment Management

Create a Deployment

Deploy a model as an inference endpoint on dedicated GPUs.

Syntax:

exo dedicated-inference deployment create <deployment-name> \
  --model-name <huggingface-model-id> \
  --gpu-type <gpu-sku> \
  --gpu-count <count> \
  --replicas <count> \
  -z <zone>

Parameters:

  • <deployment-name>: Unique name for your deployment
  • --model-name: Model name as created (alternative: --model-id for Exoscale model ID)
  • --gpu-type: GPU type (e.g., gpua5000, gpurtx6000pro)
  • --gpu-count: Number of GPUs per model instance (for large models)
  • --replicas: Number of model replicas (minimum 1 during creation)
  • -z, --zone: Exoscale zone for deployment

Optional Parameters:

  • --inference-engine-params: Advanced inference engine configuration (space-separated parameters passed to the vLLM engine)
    • Use for speculative decoding: --speculative-config={JSON}
    • Use for GPU memory tuning: --gpu-memory-utilization=0.9
    • Use for context length: --max-model-len=32768

Examples:

Basic deployment:

exo dedicated-inference deployment create my-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 \
  --gpu-count 1 \
  --replicas 1 \
  -z at-vie-2

Multi-GPU deployment:

exo dedicated-inference deployment create large-model \
  --model-name meta-llama/Llama-3.1-70B-Instruct \
  --gpu-type gpua5000 \
  --gpu-count 4 \
  --replicas 1 \
  -z at-vie-2

With speculative decoding (requires larger GPU for both models):

exo dedicated-inference deployment create fast-inference \
  --model-name meta-llama/Llama-3.1-70B-Instruct \
  --gpu-type gpurtx6000pro \
  --gpu-count 2 \
  --replicas 1 \
  --inference-engine-params '--speculative-config={"method":"eagle3","model":"meta-llama/Llama-3.1-8B-Instruct","num_speculative_tokens":5}' \
  -z at-vie-2

Notes:

  • Total GPUs used = gpu-count × replicas
  • Deployment typically takes 3-5 minutes
  • GPU billing starts when deployment is ready

Inference Engine Parameters

Advanced vLLM configuration can be passed via --inference-engine-params. To see all available options:

exo dedicated-inference deployment create --inference-engine-parameter-help

For detailed documentation on vLLM parameters, see the vLLM documentation.

Example:

exo dedicated-inference deployment create my-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 \
  --gpu-count 1 \
  --replicas 1 \
  --inference-engine-params '--max-model-len=32768 --gpu-memory-utilization=0.85' \
  -z at-vie-2

List Deployments

View all deployments in a zone.

Syntax:

exo dedicated-inference deployment list -z <zone>

Example:

exo dedicated-inference deployment list -z at-vie-2

Output includes:

  • Deployment name
  • Model name
  • GPU type and count
  • Replica count
  • Status (deploying, ready, scaling)

Show Deployment Details

Display detailed information about a deployment including endpoint URL.

Syntax:

exo dedicated-inference deployment show <deployment-name> -z <zone>

Example:

exo dedicated-inference deployment show my-app -z at-vie-2

Output includes:

  • Deployment URL (including /v1 path)
  • Current status
  • GPU configuration
  • Replica count
  • Creation date

Scale a Deployment

Change the number of replicas for a deployment.

Syntax:

exo dedicated-inference deployment scale <deployment-name> <replica-count> -z <zone>

Parameters:

  • <deployment-name>: Name of the deployment to scale
  • <replica-count>: New number of replicas (0 to stop billing)
  • -z, --zone: Exoscale zone

Examples:

Scale up to 3 replicas:

exo dedicated-inference deployment scale my-app 3 -z at-vie-2

Scale down to 1 replica:

exo dedicated-inference deployment scale my-app 1 -z at-vie-2

Scale to zero (stop GPU billing):

exo dedicated-inference deployment scale my-app 0 -z at-vie-2

Notes:

  • Scale to 0 to stop GPU billing while keeping the deployment
  • URL and API key are preserved when scaling to/from zero
  • Scaling takes 3-5 minutes per new replica
  • --gpu-count cannot be changed; create new deployment instead

Get Deployment API Key

Retrieve the API key for authenticating requests to a deployment endpoint.

Syntax:

exo dedicated-inference deployment reveal-api-key <deployment-name> -z <zone>

Example:

exo dedicated-inference deployment reveal-api-key my-app -z at-vie-2

Security Note: Treat this API key as a secret. The command reveals the existing key (it does not rotate it). Use it in the Authorization: Bearer <key> header for API requests. If you need a new key, delete and recreate the deployment.

View Deployment Logs

Access logs for troubleshooting and monitoring.

Syntax:

exo dedicated-inference deployment logs <deployment-name> -z <zone> [--tail <N>]

Examples:

# Show all logs
exo dedicated-inference deployment logs my-app -z at-vie-2

# Show last 100 lines
exo dedicated-inference deployment logs my-app -z at-vie-2 --tail 100

Delete a Deployment

Remove a deployment and stop GPU billing.

Syntax:

exo dedicated-inference deployment delete <deployment-name> -z <zone>

Example:

exo dedicated-inference deployment delete my-app -z at-vie-2

Notes:

  • Stops GPU billing immediately
  • The underlying model remains in Object Storage
  • Endpoint URL becomes inactive
  • This action cannot be undone

Common Workflows

End-to-End Deployment

Complete workflow from model to production endpoint:

# 1. Create model
exo dedicated-inference model create mistralai/Mistral-7B-Instruct-v0.3 -z at-vie-2

# 2. Wait for model to be available
exo dedicated-inference model list -z at-vie-2

# 3. Create deployment
exo dedicated-inference deployment create my-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 \
  --gpu-count 1 \
  --replicas 1 \
  -z at-vie-2

# 4. Check deployment status
exo dedicated-inference deployment show my-app -z at-vie-2

# 5. Get API credentials
exo dedicated-inference deployment reveal-api-key my-app -z at-vie-2

Cost Management

Manage resources to optimize costs:

# Scale down during off-hours
exo dedicated-inference deployment scale my-app 1 -z at-vie-2

# Scale to zero to stop GPU billing (preserves URL and API key)
exo dedicated-inference deployment scale my-app 0 -z at-vie-2

# Scale back up when needed
exo dedicated-inference deployment scale my-app 1 -z at-vie-2

# Clean up model completely (only if no longer needed)
exo dedicated-inference model delete <model-id> -z at-vie-2

Troubleshooting

See Monitor and troubleshoot deployments for troubleshooting steps and diagnostics.

Zone Availability

GPU availability varies by zone and GPU type. See GPU availability by zone for the current GPU-by-zone matrix. Choose zones based on your data residency requirements and proximity to users.

Last updated on