CLI Commands

The Exoscale CLI (exo) provides comprehensive commands for managing Dedicated Inference models and deployments.

Command Overview

All Dedicated Inference commands are accessed through the exo dedicated-inference namespace:

exo dedicated-inference --help

For detailed command reference and additional options, see the Dedicated Inference CLI documentation.

Model Management

Create a Model

Download and register a model from Hugging Face for use in deployments.

Syntax:

exo dedicated-inference model create <huggingface-model-id> \
  [--huggingface-token <token>] \
  -z <zone>

Parameters:

  • <huggingface-model-id>: The model identifier from Hugging Face (e.g., mistralai/Mistral-7B-Instruct-v0.3)
  • --huggingface-token: Required for gated models (Llama, etc.)
  • -z, --zone: Exoscale zone where the model will be stored

Examples:

Public model:

exo dedicated-inference model create mistralai/Mistral-7B-Instruct-v0.3 -z at-vie-2

Gated model:

exo dedicated-inference model create meta-llama/Llama-3.2-1B-Instruct \
  --huggingface-token hf_YOUR_TOKEN \
  -z at-vie-2

Notes:

  • Model creation may take several minutes depending on model size
  • Models are stored in Exoscale Object Storage (SOS)
  • Only safetensor format models are supported

List Models

View all models available in a zone.

Syntax:

exo dedicated-inference model list -z <zone>

Example:

exo dedicated-inference model list -z at-vie-2

Output includes:

  • Model ID
  • Model name (Hugging Face identifier)
  • Status (creating, created, error)
  • Creation date

Show Model Details

Display detailed information about a specific model.

Syntax:

exo dedicated-inference model show <model-id> -z <zone>

Example:

exo dedicated-inference model show 12345678-1234-1234-1234-123456789abc -z at-vie-2

Delete a Model

Remove a model from Object Storage.

Syntax:

exo dedicated-inference model delete <model-id> -z <zone>

Example:

exo dedicated-inference model delete 12345678-1234-1234-1234-123456789abc -z at-vie-2

Warning: This permanently deletes the model. Any deployments using this model will be affected.

Deployment Management

Create a Deployment

Deploy a model as an inference endpoint on dedicated GPUs.

Syntax:

exo dedicated-inference deployment create <deployment-name> \
  --model-name <huggingface-model-id> \
  --gpu-type <gpu-sku> \
  --gpu-count <count> \
  --replicas <count> \
  -z <zone>

Parameters:

  • <deployment-name>: Unique name for your deployment
  • --model-name: Model name as created (alternative: --model-id for Exoscale model ID)
  • --gpu-type: GPU type (e.g., gpua5000, gpurtx6000pro)
  • --gpu-count: Number of GPUs per model instance (for large models)
  • --replicas: Number of model replicas (minimum 1 during creation)
  • -z, --zone: Exoscale zone for deployment

Optional Parameters:

  • --inference-engine-params: Advanced inference engine configuration (space-separated parameters passed to the vLLM engine)
    • Use for speculative decoding: --speculative-config={JSON}
    • Use for GPU memory tuning: --gpu-memory-utilization=0.9
    • Use for context length: --max-model-len=32768

Examples:

Basic deployment:

exo dedicated-inference deployment create my-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 \
  --gpu-count 1 \
  --replicas 1 \
  -z at-vie-2

Multi-GPU deployment:

exo dedicated-inference deployment create large-model \
  --model-name meta-llama/Llama-3.1-70B-Instruct \
  --gpu-type gpua5000 \
  --gpu-count 4 \
  --replicas 1 \
  -z at-vie-2

With speculative decoding (requires larger GPU for both models):

exo dedicated-inference deployment create fast-inference \
  --model-name meta-llama/Llama-3.1-70B-Instruct \
  --gpu-type gpurtx6000pro \
  --gpu-count 2 \
  --replicas 1 \
  --inference-engine-params '--speculative-config={"method":"eagle3","model":"meta-llama/Llama-3.1-8B-Instruct","num_speculative_tokens":5}' \
  -z at-vie-2

Notes:

  • Total GPUs used = gpu-count × replicas
  • Deployment typically takes 3-5 minutes
  • GPU billing starts when deployment is ready

Inference Engine Parameters

Advanced vLLM configuration can be passed via --inference-engine-params. To see all available options:

exo dedicated-inference deployment create --inference-engine-parameter-help

For detailed documentation on vLLM parameters, see the vLLM documentation.

Example:

exo dedicated-inference deployment create my-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 \
  --gpu-count 1 \
  --replicas 1 \
  --inference-engine-params '--max-model-len=32768 --gpu-memory-utilization=0.85' \
  -z at-vie-2

List Deployments

View all deployments in a zone.

Syntax:

exo dedicated-inference deployment list -z <zone>

Example:

exo dedicated-inference deployment list -z at-vie-2

Output includes:

  • Deployment name
  • Model name
  • GPU type and count
  • Replica count
  • Status (deploying, ready, scaling)

Show Deployment Details

Display detailed information about a deployment including endpoint URL.

Syntax:

exo dedicated-inference deployment show <deployment-name> -z <zone>

Example:

exo dedicated-inference deployment show my-app -z at-vie-2

Output includes:

  • Deployment URL (including /v1 path)
  • Current status
  • GPU configuration
  • Replica count
  • Creation date

Scale a Deployment

Change the number of replicas for a deployment.

Syntax:

exo dedicated-inference deployment scale <deployment-name> <replica-count> -z <zone>

Parameters:

  • <deployment-name>: Name of the deployment to scale
  • <replica-count>: New number of replicas (0 to stop billing)
  • -z, --zone: Exoscale zone

Examples:

Scale up to 3 replicas:

exo dedicated-inference deployment scale my-app 3 -z at-vie-2

Scale down to 1 replica:

exo dedicated-inference deployment scale my-app 1 -z at-vie-2

Scale to zero (stop GPU billing):

exo dedicated-inference deployment scale my-app 0 -z at-vie-2

Notes:

  • Scale to 0 to stop GPU billing while keeping the deployment
  • URL and API key are preserved when scaling to/from zero
  • Scaling takes 3-5 minutes per new replica
  • --gpu-count cannot be changed; create new deployment instead

Get Deployment API Key

Retrieve the API key for authenticating requests to a deployment endpoint.

Syntax:

exo dedicated-inference deployment reveal-api-key <deployment-name> -z <zone>

Example:

exo dedicated-inference deployment reveal-api-key my-app -z at-vie-2

Security Note: Treat this API key as a secret. The command reveals the existing key (it does not rotate it). Use it in the Authorization: Bearer <key> header for API requests. If you need a new key, delete and recreate the deployment.

View Deployment Logs

Access logs for troubleshooting and monitoring.

Syntax:

exo dedicated-inference deployment logs <deployment-name> -z <zone> [--tail <N>]

Examples:

# Show all logs
exo dedicated-inference deployment logs my-app -z at-vie-2

# Show last 100 lines
exo dedicated-inference deployment logs my-app -z at-vie-2 --tail 100

Delete a Deployment

Remove a deployment and stop GPU billing.

Syntax:

exo dedicated-inference deployment delete <deployment-name> -z <zone>

Example:

exo dedicated-inference deployment delete my-app -z at-vie-2

Notes:

  • Stops GPU billing immediately
  • The underlying model remains in Object Storage
  • Endpoint URL becomes inactive
  • This action cannot be undone

Common Workflows

End-to-End Deployment

Complete workflow from model to production endpoint:

# 1. Create model
exo dedicated-inference model create mistralai/Mistral-7B-Instruct-v0.3 -z at-vie-2

# 2. Wait for model to be available
exo dedicated-inference model list -z at-vie-2

# 3. Create deployment
exo dedicated-inference deployment create my-app \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 \
  --gpu-count 1 \
  --replicas 1 \
  -z at-vie-2

# 4. Check deployment status
exo dedicated-inference deployment show my-app -z at-vie-2

# 5. Get API credentials
exo dedicated-inference deployment reveal-api-key my-app -z at-vie-2

Cost Management

Manage resources to optimize costs:

# Scale down during off-hours
exo dedicated-inference deployment scale my-app 1 -z at-vie-2

# Scale to zero to stop GPU billing (preserves URL and API key)
exo dedicated-inference deployment scale my-app 0 -z at-vie-2

# Scale back up when needed
exo dedicated-inference deployment scale my-app 1 -z at-vie-2

# Clean up model completely (only if no longer needed)
exo dedicated-inference model delete <model-id> -z at-vie-2

Troubleshooting

See Monitor and troubleshoot deployments for troubleshooting steps and diagnostics.

Zone Availability

GPU availability varies by zone and GPU type. See GPU availability by zone for the current GPU-by-zone matrix. Choose zones based on your data residency requirements and proximity to users.

Last updated on