CLI Commands
The Exoscale CLI (exo) provides comprehensive commands for managing Dedicated Inference models and deployments.
Command Overview
All Dedicated Inference commands are accessed through the exo dedicated-inference namespace:
exo dedicated-inference --helpFor detailed command reference and additional options, see the Dedicated Inference CLI documentation.
Model Management
Create a Model
Download and register a model from Hugging Face for use in deployments.
Syntax:
exo dedicated-inference model create <huggingface-model-id> \
[--huggingface-token <token>] \
-z <zone>Parameters:
<huggingface-model-id>: The model identifier from Hugging Face (e.g.,mistralai/Mistral-7B-Instruct-v0.3)--huggingface-token: Required for gated models (Llama, etc.)-z, --zone: Exoscale zone where the model will be stored
Examples:
Public model:
exo dedicated-inference model create mistralai/Mistral-7B-Instruct-v0.3 -z at-vie-2Gated model:
exo dedicated-inference model create meta-llama/Llama-3.2-1B-Instruct \
--huggingface-token hf_YOUR_TOKEN \
-z at-vie-2Notes:
- Model creation may take several minutes depending on model size
- Models are stored in Exoscale Object Storage (SOS)
- Only
safetensorformat models are supported
List Models
View all models available in a zone.
Syntax:
exo dedicated-inference model list -z <zone>Example:
exo dedicated-inference model list -z at-vie-2Output includes:
- Model ID
- Model name (Hugging Face identifier)
- Status (
creating,created,error) - Creation date
Show Model Details
Display detailed information about a specific model.
Syntax:
exo dedicated-inference model show <model-id> -z <zone>Example:
exo dedicated-inference model show 12345678-1234-1234-1234-123456789abc -z at-vie-2Delete a Model
Remove a model from Object Storage.
Syntax:
exo dedicated-inference model delete <model-id> -z <zone>Example:
exo dedicated-inference model delete 12345678-1234-1234-1234-123456789abc -z at-vie-2Warning: This permanently deletes the model. Any deployments using this model will be affected.
Deployment Management
Create a Deployment
Deploy a model as an inference endpoint on dedicated GPUs.
Syntax:
exo dedicated-inference deployment create <deployment-name> \
--model-name <huggingface-model-id> \
--gpu-type <gpu-sku> \
--gpu-count <count> \
--replicas <count> \
-z <zone>Parameters:
<deployment-name>: Unique name for your deployment--model-name: Model name as created (alternative:--model-idfor Exoscale model ID)--gpu-type: GPU type (e.g.,gpua5000,gpurtx6000pro)--gpu-count: Number of GPUs per model instance (for large models)--replicas: Number of model replicas (minimum 1 during creation)-z, --zone: Exoscale zone for deployment
Optional Parameters:
--inference-engine-params: Advanced inference engine configuration (space-separated parameters passed to the vLLM engine)- Use for speculative decoding:
--speculative-config={JSON} - Use for GPU memory tuning:
--gpu-memory-utilization=0.9 - Use for context length:
--max-model-len=32768
- Use for speculative decoding:
Examples:
Basic deployment:
exo dedicated-inference deployment create my-app \
--model-name mistralai/Mistral-7B-Instruct-v0.3 \
--gpu-type gpua5000 \
--gpu-count 1 \
--replicas 1 \
-z at-vie-2Multi-GPU deployment:
exo dedicated-inference deployment create large-model \
--model-name meta-llama/Llama-3.1-70B-Instruct \
--gpu-type gpua5000 \
--gpu-count 4 \
--replicas 1 \
-z at-vie-2With speculative decoding (requires larger GPU for both models):
exo dedicated-inference deployment create fast-inference \
--model-name meta-llama/Llama-3.1-70B-Instruct \
--gpu-type gpurtx6000pro \
--gpu-count 2 \
--replicas 1 \
--inference-engine-params '--speculative-config={"method":"eagle3","model":"meta-llama/Llama-3.1-8B-Instruct","num_speculative_tokens":5}' \
-z at-vie-2Notes:
- Total GPUs used =
gpu-count × replicas - Deployment typically takes 3-5 minutes
- GPU billing starts when deployment is ready
Inference Engine Parameters
Advanced vLLM configuration can be passed via --inference-engine-params. To see all available options:
exo dedicated-inference deployment create --inference-engine-parameter-helpFor detailed documentation on vLLM parameters, see the vLLM documentation.
Example:
exo dedicated-inference deployment create my-app \
--model-name mistralai/Mistral-7B-Instruct-v0.3 \
--gpu-type gpua5000 \
--gpu-count 1 \
--replicas 1 \
--inference-engine-params '--max-model-len=32768 --gpu-memory-utilization=0.85' \
-z at-vie-2List Deployments
View all deployments in a zone.
Syntax:
exo dedicated-inference deployment list -z <zone>Example:
exo dedicated-inference deployment list -z at-vie-2Output includes:
- Deployment name
- Model name
- GPU type and count
- Replica count
- Status (
deploying,ready,scaling)
Show Deployment Details
Display detailed information about a deployment including endpoint URL.
Syntax:
exo dedicated-inference deployment show <deployment-name> -z <zone>Example:
exo dedicated-inference deployment show my-app -z at-vie-2Output includes:
- Deployment URL (including
/v1path) - Current status
- GPU configuration
- Replica count
- Creation date
Scale a Deployment
Change the number of replicas for a deployment.
Syntax:
exo dedicated-inference deployment scale <deployment-name> <replica-count> -z <zone>Parameters:
<deployment-name>: Name of the deployment to scale<replica-count>: New number of replicas (0 to stop billing)-z, --zone: Exoscale zone
Examples:
Scale up to 3 replicas:
exo dedicated-inference deployment scale my-app 3 -z at-vie-2Scale down to 1 replica:
exo dedicated-inference deployment scale my-app 1 -z at-vie-2Scale to zero (stop GPU billing):
exo dedicated-inference deployment scale my-app 0 -z at-vie-2Notes:
- Scale to 0 to stop GPU billing while keeping the deployment
- URL and API key are preserved when scaling to/from zero
- Scaling takes 3-5 minutes per new replica
--gpu-countcannot be changed; create new deployment instead
Get Deployment API Key
Retrieve the API key for authenticating requests to a deployment endpoint.
Syntax:
exo dedicated-inference deployment reveal-api-key <deployment-name> -z <zone>Example:
exo dedicated-inference deployment reveal-api-key my-app -z at-vie-2Security Note: Treat this API key as a secret. The command reveals the existing key (it does not rotate it). Use it in the Authorization: Bearer <key> header for API requests. If you need a new key, delete and recreate the deployment.
View Deployment Logs
Access logs for troubleshooting and monitoring.
Syntax:
exo dedicated-inference deployment logs <deployment-name> -z <zone> [--tail <N>]Examples:
# Show all logs
exo dedicated-inference deployment logs my-app -z at-vie-2
# Show last 100 lines
exo dedicated-inference deployment logs my-app -z at-vie-2 --tail 100Delete a Deployment
Remove a deployment and stop GPU billing.
Syntax:
exo dedicated-inference deployment delete <deployment-name> -z <zone>Example:
exo dedicated-inference deployment delete my-app -z at-vie-2Notes:
- Stops GPU billing immediately
- The underlying model remains in Object Storage
- Endpoint URL becomes inactive
- This action cannot be undone
Common Workflows
End-to-End Deployment
Complete workflow from model to production endpoint:
# 1. Create model
exo dedicated-inference model create mistralai/Mistral-7B-Instruct-v0.3 -z at-vie-2
# 2. Wait for model to be available
exo dedicated-inference model list -z at-vie-2
# 3. Create deployment
exo dedicated-inference deployment create my-app \
--model-name mistralai/Mistral-7B-Instruct-v0.3 \
--gpu-type gpua5000 \
--gpu-count 1 \
--replicas 1 \
-z at-vie-2
# 4. Check deployment status
exo dedicated-inference deployment show my-app -z at-vie-2
# 5. Get API credentials
exo dedicated-inference deployment reveal-api-key my-app -z at-vie-2Cost Management
Manage resources to optimize costs:
# Scale down during off-hours
exo dedicated-inference deployment scale my-app 1 -z at-vie-2
# Scale to zero to stop GPU billing (preserves URL and API key)
exo dedicated-inference deployment scale my-app 0 -z at-vie-2
# Scale back up when needed
exo dedicated-inference deployment scale my-app 1 -z at-vie-2
# Clean up model completely (only if no longer needed)
exo dedicated-inference model delete <model-id> -z at-vie-2Troubleshooting
See Monitor and troubleshoot deployments for troubleshooting steps and diagnostics.
Zone Availability
GPU availability varies by zone and GPU type. See GPU availability by zone for the current GPU-by-zone matrix. Choose zones based on your data residency requirements and proximity to users.