Use CLI Commands
Use the Exoscale CLI (exo) to manage Dedicated Inference from your terminal.
Dedicated Inference has two main resources:
- A model is a model from Hugging Face that Exoscale downloads and stores for later use. Hugging Face is a public model registry where enterprises, AI labs and teams publish AI models.
- A deployment runs one of those models on dedicated GPUs and exposes it through an endpoint.
All Dedicated Inference commands are available under the exo dedicated-inference namespace:
exo dedicated-inference --helpFor the full command reference, see the Dedicated Inference CLI documentation.
Model Management
Create a model before you create a deployment. This gives Exoscale a copy of the model files in the selected zone, so deployments don’t need to fetch the model directly from Hugging Face every time.
Create a Model
Register a model from Hugging Face.
exo dedicated-inference model create <huggingface-model-id> \
[--huggingface-token <token>] \
-z <zone>Parameters:
<huggingface-model-id>: Hugging Face model identifier, for examplemistralai/Ministral-3-8B-Reasoning-2512--huggingface-token: Hugging Face access token. Required for gated models.-z, --zone: Exoscale zone where the model is stored
Example with a public model:
exo dedicated-inference model create mistralai/Ministral-3-8B-Reasoning-2512 -z de-fra-1Example with a gated model:
exo dedicated-inference model create <gated-model-id> \
--huggingface-token hf_YOUR_TOKEN \
-z de-fra-1A gated model is a Hugging Face model that requires authentication before you can download it.
Note
Model creation can take several minutes. Larger models have more files to download and store. Models are stored in Exoscale Object Storage (SOS), and only models using the safetensors format are supported.
List Models
List models available in a zone.
exo dedicated-inference model list -z <zone>Example:
exo dedicated-inference model list -z de-fra-1The output includes the model ID, model name, state, size, creation date and update date.
Show Model Details
Display details for a model.
exo dedicated-inference model show <model-id> -z <zone>Example:
exo dedicated-inference model show 12345678-1234-1234-1234-123456789abc -z de-fra-1Use the model ID when you want to avoid ambiguity, especially if you have several models with similar names.
Delete a Model
Delete a model from storage.
exo dedicated-inference model delete <model-id> -z <zone>Example:
exo dedicated-inference model delete 12345678-1234-1234-1234-123456789abc -z de-fra-1Warning
Deleting a model is permanent. You can’t delete a model while an active deployment is using it.
Deployment Management
A deployment runs a registered model on dedicated GPUs. Each deployment has its own endpoint URL and API key.
Create a Deployment
Create an inference deployment from a registered model.
exo dedicated-inference deployment create <deployment-name> \
--model-name <huggingface-model-id> \
--gpu-type <gpu-type> \
--gpu-count <count> \
--replicas <count> \
-z <zone>Parameters:
<deployment-name>: Name of the deployment--model-name: Model name as registered from Hugging Face--model-id: Exoscale model ID, as an alternative to--model-name--gpu-type: GPU type, for examplegpua5000,gpu3orgpurtx6000pro--gpu-count: Number of GPUs per replica--replicas: Number of replicas to create--inference-engine-params: Optional vLLM settings passed to the inference engine, such as context length, memory settings, text-only mode or model-specific parsers-z, --zone: Exoscale zone where the deployment is created
Choose gpu-count based on model size and memory needs. A larger model, a longer context window or heavier inference settings may require more GPU memory. Use replicas when you need more parallel capacity for the same model.
Basic deployment:
exo dedicated-inference deployment create ministral-3-8b-reasoning \
--model-name mistralai/Ministral-3-8B-Reasoning-2512 \
--gpu-type gpurtx6000pro \
--gpu-count 1 \
--replicas 1 \
--inference-engine-params '--enable-auto-tool-choice --tool-call-parser=mistral --reasoning-parser=mistral --max-model-len=128000 --language-model-only' \
-z de-fra-1Note
Total GPU usage is gpu-count × replicas. The GPU count can’t be changed after deployment creation. To use a different GPU count, create a new deployment.
Configure Inference Engine Parameters
Use --inference-engine-params to pass settings to the inference engine.
These parameters matter. Many models won’t run well with the default settings on every GPU type. Some may fail during startup if the context length, memory usage, multimodal processing or model-specific options don’t fit the hardware you selected.
Before creating a deployment, check whether the model needs specific inference engine parameters for your GPU type and GPU count. This is especially important for large models, long-context models, quantized models, multimodal models and models that require custom loading options.
Examples:
--max-model-len=128000limits the maximum context length. Lower values use less memory and can prevent startup failures caused by insufficient GPU memory.--language-model-onlydisables multimodal inputs and runs a multimodal model in text-only mode. Use it when you don’t need image input.--gpu-memory-utilization=0.85controls how much GPU memory the inference engine can use.--max-num-seqs=8limits the number of concurrent sequences handled by one engine instance.--tool-call-parser=mistralenables Mistral tool-call parsing for compatible models.--reasoning-parser=mistralenables Mistral reasoning output parsing for compatible models.--speculative-config={JSON}configures speculative decoding, which can improve generation speed when the model and decoding method are compatible.
Dedicated Inference sets tensor parallelism from --gpu-count. For example, --gpu-count 2 allocates two GPUs to each replica and sets tensor parallelism for the vLLM deployment. You don’t need to pass --tensor-parallel-size in --inference-engine-params.
To list supported parameters for the selected inference engine version:
exo dedicated-inference deployment create --inference-engine-parameter-helpFor more information, see:
The vLLM recipes are useful before deployment because they include model-specific runbooks. They often show which vLLM version is needed, which hardware has been tested and which parameters are required for popular open-weight models.
Example for mistralai/Ministral-3-8B-Reasoning-2512 in text-only mode on one RTX 6000 Pro GPU in de-fra-1:
exo dedicated-inference deployment create ministral-3-8b-reasoning \
--model-name mistralai/Ministral-3-8B-Reasoning-2512 \
--gpu-type gpurtx6000pro \
--gpu-count 1 \
--replicas 1 \
--inference-engine-params '--enable-auto-tool-choice --tool-call-parser=mistral --reasoning-parser=mistral --max-model-len=128000 --language-model-only' \
-z de-fra-1This example runs mistralai/Ministral-3-8B-Reasoning-2512 on one RTX 6000 Pro GPU in de-fra-1. The model has a 256K context window, but the example sets --max-model-len=128000 to reduce memory use. The --language-model-only parameter disables multimodal inputs and runs the model in text-only mode, which is useful when you don’t need image input. The Mistral parser settings enable tool calling and reasoning output for this model family.
List Deployments
List deployments in a zone.
exo dedicated-inference deployment list -z <zone>Example:
exo dedicated-inference deployment list -z de-fra-1The output includes the deployment name, model, GPU type, GPU count, replica count, state and endpoint URL.
Show Deployment Details
Display details for a deployment.
exo dedicated-inference deployment show <deployment-name> -z <zone>Example:
exo dedicated-inference deployment show ministral-3-8b-reasoning -z de-fra-1Use this command to check whether the deployment is ready and to retrieve its endpoint URL, GPU configuration and replica count.
Scale a Deployment
Change the number of replicas for a deployment.
exo dedicated-inference deployment scale <deployment-name> <replica-count> -z <zone>Parameters:
<deployment-name>: Name of the deployment to scale<replica-count>: New number of replicas. Use0to scale to zero.-z, --zone: Exoscale zone
Scale up:
exo dedicated-inference deployment scale ministral-3-8b-reasoning 3 -z de-fra-1Scale down:
exo dedicated-inference deployment scale ministral-3-8b-reasoning 1 -z de-fra-1Scale to zero (stop GPU billing):
exo dedicated-inference deployment scale ministral-3-8b-reasoning 0 -z de-fra-1Scaling changes the number of running model replicas. It doesn’t change the model, endpoint URL, API key, GPU type or GPU count per replica.
Note
Scaling to zero stops GPU billing while keeping the deployment, endpoint URL and API key. Scaling back from zero can take several minutes because the model has to be loaded again.
Reveal a Deployment API Key
Reveal the API key used to authenticate requests to the deployment endpoint.
exo dedicated-inference deployment reveal-api-key <deployment-name> -z <zone>Example:
exo dedicated-inference deployment reveal-api-key ministral-3-8b-reasoning -z de-fra-1Warning
Treat this API key as a secret. The command reveals the existing key (it does not rotate it). Use it in the Authorization: Bearer <key> header for API requests. If you need a new key, delete and recreate the deployment.
View Deployment Logs
Retrieve deployment logs for troubleshooting.
exo dedicated-inference deployment logs <deployment-name> -z <zone> [--tail <lines>]Example:
exo dedicated-inference deployment logs ministral-3-8b-reasoning -z de-fra-1 --tail 100Logs are useful when a deployment doesn’t become ready, when a model fails to load or when inference requests return errors.
Delete a Deployment
Delete a deployment and stop GPU billing.
exo dedicated-inference deployment delete <deployment-name> -z <zone>Example:
exo dedicated-inference deployment delete ministral-3-8b-reasoning -z de-fra-1Note
Deleting a deployment stops GPU billing and makes the endpoint inactive. The registered model remains available in storage, so you can create another deployment from it later.
Check Available Instance Types
List GPU types available to your organization in a zone.
exo dedicated-inference deployment instance-type -z <zone>Example:
exo dedicated-inference deployment instance-type -z de-fra-1A GPU type is shown as authorized when your organization is allowed to use it and there is enough capacity in the selected zone.
Check instance types before creating a deployment. It helps you avoid failed deployment attempts caused by missing permissions, quota limits or lack of available GPU capacity.
If you lack a sufficient quota or the desired GPU type is not available in your preferred zone, open a support ticket so options for making that GPU available in that zone can be discussed and considered.
Common Workflows
Deploy a Model End to End
This example deploys mistralai/Ministral-3-8B-Reasoning-2512, a reasoning model with a long context window and Mistral-specific inference settings.
# 1. Register the model
exo dedicated-inference model create mistralai/Ministral-3-8B-Reasoning-2512 -z de-fra-1
# 2. Check model state
exo dedicated-inference model list -z de-fra-1
# 3. Create a deployment
exo dedicated-inference deployment create ministral-3-8b-reasoning \
--model-name mistralai/Ministral-3-8B-Reasoning-2512 \
--gpu-type gpurtx6000pro \
--gpu-count 1 \
--replicas 1 \
--inference-engine-params '--enable-auto-tool-choice --tool-call-parser=mistral --reasoning-parser=mistral --max-model-len=128000 --language-model-only' \
-z de-fra-1
# 4. Check deployment state and endpoint URL
exo dedicated-inference deployment show ministral-3-8b-reasoning -z de-fra-1
# 5. Reveal the deployment API key
exo dedicated-inference deployment reveal-api-key ministral-3-8b-reasoning -z de-fra-1This model has a 256K context window. The example sets --max-model-len=128000 to reduce memory usage on the selected GPU. It also passes text-only mode, tool-calling and reasoning parser settings used by this model family.
Manage Costs
# Scale down when demand is low
exo dedicated-inference deployment scale ministral-3-8b-reasoning 1 -z de-fra-1
# Scale to zero to stop GPU billing while keeping the deployment
exo dedicated-inference deployment scale ministral-3-8b-reasoning 0 -z de-fra-1
# Scale back up when needed
exo dedicated-inference deployment scale ministral-3-8b-reasoning 1 -z de-fra-1
# Delete the deployment when it is no longer needed
exo dedicated-inference deployment delete ministral-3-8b-reasoning -z de-fra-1
# Delete the model if it is no longer used
exo dedicated-inference model delete <model-id> -z de-fra-1Scaling to zero is usually the right choice for temporary pauses, such as nights, weekends, demos or test environments. Delete the deployment when you no longer need the endpoint.
Troubleshooting
For deployment states, logs, model loading issues, GPU memory errors and inference request failures, see Monitor and troubleshoot deployments.
Zone Availability
GPU availability depends on the zone and GPU type. See GPU availability by zone for the current GPU-by-zone matrix.