Use CLI Commands

Use the Exoscale CLI (exo) to manage Dedicated Inference from your terminal.

Dedicated Inference has two main resources:

A model is a model from Hugging Face that Exoscale downloads and stores for later use. Hugging Face is a public model registry where enterprises, AI labs and teams publish AI models.
A deployment runs one of those models on dedicated GPUs and exposes it through an endpoint.

All Dedicated Inference commands are available under the exo dedicated-inference namespace:

exo dedicated-inference --help

For the full command reference, see the Dedicated Inference CLI documentation.

Model Management

Create a model before you create a deployment. This gives Exoscale a copy of the model files in the selected zone, so deployments don’t need to fetch the model directly from Hugging Face every time.

Create a Model

exo dedicated-inference model create <huggingface-model-id> \
  [--huggingface-token <token>] \
  -z <zone>

Parameters:

<huggingface-model-id>: Hugging Face model identifier, for example mistralai/Ministral-3-8B-Reasoning-2512
--huggingface-token: Hugging Face access token. Required for gated models.
-z, --zone: Exoscale zone where the model is stored

Example with a public model:

exo dedicated-inference model create mistralai/Ministral-3-8B-Reasoning-2512 -z de-fra-1

Example with a gated model:

exo dedicated-inference model create <gated-model-id> \
  --huggingface-token hf_YOUR_TOKEN \
  -z de-fra-1

A gated model is a Hugging Face model that requires authentication before you can download it.

Note

Model creation can take several minutes. Larger models have more files to download and store. Models are stored in Exoscale Object Storage (SOS), and only models using the safetensors format are supported.

List Models

List models available in a zone.

exo dedicated-inference model list -z <zone>

Example:

exo dedicated-inference model list -z de-fra-1

The output includes the model ID, model name, state, size, creation date and update date.

Show Model Details

Display details for a model.

exo dedicated-inference model show <model-id> -z <zone>

Example:

exo dedicated-inference model show 12345678-1234-1234-1234-123456789abc -z de-fra-1

Use the model ID when you want to avoid ambiguity, especially if you have several models with similar names.

Delete a Model

Delete a model from storage.

exo dedicated-inference model delete <model-id> -z <zone>

Example:

exo dedicated-inference model delete 12345678-1234-1234-1234-123456789abc -z de-fra-1

Warning

Deleting a model is permanent. You can’t delete a model while an active deployment is using it.

Deployment Management

A deployment runs a registered model on dedicated GPUs. Each deployment has its own endpoint URL and API key.

Create a Deployment

Create an inference deployment from a registered model.

exo dedicated-inference deployment create <deployment-name> \
  --model-name <huggingface-model-id> \
  --gpu-type <gpu-type> \
  --gpu-count <count> \
  --replicas <count> \
  -z <zone>

Parameters:

<deployment-name>: Name of the deployment
--model-name: Model name as registered from Hugging Face
--model-id: Exoscale model ID, as an alternative to --model-name
--gpu-type: GPU type, for example gpua5000, gpu3 or gpurtx6000pro
--gpu-count: Number of GPUs per replica
--replicas: Number of replicas to create
--inference-engine-params: Optional vLLM settings passed to the inference engine, such as context length, memory settings, text-only mode or model-specific parsers
-z, --zone: Exoscale zone where the deployment is created

Choose gpu-count based on model size and memory needs. A larger model, a longer context window or heavier inference settings may require more GPU memory. Use replicas when you need more parallel capacity for the same model.

Basic deployment:

exo dedicated-inference deployment create ministral-3-8b-reasoning \
  --model-name mistralai/Ministral-3-8B-Reasoning-2512 \
  --gpu-type gpurtx6000pro \
  --gpu-count 1 \
  --replicas 1 \
  --inference-engine-params '--enable-auto-tool-choice --tool-call-parser=mistral --reasoning-parser=mistral --max-model-len=128000 --language-model-only' \
  -z de-fra-1

Note

Total GPU usage is gpu-count × replicas. The GPU count can’t be changed after deployment creation. To use a different GPU count, create a new deployment.

Configure Inference Engine Parameters

Use --inference-engine-params to pass settings to the inference engine.

These parameters matter. Many models won’t run well with the default settings on every GPU type. Some may fail during startup if the context length, memory usage, multimodal processing or model-specific options don’t fit the hardware you selected.

Before creating a deployment, check whether the model needs specific inference engine parameters for your GPU type and GPU count. This is especially important for large models, long-context models, quantized models, multimodal models and models that require custom loading options.

Examples:

--max-model-len=128000 limits the maximum context length. Lower values use less memory and can prevent startup failures caused by insufficient GPU memory.
--language-model-only disables multimodal inputs and runs a multimodal model in text-only mode. Use it when you don’t need image input.
--gpu-memory-utilization=0.85 controls how much GPU memory the inference engine can use.
--max-num-seqs=8 limits the number of concurrent sequences handled by one engine instance.
--tool-call-parser=mistral enables Mistral tool-call parsing for compatible models.
--reasoning-parser=mistral enables Mistral reasoning output parsing for compatible models.
--speculative-config={JSON} configures speculative decoding, which can improve generation speed when the model and decoding method are compatible.

Dedicated Inference sets tensor parallelism from --gpu-count. For example, --gpu-count 2 allocates two GPUs to each replica and sets tensor parallelism for the vLLM deployment. You don’t need to pass --tensor-parallel-size in --inference-engine-params.

To list supported parameters for the selected inference engine version:

exo dedicated-inference deployment create --inference-engine-parameter-help

For more information, see:

The vLLM recipes are useful before deployment because they include model-specific runbooks. They often show which vLLM version is needed, which hardware has been tested and which parameters are required for popular open-weight models.

Example for mistralai/Ministral-3-8B-Reasoning-2512 in text-only mode on one RTX 6000 Pro GPU in de-fra-1:

exo dedicated-inference deployment create ministral-3-8b-reasoning \
  --model-name mistralai/Ministral-3-8B-Reasoning-2512 \
  --gpu-type gpurtx6000pro \
  --gpu-count 1 \
  --replicas 1 \
  --inference-engine-params '--enable-auto-tool-choice --tool-call-parser=mistral --reasoning-parser=mistral --max-model-len=128000 --language-model-only' \
  -z de-fra-1

This example runs mistralai/Ministral-3-8B-Reasoning-2512 on one RTX 6000 Pro GPU in de-fra-1. The model has a 256K context window, but the example sets --max-model-len=128000 to reduce memory use. The --language-model-only parameter disables multimodal inputs and runs the model in text-only mode, which is useful when you don’t need image input. The Mistral parser settings enable tool calling and reasoning output for this model family.

List Deployments

List deployments in a zone.

exo dedicated-inference deployment list -z <zone>

Example:

exo dedicated-inference deployment list -z de-fra-1

The output includes the deployment name, model, GPU type, GPU count, replica count, state and endpoint URL.

Show Deployment Details

Display details for a deployment.

exo dedicated-inference deployment show <deployment-name> -z <zone>

Example:

exo dedicated-inference deployment show ministral-3-8b-reasoning -z de-fra-1

Use this command to check whether the deployment is ready and to retrieve its endpoint URL, GPU configuration and replica count.

Scale a Deployment

Change the number of replicas for a deployment.

exo dedicated-inference deployment scale <deployment-name> <replica-count> -z <zone>

Parameters:

<deployment-name>: Name of the deployment to scale
<replica-count>: New number of replicas. Use 0 to scale to zero.
-z, --zone: Exoscale zone

Scale up:

exo dedicated-inference deployment scale ministral-3-8b-reasoning 3 -z de-fra-1

Scale down:

exo dedicated-inference deployment scale ministral-3-8b-reasoning 1 -z de-fra-1

Scale to zero (stop GPU billing):

exo dedicated-inference deployment scale ministral-3-8b-reasoning 0 -z de-fra-1

Scaling changes the number of running model replicas. It doesn’t change the model, endpoint URL, API key, GPU type or GPU count per replica.

Note

Scaling to zero stops GPU billing while keeping the deployment, endpoint URL and API key. Scaling back from zero can take several minutes because the model has to be loaded again.

Reveal a Deployment API Key

Reveal the API key used to authenticate requests to the deployment endpoint.

exo dedicated-inference deployment reveal-api-key <deployment-name> -z <zone>

Example:

exo dedicated-inference deployment reveal-api-key ministral-3-8b-reasoning -z de-fra-1

Warning

Treat this API key as a secret. The command reveals the existing key (it does not rotate it). Use it in the Authorization: Bearer <key> header for API requests. If you need a new key, delete and recreate the deployment.

View Deployment Logs

Retrieve deployment logs for troubleshooting.

exo dedicated-inference deployment logs <deployment-name> -z <zone> [--tail <lines>]

Example:

exo dedicated-inference deployment logs ministral-3-8b-reasoning -z de-fra-1 --tail 100

Logs are useful when a deployment doesn’t become ready, when a model fails to load or when inference requests return errors.

Delete a Deployment

Delete a deployment and stop GPU billing.

exo dedicated-inference deployment delete <deployment-name> -z <zone>

Example:

exo dedicated-inference deployment delete ministral-3-8b-reasoning -z de-fra-1

Note

Deleting a deployment stops GPU billing and makes the endpoint inactive. The registered model remains available in storage, so you can create another deployment from it later.

Check Available Instance Types

List GPU types available to your organization in a zone.

exo dedicated-inference deployment instance-type -z <zone>

Example:

exo dedicated-inference deployment instance-type -z de-fra-1

A GPU type is shown as authorized when your organization is allowed to use it and there is enough capacity in the selected zone.

Check instance types before creating a deployment. It helps you avoid failed deployment attempts caused by missing permissions, quota limits or lack of available GPU capacity.

If you lack a sufficient quota or the desired GPU type is not available in your preferred zone, open a support ticket so options for making that GPU available in that zone can be discussed and considered.

Common Workflows

Deploy a Model End to End

This example deploys mistralai/Ministral-3-8B-Reasoning-2512, a reasoning model with a long context window and Mistral-specific inference settings.

# 1. Register the model
exo dedicated-inference model create mistralai/Ministral-3-8B-Reasoning-2512 -z de-fra-1

# 2. Check model state
exo dedicated-inference model list -z de-fra-1

# 3. Create a deployment
exo dedicated-inference deployment create ministral-3-8b-reasoning \
  --model-name mistralai/Ministral-3-8B-Reasoning-2512 \
  --gpu-type gpurtx6000pro \
  --gpu-count 1 \
  --replicas 1 \
  --inference-engine-params '--enable-auto-tool-choice --tool-call-parser=mistral --reasoning-parser=mistral --max-model-len=128000 --language-model-only' \
  -z de-fra-1

# 4. Check deployment state and endpoint URL
exo dedicated-inference deployment show ministral-3-8b-reasoning -z de-fra-1

# 5. Reveal the deployment API key
exo dedicated-inference deployment reveal-api-key ministral-3-8b-reasoning -z de-fra-1

This model has a 256K context window. The example sets --max-model-len=128000 to reduce memory usage on the selected GPU. It also passes text-only mode, tool-calling and reasoning parser settings used by this model family.

Manage Costs

# Scale down when demand is low
exo dedicated-inference deployment scale ministral-3-8b-reasoning 1 -z de-fra-1

# Scale to zero to stop GPU billing while keeping the deployment
exo dedicated-inference deployment scale ministral-3-8b-reasoning 0 -z de-fra-1

# Scale back up when needed
exo dedicated-inference deployment scale ministral-3-8b-reasoning 1 -z de-fra-1

# Delete the deployment when it is no longer needed
exo dedicated-inference deployment delete ministral-3-8b-reasoning -z de-fra-1

# Delete the model if it is no longer used
exo dedicated-inference model delete <model-id> -z de-fra-1

Scaling to zero is usually the right choice for temporary pauses, such as nights, weekends, demos or test environments. Delete the deployment when you no longer need the endpoint.

Troubleshooting

For deployment states, logs, model loading issues, GPU memory errors and inference request failures, see Monitor and troubleshoot deployments.

Zone Availability

GPU availability depends on the zone and GPU type. See GPU availability by zone for the current GPU-by-zone matrix.

Next Steps

Last updated on June 15, 2026