Skip to content

Quick Start

In this quick start guide, you will deploy your first AI model using Exoscale Dedicated Inference and expose it as a production-ready inference endpoint running on dedicated GPUs.

This guide uses mistralai/Ministral-3-8B-Reasoning-2512 on one RTX 6000 Pro GPU. The model is imported from Hugging Face, stored in Exoscale Object Storage, and served through an OpenAI-compatible endpoint.

Before You Start

You need:

  • the Exoscale CLI (exo) installed and configured
  • an Exoscale API key with access to the compute and ai services
  • enough GPU quota

Configure API Access

Dedicated Inference is managed through the Exoscale CLI and API.

Create a role with compute and ai access:

exo iam role create ai-role --policy '{
  "default-service-strategy": "deny",
  "services": {
    "compute": {"type": "allow", "rules": []},
    "ai": {"type": "allow", "rules": []}
  }
}'

Create an API key for the role:

exo iam api-key create ai-key ai-role

Add the key to your CLI configuration:

exo config add

Check GPU Availability

Check whether your organization can use the RTX 6000 Pro GPU type in de-fra-1 (besides quota and authorization, the GPU End User Certificate should be signed for the RTX 6000 Pro):

exo dedicated-inference deployment instance-type -z de-fra-1

Look for gpurtx6000pro in the output.

A GPU type is shown as authorized when your organization can use it and there is enough capacity in the zone. If it is not available, choose another zone or request access or quota through the Exoscale Portal.

Create the Model

Register the model in Exoscale:

exo dedicated-inference model create mistralai/Ministral-3-8B-Reasoning-2512 -z de-fra-1

This downloads the model files from Hugging Face and stores them in Exoscale Object Storage in de-fra-1. The deployment will use this stored copy later, so it does not need to fetch the model from Hugging Face at startup.

Check the model state:

exo dedicated-inference model list -z de-fra-1

Wait until the model is created.

Create the Deployment

Create a deployment on one RTX 6000 Pro GPU:

exo dedicated-inference deployment create ministral-3-8b-reasoning \
  --model-name mistralai/Ministral-3-8B-Reasoning-2512 \
  --gpu-type gpurtx6000pro \
  --gpu-count 1 \
  --replicas 1 \
  --inference-engine-params '--enable-auto-tool-choice --tool-call-parser=mistral --reasoning-parser=mistral --max-model-len=128000 --language-model-only' \
  -z de-fra-1

This model supports a long context window. The example sets --max-model-len=128000 to reduce memory use while keeping a large context. --language-model-only runs the model in text-only mode, which is useful when you do not need image input. The Mistral parser settings enable tool calling and reasoning output for this model family.

Check the deployment state:

exo dedicated-inference deployment show ministral-3-8b-reasoning -z de-fra-1

Wait until the deployment is ready.

Get the Endpoint and API Key

Show the deployment details:

exo dedicated-inference deployment show ministral-3-8b-reasoning -z de-fra-1

Copy the deployment URL. It should include the /v1 API path.

Reveal the deployment API key:

exo dedicated-inference deployment reveal-api-key ministral-3-8b-reasoning -z de-fra-1

Export both values:

export ENDPOINT_URL="https://<deployment-id>.inference.de-fra-1.exoscale-cloud.com/v1"
export API_KEY="<deployment-api-key>"

Treat the deployment API key as a secret. The reveal-api-key command shows the existing key; it does not rotate it.

Send a Test Request

Dedicated Inference endpoints are OpenAI-compatible. You can call the chat completions route with a standard OpenAI-style request body.

curl -X POST "$ENDPOINT_URL/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "mistralai/Ministral-3-8B-Reasoning-2512",
    "messages": [
      {
        "role": "user",
        "content": "Write a short poem about Switzerland."
      }
    ],
    "max_tokens": 300
  }'

If the request returns a completion, your deployment is running and serving requests.

Clean Up

Scale to zero when you want to stop GPU billing but keep the deployment, endpoint URL, and API key:

exo dedicated-inference deployment scale ministral-3-8b-reasoning 0 -z de-fra-1

Scale back up when needed:

exo dedicated-inference deployment scale ministral-3-8b-reasoning 1 -z de-fra-1

Delete the deployment when you no longer need the endpoint:

exo dedicated-inference deployment delete ministral-3-8b-reasoning -z de-fra-1

Delete the model if you no longer need the stored model files:

exo dedicated-inference model delete <model-id> -z de-fra-1

Only delete the model when you are sure you will not need it again. If you delete it, you must import it again from Hugging Face before creating a new deployment.

Next Steps

Last updated on