Quick Start
In this quick start guide, you will deploy your first AI model using Exoscale Dedicated Inference and expose it as a production-ready inference endpoint running on dedicated GPUs.
This guide uses mistralai/Ministral-3-8B-Reasoning-2512 on one RTX 6000 Pro GPU. The model is imported from Hugging Face, stored in Exoscale Object Storage, and served through an OpenAI-compatible endpoint.
Before You Start
You need:
- the Exoscale CLI (
exo) installed and configured - an Exoscale API key with access to the
computeandaiservices - enough GPU quota
Configure API Access
Dedicated Inference is managed through the Exoscale CLI and API.
Create a role with compute and ai access:
exo iam role create ai-role --policy '{
"default-service-strategy": "deny",
"services": {
"compute": {"type": "allow", "rules": []},
"ai": {"type": "allow", "rules": []}
}
}'Create an API key for the role:
exo iam api-key create ai-key ai-roleAdd the key to your CLI configuration:
exo config addCheck GPU Availability
Check whether your organization can use the RTX 6000 Pro GPU type in de-fra-1 (besides quota and authorization, the GPU End User Certificate should be signed for the RTX 6000 Pro):
exo dedicated-inference deployment instance-type -z de-fra-1Look for gpurtx6000pro in the output.
A GPU type is shown as authorized when your organization can use it and there is enough capacity in the zone. If it is not available, choose another zone or request access or quota through the Exoscale Portal.
Create the Model
Register the model in Exoscale:
exo dedicated-inference model create mistralai/Ministral-3-8B-Reasoning-2512 -z de-fra-1This downloads the model files from Hugging Face and stores them in Exoscale Object Storage in de-fra-1. The deployment will use this stored copy later, so it does not need to fetch the model from Hugging Face at startup.
Check the model state:
exo dedicated-inference model list -z de-fra-1Wait until the model is created.
Create the Deployment
Create a deployment on one RTX 6000 Pro GPU:
exo dedicated-inference deployment create ministral-3-8b-reasoning \
--model-name mistralai/Ministral-3-8B-Reasoning-2512 \
--gpu-type gpurtx6000pro \
--gpu-count 1 \
--replicas 1 \
--inference-engine-params '--enable-auto-tool-choice --tool-call-parser=mistral --reasoning-parser=mistral --max-model-len=128000 --language-model-only' \
-z de-fra-1This model supports a long context window. The example sets --max-model-len=128000 to reduce memory use while keeping a large context. --language-model-only runs the model in text-only mode, which is useful when you do not need image input. The Mistral parser settings enable tool calling and reasoning output for this model family.
Check the deployment state:
exo dedicated-inference deployment show ministral-3-8b-reasoning -z de-fra-1Wait until the deployment is ready.
Get the Endpoint and API Key
Show the deployment details:
exo dedicated-inference deployment show ministral-3-8b-reasoning -z de-fra-1Copy the deployment URL. It should include the /v1 API path.
Reveal the deployment API key:
exo dedicated-inference deployment reveal-api-key ministral-3-8b-reasoning -z de-fra-1Export both values:
export ENDPOINT_URL="https://<deployment-id>.inference.de-fra-1.exoscale-cloud.com/v1"
export API_KEY="<deployment-api-key>"Treat the deployment API key as a secret. The reveal-api-key command shows the existing key; it does not rotate it.
Send a Test Request
Dedicated Inference endpoints are OpenAI-compatible. You can call the chat completions route with a standard OpenAI-style request body.
curl -X POST "$ENDPOINT_URL/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"model": "mistralai/Ministral-3-8B-Reasoning-2512",
"messages": [
{
"role": "user",
"content": "Write a short poem about Switzerland."
}
],
"max_tokens": 300
}'If the request returns a completion, your deployment is running and serving requests.
Clean Up
Scale to zero when you want to stop GPU billing but keep the deployment, endpoint URL, and API key:
exo dedicated-inference deployment scale ministral-3-8b-reasoning 0 -z de-fra-1Scale back up when needed:
exo dedicated-inference deployment scale ministral-3-8b-reasoning 1 -z de-fra-1Delete the deployment when you no longer need the endpoint:
exo dedicated-inference deployment delete ministral-3-8b-reasoning -z de-fra-1Delete the model if you no longer need the stored model files:
exo dedicated-inference model delete <model-id> -z de-fra-1Only delete the model when you are sure you will not need it again. If you delete it, you must import it again from Hugging Face before creating a new deployment.