Quick Start
In this quick start guide, you will deploy your first AI model using Exoscale Dedicated Inference and expose it as a production-ready inference endpoint running on dedicated GPUs.
Before You Start
Before deploying your first model, make sure the following prerequisites are met:
Exoscale CLI
Dedicated Inference is currently managed via the Exoscale CLI (exo) and API.
Install and configure the latest Exoscale CLI
The API key used for Exoscale CLI needs full access to the API. Create a role with the necessary permissions:
# Create a role with compute and AI service access
exo iam role create ai-role --policy '{
"default-service-strategy": "deny",
"services": {
"compute": {"type": "allow", "rules": []},
"ai": {"type": "allow", "rules": []}
}
}'
# Create an API key for this role
exo iam api-key create ai-key ai-role
# Add the key to your CLI configuration
exo config addGPU Quota
You must have sufficient GPU quota available in your organization. If needed, request a quota increase via the Exoscale Portal under Organization → Quotas, or by opening a support ticket.
Hugging Face Account (Optional)
Some models (for example Llama-based models) are gated on Hugging Face.
If you plan to deploy gated models:
Create a Hugging Face account
Accept the model license
Generate a read-access token
This token will be required when creating the model.
Create a Model
The first step is to register a model with Exoscale. This downloads the model and stores it securely in Exoscale Object Storage, making it available for deployment.
You must specify the zone where the model will be stored. This should match the zone where your GPUs are available.
exo dedicated-inference model create <huggingface-model-id> -z <zone>Example (Public Model)
exo dedicated-inference model create mistralai/Mistral-7B-Instruct-v0.3 -z at-vie-2Example (Gated Model)
exo dedicated-inference model create meta-llama/Llama-3.2-1B-Instruct \
--huggingface-token <hf-token> \
-z at-vie-2Model creation may take several minutes. You can check the status with:
exo dedicated-inference model listDeploy the Model
Once the model is available, deploy it as an inference endpoint on dedicated GPUs.
This step provisions the required GPU resources and launches the model.
exo dedicated-inference deployment create <deployment-name> \
--model-name <model-name> \
--gpu-type <gpu-sku> \
--gpu-count <count> \
--replicas <count> \
-z <zone>Example
exo dedicated-inference deployment create demo \
--model-name mistralai/Mistral-7B-Instruct-v0.3 \
--gpu-type gpua5000 \
--gpu-count 1 \
--replicas 1 \
-z at-vie-2You can monitor deployment progress with:
exo dedicated-inference deployment listWait until the deployment status is ready.
Access the Inference Endpoint
After deployment, retrieve the endpoint URL and API key.
Get the Endpoint URL
exo dedicated-inference deployment show <deployment-name> -z <zone>Look for the Deployment URL field.
The URL already includes the /v1 path required by the API.
Get the API Key
exo dedicated-inference deployment reveal-api-key <deployment-name> -z <zone>This command reveals the existing API key (it does not rotate it). To issue a new key, delete and recreate the deployment.
Test the Endpoint
Dedicated Inference exposes an OpenAI-compatible API, making it easy to integrate with existing tools and SDKs.
Set Endpoint and API Key
Copy the deployment URL and API key from the previous commands, then export them:
export ENDPOINT_URL="https://<your-deployment-id>.inference.<zone>.exoscale-cloud.com/v1"
export API_KEY="<your-api-key>"Example Request
curl -X POST "$ENDPOINT_URL/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"model": "<model-name>",
"messages": [
{ "role": "user", "content": "Write a short poem about Switzerland" }
],
"max_tokens": 3000
}'Note that /v1 part as given in the show command.
If the request succeeds, your model is now running in production.
Clean Up and Cost Management
Scale to Zero
To stop GPU billing while preserving your deployment’s URL and API key, scale to zero:
exo dedicated-inference deployment scale <deployment-name> 0 -z <zone>When you’re ready to resume, scale back up:
exo dedicated-inference deployment scale <deployment-name> 1 -z <zone>Delete the Deployment
If you no longer need the deployment at all, delete it:
exo dedicated-inference deployment delete <deployment-name> -z <zone>Note: Deleting loses the URL and API key. Prefer scaling to zero if you plan to use the deployment again.
Delete the Model
To remove the model from Object Storage and stop all associated costs:
exo dedicated-inference model delete <model-id> -z <zone>Only delete models you’re certain you won’t need again, as they’ll need to be re-downloaded from Hugging Face.
Next Steps
Now that you’ve deployed your first model, explore more advanced topics: