Quick Start

In this quick start guide, you will deploy your first AI model using Exoscale Dedicated Inference and expose it as a production-ready inference endpoint running on dedicated GPUs.

Before You Start

Before deploying your first model, make sure the following prerequisites are met:

Exoscale CLI

Dedicated Inference is currently managed via the Exoscale CLI (exo) and API.

  • Install and configure the latest Exoscale CLI

  • The API key used for Exoscale CLI needs full access to the API. Create a role with the necessary permissions:

# Create a role with compute and AI service access
exo iam role create ai-role --policy '{
  "default-service-strategy": "deny",
  "services": {
    "compute": {"type": "allow", "rules": []},
    "ai": {"type": "allow", "rules": []}
  }
}'

# Create an API key for this role
exo iam api-key create ai-key ai-role

# Add the key to your CLI configuration
exo config add

GPU Quota

You must have sufficient GPU quota available in your organization. If needed, request a quota increase via the Exoscale Portal under Organization → Quotas, or by opening a support ticket.

Hugging Face Account (Optional)

Some models (for example Llama-based models) are gated on Hugging Face.

If you plan to deploy gated models:

  • Create a Hugging Face account

  • Accept the model license

  • Generate a read-access token

This token will be required when creating the model.

Create a Model

The first step is to register a model with Exoscale. This downloads the model and stores it securely in Exoscale Object Storage, making it available for deployment.

You must specify the zone where the model will be stored. This should match the zone where your GPUs are available.

exo dedicated-inference model create <huggingface-model-id> -z <zone>

Example (Public Model)

exo dedicated-inference model create mistralai/Mistral-7B-Instruct-v0.3 -z at-vie-2

Example (Gated Model)

exo dedicated-inference model create meta-llama/Llama-3.2-1B-Instruct \
  --huggingface-token <hf-token> \
  -z at-vie-2

Model creation may take several minutes. You can check the status with:

exo dedicated-inference model list

Deploy the Model

Once the model is available, deploy it as an inference endpoint on dedicated GPUs.

This step provisions the required GPU resources and launches the model.

exo dedicated-inference deployment create <deployment-name> \
  --model-name <model-name> \
  --gpu-type <gpu-sku> \
  --gpu-count <count> \
  --replicas <count> \
  -z <zone>

Example

exo dedicated-inference deployment create demo \
  --model-name mistralai/Mistral-7B-Instruct-v0.3 \
  --gpu-type gpua5000 \
  --gpu-count 1 \
  --replicas 1 \
  -z at-vie-2

You can monitor deployment progress with:

exo dedicated-inference deployment list

Wait until the deployment status is ready.

Access the Inference Endpoint

After deployment, retrieve the endpoint URL and API key.

Get the Endpoint URL

exo dedicated-inference deployment show <deployment-name> -z <zone>

Look for the Deployment URL field. The URL already includes the /v1 path required by the API.

Get the API Key

exo dedicated-inference deployment reveal-api-key <deployment-name> -z <zone>

This command reveals the existing API key (it does not rotate it). To issue a new key, delete and recreate the deployment.

Test the Endpoint

Dedicated Inference exposes an OpenAI-compatible API, making it easy to integrate with existing tools and SDKs.

Set Endpoint and API Key

Copy the deployment URL and API key from the previous commands, then export them:

export ENDPOINT_URL="https://<your-deployment-id>.inference.<zone>.exoscale-cloud.com/v1"
export API_KEY="<your-api-key>"

Example Request

curl -X POST "$ENDPOINT_URL/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "<model-name>",
    "messages": [
      { "role": "user", "content": "Write a short poem about Switzerland" }
    ],
    "max_tokens": 3000
  }'

Note that MUST contain the /v1 part as given in the show command.

If the request succeeds, your model is now running in production.

Clean Up and Cost Management

Scale to Zero

To stop GPU billing while preserving your deployment’s URL and API key, scale to zero:

exo dedicated-inference deployment scale <deployment-name> 0 -z <zone>

When you’re ready to resume, scale back up:

exo dedicated-inference deployment scale <deployment-name> 1 -z <zone>

Delete the Deployment

If you no longer need the deployment at all, delete it:

exo dedicated-inference deployment delete <deployment-name> -z <zone>

Note: Deleting loses the URL and API key. Prefer scaling to zero if you plan to use the deployment again.

Delete the Model

To remove the model from Object Storage and stop all associated costs:

exo dedicated-inference model delete <model-id> -z <zone>

Only delete models you’re certain you won’t need again, as they’ll need to be re-downloaded from Hugging Face.

Next Steps

Now that you’ve deployed your first model, explore more advanced topics:

Last updated on