Overview

Exoscale Dedicated Inference is a managed AI service within the Concrete AI product suite that allows you to deploy and run AI models as production-ready inference endpoints on Exoscale’s dedicated, sovereign GPU infrastructure.

It removes the complexity of infrastructure setup, scaling, and GPU management, providing a simple and reliable way to expose AI models through an OpenAI-compatible API. Dedicated Inference is designed for teams that want to move from experimentation to production without building and operating their own AI platform.

Terminology

Understanding the key concepts used in Dedicated Inference will help you deploy models efficiently.

Model
A model is an AI artifact (for example from Hugging Face) that is imported into Exoscale and stored securely in Exoscale Object Storage for deployment.
Deployment
A deployment represents a running inference endpoint backed by dedicated GPU resources. Each deployment exposes an OpenAI-compatible API endpoint.
Replica
A replica is a running copy of a deployment used to scale inference capacity and handle concurrent requests.
GPU Count
Defines how many GPUs are assigned to a single model instance, enabling large models to run across multiple GPUs.
Dedicated GPU
A physical GPU allocated exclusively to a deployment, ensuring predictable performance and isolation.

Features

Managed Inference Endpoints
Deploy AI models as fully managed inference endpoints using simple API or CLI commands.
Dedicated GPU Performance
Each deployment runs on dedicated NVIDIA GPUs, ensuring consistent performance without resource contention.
OpenAI-Compatible API
Integrate deployed models directly into applications using a standard, OpenAI-compatible API.
Bring Your Own Model
Use public or gated models from Hugging Face or upload your own models, keeping full control over model choice.
Flexible Scaling
Scale deployments horizontally using replicas or vertically using multiple GPUs. Scale to zero to stop GPU billing while preserving your URL and API key.
Speculative Decoding
An advanced performance optimization technique that significantly reduces inference latency and improves throughput using a two-model approach. Learn more
Sovereign & Secure
All deployments run in European data centers with strong isolation between organizations, supporting GDPR and data sovereignty requirements.
Transparent Pricing
Billing is based on per-second GPU usage and standard Object Storage costs, with no token-based pricing.

Speculative Decoding (Advanced)

Speculative decoding is a performance optimization that pairs a small draft model with a larger target model to reduce inference latency and improve throughput. For details on model pairing, configuration, and best practices, see the Optimize Performance guide.

Availability

GPU availability varies by zone and GPU type. See GPU availability by zone for the current GPU-by-zone matrix.

ZoneCountryCityAvailability
at-vie-2AustriaVienna
ch-dk-2SwitzerlandZurich
ch-gva-2SwitzerlandGeneva
de-fra-1GermanyFrankfurt
hr-zag-1CroatiaZagreb

Limitations

The following limitations apply during the preview phase:

CLI-First Interface
Model creation, deployment, scaling, and deletion are currently performed via API or CLI. Portal and Terraform support coming soon.
Custom Domains
Custom domain configuration is not yet available. Each deployment receives a standard endpoint URL.
Service Level Tiers
Service level tiers are not available during preview. All deployments run at the same service level.
GPU Count Immutability
The --gpu-count parameter cannot be changed after deployment. To use a different GPU count, create a new deployment.
Model File Format
Only safetensor format is supported. GGUF and other formats are not compatible.
Last updated on