Overview
Exoscale Dedicated Inference is a managed AI service within the Concrete AI product suite that allows you to deploy and run AI models as production-ready inference endpoints on Exoscale’s dedicated, sovereign GPU infrastructure.
It removes the complexity of infrastructure setup, scaling, and GPU management, providing a simple and reliable way to expose AI models through an OpenAI-compatible API. Dedicated Inference is designed for teams that want to move from experimentation to production without building and operating their own AI platform.
Terminology
Understanding the key concepts used in Dedicated Inference will help you deploy models efficiently.
- Model
- A model is an AI artifact (for example from Hugging Face) that is imported into Exoscale and stored securely in Exoscale Object Storage for deployment.
- Deployment
- A deployment represents a running inference endpoint backed by dedicated GPU resources. Each deployment exposes an OpenAI-compatible API endpoint.
- Replica
- A replica is a running copy of a deployment used to scale inference capacity and handle concurrent requests.
- GPU Count
- Defines how many GPUs are assigned to a single model instance, enabling large models to run across multiple GPUs.
- Dedicated GPU
- A physical GPU allocated exclusively to a deployment, ensuring predictable performance and isolation.
Features
- Managed Inference Endpoints
- Deploy AI models as fully managed inference endpoints using simple API or CLI commands.
- Dedicated GPU Performance
- Each deployment runs on dedicated NVIDIA GPUs, ensuring consistent performance without resource contention.
- OpenAI-Compatible API
- Integrate deployed models directly into applications using a standard, OpenAI-compatible API.
- Bring Your Own Model
- Use public or gated models from Hugging Face or upload your own models, keeping full control over model choice.
- Flexible Scaling
- Scale deployments horizontally using replicas or vertically using multiple GPUs. Scale to zero to stop GPU billing while preserving your URL and API key.
- Speculative Decoding
- An advanced performance optimization technique that significantly reduces inference latency and improves throughput using a two-model approach. Learn more
- Sovereign & Secure
- All deployments run in European data centers with strong isolation between organizations, supporting GDPR and data sovereignty requirements.
- Transparent Pricing
- Billing is based on per-second GPU usage and standard Object Storage costs, with no token-based pricing.
Speculative Decoding (Advanced)
Speculative decoding is a performance optimization that pairs a small draft model with a larger target model to reduce inference latency and improve throughput. For details on model pairing, configuration, and best practices, see the Optimize Performance guide.
Availability
GPU availability varies by zone and GPU type. See GPU availability by zone for the current GPU-by-zone matrix.
| Zone | Country | City | Availability |
|---|---|---|---|
at-vie-2 | Austria | Vienna | |
ch-dk-2 | Switzerland | Zurich | |
ch-gva-2 | Switzerland | Geneva | |
de-fra-1 | Germany | Frankfurt | |
hr-zag-1 | Croatia | Zagreb |
Limitations
The following limitations apply during the preview phase:
- CLI-First Interface
- Model creation, deployment, scaling, and deletion are currently performed via API or CLI. Portal and Terraform support coming soon.
- Custom Domains
- Custom domain configuration is not yet available. Each deployment receives a standard endpoint URL.
- Service Level Tiers
- Service level tiers are not available during preview. All deployments run at the same service level.
- GPU Count Immutability
- The
--gpu-countparameter cannot be changed after deployment. To use a different GPU count, create a new deployment. - Model File Format
- Only
safetensorformat is supported. GGUF and other formats are not compatible.