Overview

Exoscale Dedicated Inference is a managed AI service within the Concrete AI product suite that allows you to deploy and run AI models as production-ready inference endpoints on Exoscale’s dedicated, sovereign GPU infrastructure.

It removes the complexity of infrastructure setup, scaling, and GPU management, providing a simple and reliable way to expose AI models through an OpenAI-compatible API. Dedicated Inference is designed for teams that want to move from experimentation to production without building and operating their own AI platform.

Terminology

Understanding the key concepts used in Dedicated Inference will help you deploy models efficiently.

Model: A model is an AI artifact (for example from Hugging Face) that is imported into Exoscale and stored securely in Exoscale Object Storage for deployment.
Deployment: A deployment represents a running inference endpoint backed by dedicated GPU resources. Each deployment exposes an OpenAI-compatible API endpoint.
Replica: A replica is a running copy of a deployment used to scale inference capacity and handle concurrent requests.
GPU Count: Defines how many GPUs are assigned to a single model instance, enabling large models to run across multiple GPUs.
Dedicated GPU: A physical GPU allocated exclusively to a deployment rather than shared with other customers. This helps keep performance more predictable because a deployment is not competing with unrelated workloads for the same GPU.

Features

Dedicated GPU Performance: Each deployment runs on dedicated NVIDIA GPUs, ensuring consistent performance without resource contention.
Bring Your Own Model: Import public, gated, or private Hugging Face models, including models owned by your Hugging Face organization.
OpenAI-Compatible API: Integrate deployed models directly into applications using a standard OpenAI-compatible API
Flexible Scaling: Scale deployments by changing replicas. Scale to zero to stop GPU billing while keeping the endpoint URL and API key. The endpoint will not answer requests while scaled to zero, but you can scale it back up later.
Sovereign & Secure: All deployments run in European data centers with strong isolation between organizations, supporting GDPR and data sovereignty requirements.
Transparent Pricing: Billing is based on per-second GPU usage and standard Object Storage costs, with no token-based pricing.

Last updated on June 15, 2026