Skip to content

Service Boundaries

Dedicated Inference operates with the following constraints:

Safetensors Model File Format
Model weights must be in the safetensors format. GGUF and other formats are not supported.
Customer-managed sizing
Picking a GPU type and count is model-dependent and use-case dependent. As such, it is up to you to size your inference deployments.
GPU Count Immutability
The --gpu-count parameter cannot be changed after deployment. To use a different GPU count, create a new deployment.

Note

Integration on the web portal is pending. During the preview phase, Dedicated Inference is available via CLI and API SDKs only.

Last updated on