Dedicated Inference

Dedicated Inference

Dedicated Inference lets you run Large Language Models (LLMs) on Exoscale GPU infrastructure.

Read more

scale-deployment

[BETA] Scale Deployment

POST /ai/deployment/{id}/scale

Path parameters

  • id in path (required)

Request body

  • application/json
    • replicas (required) (integer): Number of replicas (>=0)

Responses

  • 200: 200
    • application/json
      • id (string): Operation ID
      • reason (string): Operation failure reason
      • reference (object): Related resource reference - schema details
      • message (string): Operation message
      • state (string): Operation status

SDK reference for scale-deployment: golang | Python | Java

CLI: exo api scale-deployment

create-deployment

[BETA] Create Deployment

POST /ai/deployment

Deploy a model on an inference server

Request body

  • application/json
    • model (object) - schema details
    • name (string): Deployment name
    • gpu-type (required) (string): GPU type family (e.g., gpua5000, gpu3080ti)
    • gpu-count (required) (integer): Number of GPUs (1-8)
    • replicas (required) (integer): Number of replicas (>=1)
    • inference-engine-parameters (array[string]): Optional extra inference engine server CLI args

Responses

  • 200: 200
    • application/json
      • id (string): Operation ID
      • reason (string): Operation failure reason
      • reference (object): Related resource reference - schema details
      • message (string): Operation message
      • state (string): Operation status
  • 400: 400
    • application/json
      • error (string): Error description

SDK reference for create-deployment: golang | Python | Java

CLI: exo api create-deployment

get-deployment

[BETA] Get Deployment

GET /ai/deployment/{id}

Path parameters

  • id in path (required)

Responses

  • 200: 200
    • application/json
      • gpu-count (integer): Number of GPUs
      • updated-at (string): Update time
      • deployment-url (string): Deployment URL (nullable)
      • service-level (string): Service level
      • name (string): Deployment name
      • status-details (string): Deployment status details
      • gpu-type (string): GPU type family
      • status (string): Deployment status
      • id (string): Deployment ID
      • replicas (integer): Number of replicas (>=0)
      • created-at (string): Creation time
      • inference-engine-parameters (array[string]): Optional extra inference engine server CLI args
      • model (object) - schema details
  • 404: 404
    • application/json
      • error (string): Error description

SDK reference for get-deployment: golang | Python | Java

CLI: exo api get-deployment

delete-deployment

[BETA] Delete Deployment

DELETE /ai/deployment/{id}

Path parameters

  • id in path (required)

Responses

  • 200: 200
    • application/json
      • id (string): Operation ID
      • reason (string): Operation failure reason
      • reference (object): Related resource reference - schema details
      • message (string): Operation message
      • state (string): Operation status

SDK reference for delete-deployment: golang | Python | Java

CLI: exo api delete-deployment

create-model

[BETA] Create Model

POST /ai/model

Model files will be downloaded from Huggingface.

Name must be the exact name of the model on huggingface (ex: openai/gpt-oss-120b or ggml-org/gpt-oss-120b-GGUF).

If the model is under a license then you must provide a Huggingface access token for an account that signed the license agreement

Request body

  • application/json
    • huggingface-token (string): Huggingface Token
    • name (string): Model name

Responses

  • 200: 200
    • application/json
      • id (string): Operation ID
      • reason (string): Operation failure reason
      • reference (object): Related resource reference - schema details
      • message (string): Operation message
      • state (string): Operation status

SDK reference for create-model: golang | Python | Java

CLI: exo api create-model

get-model

[BETA] Get Model

GET /ai/model/{id}

Path parameters

  • id in path (required)

Responses

  • 200: 200
    • application/json
      • id (string): Model ID
      • name (string): Model name
      • status (string): Model status
      • model-size (integer): Model size (nullable)
      • created-at (string): Creation time
      • updated-at (string): Update time
  • 404: 404
    • application/json
      • error (string): Error description

SDK reference for get-model: golang | Python | Java

CLI: exo api get-model

delete-model

[BETA] Delete Model

DELETE /ai/model/{id}

Path parameters

  • id in path (required)

Responses

  • 200: 200
    • application/json
      • id (string): Operation ID
      • reason (string): Operation failure reason
      • reference (object): Related resource reference - schema details
      • message (string): Operation message
      • state (string): Operation status
  • 412: 412
    • application/json
      • deployments (array[string]): Deployments using models

SDK reference for delete-model: golang | Python | Java

CLI: exo api delete-model


Other Operations

reveal-deployment-api-key

[BETA] Reveal Deployment API Key

GET /ai/deployment/{id}/api-key

Path parameters

  • id in path (required)

Responses

  • 200: 200
    • application/json
      • api-key (string)

SDK reference for reveal-deployment-api-key: golang | Python | Java

CLI: exo api reveal-deployment-api-key

get-deployment-logs

[BETA] Get Deployment Logs

GET /ai/deployment/{id}/logs

Return logs for the vLLM deployment (deploy/–deployment-vllm). Optional ?stream=true to request streaming (may not be supported).

Path parameters

  • id in path (required)

Responses

  • 200: 200
    • application/json
  • 400: 400
    • application/json
      • error (string): Error description
  • 404: 404
    • application/json
      • error (string): Error description
  • 500: 500
    • application/json
      • error (string): Error description

SDK reference for get-deployment-logs: golang | Python | Java

CLI: exo api get-deployment-logs

list-deployments

[BETA] List Deployments

GET /ai/deployment

Responses

SDK reference for list-deployments: golang | Python | Java

CLI: exo api list-deployments

get-inference-engine-help

[BETA] Get inference-engine Help

GET /ai/help/inference-engine-parameters

Get list of allowed inference engine parameters with their descriptions, types, allowed values, and defaults

Responses

  • 200: 200
  • 500: 500
    • application/json
      • error (string): Error description

SDK reference for get-inference-engine-help: golang | Python | Java

CLI: exo api get-inference-engine-help

list-models

[BETA] List Models

GET /ai/model

Responses

SDK reference for list-models: golang | Python | Java

CLI: exo api list-models

Last updated on