Monitor and Troubleshoot

Use this guide when a deployment doesn’t become ready, requests fail, or you need logs for support.

A deployment starts from model files stored in zone-local Object Storage. At startup, the files are copied into the runtime, loaded by vLLM, and exposed through the deployment URL. The deployment state shows the phase. Logs show the cause.

Set these variables for the examples:

export ZONE="de-fra-1"
export DEPLOYMENT="ministral-3-8b-reasoning"

Check State

List deployments:

exo dedicated-inference deployment list -z "$ZONE"

Show one deployment:

exo dedicated-inference deployment show "$DEPLOYMENT" -z "$ZONE"

Use show to check the state, endpoint URL, model, GPU type, GPU count, replicas, and state details.

State	Meaning	What to do
`creating`	The request was accepted	Wait, then check again
`preparing`	Runtime resources are being prepared	Check logs if it stays there
`deploying`	vLLM is starting and loading the model	Check logs if it takes too long
`ready`	The endpoint can accept requests	Test the client request
`scaling`	Replicas are changing	Wait for scaling to finish
`updating`	Deployment settings are being applied	Wait, then check state again
`error`	The deployment failed	Check state details and logs

Large models can spend more time in deploying, updating, or scaling when runtime replicas start or restart. This is expected while model files are copied and loaded into GPU memory.

View Logs

Show recent logs first:

exo dedicated-inference deployment logs "$DEPLOYMENT" -z "$ZONE"

Fetch more logs when needed:

exo dedicated-inference deployment logs "$DEPLOYMENT" -z "$ZONE" --tail 500

Look for:

CUDA out of memory
Python tracebacks
invalid vLLM parameters
model loading messages

Progress messages usually mean the deployment is still starting. Tracebacks, memory errors, and invalid parameter messages usually need vLLM inference engine parameter changes.

Startup Issues

Symptom	Likely cause	What to do
Stuck in `preparing`	Runtime setup, quota, capacity, or GPU authorization issue	Check state details and available instance types
Stuck in `deploying` with progress logs	Large model or slow model load	Wait, then check logs again
`CUDA out of memory`	Model, context length, or KV-cache doesn’t fit	Tune your vLLM engine parameters to fit the available CUDA memory, or use more GPUs or a GPU class with more VRAM
Invalid vLLM parameter	Parameter isn’t allowed for the selected engine version	Check parameter help and remove it
Quota or capacity error	Not enough GPU quota or capacity in the zone	Check quota and GPU availability
`error` without a clear log line	The failure happened before or outside vLLM startup	Collect state details and logs for support

If you lack a sufficient quota or the desired GPU type is not available in your preferred zone, open a support ticket so options for making that GPU available in that zone can be discussed and considered.

Check available GPU types and authorization:

exo dedicated-inference deployment instance-type -z "$ZONE"

Check allowed inference engine parameters for the default vLLM engine version:

exo dedicated-inference deployment create --inference-engine-parameter-help

GPU Memory Errors

A deployment can run out of memory even when the model weights fit. Long context windows also reserve memory for the KV-cache. That memory can be large.

Try a lower context length:

--inference-engine-params '--max-model-len=32768'

If supported in the context you’re in, a lower-precision KV-cache can also reduce memory use:

--inference-engine-params '--kv-cache-dtype=fp8'

If the model still doesn’t fit, create a new deployment with more GPUs or a GPU type with more memory. --gpu-count and GPU type are fixed after creation. Scaling replicas helps concurrency, but it doesn’t give one request more GPU memory.

If the Model Is Not Created

A deployment needs a model in the created state.

Check models in the zone:

exo dedicated-inference model list -z "$ZONE"

If the model is still creating, downloading, or error, fix the model import first. Common causes are a wrong Hugging Face model ID, missing access to a gated model, or unsupported model files.

For gated or private models, see Import Gated Models.

Request Errors

If the deployment is ready but requests fail, check the client setup.

Error	Likely cause	What to do
`401 Unauthorized`	Wrong deployment API key	Reveal the key again and update the client
`404 Not Found`	Wrong endpoint URL or missing `/v1` path	Check the URL with `deployment show`
`400 Bad Request`	Invalid request body	Check the JSON payload and model name
`429 Too Many Requests`	More traffic than the deployment can handle	Reduce traffic or add replicas
`500`	Runtime error during inference	Check state and logs

Reveal the deployment API key:

exo dedicated-inference deployment reveal-api-key "$DEPLOYMENT" -z "$ZONE"

Use it as a bearer token:

Authorization: Bearer <api-key>

Dedicated Inference endpoints are OpenAI-compatible. Use the deployment URL as the base URL and include the /v1 path.

Slow Responses

Slow responses usually come from long prompts, high max_tokens, too much traffic, or limited GPU memory headroom.

Add replicas when you need to handle more concurrent requests:

exo dedicated-inference deployment scale "$DEPLOYMENT" 3 -z "$ZONE"

Replicas add more running copies of the same deployment. They help with concurrency. They don’t make one heavy request faster.

Collect Details for Support

Before opening a ticket, collect state and logs:

exo dedicated-inference deployment show "$DEPLOYMENT" -z "$ZONE" > deployment-info.txt
exo dedicated-inference deployment logs "$DEPLOYMENT" -z "$ZONE" --tail 500 > deployment-logs.txt

Don’t include deployment API keys, Hugging Face tokens, or other secrets.

Open a support ticket through the Exoscale Portal.

Next Steps

Last updated on June 15, 2026