Monitor and Troubleshoot
Use this guide when a deployment doesn’t become ready, requests fail, or you need logs for support.
A deployment starts from model files stored in zone-local Object Storage. At startup, the files are copied into the runtime, loaded by vLLM, and exposed through the deployment URL. The deployment state shows the phase. Logs show the cause.
Set these variables for the examples:
export ZONE="de-fra-1"
export DEPLOYMENT="ministral-3-8b-reasoning"Check State
List deployments:
exo dedicated-inference deployment list -z "$ZONE"Show one deployment:
exo dedicated-inference deployment show "$DEPLOYMENT" -z "$ZONE"Use show to check the state, endpoint URL, model, GPU type, GPU count, replicas, and state details.
| State | Meaning | What to do |
|---|---|---|
creating | The request was accepted | Wait, then check again |
preparing | Runtime resources are being prepared | Check logs if it stays there |
deploying | vLLM is starting and loading the model | Check logs if it takes too long |
ready | The endpoint can accept requests | Test the client request |
scaling | Replicas are changing | Wait for scaling to finish |
updating | Deployment settings are being applied | Wait, then check state again |
error | The deployment failed | Check state details and logs |
Large models can spend more time in deploying, updating, or scaling when runtime replicas start or restart. This is expected while model files are copied and loaded into GPU memory.
View Logs
Show recent logs first:
exo dedicated-inference deployment logs "$DEPLOYMENT" -z "$ZONE"Fetch more logs when needed:
exo dedicated-inference deployment logs "$DEPLOYMENT" -z "$ZONE" --tail 500Look for:
CUDA out of memory- Python tracebacks
- invalid vLLM parameters
- model loading messages
Progress messages usually mean the deployment is still starting. Tracebacks, memory errors, and invalid parameter messages usually need vLLM inference engine parameter changes.
Startup Issues
| Symptom | Likely cause | What to do |
|---|---|---|
Stuck in preparing | Runtime setup, quota, capacity, or GPU authorization issue | Check state details and available instance types |
Stuck in deploying with progress logs | Large model or slow model load | Wait, then check logs again |
CUDA out of memory | Model, context length, or KV-cache doesn’t fit | Tune your vLLM engine parameters to fit the available CUDA memory, or use more GPUs or a GPU class with more VRAM |
| Invalid vLLM parameter | Parameter isn’t allowed for the selected engine version | Check parameter help and remove it |
| Quota or capacity error | Not enough GPU quota or capacity in the zone | Check quota and GPU availability |
error without a clear log line | The failure happened before or outside vLLM startup | Collect state details and logs for support |
If you lack a sufficient quota or the desired GPU type is not available in your preferred zone, open a support ticket so options for making that GPU available in that zone can be discussed and considered.
Check available GPU types and authorization:
exo dedicated-inference deployment instance-type -z "$ZONE"Check allowed inference engine parameters for the default vLLM engine version:
exo dedicated-inference deployment create --inference-engine-parameter-helpGPU Memory Errors
A deployment can run out of memory even when the model weights fit. Long context windows also reserve memory for the KV-cache. That memory can be large.
Try a lower context length:
--inference-engine-params '--max-model-len=32768'If supported in the context you’re in, a lower-precision KV-cache can also reduce memory use:
--inference-engine-params '--kv-cache-dtype=fp8'If the model still doesn’t fit, create a new deployment with more GPUs or a GPU type with more memory. --gpu-count and GPU type are fixed after creation. Scaling replicas helps concurrency, but it doesn’t give one request more GPU memory.
If the Model Is Not Created
A deployment needs a model in the created state.
Check models in the zone:
exo dedicated-inference model list -z "$ZONE"If the model is still creating, downloading, or error, fix the model import first. Common causes are a wrong Hugging Face model ID, missing access to a gated model, or unsupported model files.
For gated or private models, see Import Gated Models.
Request Errors
If the deployment is ready but requests fail, check the client setup.
| Error | Likely cause | What to do |
|---|---|---|
401 Unauthorized | Wrong deployment API key | Reveal the key again and update the client |
404 Not Found | Wrong endpoint URL or missing /v1 path | Check the URL with deployment show |
400 Bad Request | Invalid request body | Check the JSON payload and model name |
429 Too Many Requests | More traffic than the deployment can handle | Reduce traffic or add replicas |
500 | Runtime error during inference | Check state and logs |
Reveal the deployment API key:
exo dedicated-inference deployment reveal-api-key "$DEPLOYMENT" -z "$ZONE"Use it as a bearer token:
Authorization: Bearer <api-key>Dedicated Inference endpoints are OpenAI-compatible. Use the deployment URL as the base URL and include the /v1 path.
Slow Responses
Slow responses usually come from long prompts, high max_tokens, too much traffic, or limited GPU memory headroom.
Add replicas when you need to handle more concurrent requests:
exo dedicated-inference deployment scale "$DEPLOYMENT" 3 -z "$ZONE"Replicas add more running copies of the same deployment. They help with concurrency. They don’t make one heavy request faster.
Collect Details for Support
Before opening a ticket, collect state and logs:
exo dedicated-inference deployment show "$DEPLOYMENT" -z "$ZONE" > deployment-info.txt
exo dedicated-inference deployment logs "$DEPLOYMENT" -z "$ZONE" --tail 500 > deployment-logs.txtDon’t include deployment API keys, Hugging Face tokens, or other secrets.
Open a support ticket through the Exoscale Portal.