Skip to content

Monitor and Troubleshoot

Use this guide when a deployment doesn’t become ready, requests fail, or you need logs for support.

A deployment starts from model files stored in zone-local Object Storage. At startup, the files are copied into the runtime, loaded by vLLM, and exposed through the deployment URL. The deployment state shows the phase. Logs show the cause.

Set these variables for the examples:

export ZONE="de-fra-1"
export DEPLOYMENT="ministral-3-8b-reasoning"

Check State

List deployments:

exo dedicated-inference deployment list -z "$ZONE"

Show one deployment:

exo dedicated-inference deployment show "$DEPLOYMENT" -z "$ZONE"

Use show to check the state, endpoint URL, model, GPU type, GPU count, replicas, and state details.

StateMeaningWhat to do
creatingThe request was acceptedWait, then check again
preparingRuntime resources are being preparedCheck logs if it stays there
deployingvLLM is starting and loading the modelCheck logs if it takes too long
readyThe endpoint can accept requestsTest the client request
scalingReplicas are changingWait for scaling to finish
updatingDeployment settings are being appliedWait, then check state again
errorThe deployment failedCheck state details and logs

Large models can spend more time in deploying, updating, or scaling when runtime replicas start or restart. This is expected while model files are copied and loaded into GPU memory.

View Logs

Show recent logs first:

exo dedicated-inference deployment logs "$DEPLOYMENT" -z "$ZONE"

Fetch more logs when needed:

exo dedicated-inference deployment logs "$DEPLOYMENT" -z "$ZONE" --tail 500

Look for:

  • CUDA out of memory
  • Python tracebacks
  • invalid vLLM parameters
  • model loading messages

Progress messages usually mean the deployment is still starting. Tracebacks, memory errors, and invalid parameter messages usually need vLLM inference engine parameter changes.

Startup Issues

SymptomLikely causeWhat to do
Stuck in preparingRuntime setup, quota, capacity, or GPU authorization issueCheck state details and available instance types
Stuck in deploying with progress logsLarge model or slow model loadWait, then check logs again
CUDA out of memoryModel, context length, or KV-cache doesn’t fitTune your vLLM engine parameters to fit the available CUDA memory, or use more GPUs or a GPU class with more VRAM
Invalid vLLM parameterParameter isn’t allowed for the selected engine versionCheck parameter help and remove it
Quota or capacity errorNot enough GPU quota or capacity in the zoneCheck quota and GPU availability
error without a clear log lineThe failure happened before or outside vLLM startupCollect state details and logs for support

If you lack a sufficient quota or the desired GPU type is not available in your preferred zone, open a support ticket so options for making that GPU available in that zone can be discussed and considered.

Check available GPU types and authorization:

exo dedicated-inference deployment instance-type -z "$ZONE"

Check allowed inference engine parameters for the default vLLM engine version:

exo dedicated-inference deployment create --inference-engine-parameter-help

GPU Memory Errors

A deployment can run out of memory even when the model weights fit. Long context windows also reserve memory for the KV-cache. That memory can be large.

Try a lower context length:

--inference-engine-params '--max-model-len=32768'

If supported in the context you’re in, a lower-precision KV-cache can also reduce memory use:

--inference-engine-params '--kv-cache-dtype=fp8'

If the model still doesn’t fit, create a new deployment with more GPUs or a GPU type with more memory. --gpu-count and GPU type are fixed after creation. Scaling replicas helps concurrency, but it doesn’t give one request more GPU memory.

If the Model Is Not Created

A deployment needs a model in the created state.

Check models in the zone:

exo dedicated-inference model list -z "$ZONE"

If the model is still creating, downloading, or error, fix the model import first. Common causes are a wrong Hugging Face model ID, missing access to a gated model, or unsupported model files.

For gated or private models, see Import Gated Models.

Request Errors

If the deployment is ready but requests fail, check the client setup.

ErrorLikely causeWhat to do
401 UnauthorizedWrong deployment API keyReveal the key again and update the client
404 Not FoundWrong endpoint URL or missing /v1 pathCheck the URL with deployment show
400 Bad RequestInvalid request bodyCheck the JSON payload and model name
429 Too Many RequestsMore traffic than the deployment can handleReduce traffic or add replicas
500Runtime error during inferenceCheck state and logs

Reveal the deployment API key:

exo dedicated-inference deployment reveal-api-key "$DEPLOYMENT" -z "$ZONE"

Use it as a bearer token:

Authorization: Bearer <api-key>

Dedicated Inference endpoints are OpenAI-compatible. Use the deployment URL as the base URL and include the /v1 path.

Slow Responses

Slow responses usually come from long prompts, high max_tokens, too much traffic, or limited GPU memory headroom.

Add replicas when you need to handle more concurrent requests:

exo dedicated-inference deployment scale "$DEPLOYMENT" 3 -z "$ZONE"

Replicas add more running copies of the same deployment. They help with concurrency. They don’t make one heavy request faster.

Collect Details for Support

Before opening a ticket, collect state and logs:

exo dedicated-inference deployment show "$DEPLOYMENT" -z "$ZONE" > deployment-info.txt
exo dedicated-inference deployment logs "$DEPLOYMENT" -z "$ZONE" --tail 500 > deployment-logs.txt

Don’t include deployment API keys, Hugging Face tokens, or other secrets.

Open a support ticket through the Exoscale Portal.

Next Steps

Last updated on