Monitor and Troubleshoot Deployments

This guide covers how to monitor deployments, interpret logs, and resolve common issues.

Monitoring Deployment Status

List All Deployments

Get an overview of all deployments in a zone:

exo dedicated-inference deployment list -z at-vie-2

Output includes:

Deployment name
Model name
GPU type and count
Replica count
Current status

Check Specific Deployment

View detailed information about a single deployment:

exo dedicated-inference deployment show my-app -z at-vie-2

Key information:

Deployment URL (endpoint for inference requests)
Status (deploying, ready, scaling)
GPU configuration (type, count per instance)
Replica count (current and target)
Creation and update timestamps

Understanding Deployment States

Status	Description	Expected Duration
`deploying`	Resources are being provisioned and model is loading	3-5 minutes (longer for large models)
`ready`	Deployment is running and accepting requests	Stable state
`scaling`	Replicas are being added or removed	3-5 minutes per replica

Viewing Deployment Logs

Logs provide detailed information about deployment health, model loading, and inference requests.

Access Logs

exo dedicated-inference deployment logs my-app -z at-vie-2

What Logs Show

Model Loading: Track progress as the model downloads from Object Storage and loads into GPU memory.
GPU Memory Usage: Identify if your model fits within available GPU memory.
Inference Requests: Monitor incoming requests and response times.
Errors and Warnings: Diagnose failures, misconfigurations, or resource constraints.

Common Issues

Check logs first for any issue:

exo dedicated-inference deployment logs my-app -z at-vie-2

# Show last 100 lines
exo dedicated-inference deployment logs my-app -z at-vie-2 --tail 100

Deployment stuck in “deploying”

Error	Cause	Solution
`Out of memory` / `CUDA out of memory`	Model too large for GPU	Increase `--gpu-count` or use larger GPU (requires new deployment)
`Failed to download model`	Model name incorrect or connectivity issue	Verify model name, delete and recreate model
`Quota exceeded`	Insufficient GPU quota	Request increase via Portal (Organization → Quotas)

Inference requests failing

Error	Cause	Solution
`401 Unauthorized`	Invalid API key	Run `reveal-api-key` and update client
`404 Not Found`	Wrong endpoint URL	Ensure URL includes `/v1` path
`400 Bad Request`	Malformed request body	Check OpenAI API format
`500` errors	Deployment not ready	Wait for `ready` status

Slow performance

High concurrency: Add replicas to handle more parallel requests

exo dedicated-inference deployment scale my-app 3 -z at-vie-2

Large contexts or high parallelism: Increase --gpu-count per instance (requires new deployment) for larger KV-cache
See Optimize Performance guide

Model creation failed

Error	Cause	Solution
Format error	Not `safetensor` format	Use model with safetensor format
Access denied	Gated model, missing token	See Deploy Gated Model guide
Not found	Wrong model ID	Verify ID on Hugging Face

To recover: delete failed model, recreate with correct parameters.

High costs

Scale to zero when idle:

exo dedicated-inference deployment scale my-app 0 -z at-vie-2

See Optimize Costs guide.

Getting Help

Before contacting support, collect diagnostic info:

exo dedicated-inference deployment show my-app -z at-vie-2 > deployment-info.txt
exo dedicated-inference deployment logs my-app -z at-vie-2 > deployment-logs.txt

Open a support ticket via the Exoscale Portal.

Next Steps

Last updated on January 30, 2026