Running LLMs on GPU instances

For a quick and easy setup, we’ll run a containerized vLLM server on one of Exoscale’s GPU instances. This guide shows an “advanced-but-practical” setup you can adapt for production.

You can also run LLMs more easily and natively by simply installing Ollama in section Access the instance using SSH and be done immediately.

Provisioning a GPU Instance

Request GPU Quota

GPU instance types (e.g., GPU A30, A5000, A40) require granted quotas.

You can request access via the Exoscale Portal:

  1. Go to the Create Instance view.
  2. Find your GPU instance type of choice.
  3. Select the zone where it is available.
  4. Click Enable for that type.

Please be sure that your overall GPU quota is sufficient! You can request additional quota under Organization → Quotas.

Once granted, the quota shows up under Organization → Quotas → GPUs.

Launch GPU instance

To list available GPU instance-types, execute the following command:

exo compute instance-type list | grep gpu

Create a GPU instance with 24GB of vRAM (e.g., Small GPUA30 in ch-gva-2 or Small GPUA5000 in at-vie-2):

exo compute instance create llm-a30 \
 --instance-type gpua30.small \
 --zone ch-gva-2 \
 --ssh-key <my_ssh_key> \
 --security-group <my_security_group> \
 --template "Linux Ubuntu 24.04 LTS 64-bit"

Depending on your needs, adjust for A5000 or A40 by using types gpua5000.medium or gpu3.medium (and the corresponding zone).

Prepare VM Environment

Access the instance using SSH

ssh root@<public-ip>

For a simplified setup, the machine is already ready to run Ollama at this point:

apt update && apt install ollama
ollama run llama3.3:latest

Ollama is not considered production-ready and scalable.

Install Docker & NVIDIA support

# install docker
curl -fsSL https://get.docker.com | sh
# add nvidia production package repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | 
    gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && 
    curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list |
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' |
    tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# install nvidia drivers
apt update
apt install -y nvidia-driver-570 nvidia-container-toolkit
systemctl restart docker
# Validate NVIDIA runtime is available:
docker run --gpus all nvidia/cuda:12.9.1-base-ubuntu24.04 nvidia-smi

Run vLLM Inference Server

Three steps for running an OpenAI-compatible vLLM Inference Server:

  1. Get a Hugging Face Token
  2. Deploy via Docker Compose
  3. Test Inference

Get Hugging Face Token

Create/retrieve a Hugging Face Token with read access to your model: https://huggingface.co/settings/tokens

Deploy via Docker Compose

Write the following content to a file called docker-compose.yml:

services:
  vllm:
    image: vllm/vllm-openai:gptoss
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ports:
      - "8000:8000"
    ipc: host
    environment:
      - HUGGING_FACE_HUB_TOKEN=<HF_TOKEN>
      - NVIDIA_VISIBLE_DEVICES=all
      - VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    command: >
      --host 0.0.0.0 --port 8000
      --model openai/gpt-oss-20b
      --dtype auto
      --trust-remote-code
      --tensor-parallel-size 1
      --max-num-seqs 64

Launch:

docker compose up -d
docker logs -f vllm

vLLM will then download and optimize the model for your GPU, which will take a few minutes. Once you see the following logs, you know that your server is ready:

(APIServer pid=1) INFO 08-21 08:31:01 [api_server.py:1857] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1) INFO 08-21 08:31:01 [launcher.py:29] Available routes are:
(APIServer pid=1) INFO 08-21 08:31:01 [launcher.py:37] Route: /openapi.json, Methods: HEAD, GET
...
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.

C. Test Inference

curl -X POST localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
  "model":"openai/gpt-oss-20b",
  "messages":[{"role":"user","content":"What is the capital of Switzerland?"}],
  "max_tokens":50
}'

Considerations

  • Use nvidia-smi to monitor utilization during load testing.
  • Ensure you have enough GPU memory to avoid OOM.
  • Ports, instance stopping/starting, and SSH key handling are best automated via Terraform or scripts.
  • Be careful of exposing your endpoint to the public internet without added security and protections.