Deploy a Gated Model

Many popular AI models on Hugging Face, such as Meta’s Llama series, are “gated”—meaning you must accept license terms and authenticate before downloading them. This guide explains how to deploy gated models with Exoscale Dedicated Inference.

What Are Gated Models?

Gated models require users to:

Create a Hugging Face account
Visit the model page and accept its license terms
Generate an access token for authentication

Common gated model families include:

Meta Llama (Llama 3.2, Llama 3.1, Llama 2)
Mistral Large models
Stable Diffusion models
Many research and enterprise models

Prerequisites

Before deploying a gated model, ensure you have:

A Hugging Face account
License acceptance for the specific model you want to deploy
A Hugging Face read-access token
Exoscale CLI configured with appropriate permissions
Sufficient GPU quota in your Exoscale organization

Step 1: Create a Hugging Face Account

If you don’t already have one:

Go to https://huggingface.co/join
Complete the registration process
Verify your email address

Step 2: Accept the Model License

Navigate to the model page on Hugging Face and accept its license:

Visit the model page (e.g., meta-llama/Llama-3.2-1B-Instruct)
Read the license terms
Click the “Accept” or “Agree” button
Wait for access to be granted (usually instant, but can take a few minutes)

Step 3: Generate a Hugging Face Access Token

Create a read-access token for authentication:

Log in to your Hugging Face account
Go to Settings → Access Tokens
Click New token
Configure your token:
- Name: Give it a descriptive name (e.g., “exoscale-dedicated-inference”)
- Role: Select Read (write access is not needed)
Click Generate token
Copy the token immediately—you won’t be able to see it again

Your token will look like: hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Security Note: Treat this token as a password. Do not commit it to version control or share it publicly.

Step 4: Create the Model in Exoscale

Use the --huggingface-token flag when creating the model:

exo dedicated-inference model create <huggingface-model-id> \
  --huggingface-token <your-hf-token> \
  -z <zone>

Example: Deploying Llama 3.2

exo dedicated-inference model create meta-llama/Llama-3.2-1B-Instruct \
  --huggingface-token hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
  -z at-vie-2

The model will be downloaded from Hugging Face and stored in Exoscale Object Storage. This may take several minutes depending on model size.

Check Model Status

Monitor the model creation progress:

exo dedicated-inference model list -z at-vie-2

Wait until the status shows created before proceeding.

Step 5: Deploy the Model

Once the model is available, create a deployment as usual:

exo dedicated-inference deployment create llama-deployment \
  --model-name meta-llama/Llama-3.2-1B-Instruct \
  --gpu-type gpua5000 \
  --gpu-count 1 \
  --replicas 1 \
  -z at-vie-2

The Hugging Face token is only needed during model creation, not for deployment.

Step 6: Verify the Deployment

Check deployment status:

exo dedicated-inference deployment show llama-deployment -z at-vie-2

Wait for status ready, then retrieve the API key:

exo dedicated-inference deployment reveal-api-key llama-deployment -z at-vie-2

Step 7: Test the Endpoint

Test your gated model deployment:

curl -X POST <endpoint-url>/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <api-key>" \
  -d '{
    "model": "meta-llama/Llama-3.2-1B-Instruct",
    "messages": [
      {"role": "user", "content": "Explain what a gated model is in one sentence."}
    ],
    "max_tokens": 100
  }'

Troubleshooting

Error: “Access Denied” or “401 Unauthorized”

Cause: Invalid or expired Hugging Face token, or license not accepted.

Solution:

Verify you accepted the model license on Hugging Face
Generate a new access token with Read permissions
Ensure the token is correctly copied (no extra spaces)
Wait a few minutes after accepting the license

Error: “Model Not Found”

Cause: Model identifier is incorrect or model is private.

Solution:

Verify the exact model ID from the Hugging Face page (e.g., meta-llama/Llama-3.2-1B-Instruct)
Check that you have access to the model on Hugging Face
Ensure the model is publicly available (even if gated)

Model Creation Takes Too Long

Cause: Large models can take significant time to download.

Solution:

Check model status with exo dedicated-inference model list -z <zone>
Very large models (100+ GB) may require extended download times
If the model remains in creating for an extended period, delete and recreate it, or contact support

Best Practices

Store Tokens Securely: Use environment variables or secret management systems instead of hardcoding tokens.

# Store token in environment variable
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Use in CLI command
exo dedicated-inference model create meta-llama/Llama-3.2-1B-Instruct \
  --huggingface-token "$HF_TOKEN" \
  -z at-vie-2

Rotate Tokens Regularly: Generate new tokens periodically and revoke old ones from your Hugging Face settings.
Use Read-Only Tokens: Never use write or admin tokens for model downloads—read access is sufficient.
Verify License Compliance: Ensure your use case complies with the model’s license terms (commercial use, attribution, etc.).

Next Steps

Last updated on January 30, 2026