Deploy a Gated Model
Many popular AI models on Hugging Face, such as Meta’s Llama series, are “gated”—meaning you must accept license terms and authenticate before downloading them. This guide explains how to deploy gated models with Exoscale Dedicated Inference.
What Are Gated Models?
Gated models require users to:
- Create a Hugging Face account
- Visit the model page and accept its license terms
- Generate an access token for authentication
Common gated model families include:
- Meta Llama (Llama 3.2, Llama 3.1, Llama 2)
- Mistral Large models
- Stable Diffusion models
- Many research and enterprise models
Prerequisites
Before deploying a gated model, ensure you have:
- A Hugging Face account
- License acceptance for the specific model you want to deploy
- A Hugging Face read-access token
- Exoscale CLI configured with appropriate permissions
- Sufficient GPU quota in your Exoscale organization
Step 1: Create a Hugging Face Account
If you don’t already have one:
- Go to https://huggingface.co/join
- Complete the registration process
- Verify your email address
Step 2: Accept the Model License
Navigate to the model page on Hugging Face and accept its license:
- Visit the model page (e.g., meta-llama/Llama-3.2-1B-Instruct)
- Read the license terms
- Click the “Accept” or “Agree” button
- Wait for access to be granted (usually instant, but can take a few minutes)
Step 3: Generate a Hugging Face Access Token
Create a read-access token for authentication:
- Log in to your Hugging Face account
- Go to Settings → Access Tokens
- Click New token
- Configure your token:
- Name: Give it a descriptive name (e.g., “exoscale-dedicated-inference”)
- Role: Select Read (write access is not needed)
- Click Generate token
- Copy the token immediately—you won’t be able to see it again
Your token will look like: hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Security Note: Treat this token as a password. Do not commit it to version control or share it publicly.
Step 4: Create the Model in Exoscale
Use the --huggingface-token flag when creating the model:
exo dedicated-inference model create <huggingface-model-id> \
--huggingface-token <your-hf-token> \
-z <zone>Example: Deploying Llama 3.2
exo dedicated-inference model create meta-llama/Llama-3.2-1B-Instruct \
--huggingface-token hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
-z at-vie-2The model will be downloaded from Hugging Face and stored in Exoscale Object Storage. This may take several minutes depending on model size.
Check Model Status
Monitor the model creation progress:
exo dedicated-inference model list -z at-vie-2Wait until the status shows created before proceeding.
Step 5: Deploy the Model
Once the model is available, create a deployment as usual:
exo dedicated-inference deployment create llama-deployment \
--model-name meta-llama/Llama-3.2-1B-Instruct \
--gpu-type gpua5000 \
--gpu-count 1 \
--replicas 1 \
-z at-vie-2The Hugging Face token is only needed during model creation, not for deployment.
Step 6: Verify the Deployment
Check deployment status:
exo dedicated-inference deployment show llama-deployment -z at-vie-2Wait for status ready, then retrieve the API key:
exo dedicated-inference deployment reveal-api-key llama-deployment -z at-vie-2Step 7: Test the Endpoint
Test your gated model deployment:
curl -X POST <endpoint-url>/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <api-key>" \
-d '{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"messages": [
{"role": "user", "content": "Explain what a gated model is in one sentence."}
],
"max_tokens": 100
}'Troubleshooting
Error: “Access Denied” or “401 Unauthorized”
Cause: Invalid or expired Hugging Face token, or license not accepted.
Solution:
- Verify you accepted the model license on Hugging Face
- Generate a new access token with Read permissions
- Ensure the token is correctly copied (no extra spaces)
- Wait a few minutes after accepting the license
Error: “Model Not Found”
Cause: Model identifier is incorrect or model is private.
Solution:
- Verify the exact model ID from the Hugging Face page (e.g.,
meta-llama/Llama-3.2-1B-Instruct) - Check that you have access to the model on Hugging Face
- Ensure the model is publicly available (even if gated)
Model Creation Takes Too Long
Cause: Large models can take significant time to download.
Solution:
- Check model status with
exo dedicated-inference model list -z <zone> - Very large models (100+ GB) may require extended download times
- If the model remains in
creatingfor an extended period, delete and recreate it, or contact support
Best Practices
- Store Tokens Securely
- Use environment variables or secret management systems instead of hardcoding tokens.
# Store token in environment variable
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# Use in CLI command
exo dedicated-inference model create meta-llama/Llama-3.2-1B-Instruct \
--huggingface-token "$HF_TOKEN" \
-z at-vie-2- Rotate Tokens Regularly
- Generate new tokens periodically and revoke old ones from your Hugging Face settings.
- Use Read-Only Tokens
- Never use write or admin tokens for model downloads—read access is sufficient.
- Verify License Compliance
- Ensure your use case complies with the model’s license terms (commercial use, attribution, etc.).