vLLM Deployment on Bare Metal GPUs
1. Overview
This guide walks you through deploying an OpenAI-compatible LLM inference API using vLLM on SubstrateAI infrastructure.
You'll run the vLLM server in Docker with GPU acceleration, place NGINX in front as a reverse proxy, and optionally enable HTTPS with Certbot. The result is a production-style API endpoint you can hit with standard OpenAI SDKs or curl.
Objectives
- Launch vLLM with an OpenAI-compatible HTTP API
- Expose it via NGINX at a clean domain (e.g., api.substrateai.net)
- Optionally enable TLS with Let's Encrypt (Certbot)
Launch vLLM with an OpenAI-compatible HTTP API
Expose it via NGINX at a clean domain (e.g., api.substrateai.net)
Optionally enable TLS with Let's Encrypt (Certbot)
Architecture Overview
| Layer | Purpose | Port |
|---|---|---|
| vLLM container | OpenAI-compatible inference server | 8000 (container) |
| Docker host | Port map to host | 8011 (host) |
| NGINX | Reverse proxy / public endpoint | 80 / 443 |
| Certbot (optional) | HTTPS certificates | — |
| Cloudflare DNS (optional) | Domain → server IP | — |
Layer Purpose Port vLLM container OpenAI-compatible inference server 8000 (container) Docker host Port map to host 8011 (host) NGINX Reverse proxy / public endpoint 80 / 443 Certbot (optional) HTTPS certificates — Cloudflare DNS (optional) Domain → server IP —
2. Environment Prerequisites
| Requirement | Recommended |
|---|---|
| OS | Ubuntu 20.04 / 22.04 |
| GPU Driver | NVIDIA =E2=89=A5 535 |
| Docker | Latest (supports --gpus all) |
| Internet | Required for image/model downloads |
| DNS | A-record (e.g., api.substrateai.net → server IP) |
Requirement
Recommended
OS
Ubuntu 20.04 / 22.04
GPU Driver
NVIDIA =E2=89=A5 535
Docker
Latest (supports --gpus all)
Internet
Required for image/model downloads
DNS
A-record (e.g., api.substrateai.net → server IP)
Quick sanity check (GPU):
nvidia-smi
3. (If needed) Install Docker Engine
Skip if Docker is already installed.
sudo apt update -y
sudo apt install -y ca-certificates curl gnupg lsb-release
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
| sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu $(. /etc/os-release; echo $VERSION_CODENAME) stable" \
| sudo tee /etc/apt/sources.list.d/docker.list >/dev/null
sudo apt update -y
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo systemctl enable --now docker
4. Install the NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
| sudo sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Validate GPU inside containers:
sudo docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
5. Run the vLLM Inference Server
Launch your model as an OpenAI-compatible API in Docker:
sudo docker run -d --gpus all -p 8011:8000 \
--name vllm-mixtral \
vllm/vllm-openai:latest \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--tensor-parallel-size 8 \
--max-model-len 8192 \
--gpu-memory-utilization 0.95
Key Flags Explained
| Flag | Description |
|---|---|
| -d | Run container detached (keeps running after logout) |
| --gpus all | Expose all GPUs to the container |
| -p 8011:8000 | Map container port 8000 → host 8011 |
| --tensor-parallel-size | Split model across N GPUs |
| --max-model-len | Maximum context window (tokens) |
| --gpu-memory-utilization | Fraction of GPU VRAM the server can use |
Flag
Description
-d
Run container detached (keeps running after logout)
--gpus all
Expose all GPUs to the container
-p 8011:8000
Map container port 8000 → host 8011
--tensor-parallel-size
Split model across N GPUs
--max-model-len
Maximum context window (tokens)
--gpu-memory-utilization
Fraction of GPU VRAM the server can use
Check container:
sudo docker ps -a
6. Verify API Locally
curl http://localhost:8011/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"messages": [
{"role": "system", "content": "You are a helpful AI assistant."}`,
{"role": "user", "content": "Explain how vLLM improves inference speed."}`
],
"temperature": 0.7,
"max_tokens": 200
}`'
Expected: a JSON response with generated text.
7. Configure NGINX Reverse Proxy (HTTP on 80)
Install & enable NGINX:
sudo apt update && sudo apt install -y nginx
sudo systemctl enable nginx && sudo systemctl start nginx
Create a proxy config:
sudo tee /etc/nginx/sites-available/vllm.conf >/dev/null <<'CONF'
server {
listen 80;
server_name api.substrateai.net;
# (Optional) Enforce JSON payload size/timeouts suitable for LLMs
client_max_body_size 32m;
location / {
proxy_pass http://127.0.0.1:8011;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_http_version 1.1;
proxy_read_timeout 3600;
}`
}`
CONF
sudo ln -s /etc/nginx/sites-available/vllm.conf /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx
Your model is now reachable at:http://api.substrateai.net/v1/...
8. Enable HTTPS (Recommended)
sudo apt install -y certbot python3-certbot-nginx
sudo certbot --nginx -d api.substrateai.net
Certbot will:
- Issue Let's Encrypt certificates
- Update NGINX for HTTPS
- Install auto-renewal (systemd timer/cron)
Issue Let's Encrypt certificates Update NGINX for HTTPS Install auto-renewal (systemd timer/cron)
9. External Testing (over HTTPS)
List models:
curl https://api.substrateai.net/v1/models
Send a chat completion:
curl https://api.substrateai.net/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"messages": [
{"role": "system", "content": "You are a helpful AI assistant."}`,
{"role": "user", "content": "Explain how vLLM improves inference speed."}`
],
"temperature": 0.7,
"max_tokens": 200
}`'
Works with OpenAI SDKs by setting baseURL=https://api.substrateai.net/v1 and apiKey to a placeholder if not enforced by your proxy.
10. Common Maintenance
| Task | Command |
|---|---|
| View logs | sudo docker logs -f vllm-mixtral |
| Restart container | sudo docker restart vllm-mixtral |
| Stop container | sudo docker stop vllm-mixtral |
| Renew SSL (manual test) | sudo certbot renew --dry-run |
Task
Command
View logs
sudo docker logs -f vllm-mixtral
Restart container
sudo docker restart vllm-mixtral
Stop container
sudo docker stop vllm-mixtral
Renew SSL (manual test)
sudo certbot renew --dry-run
11. Optional Enhancements
| Enhancement | Description |
|---|---|
| systemd service | Auto-start vLLM on reboot (wrap Docker run in a unit) |
| HTTPS-only redirect | Add return 301 https://$host$request_uri; to a port 80 server block |
| Authentication layer | NGINX auth_basic or API-key middleware upstream |
| Load balancing | Multiple vLLM nodes behind NGINX/HAProxy |
| Metrics & monitoring | Expose vLLM metrics; scrape via Prometheus; visualize in Grafana |
| Autoscaling | Kubernetes GPU nodes or Ray Serve |
| Caching | LMCache/Redis for KV-cache reuse |
| Web UI / Playground | Pair with Open WebUI / custom React dashboard |
| Multi-model serving | Run multiple vLLM instances on different host ports |
| Security hardening | Close unused ports, enable ufw, add Cloudflare firewall rules |
Enhancement
Description
systemd service
Auto-start vLLM on reboot (wrap Docker run in a unit)
HTTPS-only redirect
Add return 301 https://$host$request_uri; to a port 80 server block
Authentication layer
NGINX auth_basic or API-key middleware upstream
Load balancing
Multiple vLLM nodes behind NGINX/HAProxy
Metrics & monitoring
Expose vLLM metrics; scrape via Prometheus; visualize in Grafana
Autoscaling
Kubernetes GPU nodes or Ray Serve
Caching
LMCache/Redis for KV-cache reuse
Web UI / Playground
Pair with Open WebUI / custom React dashboard
Multi-model serving
Run multiple vLLM instances on different host ports
Security hardening
Close unused ports, enable ufw, add Cloudflare firewall rules
12. Troubleshooting
| Symptom | Check / Fix |
|---|---|
| 502 Bad Gateway | sudo nginx -t && sudo systemctl reload nginx; confirm vLLM listening on 8011 |
| GPU not used | nvidia-smi inside container; confirm --gpus all and NVIDIA toolkit configured |
| Slow responses | Lower --max-model-len; tune --max-num-seqs and --gpu-memory-utilization |
| OOM / CUDA errors | Reduce batch size / concurrency; lower --gpu-memory-utilization |
| CORS issues | Add appropriate CORS headers in NGINX if calling from browsers |
Symptom
Check / Fix
502 Bad Gateway
sudo nginx -t && sudo systemctl reload nginx; confirm vLLM listening on 8011
GPU not used
nvidia-smi inside container; confirm --gpus all and NVIDIA toolkit configured
Slow responses
Lower --max-model-len; tune --max-num-seqs and --gpu-memory-utilization
OOM / CUDA errors
Reduce batch size / concurrency; lower --gpu-memory-utilization
CORS issues
Add appropriate CORS headers in NGINX if calling from browsers
Minimal CORS snippet (optional):
add_header Access-Control-Allow-Origin * always;
add_header Access-Control-Allow-Headers * always;
add_header Access-Control-Allow-Methods "GET, POST, OPTIONS" always;
if ($request_method OPTIONS) { return 204; }`
13. Architecture Summary
| Layer | Purpose | Port |
|---|---|---|
| vLLM container | Model inference engine | 8000 (internal) |
| Docker host | Mapped API endpoint | 8011 |
| NGINX | Reverse proxy | 80 / 443 |
| Certbot | HTTPS certificates | — |
| Cloudflare DNS | Maps domain → server | — |
Layer Purpose Port vLLM container Model inference engine 8000 (internal) Docker host Mapped API endpoint 8011 NGINX Reverse proxy 80 / 443 Certbot HTTPS certificates — Cloudflare DNS Maps domain → server —
14. Final Result
Your OpenAI-compatible LLM endpoint is live:
https://api.substrateai.net/v1/chat/completions
Use it directly via curl or standard OpenAI SDKs (with baseURL pointed to your domain).