vLLM Deployment on Bare Metal GPUs

1. Overview

This guide walks you through deploying an OpenAI-compatible LLM inference API using vLLM on SubstrateAI infrastructure. You'll run the vLLM server in Docker with GPU acceleration, place NGINX in front as a reverse proxy, and optionally enable HTTPS with Certbot. The result is a production-style API endpoint you can hit with standard OpenAI SDKs or curl.

Objectives

Launch vLLM with an OpenAI-compatible HTTP API
Expose it via NGINX at a clean domain (e.g., api.substrateai.net)
Optionally enable TLS with Let's Encrypt (Certbot)

Launch vLLM with an OpenAI-compatible HTTP API Expose it via NGINX at a clean domain (e.g., api.substrateai.net) Optionally enable TLS with Let's Encrypt (Certbot)

Architecture Overview

Layer	Purpose	Port
vLLM container	OpenAI-compatible inference server	8000 (container)
Docker host	Port map to host	8011 (host)
NGINX	Reverse proxy / public endpoint	80 / 443
Certbot (optional)	HTTPS certificates	—
Cloudflare DNS (optional)	Domain → server IP	—

Layer Purpose Port vLLM container OpenAI-compatible inference server 8000 (container) Docker host Port map to host 8011 (host) NGINX Reverse proxy / public endpoint 80 / 443 Certbot (optional) HTTPS certificates — Cloudflare DNS (optional) Domain → server IP —

2. Environment Prerequisites

Requirement	Recommended
OS	Ubuntu 20.04 / 22.04
GPU Driver	NVIDIA =E2=89=A5 535
Docker	Latest (supports --gpus all)
Internet	Required for image/model downloads
DNS	A-record (e.g., api.substrateai.net → server IP)

Requirement Recommended OS Ubuntu 20.04 / 22.04 GPU Driver NVIDIA =E2=89=A5 535 Docker Latest (supports --gpus all) Internet Required for image/model downloads DNS A-record (e.g., api.substrateai.net → server IP) Quick sanity check (GPU):

nvidia-smi

3. (If needed) Install Docker Engine

Skip if Docker is already installed.

sudo apt update -y
sudo apt install -y ca-certificates curl gnupg lsb-release

sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
 | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu $(. /etc/os-release; echo $VERSION_CODENAME) stable" \
 | sudo tee /etc/apt/sources.list.d/docker.list >/dev/null

sudo apt update -y
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo systemctl enable --now docker

4. Install the NVIDIA Container Toolkit

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
 | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
 | sudo sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#' \
 | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update
sudo apt install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Validate GPU inside containers:

sudo docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

5. Run the vLLM Inference Server

Launch your model as an OpenAI-compatible API in Docker:

sudo docker run -d --gpus all -p 8011:8000 \
  --name vllm-mixtral \
  vllm/vllm-openai:latest \
  --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --tensor-parallel-size 8 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.95

Key Flags Explained

Flag	Description
-d	Run container detached (keeps running after logout)
--gpus all	Expose all GPUs to the container
-p 8011:8000	Map container port 8000 → host 8011
--tensor-parallel-size	Split model across N GPUs
--max-model-len	Maximum context window (tokens)
--gpu-memory-utilization	Fraction of GPU VRAM the server can use

Flag Description -d Run container detached (keeps running after logout) --gpus all Expose all GPUs to the container -p 8011:8000 Map container port 8000 → host 8011 --tensor-parallel-size Split model across N GPUs --max-model-len Maximum context window (tokens) --gpu-memory-utilization Fraction of GPU VRAM the server can use Check container:

sudo docker ps -a

6. Verify API Locally

curl http://localhost:8011/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
    "messages": [
      {"role": "system", "content": "You are a helpful AI assistant."}`,
      {"role": "user", "content": "Explain how vLLM improves inference speed."}`
    ],
    "temperature": 0.7,
    "max_tokens": 200
  }`'

Expected: a JSON response with generated text.

7. Configure NGINX Reverse Proxy (HTTP on 80)

Install & enable NGINX:

sudo apt update && sudo apt install -y nginx
sudo systemctl enable nginx && sudo systemctl start nginx

Create a proxy config:

sudo tee /etc/nginx/sites-available/vllm.conf >/dev/null <<'CONF'
server {
    listen 80;
    server_name api.substrateai.net;

    # (Optional) Enforce JSON payload size/timeouts suitable for LLMs
    client_max_body_size 32m;

    location / {
        proxy_pass http://127.0.0.1:8011;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_http_version 1.1;
        proxy_read_timeout 3600;
    }`
}`
CONF

sudo ln -s /etc/nginx/sites-available/vllm.conf /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx

Your model is now reachable at:http://api.substrateai.net/v1/...

8. Enable HTTPS (Recommended)

sudo apt install -y certbot python3-certbot-nginx
sudo certbot --nginx -d api.substrateai.net

Certbot will:

Issue Let's Encrypt certificates
Update NGINX for HTTPS
Install auto-renewal (systemd timer/cron)

Issue Let's Encrypt certificates Update NGINX for HTTPS Install auto-renewal (systemd timer/cron)

9. External Testing (over HTTPS)

List models:

curl https://api.substrateai.net/v1/models

Send a chat completion:

curl https://api.substrateai.net/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
    "messages": [
      {"role": "system", "content": "You are a helpful AI assistant."}`,
      {"role": "user", "content": "Explain how vLLM improves inference speed."}`
    ],
    "temperature": 0.7,
    "max_tokens": 200
  }`'

Works with OpenAI SDKs by setting baseURL=https://api.substrateai.net/v1 and apiKey to a placeholder if not enforced by your proxy.

10. Common Maintenance

Task	Command
View logs	sudo docker logs -f vllm-mixtral
Restart container	sudo docker restart vllm-mixtral
Stop container	sudo docker stop vllm-mixtral
Renew SSL (manual test)	sudo certbot renew --dry-run

Task Command View logs sudo docker logs -f vllm-mixtral Restart container sudo docker restart vllm-mixtral Stop container sudo docker stop vllm-mixtral Renew SSL (manual test) sudo certbot renew --dry-run

11. Optional Enhancements

Enhancement	Description
systemd service	Auto-start vLLM on reboot (wrap Docker run in a unit)
HTTPS-only redirect	Add return 301 https://$host$request_uri; to a port 80 server block
Authentication layer	NGINX auth_basic or API-key middleware upstream
Load balancing	Multiple vLLM nodes behind NGINX/HAProxy
Metrics & monitoring	Expose vLLM metrics; scrape via Prometheus; visualize in Grafana
Autoscaling	Kubernetes GPU nodes or Ray Serve
Caching	LMCache/Redis for KV-cache reuse
Web UI / Playground	Pair with Open WebUI / custom React dashboard
Multi-model serving	Run multiple vLLM instances on different host ports
Security hardening	Close unused ports, enable ufw, add Cloudflare firewall rules

Enhancement Description systemd service Auto-start vLLM on reboot (wrap Docker run in a unit) HTTPS-only redirect Add return 301 https://$host$request_uri; to a port 80 server block Authentication layer NGINX auth_basic or API-key middleware upstream Load balancing Multiple vLLM nodes behind NGINX/HAProxy Metrics & monitoring Expose vLLM metrics; scrape via Prometheus; visualize in Grafana Autoscaling Kubernetes GPU nodes or Ray Serve Caching LMCache/Redis for KV-cache reuse Web UI / Playground Pair with Open WebUI / custom React dashboard Multi-model serving Run multiple vLLM instances on different host ports Security hardening Close unused ports, enable ufw, add Cloudflare firewall rules

12. Troubleshooting

Symptom	Check / Fix
502 Bad Gateway	sudo nginx -t && sudo systemctl reload nginx; confirm vLLM listening on 8011
GPU not used	nvidia-smi inside container; confirm --gpus all and NVIDIA toolkit configured
Slow responses	Lower --max-model-len; tune --max-num-seqs and --gpu-memory-utilization
OOM / CUDA errors	Reduce batch size / concurrency; lower --gpu-memory-utilization
CORS issues	Add appropriate CORS headers in NGINX if calling from browsers

Symptom Check / Fix 502 Bad Gateway sudo nginx -t && sudo systemctl reload nginx; confirm vLLM listening on 8011 GPU not used nvidia-smi inside container; confirm --gpus all and NVIDIA toolkit configured Slow responses Lower --max-model-len; tune --max-num-seqs and --gpu-memory-utilization OOM / CUDA errors Reduce batch size / concurrency; lower --gpu-memory-utilization CORS issues Add appropriate CORS headers in NGINX if calling from browsers Minimal CORS snippet (optional):

add_header Access-Control-Allow-Origin * always;
add_header Access-Control-Allow-Headers * always;
add_header Access-Control-Allow-Methods "GET, POST, OPTIONS" always;
if ($request_method  OPTIONS) { return 204; }`

13. Architecture Summary

Layer	Purpose	Port
vLLM container	Model inference engine	8000 (internal)
Docker host	Mapped API endpoint	8011
NGINX	Reverse proxy	80 / 443
Certbot	HTTPS certificates	—
Cloudflare DNS	Maps domain → server	—

Layer Purpose Port vLLM container Model inference engine 8000 (internal) Docker host Mapped API endpoint 8011 NGINX Reverse proxy 80 / 443 Certbot HTTPS certificates — Cloudflare DNS Maps domain → server —

14. Final Result

Your OpenAI-compatible LLM endpoint is live: https://api.substrateai.net/v1/chat/completions Use it directly via curl or standard OpenAI SDKs (with baseURL pointed to your domain).

1. Overview​

Objectives​

Architecture Overview​

2. Environment Prerequisites​

3. (If needed) Install Docker Engine​

4. Install the NVIDIA Container Toolkit​

5. Run the vLLM Inference Server​

Key Flags Explained​

6. Verify API Locally​

7. Configure NGINX Reverse Proxy (HTTP on 80)​

8. Enable HTTPS (Recommended)​

9. External Testing (over HTTPS)​

10. Common Maintenance​

11. Optional Enhancements​

12. Troubleshooting​

13. Architecture Summary​

14. Final Result​