Skip to main content

vLLM Deployment on Bare Metal GPUs

1. Overview

This guide walks you through deploying an OpenAI-compatible LLM inference API using vLLM on SubstrateAI infrastructure. You'll run the vLLM server in Docker with GPU acceleration, place NGINX in front as a reverse proxy, and optionally enable HTTPS with Certbot. The result is a production-style API endpoint you can hit with standard OpenAI SDKs or curl.

Objectives

  • Launch vLLM with an OpenAI-compatible HTTP API
  • Expose it via NGINX at a clean domain (e.g., api.substrateai.net)
  • Optionally enable TLS with Let's Encrypt (Certbot)

Launch vLLM with an OpenAI-compatible HTTP API Expose it via NGINX at a clean domain (e.g., api.substrateai.net) Optionally enable TLS with Let's Encrypt (Certbot)

Architecture Overview

LayerPurposePort
vLLM containerOpenAI-compatible inference server8000 (container)
Docker hostPort map to host8011 (host)
NGINXReverse proxy / public endpoint80 / 443
Certbot (optional)HTTPS certificates
Cloudflare DNS (optional)Domain → server IP

Layer Purpose Port vLLM container OpenAI-compatible inference server 8000 (container) Docker host Port map to host 8011 (host) NGINX Reverse proxy / public endpoint 80 / 443 Certbot (optional) HTTPS certificates — Cloudflare DNS (optional) Domain → server IP —


2. Environment Prerequisites

RequirementRecommended
OSUbuntu 20.04 / 22.04
GPU DriverNVIDIA =E2=89=A5 535
DockerLatest (supports --gpus all)
InternetRequired for image/model downloads
DNSA-record (e.g., api.substrateai.net → server IP)

Requirement Recommended OS Ubuntu 20.04 / 22.04 GPU Driver NVIDIA =E2=89=A5 535 Docker Latest (supports --gpus all) Internet Required for image/model downloads DNS A-record (e.g., api.substrateai.net → server IP) Quick sanity check (GPU):

nvidia-smi

3. (If needed) Install Docker Engine

Skip if Docker is already installed.

sudo apt update -y
sudo apt install -y ca-certificates curl gnupg lsb-release

sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
| sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu $(. /etc/os-release; echo $VERSION_CODENAME) stable" \
| sudo tee /etc/apt/sources.list.d/docker.list >/dev/null

sudo apt update -y
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo systemctl enable --now docker

4. Install the NVIDIA Container Toolkit

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
| sudo sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update
sudo apt install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Validate GPU inside containers:

sudo docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

5. Run the vLLM Inference Server

Launch your model as an OpenAI-compatible API in Docker:

sudo docker run -d --gpus all -p 8011:8000 \
--name vllm-mixtral \
vllm/vllm-openai:latest \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--tensor-parallel-size 8 \
--max-model-len 8192 \
--gpu-memory-utilization 0.95

Key Flags Explained

FlagDescription
-dRun container detached (keeps running after logout)
--gpus allExpose all GPUs to the container
-p 8011:8000Map container port 8000 → host 8011
--tensor-parallel-sizeSplit model across N GPUs
--max-model-lenMaximum context window (tokens)
--gpu-memory-utilizationFraction of GPU VRAM the server can use

Flag Description -d Run container detached (keeps running after logout) --gpus all Expose all GPUs to the container -p 8011:8000 Map container port 8000 → host 8011 --tensor-parallel-size Split model across N GPUs --max-model-len Maximum context window (tokens) --gpu-memory-utilization Fraction of GPU VRAM the server can use Check container:

sudo docker ps -a

6. Verify API Locally

curl http://localhost:8011/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"messages": [
{"role": "system", "content": "You are a helpful AI assistant."}`,
{"role": "user", "content": "Explain how vLLM improves inference speed."}`
],
"temperature": 0.7,
"max_tokens": 200
}`'

Expected: a JSON response with generated text.


7. Configure NGINX Reverse Proxy (HTTP on 80)

Install & enable NGINX:

sudo apt update && sudo apt install -y nginx
sudo systemctl enable nginx && sudo systemctl start nginx

Create a proxy config:

sudo tee /etc/nginx/sites-available/vllm.conf >/dev/null <<'CONF'
server {
listen 80;
server_name api.substrateai.net;

# (Optional) Enforce JSON payload size/timeouts suitable for LLMs
client_max_body_size 32m;

location / {
proxy_pass http://127.0.0.1:8011;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_http_version 1.1;
proxy_read_timeout 3600;
}`
}`
CONF

sudo ln -s /etc/nginx/sites-available/vllm.conf /etc/nginx/sites-enabled/
sudo nginx -t
sudo systemctl reload nginx

Your model is now reachable at:http://api.substrateai.net/v1/...


sudo apt install -y certbot python3-certbot-nginx
sudo certbot --nginx -d api.substrateai.net

Certbot will:

  • Issue Let's Encrypt certificates
  • Update NGINX for HTTPS
  • Install auto-renewal (systemd timer/cron)

Issue Let's Encrypt certificates Update NGINX for HTTPS Install auto-renewal (systemd timer/cron)


9. External Testing (over HTTPS)

List models:

curl https://api.substrateai.net/v1/models

Send a chat completion:

curl https://api.substrateai.net/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mixtral-8x7B-Instruct-v0.1",
"messages": [
{"role": "system", "content": "You are a helpful AI assistant."}`,
{"role": "user", "content": "Explain how vLLM improves inference speed."}`
],
"temperature": 0.7,
"max_tokens": 200
}`'

Works with OpenAI SDKs by setting baseURL=https://api.substrateai.net/v1 and apiKey to a placeholder if not enforced by your proxy.


10. Common Maintenance

TaskCommand
View logssudo docker logs -f vllm-mixtral
Restart containersudo docker restart vllm-mixtral
Stop containersudo docker stop vllm-mixtral
Renew SSL (manual test)sudo certbot renew --dry-run

Task Command View logs sudo docker logs -f vllm-mixtral Restart container sudo docker restart vllm-mixtral Stop container sudo docker stop vllm-mixtral Renew SSL (manual test) sudo certbot renew --dry-run


11. Optional Enhancements

EnhancementDescription
systemd serviceAuto-start vLLM on reboot (wrap Docker run in a unit)
HTTPS-only redirectAdd return 301 https://$host$request_uri; to a port 80 server block
Authentication layerNGINX auth_basic or API-key middleware upstream
Load balancingMultiple vLLM nodes behind NGINX/HAProxy
Metrics & monitoringExpose vLLM metrics; scrape via Prometheus; visualize in Grafana
AutoscalingKubernetes GPU nodes or Ray Serve
CachingLMCache/Redis for KV-cache reuse
Web UI / PlaygroundPair with Open WebUI / custom React dashboard
Multi-model servingRun multiple vLLM instances on different host ports
Security hardeningClose unused ports, enable ufw, add Cloudflare firewall rules

Enhancement Description systemd service Auto-start vLLM on reboot (wrap Docker run in a unit) HTTPS-only redirect Add return 301 https://$host$request_uri; to a port 80 server block Authentication layer NGINX auth_basic or API-key middleware upstream Load balancing Multiple vLLM nodes behind NGINX/HAProxy Metrics & monitoring Expose vLLM metrics; scrape via Prometheus; visualize in Grafana Autoscaling Kubernetes GPU nodes or Ray Serve Caching LMCache/Redis for KV-cache reuse Web UI / Playground Pair with Open WebUI / custom React dashboard Multi-model serving Run multiple vLLM instances on different host ports Security hardening Close unused ports, enable ufw, add Cloudflare firewall rules


12. Troubleshooting

SymptomCheck / Fix
502 Bad Gatewaysudo nginx -t && sudo systemctl reload nginx; confirm vLLM listening on 8011
GPU not usednvidia-smi inside container; confirm --gpus all and NVIDIA toolkit configured
Slow responsesLower --max-model-len; tune --max-num-seqs and --gpu-memory-utilization
OOM / CUDA errorsReduce batch size / concurrency; lower --gpu-memory-utilization
CORS issuesAdd appropriate CORS headers in NGINX if calling from browsers

Symptom Check / Fix 502 Bad Gateway sudo nginx -t && sudo systemctl reload nginx; confirm vLLM listening on 8011 GPU not used nvidia-smi inside container; confirm --gpus all and NVIDIA toolkit configured Slow responses Lower --max-model-len; tune --max-num-seqs and --gpu-memory-utilization OOM / CUDA errors Reduce batch size / concurrency; lower --gpu-memory-utilization CORS issues Add appropriate CORS headers in NGINX if calling from browsers Minimal CORS snippet (optional):

add_header Access-Control-Allow-Origin * always;
add_header Access-Control-Allow-Headers * always;
add_header Access-Control-Allow-Methods "GET, POST, OPTIONS" always;
if ($request_method OPTIONS) { return 204; }`

13. Architecture Summary

LayerPurposePort
vLLM containerModel inference engine8000 (internal)
Docker hostMapped API endpoint8011
NGINXReverse proxy80 / 443
CertbotHTTPS certificates
Cloudflare DNSMaps domain → server

Layer Purpose Port vLLM container Model inference engine 8000 (internal) Docker host Mapped API endpoint 8011 NGINX Reverse proxy 80 / 443 Certbot HTTPS certificates — Cloudflare DNS Maps domain → server —


14. Final Result

Your OpenAI-compatible LLM endpoint is live: https://api.substrateai.net/v1/chat/completions Use it directly via curl or standard OpenAI SDKs (with baseURL pointed to your domain).