GPU Bare Metal Clusters
Deploy dedicated, non-virtualized GPU servers with direct hardware access for maximum performance.
Overview
GPU Bare Metal Clusters deliver dedicated GPU servers for AI training, inference, 3D rendering, and HPC workloads. Key advantages:
- 100% dedicated GPUs with no hypervisor overhead
- Direct PCIe and NVLink access
- Custom OS and driver management
- Full control over networking and storage
- Multi-node scaling for distributed training
Prerequisites
- Dashboard or API access with compute provisioning rights
- SSH key pairs for all cluster nodes
- At least one private network or VLAN configured
- Driver and CUDA version plan for your workload
- (Optional) Container or orchestration setup (Docker, Kubernetes, Slurm)
Key Features
| Feature | Description |
|---|---|
| Dedicated GPUs | Direct access to NVIDIA A100, H100, RTX 4090, L40S with full PCIe bandwidth |
| Bare Metal Control | Full OS/root access for drivers, frameworks, monitoring |
| NVLink and RDMA Fabric | Multi-node interconnect for distributed training |
| Cluster Networking | 25/50/100 GbE private connections |
| Block and Object Storage | High-performance NVMe or S3-compatible storage |
| Custom OS Images | Ubuntu, Rocky, CentOS, Debian |
| GPU Partitioning (MIG) | Isolated GPU instances for concurrent workloads |
| Automation and APIs | REST API, CLI, CI/CD workflows |
Step 1: Provision GPU Nodes
- Go to Compute > Bare Metal > GPU Nodes
- Click Create Node or Create Cluster
- Configure:
| Setting | Options |
|---|---|
| GPU Model | H200, H100, A100, RTX 4090 |
| CPU and Memory | e.g., AMD Threadripper 7950X + 256 GB RAM |
| Storage | NVMe (2-8 TB) or external volume |
| Networking | 25 Gbps (standard) or 100 Gbps |
| Region | Preferred data center location |
- Upload SSH key and specify OS image (Ubuntu 22.04 LTS recommended)
- Click Deploy Cluster
Provisioning takes 10-15 minutes.
Step 2: Connect to Cluster
SSH Access
ssh -i ~/.ssh/id_rsa ubuntu@<NODE_IP>
Verify GPU visibility:
nvidia-smi
Multi-Node Access
Access other nodes via internal cluster network:
ssh node02
ssh node03
Step 3: Install Drivers and Frameworks
NVIDIA Driver (Ubuntu 22.04)
sudo apt update
sudo apt install -y build-essential dkms
wget https://us.download.nvidia.com/tesla/550.90.12/NVIDIA-Linux-x86_64-550.90.12.run
sudo bash NVIDIA-Linux-x86_64-550.90.12.run
CUDA and cuDNN
sudo apt install -y cuda-toolkit-12-4
nvcc --version
PyTorch
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu124
Step 4: Distributed Training Setup
Orchestrate workloads using Slurm, Kubernetes, or PyTorch DDP.
PyTorch Distributed Example
import torch
import torch.distributed as dist
dist.init_process_group("nccl")
print(f"Rank {dist.get_rank()} / {dist.get_world_size()}")
Start training:
torchrun --nproc_per_node=8 train.py
Step 5: GitHub Actions Integration (Optional)
name: Train on GPU Cluster
on:
push:
branches: [ "main" ]
jobs:
train:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Connect to Bare Metal Node
run: |
ssh -o StrictHostKeyChecking=no ubuntu@${{ secrets.GPU_NODE_IP }} \
'cd /workspace && git pull && python3 train.py'
Step 6: Monitoring
GPU Health Metrics
Install DCGM Exporter:
sudo docker run -d --gpus all --rm \
--name dcgm-exporter -p 9400:9400 \
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.1
View metrics at http://<NODE_IP>:9400/metrics
Recommended Monitoring
- Stream logs via Grafana Agent or Promtail
- Enable alerts for temperature, ECC errors, utilization
- Schedule monthly burn-in tests
Step 7: Scaling and Management
| Action | Description |
|---|---|
| Add Nodes | Extend capacity from dashboard or API |
| Load Balancing | Use HAProxy or internal scheduler |
| Storage Expansion | Attach new NVMe volumes or NFS mounts |
| Snapshots | Capture full OS + driver state |
| Decommissioning | Stop or wipe nodes securely |
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| GPU not detected | Driver mismatch | Reinstall NVIDIA driver; reboot |
| Slow inter-node communication | RDMA not configured | Check InfiniBand/ROCE setup |
| CUDA out of memory | Batch too large | Reduce model size or gradient accumulation |
| Training divergence | Clock sync issue | Enable NTP or chrony on all nodes |
| SSH latency | Using public path | Use internal 10.x.x.x network |