Skip to main content

GPU Bare Metal Clusters

Deploy dedicated, non-virtualized GPU servers with direct hardware access for maximum performance.

Overview

GPU Bare Metal Clusters deliver dedicated GPU servers for AI training, inference, 3D rendering, and HPC workloads. Key advantages:

  • 100% dedicated GPUs with no hypervisor overhead
  • Direct PCIe and NVLink access
  • Custom OS and driver management
  • Full control over networking and storage
  • Multi-node scaling for distributed training

Prerequisites

  • Dashboard or API access with compute provisioning rights
  • SSH key pairs for all cluster nodes
  • At least one private network or VLAN configured
  • Driver and CUDA version plan for your workload
  • (Optional) Container or orchestration setup (Docker, Kubernetes, Slurm)

Key Features

FeatureDescription
Dedicated GPUsDirect access to NVIDIA A100, H100, RTX 4090, L40S with full PCIe bandwidth
Bare Metal ControlFull OS/root access for drivers, frameworks, monitoring
NVLink and RDMA FabricMulti-node interconnect for distributed training
Cluster Networking25/50/100 GbE private connections
Block and Object StorageHigh-performance NVMe or S3-compatible storage
Custom OS ImagesUbuntu, Rocky, CentOS, Debian
GPU Partitioning (MIG)Isolated GPU instances for concurrent workloads
Automation and APIsREST API, CLI, CI/CD workflows

Step 1: Provision GPU Nodes

  1. Go to Compute > Bare Metal > GPU Nodes
  2. Click Create Node or Create Cluster
  3. Configure:
SettingOptions
GPU ModelH200, H100, A100, RTX 4090
CPU and Memorye.g., AMD Threadripper 7950X + 256 GB RAM
StorageNVMe (2-8 TB) or external volume
Networking25 Gbps (standard) or 100 Gbps
RegionPreferred data center location
  1. Upload SSH key and specify OS image (Ubuntu 22.04 LTS recommended)
  2. Click Deploy Cluster

Provisioning takes 10-15 minutes.


Step 2: Connect to Cluster

SSH Access

ssh -i ~/.ssh/id_rsa ubuntu@<NODE_IP>

Verify GPU visibility:

nvidia-smi

Multi-Node Access

Access other nodes via internal cluster network:

ssh node02
ssh node03

Step 3: Install Drivers and Frameworks

NVIDIA Driver (Ubuntu 22.04)

sudo apt update
sudo apt install -y build-essential dkms
wget https://us.download.nvidia.com/tesla/550.90.12/NVIDIA-Linux-x86_64-550.90.12.run
sudo bash NVIDIA-Linux-x86_64-550.90.12.run

CUDA and cuDNN

sudo apt install -y cuda-toolkit-12-4
nvcc --version

PyTorch

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu124

Step 4: Distributed Training Setup

Orchestrate workloads using Slurm, Kubernetes, or PyTorch DDP.

PyTorch Distributed Example

import torch
import torch.distributed as dist
dist.init_process_group("nccl")
print(f"Rank {dist.get_rank()} / {dist.get_world_size()}")

Start training:

torchrun --nproc_per_node=8 train.py

Step 5: GitHub Actions Integration (Optional)

name: Train on GPU Cluster
on:
push:
branches: [ "main" ]

jobs:
train:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4

- name: Connect to Bare Metal Node
run: |
ssh -o StrictHostKeyChecking=no ubuntu@${{ secrets.GPU_NODE_IP }} \
'cd /workspace && git pull && python3 train.py'

Step 6: Monitoring

GPU Health Metrics

Install DCGM Exporter:

sudo docker run -d --gpus all --rm \
--name dcgm-exporter -p 9400:9400 \
nvcr.io/nvidia/k8s/dcgm-exporter:3.3.1

View metrics at http://<NODE_IP>:9400/metrics

  • Stream logs via Grafana Agent or Promtail
  • Enable alerts for temperature, ECC errors, utilization
  • Schedule monthly burn-in tests

Step 7: Scaling and Management

ActionDescription
Add NodesExtend capacity from dashboard or API
Load BalancingUse HAProxy or internal scheduler
Storage ExpansionAttach new NVMe volumes or NFS mounts
SnapshotsCapture full OS + driver state
DecommissioningStop or wipe nodes securely

Troubleshooting

IssueCauseSolution
GPU not detectedDriver mismatchReinstall NVIDIA driver; reboot
Slow inter-node communicationRDMA not configuredCheck InfiniBand/ROCE setup
CUDA out of memoryBatch too largeReduce model size or gradient accumulation
Training divergenceClock sync issueEnable NTP or chrony on all nodes
SSH latencyUsing public pathUse internal 10.x.x.x network