GPU Bare Metal Clusters

Deploy dedicated, non-virtualized GPU servers with direct hardware access for maximum performance.

Overview

GPU Bare Metal Clusters deliver dedicated GPU servers for AI training, inference, 3D rendering, and HPC workloads. Key advantages:

100% dedicated GPUs with no hypervisor overhead
Direct PCIe and NVLink access
Custom OS and driver management
Full control over networking and storage
Multi-node scaling for distributed training

Prerequisites

Dashboard or API access with compute provisioning rights
SSH key pairs for all cluster nodes
At least one private network or VLAN configured
Driver and CUDA version plan for your workload
(Optional) Container or orchestration setup (Docker, Kubernetes, Slurm)

Key Features

Feature	Description
Dedicated GPUs	Direct access to NVIDIA A100, H100, RTX 4090, L40S with full PCIe bandwidth
Bare Metal Control	Full OS/root access for drivers, frameworks, monitoring
NVLink and RDMA Fabric	Multi-node interconnect for distributed training
Cluster Networking	25/50/100 GbE private connections
Block and Object Storage	High-performance NVMe or S3-compatible storage
Custom OS Images	Ubuntu, Rocky, CentOS, Debian
GPU Partitioning (MIG)	Isolated GPU instances for concurrent workloads
Automation and APIs	REST API, CLI, CI/CD workflows

Step 1: Provision GPU Nodes

Go to Compute > Bare Metal > GPU Nodes
Click Create Node or Create Cluster
Configure:

Setting	Options
GPU Model	H200, H100, A100, RTX 4090
CPU and Memory	e.g., AMD Threadripper 7950X + 256 GB RAM
Storage	NVMe (2-8 TB) or external volume
Networking	25 Gbps (standard) or 100 Gbps
Region	Preferred data center location

Upload SSH key and specify OS image (Ubuntu 22.04 LTS recommended)
Click Deploy Cluster

Provisioning takes 10-15 minutes.

Step 2: Connect to Cluster

SSH Access

ssh -i ~/.ssh/id_rsa ubuntu@<NODE_IP>

Verify GPU visibility:

nvidia-smi

Multi-Node Access

Access other nodes via internal cluster network:

ssh node02
ssh node03

Step 3: Install Drivers and Frameworks

NVIDIA Driver (Ubuntu 22.04)

sudo apt update
sudo apt install -y build-essential dkms
wget https://us.download.nvidia.com/tesla/550.90.12/NVIDIA-Linux-x86_64-550.90.12.run
sudo bash NVIDIA-Linux-x86_64-550.90.12.run

CUDA and cuDNN

sudo apt install -y cuda-toolkit-12-4
nvcc --version

PyTorch

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu124

Step 4: Distributed Training Setup

Orchestrate workloads using Slurm, Kubernetes, or PyTorch DDP.

PyTorch Distributed Example

import torch
import torch.distributed as dist
dist.init_process_group("nccl")
print(f"Rank {dist.get_rank()} / {dist.get_world_size()}")

Start training:

torchrun --nproc_per_node=8 train.py

Step 5: GitHub Actions Integration (Optional)

name: Train on GPU Cluster
on:
  push:
    branches: [ "main" ]

jobs:
  train:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Connect to Bare Metal Node
        run: |
          ssh -o StrictHostKeyChecking=no ubuntu@${{ secrets.GPU_NODE_IP }} \
            'cd /workspace && git pull && python3 train.py'

Step 6: Monitoring

GPU Health Metrics

Install DCGM Exporter:

sudo docker run -d --gpus all --rm \
  --name dcgm-exporter -p 9400:9400 \
  nvcr.io/nvidia/k8s/dcgm-exporter:3.3.1

View metrics at http://<NODE_IP>:9400/metrics

Recommended Monitoring

Stream logs via Grafana Agent or Promtail
Enable alerts for temperature, ECC errors, utilization
Schedule monthly burn-in tests

Step 7: Scaling and Management

Action	Description
Add Nodes	Extend capacity from dashboard or API
Load Balancing	Use HAProxy or internal scheduler
Storage Expansion	Attach new NVMe volumes or NFS mounts
Snapshots	Capture full OS + driver state
Decommissioning	Stop or wipe nodes securely

Troubleshooting

Issue	Cause	Solution
GPU not detected	Driver mismatch	Reinstall NVIDIA driver; reboot
Slow inter-node communication	RDMA not configured	Check InfiniBand/ROCE setup
CUDA out of memory	Batch too large	Reduce model size or gradient accumulation
Training divergence	Clock sync issue	Enable NTP or chrony on all nodes
SSH latency	Using public path	Use internal 10.x.x.x network

Overview​

Prerequisites​

Key Features​

Step 1: Provision GPU Nodes​

Step 2: Connect to Cluster​

SSH Access​

Multi-Node Access​

Step 3: Install Drivers and Frameworks​

NVIDIA Driver (Ubuntu 22.04)​

CUDA and cuDNN​

PyTorch​

Step 4: Distributed Training Setup​

PyTorch Distributed Example​

Step 5: GitHub Actions Integration (Optional)​

Step 6: Monitoring​

GPU Health Metrics​

Recommended Monitoring​

Step 7: Scaling and Management​

Troubleshooting​