Skip to main content

Troubleshooting

Quick solutions to common issues. Use Ctrl+F to search for specific error messages.

Instance Launch Issues

GPU Unavailable

Error: "This GPU configuration is no longer available"

Cause: Another user claimed the GPU while you were configuring.

Solutions:

  • Try a different region
  • Choose a different GPU type
  • Wait 15-30 minutes and retry (GPUs become available as instances terminate)

Insufficient Balance

Error: "Insufficient wallet balance"

Solution:

  1. Click "Add Funds" in wallet section
  2. Add minimum required amount (shown in error)
  3. Retry launch

Note: First hour is always charged immediately.


No SSH Keys

Error: "No SSH keys found"

Solution:

  1. Navigate to SSH Keys page
  2. Upload existing key or generate new one
  3. Return to launch page

Stuck in "Creating" Status

Normal duration: 5-10 minutes

If exceeds 15 minutes:

  • Status will change to failed
  • First hour charge automatically refunded
  • Try different region or GPU type
  • Contact support if recurring

Launch Failed

Status: failed

Common causes:

  • Cloud provider capacity issue
  • Configuration error
  • Network problem

Troubleshooting:

  1. Review error message
  2. Try different region
  3. Simplify configuration (remove startup script and volumes temporarily)
  4. Contact support if persistent

SSH Connection Issues

Connection Timed Out

Verify before debugging:

  • Instance status is active (not creating/pending)
  • Using correct IP address from Resources page
  • Using correct private key
  • Key has proper permissions: chmod 600 key.pem
  • Firewall allows SSH (port 22)
  • Waited at least 1-2 minutes after instance became active

Debug with verbose output:

ssh -v -i /path/to/key.pem root@YOUR_IP_ADDRESS

Permission Denied

Error: "Permission denied (publickey)"

Common cause: Wrong private key or incorrect permissions

Solution:

# Set correct permissions
chmod 600 /path/to/key.pem

# Verify key is valid
ssh-keygen -lf /path/to/key.pem

# Add to SSH agent (optional)
ssh-add /path/to/key.pem

# Connect
ssh -i /path/to/key.pem root@YOUR_IP_ADDRESS

Note: Use the private key file (without .pub extension), not the public key.


Connection Refused

Possible causes:

  • SSH service hasn't started yet
  • Wrong port number
  • Instance suspended due to low balance

Solution:

  1. Wait 2-3 minutes after instance shows active status
  2. Verify instance status on Resources page
  3. If suspended: add funds to wallet

Host Key Changed

Error: "REMOTE HOST IDENTIFICATION HAS CHANGED"

Cause: Connecting to new instance with same IP as previous instance.

Solution:

# Remove old host key
ssh-keygen -R YOUR_IP_ADDRESS

# Connect normally
ssh -i /path/to/key.pem root@YOUR_IP_ADDRESS

Type yes when prompted to accept new host key.


GPU Issues

GPU Not Detected

Check GPU availability:

nvidia-smi

Errors:

  • "command not found"
  • "No devices found"

Troubleshooting:

  1. Wait 2-3 minutes after boot (drivers initializing)

  2. Reinstall NVIDIA drivers:

apt-get update
apt-get install -y --reinstall nvidia-driver-525
reboot
  1. Verify CUDA-enabled OS was selected during launch

  2. Contact support with instance ID if unresolved


CUDA Version Mismatch

Error: "CUDA version mismatch" or "CUDA driver version is insufficient"

Solutions:

  • Install software compatible with CUDA version
  • Install different CUDA toolkit version
  • Use Docker container with required CUDA version (recommended)

Check CUDA version:

nvcc --version
nvidia-smi # Shows max supported CUDA version

Out of Memory (OOM)

Error: "CUDA out of memory" or "RuntimeError: CUDA OOM"

Quick fix:

import torch
torch.cuda.empty_cache()

Solutions:

SolutionWhen to Use
Reduce batch sizeFirst thing to try
Enable gradient accumulationNeed large effective batch sizes
Use mixed precision (FP16)Reduces memory ~50%
Clear unused tensorsAfter large operations
Use larger GPUModel too large for current GPU

Example: Mixed precision training

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
with autocast():
output = model(input)
loss = criterion(output, target)

Need more VRAM? Terminate instance and launch with larger GPU.


Volume Issues

Volume Not Listed When Attaching

Possible causes:

  • Region mismatch (most common)
  • Volume already attached to another instance
  • Volume status is not active

Solution:

  1. Verify volume region exactly matches instance region
  2. Check volume status is active (not creating/deleting)
  3. Ensure volume not attached to another instance

Note: Region must match exactly. "Stockholm-1" ≠ "Stockholm-2"


Volume Not Mounted

Check if mounted:

df -h | grep /mnt/volume

No output means volume attached but not mounted.

Mount the volume:

# Verify disk exists
lsblk

# Create mount point
mkdir -p /mnt/volume

# Mount
mount /dev/sdb /mnt/volume

If filesystem errors:

# Check and repair
fsck /dev/sdb

# Auto-repair
fsck -y /dev/sdb

# Mount again
mount /dev/sdb /mnt/volume

Cannot Unmount Volume

Error: "device is busy" or "target is busy"

Cause: Process is using files on volume.

Solution:

# Find processes using volume
lsof /mnt/volume
# or
fuser -m /mnt/volume

# Change directory if inside volume
cd /root

# Stop processes using volume
fuser -km /mnt/volume

# Unmount
umount /mnt/volume

Note: You must cd out of volume directory before unmounting.


Volume Full

Error: "No space left on device"

Find what's using space:

du -ah /mnt/volume | sort -rh | head -20

Free up space:

# Delete unnecessary files
rm -rf /mnt/volume/old-data/

# Compress large directories
tar czf archive.tar.gz /mnt/volume/large-directory/
rm -rf /mnt/volume/large-directory/

# Check space
df -h /mnt/volume

Need more space? Create larger volume, transfer data, detach old volume.


Startup Script Issues

Script Not Running

Check logs:

# Watch real-time
tail -f /var/log/cloud-init-output.log

# Search errors
grep -i error /var/log/cloud-init-output.log

Common problems:

ProblemFix
Missing shebangAdd #!/bin/bash as first line
Syntax errorTest locally: bash -n script.sh
Package not foundAdd apt-get update before installing
Network timeoutAdd retry logic or increase timeout

Example well-formed script:

#!/bin/bash
set -e # Exit on error

apt-get update
apt-get install -y python3-pip
pip3 install torch transformers

Script Takes Too Long

Problem: Long-running commands block initialization.

Solution: Run long tasks in background.

Bad (blocks):

#!/bin/bash
python train.py

Good (background):

#!/bin/bash
# Option 1: nohup
nohup python train.py > /var/log/training.log 2>&1 &

# Option 2: tmux
tmux new -d -s training 'python train.py'

# Option 3: systemd service
cat > /etc/systemd/system/training.service << EOF
[Service]
ExecStart=/usr/bin/python train.py
WorkingDirectory=/root
[Install]
WantedBy=multi-user.target
EOF
systemctl enable training
systemctl start training

Billing Issues

Balance Not Updating

Steps:

  1. Wait 60 seconds (processing delay)
  2. Check email for payment confirmation
  3. Refresh wallet page
  4. Contact support with transaction ID if unresolved

Unexpected Charges

Investigation checklist:

  • Check Resources page for active instances
  • Review Volumes tab for unattached volumes (still billed)
  • Check transaction history

Common causes:

  1. Forgot to terminate instance
  2. Volume left running when not needed
  3. Multiple instances active

Tip: Always terminate instances when not in use.


Payment Failed

Troubleshooting:

  1. Try different credit card
  2. Verify billing address matches card
  3. Contact bank (may have blocked international transaction)
  4. Ensure international payments enabled
  5. Try alternative payment method

Note: Charges appear as "Substrate AI" or "Stripe" on statements.


Emergency Situations

Instance Unresponsive

No restart feature available.

Solution: Terminate and launch new instance

Data impact:

  • Volumes: Preserved, can attach to new instance
  • Instance storage: Lost permanently

Important: Always use volumes for persistent data.


Accidental Termination

Cannot be undone.

  • Instance storage: Permanently lost
  • Attached volumes: Preserved

Prevention: Read confirmation prompts carefully and use descriptive instance names.


Compromised Instance

Immediate actions:

  1. Terminate instance immediately
  2. Delete compromised SSH key
  3. Generate and add new SSH key
  4. Review billing for unauthorized usage
  5. Contact support with details

Warning: Unauthorized usage can result in significant charges.


Support

Email: support@substrate.ai

Include in message:

  • Instance ID or Volume ID
  • Exact error message (copy/paste or screenshot)
  • Steps to reproduce
  • Relevant log files or command output

Response times:

PriorityResponse Time
CriticalWithin 1 hour
HighWithin 4 hours
NormalWithin 24 hours

Tip: More details = faster resolution. Include screenshots, error messages, and IDs.