Troubleshooting

Quick solutions to common issues. Use Ctrl+F to search for specific error messages.

Instance Launch Issues

GPU Unavailable

Error: "This GPU configuration is no longer available"

Cause: Another user claimed the GPU while you were configuring.

Solutions:

Try a different region
Choose a different GPU type
Wait 15-30 minutes and retry (GPUs become available as instances terminate)

Insufficient Balance

Error: "Insufficient wallet balance"

Solution:

Click "Add Funds" in wallet section
Add minimum required amount (shown in error)
Retry launch

Note: First hour is always charged immediately.

No SSH Keys

Error: "No SSH keys found"

Solution:

Navigate to SSH Keys page
Upload existing key or generate new one
Return to launch page

Stuck in "Creating" Status

Normal duration: 5-10 minutes

If exceeds 15 minutes:

Status will change to failed
First hour charge automatically refunded
Try different region or GPU type
Contact support if recurring

Launch Failed

Status: failed

Common causes:

Cloud provider capacity issue
Configuration error
Network problem

Troubleshooting:

Review error message
Try different region
Simplify configuration (remove startup script and volumes temporarily)
Contact support if persistent

SSH Connection Issues

Connection Timed Out

Verify before debugging:

Instance status is active (not creating/pending)
Using correct IP address from Resources page
Using correct private key
Key has proper permissions: chmod 600 key.pem
Firewall allows SSH (port 22)
Waited at least 1-2 minutes after instance became active

Debug with verbose output:

ssh -v -i /path/to/key.pem root@YOUR_IP_ADDRESS

Permission Denied

Error: "Permission denied (publickey)"

Common cause: Wrong private key or incorrect permissions

Solution:

# Set correct permissions
chmod 600 /path/to/key.pem

# Verify key is valid
ssh-keygen -lf /path/to/key.pem

# Add to SSH agent (optional)
ssh-add /path/to/key.pem

# Connect
ssh -i /path/to/key.pem root@YOUR_IP_ADDRESS

Note: Use the private key file (without .pub extension), not the public key.

Connection Refused

Possible causes:

SSH service hasn't started yet
Wrong port number
Instance suspended due to low balance

Solution:

Wait 2-3 minutes after instance shows active status
Verify instance status on Resources page
If suspended: add funds to wallet

Host Key Changed

Error: "REMOTE HOST IDENTIFICATION HAS CHANGED"

Cause: Connecting to new instance with same IP as previous instance.

Solution:

# Remove old host key
ssh-keygen -R YOUR_IP_ADDRESS

# Connect normally
ssh -i /path/to/key.pem root@YOUR_IP_ADDRESS

Type yes when prompted to accept new host key.

GPU Issues

GPU Not Detected

Check GPU availability:

nvidia-smi

Errors:

"command not found"
"No devices found"

Troubleshooting:

Wait 2-3 minutes after boot (drivers initializing)
Reinstall NVIDIA drivers:

apt-get update
apt-get install -y --reinstall nvidia-driver-525
reboot

Verify CUDA-enabled OS was selected during launch
Contact support with instance ID if unresolved

CUDA Version Mismatch

Error: "CUDA version mismatch" or "CUDA driver version is insufficient"

Solutions:

Install software compatible with CUDA version
Install different CUDA toolkit version
Use Docker container with required CUDA version (recommended)

Check CUDA version:

nvcc --version
nvidia-smi  # Shows max supported CUDA version

Out of Memory (OOM)

Error: "CUDA out of memory" or "RuntimeError: CUDA OOM"

Quick fix:

import torch
torch.cuda.empty_cache()

Solutions:

Solution	When to Use
Reduce batch size	First thing to try
Enable gradient accumulation	Need large effective batch sizes
Use mixed precision (FP16)	Reduces memory ~50%
Clear unused tensors	After large operations
Use larger GPU	Model too large for current GPU

Example: Mixed precision training

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
with autocast():
    output = model(input)
    loss = criterion(output, target)

Need more VRAM? Terminate instance and launch with larger GPU.

Volume Issues

Volume Not Listed When Attaching

Possible causes:

Region mismatch (most common)
Volume already attached to another instance
Volume status is not active

Solution:

Verify volume region exactly matches instance region
Check volume status is active (not creating/deleting)
Ensure volume not attached to another instance

Note: Region must match exactly. "Stockholm-1" ≠ "Stockholm-2"

Volume Not Mounted

Check if mounted:

df -h | grep /mnt/volume

No output means volume attached but not mounted.

Mount the volume:

# Verify disk exists
lsblk

# Create mount point
mkdir -p /mnt/volume

# Mount
mount /dev/sdb /mnt/volume

If filesystem errors:

# Check and repair
fsck /dev/sdb

# Auto-repair
fsck -y /dev/sdb

# Mount again
mount /dev/sdb /mnt/volume

Cannot Unmount Volume

Error: "device is busy" or "target is busy"

Cause: Process is using files on volume.

Solution:

# Find processes using volume
lsof /mnt/volume
# or
fuser -m /mnt/volume

# Change directory if inside volume
cd /root

# Stop processes using volume
fuser -km /mnt/volume

# Unmount
umount /mnt/volume

Note: You must cd out of volume directory before unmounting.

Volume Full

Error: "No space left on device"

Find what's using space:

du -ah /mnt/volume | sort -rh | head -20

Free up space:

# Delete unnecessary files
rm -rf /mnt/volume/old-data/

# Compress large directories
tar czf archive.tar.gz /mnt/volume/large-directory/
rm -rf /mnt/volume/large-directory/

# Check space
df -h /mnt/volume

Need more space? Create larger volume, transfer data, detach old volume.

Startup Script Issues

Script Not Running

Check logs:

# Watch real-time
tail -f /var/log/cloud-init-output.log

# Search errors
grep -i error /var/log/cloud-init-output.log

Common problems:

Problem	Fix
Missing shebang	Add `#!/bin/bash` as first line
Syntax error	Test locally: `bash -n script.sh`
Package not found	Add `apt-get update` before installing
Network timeout	Add retry logic or increase timeout

Example well-formed script:

#!/bin/bash
set -e  # Exit on error

apt-get update
apt-get install -y python3-pip
pip3 install torch transformers

Script Takes Too Long

Problem: Long-running commands block initialization.

Solution: Run long tasks in background.

Bad (blocks):

#!/bin/bash
python train.py

Good (background):

#!/bin/bash
# Option 1: nohup
nohup python train.py > /var/log/training.log 2>&1 &

# Option 2: tmux
tmux new -d -s training 'python train.py'

# Option 3: systemd service
cat > /etc/systemd/system/training.service << EOF
[Service]
ExecStart=/usr/bin/python train.py
WorkingDirectory=/root
[Install]
WantedBy=multi-user.target
EOF
systemctl enable training
systemctl start training

Billing Issues

Balance Not Updating

Steps:

Wait 60 seconds (processing delay)
Check email for payment confirmation
Refresh wallet page
Contact support with transaction ID if unresolved

Unexpected Charges

Investigation checklist:

Check Resources page for active instances
Review Volumes tab for unattached volumes (still billed)
Check transaction history

Common causes:

Forgot to terminate instance
Volume left running when not needed
Multiple instances active

Tip: Always terminate instances when not in use.

Payment Failed

Troubleshooting:

Try different credit card
Verify billing address matches card
Contact bank (may have blocked international transaction)
Ensure international payments enabled
Try alternative payment method

Note: Charges appear as "Substrate AI" or "Stripe" on statements.

Emergency Situations

Instance Unresponsive

No restart feature available.

Solution: Terminate and launch new instance

Data impact:

Volumes: Preserved, can attach to new instance
Instance storage: Lost permanently

Important: Always use volumes for persistent data.

Accidental Termination

Cannot be undone.

Instance storage: Permanently lost
Attached volumes: Preserved

Prevention: Read confirmation prompts carefully and use descriptive instance names.

Compromised Instance

Immediate actions:

Terminate instance immediately
Delete compromised SSH key
Generate and add new SSH key
Review billing for unauthorized usage
Contact support with details

Warning: Unauthorized usage can result in significant charges.

Support

Email: support@substrate.ai

Include in message:

Instance ID or Volume ID
Exact error message (copy/paste or screenshot)
Steps to reproduce
Relevant log files or command output

Response times:

Priority	Response Time
Critical	Within 1 hour
High	Within 4 hours
Normal	Within 24 hours

Tip: More details = faster resolution. Include screenshots, error messages, and IDs.

Instance Launch Issues​

GPU Unavailable​

Insufficient Balance​

No SSH Keys​

Stuck in "Creating" Status​

Launch Failed​

SSH Connection Issues​

Connection Timed Out​

Permission Denied​

Connection Refused​

Host Key Changed​

GPU Issues​

GPU Not Detected​

CUDA Version Mismatch​

Out of Memory (OOM)​

Volume Issues​

Volume Not Listed When Attaching​

Volume Not Mounted​

Cannot Unmount Volume​

Volume Full​

Startup Script Issues​

Script Not Running​

Script Takes Too Long​

Billing Issues​

Balance Not Updating​

Unexpected Charges​

Payment Failed​

Emergency Situations​

Instance Unresponsive​

Accidental Termination​

Compromised Instance​

Support​

Instance Launch Issues

GPU Unavailable

Insufficient Balance

No SSH Keys

Stuck in "Creating" Status

Launch Failed

SSH Connection Issues

Connection Timed Out

Permission Denied

Connection Refused

Host Key Changed

GPU Issues

GPU Not Detected

CUDA Version Mismatch

Out of Memory (OOM)

Volume Issues

Volume Not Listed When Attaching

Volume Not Mounted

Cannot Unmount Volume

Volume Full

Startup Script Issues

Script Not Running

Script Takes Too Long

Billing Issues

Balance Not Updating

Unexpected Charges

Payment Failed

Emergency Situations

Instance Unresponsive

Accidental Termination

Compromised Instance

Support