Troubleshooting
Quick solutions to common issues. Use Ctrl+F to search for specific error messages.
Instance Launch Issues
GPU Unavailable
Error: "This GPU configuration is no longer available"
Cause: Another user claimed the GPU while you were configuring.
Solutions:
- Try a different region
- Choose a different GPU type
- Wait 15-30 minutes and retry (GPUs become available as instances terminate)
Insufficient Balance
Error: "Insufficient wallet balance"
Solution:
- Click "Add Funds" in wallet section
- Add minimum required amount (shown in error)
- Retry launch
Note: First hour is always charged immediately.
No SSH Keys
Error: "No SSH keys found"
Solution:
- Navigate to SSH Keys page
- Upload existing key or generate new one
- Return to launch page
Stuck in "Creating" Status
Normal duration: 5-10 minutes
If exceeds 15 minutes:
- Status will change to
failed - First hour charge automatically refunded
- Try different region or GPU type
- Contact support if recurring
Launch Failed
Status: failed
Common causes:
- Cloud provider capacity issue
- Configuration error
- Network problem
Troubleshooting:
- Review error message
- Try different region
- Simplify configuration (remove startup script and volumes temporarily)
- Contact support if persistent
SSH Connection Issues
Connection Timed Out
Verify before debugging:
- Instance status is
active(not creating/pending) - Using correct IP address from Resources page
- Using correct private key
- Key has proper permissions:
chmod 600 key.pem - Firewall allows SSH (port 22)
- Waited at least 1-2 minutes after instance became active
Debug with verbose output:
ssh -v -i /path/to/key.pem root@YOUR_IP_ADDRESS
Permission Denied
Error: "Permission denied (publickey)"
Common cause: Wrong private key or incorrect permissions
Solution:
# Set correct permissions
chmod 600 /path/to/key.pem
# Verify key is valid
ssh-keygen -lf /path/to/key.pem
# Add to SSH agent (optional)
ssh-add /path/to/key.pem
# Connect
ssh -i /path/to/key.pem root@YOUR_IP_ADDRESS
Note: Use the private key file (without .pub extension), not the public key.
Connection Refused
Possible causes:
- SSH service hasn't started yet
- Wrong port number
- Instance suspended due to low balance
Solution:
- Wait 2-3 minutes after instance shows
activestatus - Verify instance status on Resources page
- If suspended: add funds to wallet
Host Key Changed
Error: "REMOTE HOST IDENTIFICATION HAS CHANGED"
Cause: Connecting to new instance with same IP as previous instance.
Solution:
# Remove old host key
ssh-keygen -R YOUR_IP_ADDRESS
# Connect normally
ssh -i /path/to/key.pem root@YOUR_IP_ADDRESS
Type yes when prompted to accept new host key.
GPU Issues
GPU Not Detected
Check GPU availability:
nvidia-smi
Errors:
"command not found""No devices found"
Troubleshooting:
-
Wait 2-3 minutes after boot (drivers initializing)
-
Reinstall NVIDIA drivers:
apt-get update
apt-get install -y --reinstall nvidia-driver-525
reboot
-
Verify CUDA-enabled OS was selected during launch
-
Contact support with instance ID if unresolved
CUDA Version Mismatch
Error: "CUDA version mismatch" or "CUDA driver version is insufficient"
Solutions:
- Install software compatible with CUDA version
- Install different CUDA toolkit version
- Use Docker container with required CUDA version (recommended)
Check CUDA version:
nvcc --version
nvidia-smi # Shows max supported CUDA version
Out of Memory (OOM)
Error: "CUDA out of memory" or "RuntimeError: CUDA OOM"
Quick fix:
import torch
torch.cuda.empty_cache()
Solutions:
| Solution | When to Use |
|---|---|
| Reduce batch size | First thing to try |
| Enable gradient accumulation | Need large effective batch sizes |
| Use mixed precision (FP16) | Reduces memory ~50% |
| Clear unused tensors | After large operations |
| Use larger GPU | Model too large for current GPU |
Example: Mixed precision training
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
output = model(input)
loss = criterion(output, target)
Need more VRAM? Terminate instance and launch with larger GPU.
Volume Issues
Volume Not Listed When Attaching
Possible causes:
- Region mismatch (most common)
- Volume already attached to another instance
- Volume status is not
active
Solution:
- Verify volume region exactly matches instance region
- Check volume status is
active(not creating/deleting) - Ensure volume not attached to another instance
Note: Region must match exactly. "Stockholm-1" ≠ "Stockholm-2"
Volume Not Mounted
Check if mounted:
df -h | grep /mnt/volume
No output means volume attached but not mounted.
Mount the volume:
# Verify disk exists
lsblk
# Create mount point
mkdir -p /mnt/volume
# Mount
mount /dev/sdb /mnt/volume
If filesystem errors:
# Check and repair
fsck /dev/sdb
# Auto-repair
fsck -y /dev/sdb
# Mount again
mount /dev/sdb /mnt/volume
Cannot Unmount Volume
Error: "device is busy" or "target is busy"
Cause: Process is using files on volume.
Solution:
# Find processes using volume
lsof /mnt/volume
# or
fuser -m /mnt/volume
# Change directory if inside volume
cd /root
# Stop processes using volume
fuser -km /mnt/volume
# Unmount
umount /mnt/volume
Note: You must cd out of volume directory before unmounting.
Volume Full
Error: "No space left on device"
Find what's using space:
du -ah /mnt/volume | sort -rh | head -20
Free up space:
# Delete unnecessary files
rm -rf /mnt/volume/old-data/
# Compress large directories
tar czf archive.tar.gz /mnt/volume/large-directory/
rm -rf /mnt/volume/large-directory/
# Check space
df -h /mnt/volume
Need more space? Create larger volume, transfer data, detach old volume.
Startup Script Issues
Script Not Running
Check logs:
# Watch real-time
tail -f /var/log/cloud-init-output.log
# Search errors
grep -i error /var/log/cloud-init-output.log
Common problems:
| Problem | Fix |
|---|---|
| Missing shebang | Add #!/bin/bash as first line |
| Syntax error | Test locally: bash -n script.sh |
| Package not found | Add apt-get update before installing |
| Network timeout | Add retry logic or increase timeout |
Example well-formed script:
#!/bin/bash
set -e # Exit on error
apt-get update
apt-get install -y python3-pip
pip3 install torch transformers
Script Takes Too Long
Problem: Long-running commands block initialization.
Solution: Run long tasks in background.
Bad (blocks):
#!/bin/bash
python train.py
Good (background):
#!/bin/bash
# Option 1: nohup
nohup python train.py > /var/log/training.log 2>&1 &
# Option 2: tmux
tmux new -d -s training 'python train.py'
# Option 3: systemd service
cat > /etc/systemd/system/training.service << EOF
[Service]
ExecStart=/usr/bin/python train.py
WorkingDirectory=/root
[Install]
WantedBy=multi-user.target
EOF
systemctl enable training
systemctl start training
Billing Issues
Balance Not Updating
Steps:
- Wait 60 seconds (processing delay)
- Check email for payment confirmation
- Refresh wallet page
- Contact support with transaction ID if unresolved
Unexpected Charges
Investigation checklist:
- Check Resources page for active instances
- Review Volumes tab for unattached volumes (still billed)
- Check transaction history
Common causes:
- Forgot to terminate instance
- Volume left running when not needed
- Multiple instances active
Tip: Always terminate instances when not in use.
Payment Failed
Troubleshooting:
- Try different credit card
- Verify billing address matches card
- Contact bank (may have blocked international transaction)
- Ensure international payments enabled
- Try alternative payment method
Note: Charges appear as "Substrate AI" or "Stripe" on statements.
Emergency Situations
Instance Unresponsive
No restart feature available.
Solution: Terminate and launch new instance
Data impact:
- Volumes: Preserved, can attach to new instance
- Instance storage: Lost permanently
Important: Always use volumes for persistent data.
Accidental Termination
Cannot be undone.
- Instance storage: Permanently lost
- Attached volumes: Preserved
Prevention: Read confirmation prompts carefully and use descriptive instance names.
Compromised Instance
Immediate actions:
- Terminate instance immediately
- Delete compromised SSH key
- Generate and add new SSH key
- Review billing for unauthorized usage
- Contact support with details
Warning: Unauthorized usage can result in significant charges.
Support
Email: support@substrate.ai
Include in message:
- Instance ID or Volume ID
- Exact error message (copy/paste or screenshot)
- Steps to reproduce
- Relevant log files or command output
Response times:
| Priority | Response Time |
|---|---|
| Critical | Within 1 hour |
| High | Within 4 hours |
| Normal | Within 24 hours |
Tip: More details = faster resolution. Include screenshots, error messages, and IDs.