Advanced Configuration
Advanced features for automating deployments and customizing instances.
Environment Variables
Environment variables are key-value pairs available in your instance's shell environment.
Common uses:
- API keys and credentials
- Configuration parameters
- Application settings
- Service endpoints
Setting variables:
Add during instance launch in the Advanced Configuration section. Enter key-value pairs.
Accessing in instance:
Bash:
echo $VARIABLE_NAME
Python:
import os
value = os.environ.get('VARIABLE_NAME')
Security note: Values are encrypted in storage but visible to organization members in the console.
Startup Scripts
Bash scripts executed automatically on first boot as root user.
Common uses:
- Install software packages
- Download datasets or models
- Configure services
- Start background jobs
Basic example:
#!/bin/bash
apt-get update
apt-get install -y htop git
pip3 install torch torchvision
cd /root
git clone https://github.com/your-org/project.git
Best practices:
- Start with
#!/bin/bash - Use
set -eto exit on errors - Keep scripts under 5 minutes
- Use tmux or nohup for long-running tasks
For long-running processes:
#!/bin/bash
# Don't block instance startup
tmux new-session -d -s training 'python3 train.py'
View script output:
tail -f /var/log/cloud-init-output.log
Multi-GPU Configuration
For instances with multiple GPUs (2x, 4x, 8x):
Verify GPUs:
nvidia-smi --list-gpus
Environment variables:
# Use all GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3
# Use specific GPUs
export CUDA_VISIBLE_DEVICES=0,2
PyTorch example:
import torch
# Check GPU count
gpu_count = torch.cuda.device_count()
# DataParallel
model = torch.nn.DataParallel(model)
# DistributedDataParallel
torch.distributed.init_process_group(backend='nccl')
model = torch.nn.parallel.DistributedDataParallel(model)
System Optimization
GPU performance settings:
# Set persistence mode
nvidia-smi -pm 1
# Set power limit (if supported)
nvidia-smi -pl 300 # 300W
# Set compute mode
nvidia-smi -c EXCLUSIVE_PROCESS
Network buffer sizes:
echo "net.core.rmem_max = 134217728" >> /etc/sysctl.conf
echo "net.core.wmem_max = 134217728" >> /etc/sysctl.conf
sysctl -p
Example Configurations
Machine Learning Training:
#!/bin/bash
set -e
apt-get update
pip3 install torch transformers accelerate wandb
cd /root
git clone https://github.com/your-org/ml-project.git
cd ml-project
pip3 install -r requirements.txt
# Login to W&B
wandb login $WANDB_API_KEY
# Start training in background
tmux new-session -d -s training 'python3 train.py'
Jupyter Notebook Server:
#!/bin/bash
pip3 install jupyter
jupyter notebook --generate-config
echo "c.NotebookApp.password = 'YOUR_HASH'" >> ~/.jupyter/jupyter_notebook_config.py
nohup jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root &
Auto-shutdown After Job:
#!/bin/bash
# Run job and shutdown when complete
python3 train.py && shutdown -h now
Troubleshooting
Script didn't run: Check logs:
tail -f /var/log/cloud-init-output.log
grep -i error /var/log/cloud-init-output.log
Common issues:
- Missing shebang (
#!/bin/bash) - Syntax errors (test with
bash -n script.sh) - Package not found (run
apt-get updatefirst) - Network timeouts (add retry logic)
Environment variables not available:
# List all variables
env
# Add to profile for persistence
echo "export VAR=value" >> /etc/environment