Skip to main content

Advanced Configuration

Advanced features for automating deployments and customizing instances.

Environment Variables

Environment variables are key-value pairs available in your instance's shell environment.

Common uses:

  • API keys and credentials
  • Configuration parameters
  • Application settings
  • Service endpoints

Setting variables:

Add during instance launch in the Advanced Configuration section. Enter key-value pairs.

Accessing in instance:

Bash:

echo $VARIABLE_NAME

Python:

import os
value = os.environ.get('VARIABLE_NAME')

Security note: Values are encrypted in storage but visible to organization members in the console.

Startup Scripts

Bash scripts executed automatically on first boot as root user.

Common uses:

  • Install software packages
  • Download datasets or models
  • Configure services
  • Start background jobs

Basic example:

#!/bin/bash
apt-get update
apt-get install -y htop git
pip3 install torch torchvision
cd /root
git clone https://github.com/your-org/project.git

Best practices:

  • Start with #!/bin/bash
  • Use set -e to exit on errors
  • Keep scripts under 5 minutes
  • Use tmux or nohup for long-running tasks

For long-running processes:

#!/bin/bash
# Don't block instance startup
tmux new-session -d -s training 'python3 train.py'

View script output:

tail -f /var/log/cloud-init-output.log

Multi-GPU Configuration

For instances with multiple GPUs (2x, 4x, 8x):

Verify GPUs:

nvidia-smi --list-gpus

Environment variables:

# Use all GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3

# Use specific GPUs
export CUDA_VISIBLE_DEVICES=0,2

PyTorch example:

import torch

# Check GPU count
gpu_count = torch.cuda.device_count()

# DataParallel
model = torch.nn.DataParallel(model)

# DistributedDataParallel
torch.distributed.init_process_group(backend='nccl')
model = torch.nn.parallel.DistributedDataParallel(model)

System Optimization

GPU performance settings:

# Set persistence mode
nvidia-smi -pm 1

# Set power limit (if supported)
nvidia-smi -pl 300 # 300W

# Set compute mode
nvidia-smi -c EXCLUSIVE_PROCESS

Network buffer sizes:

echo "net.core.rmem_max = 134217728" >> /etc/sysctl.conf
echo "net.core.wmem_max = 134217728" >> /etc/sysctl.conf
sysctl -p

Example Configurations

Machine Learning Training:

#!/bin/bash
set -e

apt-get update
pip3 install torch transformers accelerate wandb

cd /root
git clone https://github.com/your-org/ml-project.git
cd ml-project
pip3 install -r requirements.txt

# Login to W&B
wandb login $WANDB_API_KEY

# Start training in background
tmux new-session -d -s training 'python3 train.py'

Jupyter Notebook Server:

#!/bin/bash
pip3 install jupyter

jupyter notebook --generate-config
echo "c.NotebookApp.password = 'YOUR_HASH'" >> ~/.jupyter/jupyter_notebook_config.py

nohup jupyter notebook --ip=0.0.0.0 --port=8888 --allow-root &

Auto-shutdown After Job:

#!/bin/bash
# Run job and shutdown when complete
python3 train.py && shutdown -h now

Troubleshooting

Script didn't run: Check logs:

tail -f /var/log/cloud-init-output.log
grep -i error /var/log/cloud-init-output.log

Common issues:

  • Missing shebang (#!/bin/bash)
  • Syntax errors (test with bash -n script.sh)
  • Package not found (run apt-get update first)
  • Network timeouts (add retry logic)

Environment variables not available:

# List all variables
env

# Add to profile for persistence
echo "export VAR=value" >> /etc/environment