Small example of using PyTorch Lightning with PBS

  
#!/bin/bash

## Required PBS Directives --------------------------------------
#PBS -A <account>
#PBS -q <queue>
#PBS -l select=4:ncpus=40:mpiprocs=1:nmlas=2
#PBS -l walltime=24:00:00

# Configure Cluster
export MASTER_PORT=8148
NODES=$(cat $PBS_NODEFILE | sort | uniq)
MASTER_ADDR=$(echo $NODES | cut -d ' ' -f 1)
export WORLD_SIZE=$(wc -l < $PBS_NODEFILE) 

# Debugging flags
export NCCL_P2P_DISABLE=0
export NCCL_DEBUG=WARN

echo "  MASTER_ADDR is: $MASTER_ADDR"
echo "  MASTER_PORT is: $MASTER_PORT"
echo "  WORLD_SIZE is : $WORLD_SIZE"

# Loop through all nodes and start the training.
NODE_RANK=0
for NODE in $NODES; do
    echo "Starting training on $NODE with NODE_RANK $NODE_RANK, MASTER_ADDR $MASTER_ADDR, MASTER_PORT $MASTER_PORT"
    # Ensure no spaces after the backslash ending each line.
    ssh $NODE "bash; \
        export MASTER_ADDR=$MASTER_ADDR; \
        export MASTER_PORT=$MASTER_PORT; \
        export WORLD_SIZE=$WORLD_SIZE; \
        export NODE_RANK=$NODE_RANK; \
        source ~/miniconda3/bin/activate <your env>; \
        cd <your dir> \
        echo \"   On $NODE_RANK with NODE_RANK $NODE_RANK, MASTER_ADDR $MASTER_ADDR, MASTER_PORT $MASTER_PORT, WORLD_SIZE $WORLD_SIZE\"; \
        python your_code.py fit --config configs/your_yaml.yaml --trainer.strategy=ddp --trainer.num_nodes=$WORLD_SIZE; \
    " &
    NODE_RANK=$((NODE_RANK + 1))
done

# Wait for all child processes to complete.
wait

This script fetches the node information from PBS. This information is only available on the master node. Next, it ssh’s into each node in the cluster allocated to your job, sets the appropriate environment variables (these tell Pytorch lightning how to communicate with the master node), and then launches your training script. The master node is automatically set when you set its environment variable, NODE_RANK = 0.

I left in the NCCL debugging flags which are helpful in debugging when the script hangs or nodes can’t communicate to each other. These can be easily commented out.