{"id":1179,"date":"2023-10-24T07:34:47","date_gmt":"2023-10-24T12:34:47","guid":{"rendered":"https:\/\/www.gergltd.com\/home\/?p=1179"},"modified":"2023-10-24T07:34:47","modified_gmt":"2023-10-24T12:34:47","slug":"small-example-of-using-pytorch-lightning-with-pbs","status":"publish","type":"post","link":"https:\/\/www.gergltd.com\/home\/small-example-of-using-pytorch-lightning-with-pbs\/","title":{"rendered":"Small example of using PyTorch Lightning with PBS"},"content":{"rendered":"\n<pre class=\"wp-block-code\"><code>  \n#!\/bin\/bash\n\n## Required PBS Directives --------------------------------------\n#PBS -A &lt;account>\n#PBS -q &lt;queue>\n#PBS -l select=4:ncpus=40:mpiprocs=1:nmlas=2\n#PBS -l walltime=24:00:00\n\n# Configure Cluster\nexport MASTER_PORT=8148\nNODES=$(cat $PBS_NODEFILE | sort | uniq)\nMASTER_ADDR=$(echo $NODES | cut -d ' ' -f 1)\nexport WORLD_SIZE=$(wc -l &lt; $PBS_NODEFILE) \n\n# Debugging flags\nexport NCCL_P2P_DISABLE=0\nexport NCCL_DEBUG=WARN\n\necho \"  MASTER_ADDR is: $MASTER_ADDR\"\necho \"  MASTER_PORT is: $MASTER_PORT\"\necho \"  WORLD_SIZE is : $WORLD_SIZE\"\n\n# Loop through all nodes and start the training.\nNODE_RANK=0\nfor NODE in $NODES; do\n    echo \"Starting training on $NODE with NODE_RANK $NODE_RANK, MASTER_ADDR $MASTER_ADDR, MASTER_PORT $MASTER_PORT\"\n    # Ensure no spaces after the backslash ending each line.\n    ssh $NODE \"bash; \\\n        export MASTER_ADDR=$MASTER_ADDR; \\\n        export MASTER_PORT=$MASTER_PORT; \\\n        export WORLD_SIZE=$WORLD_SIZE; \\\n        export NODE_RANK=$NODE_RANK; \\\n        source ~\/miniconda3\/bin\/activate &lt;your env>; \\\n        cd &lt;your dir> \\\n        echo \\\"   On $NODE_RANK with NODE_RANK $NODE_RANK, MASTER_ADDR $MASTER_ADDR, MASTER_PORT $MASTER_PORT, WORLD_SIZE $WORLD_SIZE\\\"; \\\n        python your_code.py fit --config configs\/your_yaml.yaml --trainer.strategy=ddp --trainer.num_nodes=$WORLD_SIZE; \\\n    \" &amp;\n    NODE_RANK=$((NODE_RANK + 1))\ndone\n\n# Wait for all child processes to complete.\nwait\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This script fetches the node information from PBS.  This information is only available on the master node.  Next, it ssh&#8217;s into each node in the cluster allocated to your job, sets the appropriate environment variables (these tell Pytorch lightning how to communicate with the master node), and then launches your training script.  The master node is automatically set when you set its environment variable, NODE_RANK = 0.  <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I left in the NCCL debugging flags which are helpful in debugging when the script hangs or nodes can&#8217;t communicate to each other.  These can be easily commented out.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This script fetches the node information from PBS. This information is only available on the master node. Next, it ssh&#8217;s into each node in the cluster allocated to your job, sets the appropriate environment variables (these tell Pytorch lightning how to communicate with the master node), and then launches your training script. The master node [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1179","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/www.gergltd.com\/home\/wp-json\/wp\/v2\/posts\/1179","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.gergltd.com\/home\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.gergltd.com\/home\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.gergltd.com\/home\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.gergltd.com\/home\/wp-json\/wp\/v2\/comments?post=1179"}],"version-history":[{"count":1,"href":"https:\/\/www.gergltd.com\/home\/wp-json\/wp\/v2\/posts\/1179\/revisions"}],"predecessor-version":[{"id":1180,"href":"https:\/\/www.gergltd.com\/home\/wp-json\/wp\/v2\/posts\/1179\/revisions\/1180"}],"wp:attachment":[{"href":"https:\/\/www.gergltd.com\/home\/wp-json\/wp\/v2\/media?parent=1179"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.gergltd.com\/home\/wp-json\/wp\/v2\/categories?post=1179"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.gergltd.com\/home\/wp-json\/wp\/v2\/tags?post=1179"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}