- What scheduler is your cluster running? MJS? HPC Server? PBS/Slurm/etc?
- Sounds like all the stages are passing, except the last one (parpool) IFF the workers span >1 node. Otherwise, the last stage with workers running on a single node pass. Do I have that right?
Matlab cluster - validation - parpool stuck
9 views (last 30 days)
I am testing for the first time Matlab Parallel. I have a cluster in my local network, with 8 cluster nodes and a max of 32 workers that can run.
I created the cluster profile and ran "validate." If I validate only 1 of the cluster nodes (4 workers), everything is fine, and the validation is done in a couple of minutes or less. However, if I try with 2 cluster nodes, the validation stops at "Parallel pool test (parpool)" and keeps on running indefinitely; I waited up to 1 hour. If I try to "Stop" the validation, nothing happens, and I have to kill Matlab from the task manager. I also tried running the 2 clusters with only 1 worker each, but the same thing happens. I am not getting any error, and I don't get any report either, because, as I said, the parpool testing just keeps on running. If I try cluster node 2 alone, everything works fine again. The problem arises when I start both cluster nodes 1 and 2 (and, of course, also when I start all 8 cluster nodes).
I am new to Matlab parallel, and I currently have no clue where to start looking. Could someone give me some hint on what the problem could be?
Raymond Norris on 20 May 2021
A couple of questions
Without knowing more, I betting that you don't have password-less SSH between the compute nodes and that mpiexec is hung. Get onto a compute node and ssh to another. Are you prompted for your password? If so, you've reproduced the issue.