Clear Filters
Clear Filters

Why does MATLAB Parallel Server validation fail or stall at the SPMD/Pool job test stage (communicating batch jobs)?

9 views (last 30 days)

Accepted Answer

MathWorks Support Team
MathWorks Support Team on 2 Oct 2024 at 0:00
Edited: MathWorks Support Team on 2 Oct 2024 at 12:42
This can be caused by a number of possible issues, including, but not limited to the following:
  • Network connectivity issues
  • Insufficient computer resources or restrictions are placed on computer resources
  • Licensing issues
  • The job storage location is not set in a shared filesystem
  • The job storage location needs to be cleared or changed

 

Network connectivity issues

Make sure that each worker is able to communicate with each other over the network and that the appropriate ports are opened. If you don't know which ports should be opened, take a look at the link below.
How do I configure MATLAB Parallel Server using the MATLAB Job Scheduler to work within a firewall?
Please check your hosts file to make sure that any manual entries are added correctly. Entries added incorrectly can result in network connectivity issues.

 

Insufficient computer resources

Please make sure that MATLAB has the ability to access at least the minimum system requirements when validating your cluster. If you're unsure what the minimum system requirements are, take a look at the link below.

 

System Requirements

 

Licensing issues

It is also possible that there is an issue with the Network License Manager. There are several different types of Network License Manager errors, such as the Network License Manager being misconfigured, not running, or its ports are blocked. Check the Network License Manager for any faults. Otherwise, create a full validation report to see if you can find a License Manager error or a log file in the validation report.
 

The job storage location is not set in a shared filesystem

The location of where job data is stored across workers must be in a shared filesystem. If you're submitting a job on a compute or head node, this is JobStorageLocation. If you're submitting remotely to a cluster that uses Slurm, PBS, LSF, HTCondor, or Grid Engine and you don't have a shared filesystem with the cluster, then this is the RemoteJobStorageLocation.
 

The job storage location needs to be cleared or changed

For a variety of reasons, the job storage location may need to be cleared. You may want to attempt clearing your JobStorageLocation set in your cluster profile or choosing a different location, if you don't want to clear it.

More Answers (0)

Tags

No tags entered yet.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!