The parallel cluster becomes unresponsive, while the program runs normally on local pool workers. How can this issue be resolved?

1 view (last 30 days)
I have developed a program that utilizes parfor to run parallel workers. The structure of the program is as follows:
parfor ix = 1:numel(param_files)
% Read the parameter files
% Run fmincon optimization
end
The program functioned normally with all param_files when executed using local workers. However, when I attempted to run it on a parallel cluster, it became unresponsive for many hours while processing one of the parameter files during the optimization phase. As a result, I had to manually stop the parallel cluster. (Note: The hardware capabilities of the parallel cluster server are equivalent to those of my local PC.)
  1 Comment
Sam Marshalik
Sam Marshalik on 7 Oct 2024
Hey Dung, if the remote workers become unresponsive, it is generally due to some resource issue. Have you had a chance to monitor the resources on your cluster when the problematic parameter file is being processed? Is it maxing out CPU or Memory?
Something to consider is that you may be starting more workers on your cluster than you maybe should. Meaning, each worker may not have access to sufficient CPU/Memory - we suggest 1 worker per 1 physical CPU core, but you may need to run less if your work is resource intensive.

Sign in to comment.

Answers (1)

Venkat Siddarth Reddy
Venkat Siddarth Reddy on 5 Oct 2024
Hi Dung,
I understand that the parallel cluster becomes unresponsive when ran the program utilizing the parellel workers.
To further troubleshoot issue, please consider performing the following steps:
  • Cluster Configuration: Ensure that the cluster is properly configured to handle the workload. Check the settings for the number of workers, memory allocation, and time limits.Verify that the cluster has access to all necessary files and resources. Sometimes file paths or dependencies might not be correctly set up on the cluster.
  • I/O Access and speed: If the parameter files are large or numerous, ensure that data transfer to and from the cluster is not a bottleneck.Consider using distributed data storage that the cluster can access efficiently.
  • Parallel Overhead: While local execution might handle overhead seamlessly, clusters can introduce additional overhead due to communication between nodes. This can be especially problematic if the tasks are not sufficiently large or complex to justify parallel execution. Please verify if the parallel overhead is significantly large.
  • Optimization Behavior: The fmincon optimization might behave differently on the cluster due to differences in floating-point arithmetic or other environmental factors. Check if the optimization problem is well-conditioned and robust to such changes.
  • Debugging and Logging: Implement logging within the parfor loop to track progress and identify which parameter file causes the hang-up. Use MATLAB’s debugging tools to isolate the issue. Consider running a smaller subset of parameter files to see if the problem is specific to certain inputs.
  • Cluster-Specific Issues: Check for any cluster-specific issues such as network latency, node failures, or resource contention.
I hope the above steps helps you in resolving the issue!

Categories

Find more on MATLAB Parallel Server in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!