Matlab2021b constantly fails to invoke parpool

Hi,
I have a problem in Matlab 2021b while invoking parpool.
I have a script that in several parts of it (4 to be exact), I make use of parpool.
I invoke parpool using the following snipet:
test_p = gcp('nocreate')';
if isempty(test_p)
myPool = parpool('local',64);
end
While the first 3 parpools are opening without a problem, the 4th time the parpool crashes with the following error:
Error using parpool (line 146)
Parallel pool failed to start with the following error. For more detailed information, validate the profile 'local' in the Cluster Profile Manager.
Caused by:
Error using parallel.internal.pool.AbstractInteractiveClient>iThrowWithCause (line 305)
Failed to initialize the interactive session.
Error using parallel.internal.pool.AbstractInteractiveClient>iThrowIfBadParallelJobStatus (line 399)
The interactive communicating job failed with no message.
This unstable beahviour has happened multiple times and not only with this script.
Sometimes the parpool will open, some others it will crash.
To solve this, I always re-start Matlab and delete the ~/.matlab/local_cluster_jobs, but this is only a temporal remedy. The issue persists.
Running Validate in the Cluster Profile Manager, failed on the invocation of parpool, producing the following report:
Start Time: Fri Apr 29 01:19:58 EDT 2022
Finish Time: Fri Apr 29 01:20:17 EDT 2022
Running Duration: 0 min 19 sec
Description: Job ran with 64 workers.
Error Report:
Command Line Output:
Debug Log:
Stage: Pool job test (createCommunicatingJob)
Status: Passed
Start Time: Fri Apr 29 01:20:17 EDT 2022
Finish Time: Fri Apr 29 01:20:36 EDT 2022
Running Duration: 0 min 19 sec
Description: Job ran with 64 workers.
Error Report:
Command Line Output:
Debug Log:
Stage: Parallel pool test (parpool)
Status: Failed
Start Time: Fri Apr 29 01:20:36 EDT 2022
Finish Time: Fri Apr 29 01:24:10 EDT 2022
Running Duration: 3 min 34 sec
Description: Failed to initialize the interactive session.
Error Report: Failed to initialize the interactive session.
Caused by:
Error using parallel.internal.pool.AbstractInteractiveClient>iThrowIfBadParallelJobStatus (line 399)
The interactive communicating job failed with no message.
Command Line Output:
Debug Log: CLIENT LOG OUTPUT
Currently connected to: 1
Checking communicating job status.
Session failed to start when creating InteractiveClient. Error: Error using parallel.internal.pool.AbstractInteractiveClient>iThrowWithCause (line 305)
Failed to initialize the interactive session.
Error in parallel.internal.pool.AbstractInteractiveClient/start (line 142)
iThrowWithCause( 'parallel:convenience:FailedToInitializeInteractiveSession', err );
Error in parallel.internal.pool.AbstractClusterPool>iStartClient (line 831)
spmdInitialized = client.start(sessionBuildFcn, sessionInfo, numWorkers, cluster, ...
Error in parallel.internal.pool.AbstractClusterPool.hBuildPool (line 585)
iStartClient(client, sessionInfo, forceSpmdEnabled, cluster, supportRestart, argsList);
Error in parallel.internal.types.ValidationStages>iOpenPoolForCluster (line 456)
aPool = parallel.internal.pool.AbstractClusterPool.hBuildPool('Cluster', cluster, 'NumWorkers', numWorkers);
Error in parallel.internal.types.ValidationStages>@()iOpenPoolForCluster(runInfo)
Error in parallel.internal.types.ValidationStages>iCallWithNoHotlinks (line 336)
[varargout{1:nargout}] = fcn();
Error in parallel.internal.types.ValidationStages>iRunParpoolStage (line 247)
[commandWindowOutput, aPool] = evalc(iWrapForEvalc(openPoolFcn));
Error in parallel.internal.types.ValidationStages/run (line 68)
[eventData, runInfo] = obj.RunFunction(obj, runInfo);
Error in parallel.internal.validator.Validator/runValidationSuite (line 191)
[eventData, stageRunInfo] = currentStage.run(stageRunInfo);
Error in parallel.internal.validator.Validator/validate (line 103)
status = obj.runValidationSuite(profileName, suite);
Error in parallel.internal.ui.AbstractValidationManager/validate (line 36)
obj.Validator.validate(profileName, validationSuite);
Error in parallel.internal.ui.ValidationManager.validateProfile (line 36)
parallel.internal.ui.ValidationManager.getOrCreateInstance().validate(profileName, suite);
Caused by:
Error using parallel.internal.pool.AbstractInteractiveClient>iThrowIfBadParallelJobStatus (line 399)
The interactive communicating job failed with no message.
Failed to run the DisarmableOncleanup callback due to the following error:
Dot indexing is not supported for variables of this type.
What exactly is the problem here?
I am running Matlab on a Centos 7 machine with two "Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz" (total of 64 physical - 128 logical cores) and 1.5TB of RAM.
I would really appreciate your help here as this is severely impacting my work.
Thank you in advance for your help and time!

8 Comments

Sometimes problems such as these are caused by internal security firewalls. MATLAB needs to use TCP sockets between the controller and the workers.
Thank you for your response!
I am not sure I understand what you mean by connection "between the controller and the workers".
The script is running local on the Centos 7 machine, utilizing the cores that the machine offers.
So to my understanding, both the "controller" and the "workers" are on the same machine.
How can TCP sockets affect the invokation of the parpool?
Kind regards,
anm
The controller and the worker operate as separate processes on the machine, getting scheduled independently.
Unix-type systems offer several different ways to communicate between processes, including
  • shared memory -- typically this is restricted to computations on the same node (but sometimes you find cluster systems with unified shared memory)
  • pipes -- typically these are restricted to computations on the same node (but sometimes clusters can handle them.) There is no direct way to forward pipes between different systems, but there are work-arounds, typically involving rsh and named pipes)
  • IP means such as TCP or UDP -- these are not restricted to computations on the same node
pipes do not work in nearly the same way on Windows, so since MATLAB wants the same communications mechanisms on all operating systems, pipes are not used.
Shared memory does exist on Windows, with a bit of a different implementation. There is a third-party standardized interface layer, MPI (Message Passing Interface) that can hide the details, so using shared memory for the communications within a node is not out of the question as an approach.
TCP works the same on Windows and Unix systems, so exactly the same code can be used between Windows and Unix (except perhaps setting different buffer sizes or socket options.) And using TCP gives the advantage of permitting the workers to be on a remote system -- requiring no change for using MATLAB Parallel Server ("Distributed Computing") to talk to a cluster.
So, Mathworks choose to use the more-portable TCP based implementation.
But in order for two processes to be able to talk to each other by TCP, the internal firewall configuration must not be configured to block that possibility. Centos systems would typically be configured to permit two processes on the same host to talk to each other, but sometimes systems administrators or security administrators lock down communications for security reasons.
Thank you very much for your informative post.
Again, I cannot understand how two processes running on the same computing node, would utilize TCP to communicate with each other.
I have setup the Centos machine and I can verify that I haven't made any changes for security reasons that restrict communication in any way.
How do you explain parpool working on some occasions but not on others?
Also, what do you mean by "internal firewall"? How can a firewall intervene between processes in the same computing node?
The internal implementation of firewalls in Centos is typically using iptables, but it would not be uncommon to use something like FirewallD to manage it; https://www.digitalocean.com/community/tutorials/how-to-set-up-a-firewall-using-firewalld-on-centos-7
Using TCP to communicate between different processes on the same host is very common.
Example:
port_to_use = next_available_port;
socket_to_use = create_tcp_listener(ANY_HOST, port_to_use);
next_available_port = next_available_port + 1;
cmd(1) = "/opt/MATLAB/toolbox/distcomp/startworker"
cmd(2) = HOSTNAME;
cmd(3) = string(port_to_use);
execv(cmd);
connection = wait_for_connection(socket_to_use);
%now read and write to connection
and meanwhile in /opt/MATLAB/toolbox/distcomp/worker it would start up and examine its argument list, and see the hostname and port sitting there, and would ask to open a TCP connection to the server on the given hostname and port.
The mechanisms involved might be slightly different than this. For example, creating a TCP listener might return the port number and then you would pass that port number to the new executable. The basic mechanisms only require that the server can find out (or control) a port number to use, and then embed the host and port number in the argument list when a new process is created, and then the server waits for the connection.
It is common for the TCP implementation to have optimizations for the case where the source and destination are on the same host; sometimes data can be flipped between processes instead of going into the kernel.
Hi, I have exactly the same problem with parpool in R2021b but on a CentOS 8 machine. No iptables/nftables is used. My script sometimes works but sometimes doesn't. It would be great if anyone could help to solve the problem. Thank you.
I think I understand now your reasoning, thank you for your informative posts.
So is there a way to verify whether the Centos firewall is causing problems?
And how can this be remedied?
Thank you in advance for your help!
The issue persists across all versions of Matlab released since this post was created. We have spent many hours with Mathworks support on this, to no practical effect. On our case, the pool always starts on the second attempt.

Sign in to comment.

Answers (1)

Hi,
When operating on Windows with MATLAB R2021b, users with non-ASCII characters in their usernames, such as extended ASCII characters, encounter difficulties with the local cluster's functionality. Specifically, starting parallel pools or running independent jobs using commands like parpool('local') leads to vague failure messages, such as "Failed to initialize the interactive session". This issue has been identified in the External Bug Report here: https://www.mathworks.com/support/bugreports/details/2619526
This issue was fixed in 2021b Update 3 and 2022a, further they have also provided a workaround in the bug report that you can try as a fix.
Hope this helps!

5 Comments

The issue at hand is on CentOs which is Linux
Although the issue was identified for Windows, the workaround provided in EBR is not OS specific.
The workaround provided in the bug report is very OS specific.
In the workaround, it is mentioned to use the "-c" startup flag to override the default license path of MATLAB to one that contains only ASCII characters. They have mentioned the steps for Windows. But at the end of EBR they have given this link: https://uk.mathworks.com/matlabcentral/answers/102520-how-do-i-change-the-license-search-location-for-matlab
This has the steps for Windows, MacOS and Linux for the same workaround.
The workaround provided in the bug report is very OS specific. It is mostly accidental that it happens to mention a link that can be used for Linux.

Sign in to comment.

Categories

Products

Release

R2021b

Asked:

on 29 Apr 2022

Commented:

on 22 Feb 2026 at 7:06

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!