Matlab2021b constantly fails to invoke parpool
Show older comments
Hi,
I have a problem in Matlab 2021b while invoking parpool.
I have a script that in several parts of it (4 to be exact), I make use of parpool.
I invoke parpool using the following snipet:
test_p = gcp('nocreate')';
if isempty(test_p)
myPool = parpool('local',64);
end
While the first 3 parpools are opening without a problem, the 4th time the parpool crashes with the following error:
Error using parpool (line 146)
Parallel pool failed to start with the following error. For more detailed information, validate the profile 'local' in the Cluster Profile Manager.
Caused by:
Error using parallel.internal.pool.AbstractInteractiveClient>iThrowWithCause (line 305)
Failed to initialize the interactive session.
Error using parallel.internal.pool.AbstractInteractiveClient>iThrowIfBadParallelJobStatus (line 399)
The interactive communicating job failed with no message.
This unstable beahviour has happened multiple times and not only with this script.
Sometimes the parpool will open, some others it will crash.
To solve this, I always re-start Matlab and delete the ~/.matlab/local_cluster_jobs, but this is only a temporal remedy. The issue persists.
Running Validate in the Cluster Profile Manager, failed on the invocation of parpool, producing the following report:
Start Time: Fri Apr 29 01:19:58 EDT 2022
Finish Time: Fri Apr 29 01:20:17 EDT 2022
Running Duration: 0 min 19 sec
Description: Job ran with 64 workers.
Error Report:
Command Line Output:
Debug Log:
Stage: Pool job test (createCommunicatingJob)
Status: Passed
Start Time: Fri Apr 29 01:20:17 EDT 2022
Finish Time: Fri Apr 29 01:20:36 EDT 2022
Running Duration: 0 min 19 sec
Description: Job ran with 64 workers.
Error Report:
Command Line Output:
Debug Log:
Stage: Parallel pool test (parpool)
Status: Failed
Start Time: Fri Apr 29 01:20:36 EDT 2022
Finish Time: Fri Apr 29 01:24:10 EDT 2022
Running Duration: 3 min 34 sec
Description: Failed to initialize the interactive session.
Error Report: Failed to initialize the interactive session.
Caused by:
Error using parallel.internal.pool.AbstractInteractiveClient>iThrowIfBadParallelJobStatus (line 399)
The interactive communicating job failed with no message.
Command Line Output:
Debug Log: CLIENT LOG OUTPUT
Currently connected to: 1
Checking communicating job status.
Session failed to start when creating InteractiveClient. Error: Error using parallel.internal.pool.AbstractInteractiveClient>iThrowWithCause (line 305)
Failed to initialize the interactive session.
Error in parallel.internal.pool.AbstractInteractiveClient/start (line 142)
iThrowWithCause( 'parallel:convenience:FailedToInitializeInteractiveSession', err );
Error in parallel.internal.pool.AbstractClusterPool>iStartClient (line 831)
spmdInitialized = client.start(sessionBuildFcn, sessionInfo, numWorkers, cluster, ...
Error in parallel.internal.pool.AbstractClusterPool.hBuildPool (line 585)
iStartClient(client, sessionInfo, forceSpmdEnabled, cluster, supportRestart, argsList);
Error in parallel.internal.types.ValidationStages>iOpenPoolForCluster (line 456)
aPool = parallel.internal.pool.AbstractClusterPool.hBuildPool('Cluster', cluster, 'NumWorkers', numWorkers);
Error in parallel.internal.types.ValidationStages>@()iOpenPoolForCluster(runInfo)
Error in parallel.internal.types.ValidationStages>iCallWithNoHotlinks (line 336)
[varargout{1:nargout}] = fcn();
Error in parallel.internal.types.ValidationStages>iRunParpoolStage (line 247)
[commandWindowOutput, aPool] = evalc(iWrapForEvalc(openPoolFcn));
Error in parallel.internal.types.ValidationStages/run (line 68)
[eventData, runInfo] = obj.RunFunction(obj, runInfo);
Error in parallel.internal.validator.Validator/runValidationSuite (line 191)
[eventData, stageRunInfo] = currentStage.run(stageRunInfo);
Error in parallel.internal.validator.Validator/validate (line 103)
status = obj.runValidationSuite(profileName, suite);
Error in parallel.internal.ui.AbstractValidationManager/validate (line 36)
obj.Validator.validate(profileName, validationSuite);
Error in parallel.internal.ui.ValidationManager.validateProfile (line 36)
parallel.internal.ui.ValidationManager.getOrCreateInstance().validate(profileName, suite);
Caused by:
Error using parallel.internal.pool.AbstractInteractiveClient>iThrowIfBadParallelJobStatus (line 399)
The interactive communicating job failed with no message.
Failed to run the DisarmableOncleanup callback due to the following error:
Dot indexing is not supported for variables of this type.
What exactly is the problem here?
I am running Matlab on a Centos 7 machine with two "Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz" (total of 64 physical - 128 logical cores) and 1.5TB of RAM.
I would really appreciate your help here as this is severely impacting my work.
Thank you in advance for your help and time!
8 Comments
Walter Roberson
on 29 Apr 2022
Sometimes problems such as these are caused by internal security firewalls. MATLAB needs to use TCP sockets between the controller and the workers.
nassos
on 29 Apr 2022
Walter Roberson
on 29 Apr 2022
The controller and the worker operate as separate processes on the machine, getting scheduled independently.
Unix-type systems offer several different ways to communicate between processes, including
- shared memory -- typically this is restricted to computations on the same node (but sometimes you find cluster systems with unified shared memory)
- pipes -- typically these are restricted to computations on the same node (but sometimes clusters can handle them.) There is no direct way to forward pipes between different systems, but there are work-arounds, typically involving rsh and named pipes)
- IP means such as TCP or UDP -- these are not restricted to computations on the same node
pipes do not work in nearly the same way on Windows, so since MATLAB wants the same communications mechanisms on all operating systems, pipes are not used.
Shared memory does exist on Windows, with a bit of a different implementation. There is a third-party standardized interface layer, MPI (Message Passing Interface) that can hide the details, so using shared memory for the communications within a node is not out of the question as an approach.
TCP works the same on Windows and Unix systems, so exactly the same code can be used between Windows and Unix (except perhaps setting different buffer sizes or socket options.) And using TCP gives the advantage of permitting the workers to be on a remote system -- requiring no change for using MATLAB Parallel Server ("Distributed Computing") to talk to a cluster.
So, Mathworks choose to use the more-portable TCP based implementation.
But in order for two processes to be able to talk to each other by TCP, the internal firewall configuration must not be configured to block that possibility. Centos systems would typically be configured to permit two processes on the same host to talk to each other, but sometimes systems administrators or security administrators lock down communications for security reasons.
nassos
on 29 Apr 2022
Walter Roberson
on 29 Apr 2022
The internal implementation of firewalls in Centos is typically using iptables, but it would not be uncommon to use something like FirewallD to manage it; https://www.digitalocean.com/community/tutorials/how-to-set-up-a-firewall-using-firewalld-on-centos-7
Using TCP to communicate between different processes on the same host is very common.
Example:
port_to_use = next_available_port;
socket_to_use = create_tcp_listener(ANY_HOST, port_to_use);
next_available_port = next_available_port + 1;
cmd(1) = "/opt/MATLAB/toolbox/distcomp/startworker"
cmd(2) = HOSTNAME;
cmd(3) = string(port_to_use);
execv(cmd);
connection = wait_for_connection(socket_to_use);
%now read and write to connection
and meanwhile in /opt/MATLAB/toolbox/distcomp/worker it would start up and examine its argument list, and see the hostname and port sitting there, and would ask to open a TCP connection to the server on the given hostname and port.
The mechanisms involved might be slightly different than this. For example, creating a TCP listener might return the port number and then you would pass that port number to the new executable. The basic mechanisms only require that the server can find out (or control) a port number to use, and then embed the host and port number in the argument list when a new process is created, and then the server waits for the connection.
It is common for the TCP implementation to have optimizations for the case where the source and destination are on the same host; sometimes data can be flipped between processes instead of going into the kernel.
Lin
on 29 Apr 2022
Hi, I have exactly the same problem with parpool in R2021b but on a CentOS 8 machine. No iptables/nftables is used. My script sometimes works but sometimes doesn't. It would be great if anyone could help to solve the problem. Thank you.
nassos
on 29 Apr 2022
Ilya Kuprov
on 22 Feb 2026 at 7:06
The issue persists across all versions of Matlab released since this post was created. We have spent many hours with Mathworks support on this, to no practical effect. On our case, the pool always starts on the second attempt.
Answers (1)
Yash
on 17 Jan 2024
0 votes
Hi,
When operating on Windows with MATLAB R2021b, users with non-ASCII characters in their usernames, such as extended ASCII characters, encounter difficulties with the local cluster's functionality. Specifically, starting parallel pools or running independent jobs using commands like parpool('local') leads to vague failure messages, such as "Failed to initialize the interactive session". This issue has been identified in the External Bug Report here: https://www.mathworks.com/support/bugreports/details/2619526
This issue was fixed in 2021b Update 3 and 2022a, further they have also provided a workaround in the bug report that you can try as a fix.
Hope this helps!
5 Comments
Walter Roberson
on 17 Jan 2024
The issue at hand is on CentOs which is Linux
Yash
on 19 Jan 2024
Although the issue was identified for Windows, the workaround provided in EBR is not OS specific.
Walter Roberson
on 19 Jan 2024
The workaround provided in the bug report is very OS specific.
Yash
on 8 Feb 2024
Edited: Walter Roberson
on 8 Feb 2024
In the workaround, it is mentioned to use the "-c" startup flag to override the default license path of MATLAB to one that contains only ASCII characters. They have mentioned the steps for Windows. But at the end of EBR they have given this link: https://uk.mathworks.com/matlabcentral/answers/102520-how-do-i-change-the-license-search-location-for-matlab
This has the steps for Windows, MacOS and Linux for the same workaround.
Walter Roberson
on 8 Feb 2024
The workaround provided in the bug report is very OS specific. It is mostly accidental that it happens to mention a link that can be used for Linux.
Categories
Find more on Parallel Computing Fundamentals in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!