I have been working for a year now to solve a puzzling problem with text output produced during parallel computation with Matlab.
I am using the parpool function from the Parallel Computing Toolbox (Matlab v. 2020b) to run a set of jobs at my university’s HPC system. The computation is a standard controlled search for optimization, where the goal is to find a solution vector xMin that minimizes an objective function S = f(x), where x is an arbitrary candidate solution, and S is the misfit value for that candidate solution. Parallel computing is great for this problem given that the objective function can be calculated independently for each candidate solution.
I have set up the computation in a way to avoid any obvious clashes between the workers and the master. I start Matlab on one of the 28 cores available on a single node, and I start a master program on that first core that initializes the parallel pool with a set of 27 jobs, using Matlab’s parpool function. The idea is to ensure that the master program and the job programs are each isolated to their own core.
The master then starts the 27 jobs with a different solution vector for each job. The master scans for finished jobs, and processes each one independently, which involves invoking fprintf to write a line of text (about 200 characters long) to a text file. The text states the solution vector x and associated objective value S for that job. At this point, the master program replaces the finished job with a new job. All of this is designed to occur sequentially within the master program, so there should be no collisions.
The problem is that there are rare instances where the text output fails. This is generally isolated to a single solution record and is marked by a long string of null characters, as illustrated by the following schematic example (null characters are replaced here by “?”):
112 58.1453 58.1453 152 11.4779 38.8290
114 13.4881 13.4881 30 19.1352 118.670
This error shows up at a rate of about 1 out of 5,000 solutions, which makes it very hard to isolate. I have consulted with our IT people and they have indicated that they have not seen this error with others who are running on our HPC. The string of nulls is not necessarily specific to a single solution record. For example, the nulls might start towards the end of the previous solution record. In addition, the number of nulls in each string can vary.
I have tried to fix this problem by using the buffering option provided by fopen. More specifically, when the text file is first opened with fopen, one can use the -w option, which forces a “write to file” with each call of fprintf, and -W, which sets up a 4 kB buffer so that the file writes occur less frequently. Neither of these attempts have solved this problem.
My guess is that the write process in fprintf is suffering from some kind of timing problem. All of computations are done using the default “multithreaded” mode, and may that mode is factor.
I am hoping that others may have seen this problem, and might be able to provide evidence and/or ideas to fix it.