Distributed Computing with RamDisk
4 views (last 30 days)
Show older comments
Hello,
One of my projects involves heavy computational tasks which luckily can be parallelized. After optimizing the code for 'parfor' usage, i can get a nice factor of ~ Ncores/3. The bottleneck of squeezing even more lies (i'm almost sure) in the following: although the code optimization reduced the amount of data sent to each worker, the calculation's output is large (100MB-1GB), and probably can't be dramatically reduced. So, large overheads occur, stemming from data transfer (an indication to this can be seen when the workers seem to be finished, but one processor still works to full capacity for a long time before exiting the parfor loop is exited). my question(s): 1. Is my hypothesis correct? Meaning - is this due to the fact that parfor writes temporary data files to the HD? 2. if i'm correct, RamDisk can prove beneficial (i have A LOT of RAM, 512GB). How do i 'tell' the distributed computing to use the RAM virtual drive?
Much appreciated, Yanir H.
0 Comments
Answers (2)
Anthony Barone
on 9 Dec 2017
This may or may not be related and is more of a general "this might be worth looking into" suggestion than something specific, but...
In another reply you mention that you are using a 32-core machine, which means you are using a machine with NUMA (The largest UMA CPU I know of is the 28-core skylake xeon. Epyc Chips can have 32 cores but are inherently NUMA, even though they only use a single socket). In my experience, Matlab is completely blind to NUMA, and this can lead to a huge amount of overhead and can slow things down quite a bit.
If I were you, I would try locking Matlab to only use the memory and CPU cores from a single NUMA node and see how fast it runs. If you have N nodes and it runs at 1/N the speed you are currently getting then NUMA isnt the problem. If it runs dramatically faster than NUMA is (at least in part) to blame. On Linux this is easy to do by starting the Matlab instance using "numactl" and setting both the cpu core and memory affinities to only use a single NUMA node, though if it is a windows or mac server Im sure there are ways to do this as well. If the problem is NUMA, the easiest way (if possible) is to break up the data into N equal parts, run each on its own NUMA node, and then recombine them.
For reference, I was using a 2x NUMA node machine with 8x cores per node, and was only getting ~1.1x the performance on both Nodes as I was on a single node. Assuming your 32 cores is a 4x8core design, this is roughly in line with the Ncores/3 speed-up you were seeing.
0 Comments
Walter Roberson
on 20 Feb 2017
Edited: Walter Roberson
on 20 Feb 2017
As you specifically said Distributed computing rather than Parallel Computing then my understanding is that data is not written to disk but rather is being transferred by tcp.
For parallel computing (same system) I do not have any information about whether more efficient transfer methods such as DMA are used.
I would tend to doubt DMA itself, directly, as that requires driver mode access. However, that would not rule out the use of shared memory segments, and a kernel implementation of those might use DMA. On the other hand, within a single system shared memory aligned on a page boundary can be transferred just by inserting the appropriate memory descriptor into the hardware virtual memory map.
I can say though that the programming model used between workers is the transfer of structured data, much like the serialization process for writing to disk—an encapsulation that might use offsets but not addresses. The question is just whether that serialized data is passed by tcp always or by some other message passing implementation like swapping control of buffers in shared memory. MATLAB is probably written not to care about that, handing off the decision to a layer to do the best it can.
Simply providing access to into the virtual memory of the process is not done, and neither is using a strategy of allocating a shared segment at a common address and having the C/C++ dynamic memory allocate out of there so that the other process can use the very same pointers. I can say that because that strategy requires that the code be written with a lot of attention to thread safety and a deliberate design about which process is responsible for deallocating the memory afterwards, but MATLAB lags a fair bit behind in thread safety, with memory allocation and deallocation between threads almost always causing problems.
Anyhow, in short, ramdisk probably will not help.
2 Comments
Walter Roberson
on 20 Feb 2017
I would continue to be certain that it is not writing to disk.
"The principal MPI-1 model has no shared memory concept, and MPI-2 has only a limited distributed shared memory concept. Nonetheless, MPI programs are regularly run on shared memory computers, and both MPICH and Open MPI can use shared memory for message transfer if it is available. Designing programs around the MPI model (contrary to explicit shared memory models) has advantages over NUMA architectures since MPI encourages memory locality. Explicit shared memory programming was introduced in MPI-3."
MATLAB probably invokes (Open) MPI routines, which are responsible for doing the best they can on the target system, with the library having been compiled according to the operating system and hardware facilities.
RAMDISK is unlikely to help.
A relevant question would be whether it is feasible to compress the data for transfer?
For example if you use Fast Serialize/ Deserialize from the File Exchange, and apply a Java zip routine, transfer the result, unzip, deserialize, then it might be the case that you would reduce your transfer times enough for it to be worth-while. This would tend to require large intermediate memory structures, but you have the memory.
See Also
Categories
Find more on Parallel Computing Fundamentals in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!