Choose How to Manage Data in Parallel Computing
To perform parallel computations, you need to manage data access and transfer between your MATLAB® client and the parallel workers. Use this page to decide how to transfer data the client and workers. You can manage data such as files, MATLAB variables, and handle-type resources.
Determine your Data Management Approach
The best techniques for managing data depend on your parallel application. Use the following tables to look for your goals and discover appropriate data management functions and their key features. In some cases, more than one type of object or function might meet your requirements. You can choose the type of object or function based on your preferences.
Transfer Data from Client to Workers
Use this table to identify some goals for transferring data from the client to workers and discover recommended workflows.
Goal | Recommended Workflow |
---|---|
Use variables in your MATLAB workspace in an interactive parallel pool. | The |
Transfer variables in your MATLAB workspace to workers on a cluster in a batch workflow. | Pass variables as inputs into |
Give workers access to large data stored on your desktop. |
|
Access large amounts of data or large files stored in the cloud and process it in an onsite or cloud cluster. | Use |
Give workers access to files stored on the client computer. | For workers in a parallel pool:
For workers running batch jobs:
|
Access custom MATLAB functions or libraries that are stored on the cluster. | Specify paths to the libraries or functions using
the |
Allow workers in a parallel pool to access non-copyable resources such as database connections or file handle | Use |
Send a message to a worker in an interactive pool running a function. | Create a |
Transfer Data Between Workers
Use this table to identify some goals for transferring data between workers and discover recommended workflows.
Goal | Recommended Workflow |
---|---|
| Use the |
Offload results from workers, which another worker can process. | Store the data in the |
Transfer Data from Workers to Client
Use this table to identify some goals for transferring data from a worker to a client and discover recommended workflows.
Goal | Recommended Workflow |
---|---|
Retrieve results from a
| Apply the |
Retrieve large results at the client. | Store the data in the |
| Use the |
Fetch the results from a parallel job. | Apply the |
Load the workspace variables from a | Apply the |
Transfer Data from Workers to Client During Execution
Use this table to identify some goals for transferring data from a worker during execution and discover recommended workflows.
Goal | Recommended Workflow |
---|---|
Inspect results from | Use a |
Update a plot, progress bar or other user interface with data from a function running in an interactive parallel pool. | Send the data to the client with a For very large computations with
1000s of calls to the |
Collect data asynchronously to update a plot,
progress bar or other user interface with data from a
| Use |
| Store the data in the |
| Store the files in the |
Compare Data Management Functions and Objects
Some parallel computing objects and functions that manage data have similar features. This section provides comparisons of the functions and objects that have similar features for managing data.
DataQueue
vs. ValueStore
DataQueue
and ValueStore
are two objects in
Parallel Computing Toolbox™ you can use transfer data between client and workers. The
DataQueue
object passes data from workers to the client in
a first-in, first-out (FIFO) order, while ValueStore
stores
data that multiple workers as well as the client can access and update. You can
use both objects for asynchronous data transfer to the client. However,
DataQueue
is only supported on interactive parallel
pools.
The choice between DataQueue
and ValueStore
depends on the data access pattern you require in your parallel application. If
you have many independent tasks that workers can execute in any order, and you
want to pass data to the client in a streaming fashion, then use a
DataQueue
object. However, if you want to store and share
values to multiple workers and access or update it at any time, then use
ValueStore
instead.
fetchOutputs (parfeval)
vs. ValueStore
Use the fetchOutputs
function to retrieve the output
arguments of a Future
object, which the software returns when
you run a parfeval
or parfevalOnAll
computation. fetchOutputs
blocks the client until the
computation is complete, then sends the results of the
parfeval
or parfevalOnAll
computation to the client. In contrast, you can use ValueStore
to store and retrieve values from any parallel computation and also retrieve
intermediate results as they are produced without blocking the program.
Additionally, the ValueStore
object is not held in system
memory, so you can store large results in the ValueStore
.
However, be careful when storing large amounts of data to avoid filling up the
disk space on the cluster.
If you only need to retrieve the output of a parfeval
or
parfevalOnAll
computation, then
fetchOutputs
is the simpler option. However, if you
want to store and access the results of multiple independent parallel
computations, then use ValueStore
. In cases where you have
multiple parfeval
computations generating large amounts of
data, using the pool ValueStore
object can help avoid memory
issues on the client. You can temporarily save the results in the
ValueStore
and retrieve them when you need them.
load
and fetchOutputs (Jobs)
vs. ValueStore
load
, fetchOutputs (Jobs)
, and
ValueStore
provide different ways of transferring data from
jobs back to the client.
load
retrieves the variables related to a job you create
when you use the batch
function to run a script or an
expression. This includes any input arguments you provide and temporary
variables the workers create during the computation. load
does not retrieve the variables from batch
jobs that run a
function and you cannot retrieve results while the job is running.
fetchOutputs (Jobs)
retrieves the output arguments
contained in the tasks of a finished job you create using the
batch
, createJob
or
createCommunicatingJob
functions. If the job is still
running when you call the fetchOutputs (Jobs)
function, the
fetchOutputs (Jobs)
function returns an error.
When you create a job on a cluster, the software automatically creates a
ValueStore
object for the job, and you can use it to store
data generated during job execution. Unlike the load
and
fetchOutputs
functions, the ValueStore
object does not automatically store data. Instead, you must manually add data as
key-value pairs to the ValueStore
object. Workers can store
data in the ValueStore
object that the MATLAB client can retrieve during the job execution. Additionally, the
ValueStore
object is not held in system memory, so you can
store large results in the store.
To retrieve the results of a job after the job has finished, use the
load
or fetchOutputs (Jobs)
function. To access the results or track the progress of a job while it is still
running, or to store potentially high memory results, use the
ValueStore
object
AdditionalPaths
vs. AttachedFiles
vs. AutoAttachedFiles
AdditionalPaths
, AttachedFiles
, and
AutoAttachedFiles
are all parallel job properties that
you can use to specify additional files and directories that are required to run
parallel code on workers.
AdditionalPaths
is a property you can use to add cluster
file locations to the MATLAB path on all workers running your job. This can be useful if you
have files with large data stored on the cluster storage, functions or libraries
that are required by the workers, but are not on the MATLAB path by default.
The AttachedFiles
property allows you to specify files or
directories that are required by the workers but are not stored on the cluster
storage. These files are copied to a temporary directory on each worker before
the parallel code runs. The files can be scripts, functions, or data files, and
must be located within the directory structure of the client.
Use the AutoAttachedFiles
property to allow files needed
by the workers to be automatically attached to the job. When you submit a job or
task, MATLAB performs dependency analysis on all the task functions, or on the
batch job script or function. Then it automatically adds the files required to
the job or task object so they are transferred to the workers. Essentially, you
only want to set the AutoAttachedFiles
property to
false
if you know that you do not need the software to
identify the files for you. For example, if the files your job is going to use
are already present on the cluster, perhaps inside one of the
AdditionalPaths
locations.
Use AdditionalPaths
when you have functions and libraries
stored on the cluster that are required on all workers. Use
AttachedFiles
when you have small files that are
required to run your code. To let MATLAB automatically determine if a job requires additional files to run,
set the AutoAttachedFiles
property to
true
.
See Also
ValueStore
| FileStore
| parallel.pool.Constant
| parallel.pool.PollableDataQueue
| spmdSend
| spmdReceive
| spmdSendReceive
| spmdBarrier
| fetchOutputs
| fetchOutputs
| load
| parallel.pool.DataQueue