write
Write distributed data to an output location
Description
write(
writes the values in the distributed array location
,D
)D
to files
in the folder location
. The data is stored in an
efficient binary format suitable for reading back using
datastore(location)
. If not distributed along
the first dimension, MATLAB® redistributes the data before writing, so that the
resulting files can be reread using
datastore
.
write(
uses the file extension from filepattern
,D
)filepattern
to determine
the output format. filepattern
must include a folder
to write the files into followed by a file name that includes a wildcard
*
. The wildcard represents incremental numbers
for generating unique file names, for example
write('folder/myfile_*.csv',D)
.
write(___,
specifies additional options with one or more name-value pair arguments
using any of the previous syntaxes. For example, you can specify the
file type with Name,Value
)'FileType'
and a valid file type
('mat'
, 'seq'
,
'parquet'
, 'text'
, or
'spreadsheet'
), or you can specify a custom
write function to process the data with 'WriteFcn'
and a function handle.
Examples
Write Distributed Arrays
This example shows how to write a distributed array to a file system, then read it back using a datastore.
Create a distributed array and write it to an output folder.
d = distributed.rand(5000,1);
location = 'hdfs://myHadoopCluster/some/output/folder';
write(location, d);
Recreate the distributed array from the written files.
ds = datastore(location); d1 = distributed(ds);
Write Distributed Arrays Using File Patterns
This example shows how to write distributed arrays to different formats using a file pattern.
Create a distributed table and write it to a simple text-based format that many applications can read.
dt = distributed(array2table(rand(5000,3)));
location = "/tmp/CSVData/dt_*.csv";
write(location, dt);
Recreate the distributed table from the written files.
ds = datastore(location); dt1 = distributed(ds);
Write and Read Back Tall and Distributed Data
You can write distributed data and read it back as tall data and vice versa.
Create a distributed timetable and write it to disk.
dt = distributed(array2table(rand(5000,3)));
location = "/tmp/CSVData/dt_*.csv";
write(location, dt);
Build a tall table from the written files.
ds = datastore(location); tt = tall(ds);
Alternatively, you can read data written from tall data into distributed data. Create a tall timetable and write it to disk.
tt = tall(array2table(rand(5000,3)));
location = "/tmp/CSVData/dt_*.csv";
write(location, tt);
Read back into a distributed timetable.
ds = datastore(location); dt = distributed(ds);
Write Distributed Arrays Using a Write Function
This example shows how to write distributed arrays to a file system using a custom write function.
Create a simple write function that writes out spreadsheet files.
function dataWriter(info, data) filename = info.SuggestedFilename; writetable(data, filename, "FileType", "spreadsheet"); end
Create a distributed table and write it to disk using the custom write function.
dt = distributed(array2table(rand(5000,3))); location = "/tmp/MyData/tt_*.xlsx"; write(location, dt, "WriteFcn", @dataWriter);
Input Arguments
location
— Folder location to write data
character vector | string
Folder location to write data, specified as a character
vector or
string. location
can specify
a full or relative path. The specified folder can be
either of these options:
Existing empty folder that contains no other files
New folder that
write
creates
You can write data to local folders on your computer, folders on a shared network, or to remote locations, such as Amazon S3™, Windows Azure® Storage Blob, or a Hadoop® Distributed File System (HDFS™). For more information about reading and writing data to remote locations, see Work with Remote Data.
Example: location = '../../dir/data'
specifies a relative file path.
Example: location =
'C:\Users\MyName\Desktop\data'
specifies
an absolute path to a Windows® desktop folder.
Example: location =
'file:///path/to/data'
specifies an
absolute URI path to a folder.
Example: location =
'hdfs://myHadoopCluster/some/output/folder'
specifies an HDFS URL.
Example: location =
's3://bucketname/some/output/folder'
specifies an Amazon S3 location.
Data Types: char
| string
D
— Input array
distributed array
Input array, specified as a distributed array.
filepattern
— File naming pattern
string | character vector
File naming pattern, specified as a string or a character
vector. The file naming pattern must contain a folder to
write the files into followed by a file name that includes
a wildcard *
.
write
replaces the wildcard
with sequential numbers to ensure unique file
names.
Example: write('folder/data_*.txt',D)
writes the distributed array D
as a
series of .txt
files in
folder
with the file names
data_1.txt
,
data_2.txt
, and so
on.
Data Types: char
| string
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: write('C:\myData', D, 'FileType', 'text',
'WriteVariableNames', false)
writes the distributed
array D
to C:\myData
as a
collection of text files that do not use variable names as column
headings.
FileType
— Type of file
'auto'
(default) | 'mat'
| 'parquet'
| 'seq'
| 'text'
| 'spreadsheet'
Type of file, specified as the comma-separated pair
consisting of 'FileType'
and one
of the allowed file types:
'auto'
,
'mat'
,
'parquet'
,
'seq'
,
'text'
, or
'spreadsheet'
.
Use the 'FileType'
name-value
pair with the location
argument to specify what type of files to write. By
default, write
attempts to
automatically detect the proper file type. You do
not need to specify the
'FileType'
name-value pair
argument if write
can determine
the file type from an extension in the
location
or
filepattern
arguments.
write
can determine the file
type from these extensions:
.mat
for MATLAB data files.parquet
or.parq
for Parquet files.seq
for sequence files.txt
,.dat
, or.csv
for delimited text files.xls
,.xlsx
,.xlsb
,.xlsm
,.xltx
, or.xltm
for spreadsheet files
Example: write('C:\myData', D, 'FileType',
'text')
WriteFcn
— Custom writing function
function handle
Custom writing function, specified as the
comma-separated pair consisting of
'WriteFcn'
and a function
handle. The specified function receives blocks of
data from D
and is responsible
for creating the output files. You can use the
'WriteFcn'
name-value pair
argument to write data in a variety of formats,
even if the output format is not directly supported
by write
.
Functional Signature
The custom writing function must accept two
input arguments, info
and
data
:
function myWriter(info, data)
data
contains a block of data fromD
.info
is a structure with fields that contain information about the block of data. You can use the fields to build a new file name that is globally unique within the final location. The structure fields are:Field Description RequiredLocation
Fully qualified path to a temporary output folder. All output files must be written to this folder. RequiredFilePattern
The file pattern required for output file names. This field is empty if only a folder name is specified. SuggestedFilename
A fully qualified, globally unique file name that meets the location and naming requirements. PartitionIndex
Index of the distributed array partition being written. NumPartitions
Total number of partitions in the distributed array. BlockIndexInPartition
Position of current data block within the partition. IsFinalBlock
true
if current block is the final block of the partition.
File Naming
The file name used for the output files
determines the order that the files are read back
in later by datastore
. If the
order of the files matters, then the best practice
is to use the SuggestedFilename
field to name the files since the suggested name
guarantees the file order. If you do not use the
suggested file name, the custom writing function
must create globally unique, correctly ordered
file names. The file names should follow the
naming pattern outlined in
RequiredFilePattern
. The file
names must be unique and correctly ordered between
workers, even though each worker writes to its own
local folder.
Arrays with Multiple Partitions
A distributed array is divided into partitions to facilitate running calculations on the array in parallel with Parallel Computing Toolbox™. When writing a distributed array, each of the partitions is divided in smaller blocks.
info
contains several
fields related to partitions:
PartitionIndex
,
NumPartitions
,
BlockIndexInPartition
, and
IsFinalBlock
. These fields are
useful when you are writing out a single file and
appending to it, which is a common task for arrays
with large partitions that have been split into
many blocks. The custom writing function is called
once per block, and the blocks in one partition
are always written in order on one worker.
However, different partitions can be written by
different workers.
Example Function
A simple writing function that writes out spreadsheet files is:
function dataWriter(info, data) filename = info.SuggestedFilename; writetable(data, filename, 'FileType', 'spreadsheet') end
dataWriter
as the
writing function for some data
D
, use the
commands:D = distributed(array2table(rand(5000,3))); location = '/tmp/MyData/D_*.xlsx'; write(location, D, 'WriteFcn', @dataWriter);
dataWriter
function uses the suggested file name in the
info
structure and calls
writetable
to write out a
spreadsheet file. The suggested file name takes
into account the file naming pattern that is
specified in the location
argument.
Data Types: function_handle
WriteVariableNames
— Indicator for writing variable names as column headings
true
or 1
(default) | false
or
0
Indicator for writing variable names as column
headings, specified as the comma-separated pair
consisting of
'WriteVariableNames'
and a
numeric or logical 1
(true
) or 0
(false
).
Indicator | Behavior |
---|---|
| Variable names are included as the column headings of the output. This is the default behavior. |
| Variable names are not included in the output. |
DateLocale
— Locale for writing dates
character vector | string scalar
Locale for writing dates, specified as the
comma-separated pair consisting of
'DateLocale'
and a character
vector or a string scalar. When writing
datetime
values to the file,
use DateLocale
to specify the
locale in which write
should
write month and day-of-week names and
abbreviations. The character vector or string takes
the form
,
where xx
_YY
xx
is a lowercase
ISO 639-1 two-letter code indicating a language,
and YY
is an uppercase
ISO 3166-1 alpha-2 code indicating a country. For a
list of common values for the locale, see the
Locale
name-value pair argument
for the datetime
function.
For Excel® files, write
writes variables containing
datetime
arrays as Excel dates and ignores the
'DateLocale'
parameter value.
If the datetime
variables
contain years prior to either 1900 or 1904, then
write
writes the variables as
text. For more information on Excel dates, see Differences between the 1900 and the 1904 date
system in Excel.
Example: 'DateLocale','ja_JP'
or
'DateLocale',"ja_JP"
Data Types: char
| string
Delimiter
— Field delimiter character
','
or 'comma'
| ' '
or 'space'
| ...
Field delimiter character, specified as the
comma-separated pair consisting of
'Delimiter'
and one of these
specifiers:
Specifier | Field Delimiter |
---|---|
| Comma. This is the default behavior. |
| Space |
| Tab |
| Semicolon |
| Vertical bar |
You can use the 'Delimiter'
name-value pair argument only for delimited text
files.
Example: 'Delimiter','space'
or
'Delimiter',"space"
QuoteStrings
— Indicator for writing quoted text
false
(default) | true
Indicator for writing quoted text, specified as the
comma-separated pair consisting of
'QuoteStrings'
and either
false
or
true
. If
'QuoteStrings'
is
true
, then
write
encloses the text in
double quotation marks, and replaces any
double-quote characters that appear as part of that
text with two double-quote characters. For an
example, see Write Quoted Text to CSV File.
You can use the 'QuoteStrings'
name-value pair argument only with delimited text
files.
Encoding
— Character encoding scheme
'UTF-8'
| 'ISO-8859-1'
| 'windows-1251'
| 'windows-1252'
| ...
Character encoding scheme associated with the file,
specified as the comma-separated pair consisting of
'Encoding'
and
'system'
or a standard
character encoding scheme name like one of the
values in this table. When you do not specify any
encoding or specify encoding as
'system'
, the
write
function uses your
system default encoding to write the file.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
Example: 'Encoding','system'
or
'Encoding',"system"
uses the
system default encoding.
Sheet
— Target worksheet
character vector | string scalar | positive integer
Target worksheet, specified as the comma-separated
pair consisting of 'Sheet'
and a
character vector or a string scalar containing the
worksheet name or a positive integer indicating the
worksheet index. The worksheet name cannot contain
a colon (:
). To determine the
names of sheets in a spreadsheet file, use
[status,sheets] =
xlsfinfo(filename)
.
If the sheet does not exist, then
write
adds a new sheet at the
end of the worksheet collection. If the sheet is an
index larger than the number of worksheets, then
write
appends empty sheets
until the number of worksheets in the workbook
equals the sheet index. In either case,
write
generates a warning
indicating that it has added a new
worksheet.
You can use the 'Sheet'
name-value pair argument only with spreadsheet
files.
Example: 'Sheet'
,2
Example: 'Sheet'
,
'MySheetName'
Data Types: char
| string
| single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
VariableCompression
— Parquet compression algorithm
'snappy'
(default) | 'brotli'
| 'gzip'
| 'uncompressed'
| cell array of character vectors | string vector
Parquet compression algorithm, specified as one of these values.
'snappy'
,'brotli'
,'gzip'
, or'uncompressed'
. If you specify one compression algorithm thenwrite
compresses all variables using the same algorithm.Alternatively, you can specify a cell array of character vectors or a string vector containing the names of the compression algorithms to use for each variable.
In general, 'snappy'
has better
performance for reading and writing,
'gzip'
has a higher compression
ratio at the cost of more CPU processing time, and
'brotli'
typically produces the
smallest file size at the cost of compression
speed.
Example: write('C:\myData',D,'FileType','parquet','VariableCompression','brotli')
Example: write('C:\myData', D, 'FileType',
'parquet', 'VariableCompression', {'brotli'
'snappy' 'gzip'})
VariableEncoding
— Encoding scheme names
'auto'
(default) | 'dictionary'
| 'plain'
| cell array of character vectors | string vector
Encoding scheme names, specified as one of these values:
'auto'
—write
uses'plain'
encoding for logical variables, and'dictionary'
encoding for all others.'dictionary'
,'plain'
— If you specify one encoding scheme thenwrite
encodes all variables with that scheme.Alternatively, you can specify a cell array of character vectors or a string vector containing the names of the encoding scheme to use for each variable.
In general, 'dictionary'
encoding
results in smaller file sizes, but
'plain'
encoding can be faster
for variables that do not contain many repeated
values. If the size of the dictionary or number of
unique values grows to be too big, then the
encoding automatically reverts to plain encoding.
For more information on Parquet encodings, see
Parquet encoding definitions.
Example: write('myData.parquet', D,
'FileType', 'parquet', 'VariableEncoding',
'plain')
Example: write('myData.parquet', D,
'FileType', 'parquet', 'VariableEncoding',
{'plain' 'dictionary'
'plain'})
Version
— Parquet version to use
'2.0'
(default) | '1.0'
Parquet version to use, specified as either
'1.0'
or
'2.0'
. By default,
'2.0'
offers the most efficient
storage, but you can select
'1.0'
for the broadest
compatibility with external applications that
support the Parquet format.
Limitations
In some cases, write(location, D, 'FileType', type)
creates
files that do not represent the original array D
exactly. If
you use datastore(location)
to read the checkpoint files,
then the result might not have the same format or contents as the original
distributed table.
For the 'text'
and 'spreadsheet'
file
types, write
uses these rules:
write
outputs numeric variables usinglongG
format, and categorical, character, or string variables as unquoted text.For non-text variables that have more than one column,
write
outputs multiple delimiter-separated fields on each line, and constructs suitable column headings for the first line of the file.write
outputs variables with more than two dimensions as two-dimensional variables, with trailing dimensions collapsed.For cell-valued variables,
write
outputs the contents of each cell as a single row, in multiple delimiter-separated fields, when the contents are numeric, logical, character, or categorical, and outputs a single empty field otherwise.
Do not use the 'text'
or 'spreadsheet'
file types if you need to write an exact checkpoint of the distributed
array.
Tips
Use the
write
function to create checkpoints or snapshots of your data as you work. This practice allows you to reconstruct distributed arrays directly from files on disk rather than re-executing all of the commands that produced the distributed array.
Version History
Introduced in R2017a
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: United States.
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)