Main Content

Apache Parquet Data Type Mappings

MATLAB® represents column-oriented data with tables and timetables. Each variable in a table or timetable can have a different data type and any number of columns. Column vectors are the most common shape of table and timetable variables.

The Apache® Parquet file format is used for column-oriented heterogeneous data. Similar to MATLAB tables and timetables, each of the columns in a Parquet file can have different data types. The MATLAB Parquet functions use Apache Arrow functionality to read and write Parquet files. MATLAB stores the original Arrow table schema in the Parquet file as custom metadata. Arrow uses the original table schema to roundtrip certain data types.

Despite their similarity, the permitted data types in MATLAB tables and timetables sometimes do not map exactly to the permitted data types in Parquet files. In some cases, it is necessary for MATLAB to perform data type conversions to retain information in the data (such as missing values). This conversion can sometimes result in a loss of precision in the data.

In general, MATLAB tables and timetables have these behaviors when they are converted to Parquet files:

  • Table properties set on the original table are not saved.

  • Table row names or timetable row times are converted into a new table variable before being written.

  • When reading a variable name from a Parquet file, invalid table variable names are converted to valid table variable names.

Parquet files use a small number of primitive (or physical) data types. The logical types extend the physical types by specifying how they should be interpreted. Parquet data types not covered here are not supported for reading from or writing to Parquet files (JSON, BSON, binary, and so on).

The following tables summarize the representable data types in MATLAB tables and timetables, as well as how they map to corresponding types in Apache Arrow and Parquet files.

Numeric Data Types

Reading Numeric Data Types from Apache Parquet to MATLAB

Apache Parquet Data TypeApache Arrow Data TypeMATLAB Table or Timetable Variable TypeNotes

Logical Type

Physical Type

None

DOUBLE

double

double

  • Parquet data type is a 64-bit floating-point value.

  • MATLAB converts any null floating-point values in Parquet files to NaN values.

None

FLOAT

float

single

  • Parquet data type is a 32-bit floating-point value.

  • MATLAB converts any null floating-point values in Parquet files to NaN values.

INT

  • bitWidth=8

  • isSigned=true

INT32

int8

int8

  • If an array contains null values, the array is converted to a MATLAB double and null values are set to NaN.

  • 64-bit integers are truncated when converted to doubles if values are larger in magnitude than flintmax.

  • parquetread converts columns with null values to double arrays.

  • parquetDatastore will import null values as the sentinel value 0.

  • int96 values are truncated to int64 values when read.

INT

  • bitWidth=8

  • isSigned=false

INT32

uint8

uint8

INT

  • bitWidth=16

  • isSigned=true

INT32

int16

int16

INT

  • bitWidth=16

  • isSigned=false

INT32

uint16

uint16

None

INT32

int32

int32

INT

  • bitWidth=32

  • isSigned=false

INT32

uint32

uint32

INT

  • bitWidth=64

  • isSigned=true

INT64

int64

int64

INT

  • bitWidth=64

  • isSigned=false

INT64

uint64

uint64

None

BOOLEAN

boolean

logical

  • If an array type BOOLEAN contains null values, then parquetread converts the array to the MATLAB double datatype and fills null values with NaN.

  • parquetread sets the null values to NaN.

  • parquetDatastore will import null values as the sentinel value false.

Writing Numeric Data Types from MATLAB to Apache Parquet

MATLAB Table or Timetable Variable TypeApache Arrow Data TypeApache Parquet Data TypeNotes

Logical Type

Physical Type

double

double

None

DOUBLE

  • MATLAB converts NaN values to null values in the Parquet file.

  • These workflows do not support complex values or sparse arrays.

single

float

None

FLOAT

int8

int8

INT

  • bitWidth=8

  • isSigned=true

INT32

  • These workflows do not support complex values.

uint8

uint8

INT

  • bitWidth=8

  • isSigned=false

INT32

int16

int16

INT

  • bitWidth=16

  • isSigned=true

INT32

uint16

uint16

INT

  • bitWidth=16

  • isSigned=false

INT32

int32

int32

None

INT32

uint32

uint32

INT

  • bitWidth=32

  • isSigned=false

INT32

int64

int64

INT

  • bitWidth=64

  • isSigned=true

INT64

uint64

uint64

INT

  • bitWidth=64

  • isSigned=false

INT64

logical

boolean

None

BOOLEAN

  • This workflow does not support sparse arrays.

Binary Data Types

Reading Binary Data Types from Apache Parquet to MATLAB

Apache Parquet Data TypeApache Arrow Data TypeMATLAB Table or Timetable Variable TypeNotes

Logical Type

Physical Type

String

BYTE_ARRAY

String

string

String

BYTE_ARRAY

Dictionary

  • index_type=int32

  • value_type=string

  • ordered based on metadata.

categorical

  • Only if the Parquet file contains the original Arrow schema, then Arrow reads the data as a dictionary.

None

FIXED_LEN_BYTE_ARRAY

FixedSizeBinary(byte_width)

cell of uint8 values

None

BYTE_ARRAY

Binary

cell of uint8 values

Writing Binary Data Types from MATLAB to Apache Parquet

MATLAB Table or Timetable Variable Type

Apache Arrow Data Type

Apache Parquet Data Type

Notes

Logical Type

Physical Type

string

LargeString

String

BYTE_ARRAY

  • MATLAB converts string arrays to Arrow LargeString arrays.

  • Other Parquet readers (such as PyArrow) may import Parquet String columns as LargeString arrays based on the original Arrow table schema stored in Parquet files written by MATLAB.

  • string(missing) values are written as null values in the Parquet file.

char

LargeString

String

BYTE_ARRAY

cellstr

LargeString

String

BYTE_ARRAY

categorical

Dictionary

  • values=string

  • indices can be int8, int16, int32, or int64

  • ordered=true/false can be true or false based on the MATLAB Ordinal property

String

BYTE_ARRAY

Date and Time Data Types

Reading Date and Time Data Types from Apache Parquet to MATLAB

Apache Parquet Data TypeApache Arrow Data TypeMATLAB Table or Timetable Variable TypeNotes

Logical Type

Physical Type

DATE

INT32

date32

datetime

  • The INT32 value represents the number of days since the Unix epoch (January 1, 1970).

  • null values are imported as NaT.

Timestamp

  • isAdjustedToUTC can be true or false.

  • TimeUnit = can be milliseconds, microseconds, or nanoseconds.

INT64

timestamp

  • unit

  • tz=None

datetime

  • When the Parquet file contains the original Arrow table schema as metadata:

    • If the timestamp data has been adjusted to UTC, the time zone is determined by the original schema.

    • If no time zone is present in the original schema and isAdjustedToUTC is true, MATLAB sets the TimeZone property of the imported datetime array to UTC.

  • null Timestamp and Date values are imported as NaT values.

Time

  • isAdjustedToUTC can be true or false

  • unit=milliseconds

INT32

time32 [ms]

  • unit=ms

duration

  • null time32 values are read in MATLAB as NaN sec.

Time

  • isAdjustedToUTC=true/false

  • unit=

    • microseconds

    • nanoseconds

INT64

time64 [us/ns]

  • Unit can be microseconds or nanoseconds.

duration

  • null time64 values are read in MATLAB as NaN sec.

Writing Date and Time Data Types from MATLAB to Apache Parquet

MATLAB Table or Timetable Variable TypeApache Arrow Data TypeApache Parquet Data TypeNotes

Logical Type

Physical Type

datetime

  • TimeZone is set to a nonempty value.

timestamp

  • unit=microseconds

  • tz is set to the value of the TimeZone property of the input datetime.

Timestamp

  • TimeUnit=microseconds

  • isAdjustedToUTC=true

INT64

  • This type represents an instant in time.

  • MATLAB datetime precision is truncated to 1 microsecond. Display format settings are not saved.

  • Date represented is unzoned.

  • tz is based on the TimeZone property.

  • If the TimeZone property is '', isAdjustedToUTC is set to false.

duration

time64

  • unit=microseconds

Time

  • unit=microseconds

  • isAdjustedToUTC=true

INT64

  • MATLAB datetime precision is truncated to 1 microsecond. Display format settings are not saved.

  • NaN sec are written as null values in the Parquet file.

Nested Data

To write nested tables and nested timetables to Parquet files, use parquetwrite. To import nested structured Parquet file data, use parquetread.

Reading Nested Types from Apache Parquet to MATLAB

Apache Parquet Data TypeApache Arrow Data TypeMATLAB Table or Timetable Variable TypeNotes

Logical Type

Physical Type

LIST

Any

LIST

cell

  • MATLAB cells (excluding cellstrs) are converted to Arrow LargeList arrays.

  • Other Parquet readers (such as PyArrow) may import Parquet LIST columns as LargeList arrays based on the original Arrow table schema.

LIST with n-tuple organization

Any

STRUCT

nested table

  • If the child of a LIST is an n-tuple, then the LIST is interpreted as a struct array.

  • For more information on the n-tuple organization, see Parquet Logical Type Lists.

MAP

Any

MAP

cell array of tables

  • Each table has two variables: Key and Value.

Writing Nested Types from MATLAB to Parquet

MATLAB Table or Timetable Variable TypeApache Arrow Data TypeApache Parquet Data TypeNotes

Logical Type

Physical Type

cell

LargeList

LIST

Any

  • MATLAB cells (excluding cellstrs) are converted to Arrow LargeList arrays.

  • Other Parquet readers (such asPyArrow) may import Parquet LIST columns as LargeList arrays based on the original Arrow table schema.

nested table

Struct

NONE

Any

  • Arrow writes Struct arrays as Parquet group-annotated columns.

  • If they exist, MATLAB table RowNames are added as a field to the exported Arrow Struct.

nested timetable

Struct

NONE

Any

  • MATLAB table RowTimes are added to the exported Arrow Struct array.

See Also

| |