Main Content

wordDocumentImportOptions

Import options object for Microsoft Word document files

Since R2021b

    Description

    A WordDocumentImportOptions object enables you to specify how MATLAB® imports structured, tabular data from Microsoft® Word document files. The object contains properties that control the data import process, including the handling of errors and missing data.

    Creation

    You can create a WordDocumentImportOptions object using either the wordDocumentImportOptions function (described here) or the detectImportOptions function:

    • Use wordDocumentImportOptions to define the import properties based on your import requirements.

    • Use detectImportOptions to detect and populate the import properties based on the contents of the Microsoft Word document file specified in filename.

      opts = detectImportOptions(filename)

    Description

    opts = wordDocumentImportOptions creates a WordDocumentImportOptions object with one variable.

    opts = wordDocumentImportOptions('NumVariables',numVars) creates the object with the number of variables specified in numVars.

    opts = wordDocumentImportOptions(___,Name,Value) specifies additional properties for a WordDocumentImportOptions object using one or more name-value arguments.

    example

    Input Arguments

    expand all

    Number of variables, specified as a positive scalar integer.

    Properties

    expand all

    Variable Properties

    Variable names, specified as a cell array of character vectors or string array. The VariableNames property contains the names to use when importing variables.

    If the data contains N variables, but no variable names are specified, then the VariableNames property contains {'Var1','Var2',...,'VarN'}.

    To support invalid MATLAB identifiers as variable names, such as variable names containing spaces and non-ASCII characters, set the value of VariableNamingRule to 'preserve'.

    Example: opts.VariableNames returns the current (detected) variable names.

    Example: opts.VariableNames(3) = {'Height'} changes the name of the third variable to Height.

    Data Types: char | string | cell

    Flag to preserve variable names, specified as either "modify" or "preserve".

    • "modify" — Convert invalid variable names (as determined by the isvarname function) to valid MATLAB identifiers.

    • "preserve" — Preserve variable names that are not valid MATLAB identifiers such as variable names that include spaces and non-ASCII characters.

    Starting in R2019b, variable names and row names can include any characters, including spaces and non-ASCII characters. Also, they can start with any characters, not just letters. Variable and row names do not have to be valid MATLAB identifiers (as determined by the isvarname function). To preserve these variable names and row names, set the value of VariableNamingRule to "preserve". Variable names are not refreshed when the value of VariableNamingRule is changed from "modify" to "preserve".

    Data type of variable, specified as a cell array of character vectors, or string array containing a set of valid data type names. The VariableTypes property designates the data types to use when importing variables.

    To update the VariableTypes property, use the setvartype function.

    Example: opts.VariableTypes returns the current variable data types.

    Example: opts = setvartype(opts,'Height',{'double'}) changes the data type of the variable Height to double.

    Subset of variables to import, specified as a character vector, string scalar, cell array of character vectors, string array or an array of numeric indices.

    SelectedVariableNames must be a subset of names contained in the VariableNames property. By default, SelectedVariableNames contains all the variable names from the VariableNames property, which means that all variables are imported.

    Use the SelectedVariableNames property to import only the variables of interest. Specify a subset of variables using the SelectedVariableNames property and use readtable to import only that subset.

    To support invalid MATLAB identifiers as variable names, such as variable names containing spaces and non-ASCII characters, set the value of VariableNamingRule to 'preserve'.

    Example: opts.SelectedVariableNames = {'Height','LastName'} selects only two variables, Height and LastName, for the import operation.

    Example: opts.SelectedVariableNames = [1 5] selects only two variables, the first variable and the fifth variable, for the import operation.

    Example: T = readtable(filename,opts) returns a table containing only the variables specified in the SelectedVariableNames property of the opts object.

    Data Types: uint16 | uint32 | uint64 | char | string | cell

    Type specific variable import options, returned as an array of variable import options objects. The array contains an object corresponding to each variable specified in the VariableNames property. Each object in the array contains properties that support the importing of data with a specific data type.

    Variable options support these data types: numeric, text, logical, datetime, or categorical.

    To query the current (or detected) options for a variable, use the getvaropts function.

    To set and customize options for a variable, use the setvaropts function.

    Example: opts.VariableOptions returns a collection of VariableImportOptions objects, one corresponding to each variable in the data.

    Example: getvaropts(opts,'Height') returns the VariableImportOptions object for the Height variable.

    Example: opts = setvaropts(opts,'Height','FillValue',0) sets the FillValue property for the variable Height to 0.

    Location Properties

    Table data XPath expression, specified as a character vector or string scalar that the reading function uses to select the output table data. You must specify TableSelector as a valid XPath version 1.0 expression.

    This table shows some example XPath expressions for selecting tables in Microsoft Word document files.

    DescriptionTableSelector
    Table containing the text "Cash dividends""//w:tbl[contains(.,'Cash dividends')]"
    Second table with more than 10 rows"//w:tbl[count(w:tr)>10][2]"
    Table with a header cell exactly matching "Description""//w:tbl[w:tr[1]/w:tc='Description']"

    Example: 'TableSelector',"//w:tbl[contains(.,'Cash dividends')]"

    Data location, specified as a positive scalar integer or a N-by-2 array of positive scalar integers. Specify DataRows using one of these forms.

    Specify as

    Description

    n

    Specify the first row that contains the data. Specifying the value using n sets the value of DataRows property to [n inf]. The importing function reads all rows between n and the end-of-file.

    n must be a positive integer greater than zero.

    [n1 n2]

    Specify the row range that contains the data. n1 is the first row that contains the data and the n2 is the last row that contains the data.

    Values in the array [n1 n2] must be nonzero positive integers and n2 must be greater than n1.

    [n1 n2; n3 n4;...]

    Specify multiple row ranges to read with an N-by-2 array containing N different row ranges.

    A valid array of multiple row ranges must:

    • Specify row ranges in an increasing order, that is the first row range specified in the array appears in the file before the other row ranges.

    • Contain only nonoverlapping row ranges.

    When specifying multiple row ranges, use Inf only when specifying the end of the last row range in the array. For example, [1 3; 5 6; 8 Inf].

    Example: opts.DataRows = 5 sets the DataRows property to the value [5 inf]. Read all rows of data starting from row 5 to the end-of-file.

    Example: opts.DataRows = [2 6] sets the property to read rows 2 through 6.

    Example: opts.DataRows = [1 3; 5 6; 8 inf] sets the property to read rows 1, 2, 3, 5, 6, and all rows between 8, and the end-of-file.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Row names location, specified as a positive scalar integer. The RowNamesColumn property specifies the location of the column containing the row names.

    If RowNamesColumn is specified as 0, then do not import the row names. Otherwise, import the row names from the specified column.

    Example: opts.RowNamesColumn = 2;

    Data Types: single | double | uint8 | uint16 | uint32 | uint64

    Row containing variable names, specified as a nonnegative integer. The VariableNamesRow property specifies the row number where variable names are located.

    If VariableNamesRow is 0, then do not import the variable names. Otherwise, import the variable names from the specified row.

    Example: opts.VariableNamesRow = 6;

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Row containing variable units, specified as a nonnegative integer.

    If VariableUnitsRow is 0, then the software does not import the variable units. Otherwise, the software imports the variable units from the specified row.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Row containing variable descriptions, specified as a nonnegative integer.

    If VariableDescriptionsRow is 0, then the software does not import the variable descriptions. Otherwise, the software imports the variable descriptions from the specified row.

    Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

    Replacement Rules

    Procedure to manage missing data, specified as one of the values in this table.

    Missing RuleBehavior
    'fill'

    Replace missing data with the contents of the FillValue property.

    The FillValue property is specified in the VariableImportOptions object of the variable being imported. For more information on accessing the FillValue property, see setvaropts.

    'error'Stop importing and display an error message showing the missing record and field.
    'omitrow'Omit rows that contain missing data.
    'omitvar'Omit variables that contain missing data.

    Example: opts.MissingRule = 'omitrow';

    Data Types: char | string

    Procedure to handle empty rows in the data, specified as "skip", "read", or "error". The importing function interprets white space as empty.

    Empty Row RuleBehavior
    "skip"Skip the empty rows.
    "read"Import the empty rows. The importing function parses the empty row using the values specified in VariableOptions, MissingRule, and other relevant properties.
    "error"Display an error message and abort the import operation.

    Procedure to handle import errors, specified as one of the values in this table.

    Import Error RuleBehavior
    'fill'

    Replace the data where the error occurred with the contents of the FillValue property.

    The FillValue property is specified in the VariableImportOptions object of the variable being imported. For more information on accessing the FillValue property, see setvaropts.

    'error'Stop importing and display an error message showing the error-causing record and field.
    'omitrow'Omit rows where errors occur.
    'omitvar'Omit variables where errors occur.

    Example: opts.ImportErrorRule = 'omitvar';

    Data Types: char | string

    Procedure to handle extra columns in the data, specified as one of the values in this table.

    Extra Columns RuleBehavior
    'addvars'

    To import extra columns, create new variables. If there are N extra columns, then import new variables as 'ExtraVar1', 'ExtraVar2',..., 'ExtraVarN'. Extra columns of data are imported as if their VariableTypes are char.

    'ignore'Ignore the extra columns of data.
    'wrap'Wrap the extra columns of data to new records. This action does not change the number of variables.
    'error'Display an error message and abort the import operation.

    Data Types: char | string

    Procedure to handle cells with merged columns, specified as one of the values in this table.

    Import RuleBehavior
    "placeleft"

    Place the data in the left-most cell and fill the remaining cells with the contents of the FillValue property.

    The FillValue property is specified in the VariableImportOptions object of the variable being imported. For more information on accessing the FillValue property, see getvaropts.

    "placeright"

    Place the data in the right-most cell and fill the remaining cells with the contents of the FillValue property.

    The FillValue property is specified in the VariableImportOptions object of the variable being imported. For more information on accessing the FillValue property, see getvaropts.

    "duplicate"

    Duplicate the data in all cells.

    "omitrow"Omit rows where merged cells occur.
    "error"Display an error message and abort the import operation.

    Example: "MergedCellColumnRule","placeright"

    Procedure to handle cells with merged rows, specified as one of the values in this table.

    Import RuleBehavior
    "placetop"

    Place the data in the top cell and fill the remaining cells with the contents of the FillValue property.

    The FillValue property is specified in the VariableImportOptions object of the variable being imported. For more information on accessing the FillValue property, see getvaropts.

    "placebottom"

    Place the data in the bottom cell and fill the remaining cells with the contents of the FillValue property.

    The FillValue property is specified in the VariableImportOptions object of the variable being imported. For more information on accessing the FillValue property, see getvaropts.

    "duplicate"

    Duplicate the data in all cells.

    "omitvar"Omit variables where merged rows occur.
    "error"Display an error message and abort the import operation.

    Example: "MergedCellRowRule","duplicate"

    Object Functions

    Examples

    collapse all

    Create import options for a Microsoft Word document file, specify the table to import, and then read the data.

    The file MaintenanceReport.docx contains two tables. The last row of the second table contains a cell with merged columns that do not match the table variables.

    Create a WordDocumentImportOptions object:

    • Read five variables from the table.

    • Read from the first table containing the word "Description" using the XPath query "//w:tbl[contains(.,'Description')]".

    • Read variable names from the first row.

    • Skip reading rows that have cells with merged columns.

    opts = wordDocumentImportOptions( ...
        'NumVariables',5, ...
        'TableSelector',"//w:tbl[contains(.,'Description')]", ...
        'VariableNamesRow',1, ...
        'MergedCellColumnRule','omitrow')
    opts = 
      WordDocumentImportOptions with properties:
    
       Replacement Properties:
                    MissingRule: "fill"
                ImportErrorRule: "fill"
                   EmptyRowRule: "skip"
           MergedCellColumnRule: "omitrow"
              MergedCellRowRule: "placetop"
               ExtraColumnsRule: "addvars"
    
       Variable Import Properties: Set types by name using setvartype
                  VariableNames: ["Var1"    "Var2"    "Var3"    "Var4"    "Var5"]
                  VariableTypes: ["char"    "char"    "char"    "char"    "char"]
          SelectedVariableNames: ["Var1"    "Var2"    "Var3"    "Var4"    "Var5"]
                VariableOptions: [1-by-5 matlab.io.VariableImportOptions] 
    	Access VariableOptions sub-properties using setvaropts/getvaropts
             VariableNamingRule: "preserve"
    
       Location Properties:
                  TableSelector: "//w:tbl[contains(.,'Description')]"
                       DataRows: [1 Inf]
               VariableNamesRow: 1
               VariableUnitsRow: 0
        VariableDescriptionsRow: 0
                 RowNamesColumn: 0
    
    

    Read the table from the Microsoft Word document file using the readtable function with the options object.

    filename = "MaintenanceReport.docx";
    T = readtable(filename,opts)
    T=4×5 table
                                      Description                                         Category             Urgency           Resolution            Cost   
        _______________________________________________________________________    ______________________    ___________    ____________________    __________
    
        {'Description'                                                        }    {'Category'          }    {'Urgency'}    {'Resolution'      }    {'Cost'  }
        {'Items are occasionally getting stuck in the scanner spools.'        }    {'Mechanical Failure'}    {'Medium' }    {'Readjust Machine'}    {'$45'   }
        {'Loud rattling and banging sounds are coming from assembler pistons.'}    {'Mechanical Failure'}    {'Medium' }    {'Readjust Machine'}    {'$35'   }
        {'There are cuts to the power when starting the plant.'               }    {'Electronic Failure'}    {'High'   }    {'Full Replacement'}    {'$16200'}
    
    

    Version History

    Introduced in R2021b