readtable() ignores variable names and loses data when "extra" cells contain values

22 views (last 30 days)
I have an Excel file with data in a rectangular range A1:D601 on a sheet named Fred with nothing else on the sheet. The first row is a header with the column names. If I use
T = readtable('MyData.xlsx', 'Sheet','Fred');
all is well and my table's variable names come in as expected and there are 4 columns by 600 rows of data.
If I add anything in cells to the right of my data, readtable() gets confused. The table variables come in with default names (Var1, Var2, ...) and there are more than 4 columns of data and fewer than 600 rows. I tried adding
'ExpectedNumVariables', 4
and/or
'ReadVariableNames', true
but these didn't fix it. I also tried a column range
'Range', 'A:D'
as suggested by Star Strider. My table ended up with the right variable names but the wrong number of rows as my "extra" data extended past the end of my "real" data. The extra rows were all NaNs, which I could probably filter, but that feels wrong. And I don't want to give a specifc range because another sheet has a different number of rows.
How can I tell readtable() to ignore any extraneous cells to the right of and/or below the main data that starts in A1?
The attached Excel file has 3 sheets which demonstrate the issues I've oulined above. In summary, the range option with a column range will igore excess data to the right of the main array provided it doesn't extend below the end of the data.
T = readtable('MyData.xlsx', 'Sheet', 'Fred') % OK
T = readtable('MyData.xlsx', 'Sheet', 'Wide') % extra cols, missing rows
T = readtable('MyData.xlsx', 'Sheet', 'Wide', 'Range', 'A:D') % OK
T = readtable('MyData.xlsx', 'Sheet', 'Long') % extra rows and cols
T = readtable('MyData.xlsx', 'Sheet', 'Long', 'Range', 'A:D') % extra rows

Accepted Answer

Steven Martin
Steven Martin on 1 Mar 2024
I was able to get the data I wanted by adding
'MissingRule', 'omitrow'
along with the column constraint suggested by @Star Strider. The final command that worked for me is
T = readtable('MyData.xlsx', 'Sheet', 'Barney', 'Range', 'A:D', 'MissingRule', 'omitrow');

More Answers (1)

Star Strider
Star Strider on 1 Mar 2024
How can I tell readtable() to ignore any extraneous cells to the right of the main data that starts in A1?
Without a file to experiment with, I can only offer a suggestion.
In the Spreadsheet Files documentation section, the 'Range' name-value pair offers a column range option, so for four variables (columns) that might be 'Range','A:D' and no need to specify a specific row or block range (columns and rows). That might be worth experimenting with.
.
  4 Comments
Stephen23
Stephen23 on 1 Mar 2024
Edited: Stephen23 on 1 Mar 2024
"The readtable function should just stop where the data ends"
Most likely it already does. Excel stores a kind of bounding box around all used cells in each worksheet's UsedRange property, which for the three worksheets in the provided MyData.xlsx are:
  1. A1:D11
  2. A1:G11
  3. A1:G15
In case anyone is interested, the UsedRange property is stored in the XML worksheets' "dimension" value, so you can easily check these values yourself in the uploaded workbook, or simply by using VBA.
My guess is that READTABLE uses the UsedRange property to tell it the range of the used cells. Anything else would be computationally much more expensive (requiring checking the entire XML for all used cell locations and creating its own bounding box), rather than just reading one property already correctly provided by Excel. So most likely READTABLE imports the intersection of the UsedRange and the range specified by the user when calling READTABLE.
This is also IMHO a better approach (not just computationally) because it does not silently remove data without warning: empty cells may be significant. Consider the following worksheet:
X Y Z
A 12 13
B 22 23
C
D 42 43
E
F
Are the empty cells significant or not? It seems that they are > Just because some test cases have no values does not mean that their existence is not meta-data. So if we tell READTABLE to import columns B:C then (your proposal) getting this as an output would completely distort the data without any warning:
Y Z
12 13
22 23
42 43
What lines correspond to what from the spreadsheet? We have no way to know.
Whereas the undistorted, complete, expected, correct output would actually be:
Y Z
12 13
22 23
NaN NaN
42 43
NaN NaN
NaN NaN
Only the user can know if missing cells are significant or not. I for one would not want my rows to magically disappear without any warning, thus totally distorting my data (and doing it less efficiently to boot). From this point of view, removing missing data is easily achieved after importing, if required.
"...rather than filling with NaN values."
Numeric arrays cannot have holes in them.

Sign in to comment.

Products


Release

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!