readtable() ignores variable names and loses data when "extra" cells contain values

Question

Steven Martin on 1 Mar 2024

0
Link

Direct link to this question

https://in.mathworks.com/matlabcentral/answers/2088871-readtable-ignores-variable-names-and-loses-data-when-extra-cells-contain-values

Edited: Star Strider on 7 Mar 2024

MyData.xlsx

I have an Excel file with data in a rectangular range A1:D601 on a sheet named Fred with nothing else on the sheet. The first row is a header with the column names. If I use

T = readtable('MyData.xlsx', 'Sheet','Fred');

all is well and my table's variable names come in as expected and there are 4 columns by 600 rows of data.

If I add anything in cells to the right of my data, readtable() gets confused. The table variables come in with default names (Var1, Var2, ...) and there are more than 4 columns of data and fewer than 600 rows. I tried adding

'ExpectedNumVariables', 4

and/or

'ReadVariableNames', true

but these didn't fix it. I also tried a column range

'Range', 'A:D'

as suggested by Star Strider. My table ended up with the right variable names but the wrong number of rows as my "extra" data extended past the end of my "real" data. The extra rows were all NaNs, which I could probably filter, but that feels wrong. And I don't want to give a specifc range because another sheet has a different number of rows.

How can I tell readtable() to ignore any extraneous cells to the right of and/or below the main data that starts in A1?

The attached Excel file has 3 sheets which demonstrate the issues I've oulined above. In summary, the range option with a column range will igore excess data to the right of the main array provided it doesn't extend below the end of the data.

T = readtable('MyData.xlsx', 'Sheet', 'Fred')	                % OK
T = readtable('MyData.xlsx', 'Sheet', 'Wide')	                % extra cols, missing rows
T = readtable('MyData.xlsx', 'Sheet', 'Wide', 'Range', 'A:D')	% OK
T = readtable('MyData.xlsx', 'Sheet', 'Long')	                % extra rows and cols
T = readtable('MyData.xlsx', 'Sheet', 'Long', 'Range', 'A:D')	% extra rows

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Steven Martin on 1 Mar 2024

0
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/2088871-readtable-ignores-variable-names-and-loses-data-when-extra-cells-contain-values#answer_1419606

Open in MATLAB Online

I was able to get the data I wanted by adding

'MissingRule', 'omitrow'

along with the column constraint suggested by @Star Strider. The final command that worked for me is

T = readtable('MyData.xlsx', 'Sheet', 'Barney', 'Range', 'A:D', 'MissingRule', 'omitrow');

1 Comment
Show -1 older commentsHide -1 older comments

Star Strider on 1 Mar 2024

Edited: Star Strider on 7 Mar 2024

Noted.

A vote would be appreciated.

Sign in to comment.

Answer 2

Star Strider on 1 Mar 2024

1
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/2088871-readtable-ignores-variable-names-and-loses-data-when-extra-cells-contain-values#answer_1419186

‘How can I tell readtable() to ignore any extraneous cells to the right of the main data that starts in A1?’

Without a file to experiment with, I can only offer a suggestion.

In the Spreadsheet Files documentation section, the 'Range' name-value pair offers a column range option, so for four variables (columns) that might be 'Range','A:D' and no need to specify a specific row or block range (columns and rows). That might be worth experimenting with.

.

4 Comments
Show 2 older commentsHide 2 older comments

Stephen23 on 1 Mar 2024

Edited: Stephen23 on 1 Mar 2024

Open in MATLAB Online

MyData.xlsx

"The readtable function should just stop where the data ends"

Most likely it already does. Excel stores a kind of bounding box around all used cells in each worksheet's UsedRange property, which for the three worksheets in the provided MyData.xlsx are:

A1:D11
A1:G11
A1:G15

In case anyone is interested, the UsedRange property is stored in the XML worksheets' "dimension" value, so you can easily check these values yourself in the uploaded workbook, or simply by using VBA.

My guess is that READTABLE uses the UsedRange property to tell it the range of the used cells. Anything else would be computationally much more expensive (requiring checking the entire XML for all used cell locations and creating its own bounding box), rather than just reading one property already correctly provided by Excel. So most likely READTABLE imports the intersection of the UsedRange and the range specified by the user when calling READTABLE.

This is also IMHO a better approach (not just computationally) because it does not silently remove data without warning: empty cells may be significant. Consider the following worksheet:

Are the empty cells significant or not? It seems that they are > Just because some test cases have no values does not mean that their existence is not meta-data. So if we tell READTABLE to import columns B:C then (your proposal) getting this as an output would completely distort the data without any warning:

What lines correspond to what from the spreadsheet? We have no way to know.

Whereas the undistorted, complete, expected, correct output would actually be:

Y   Z
12  13
22  23
NaN NaN
42  43
NaN NaN
NaN NaN

Only the user can know if missing cells are significant or not. I for one would not want my rows to magically disappear without any warning, thus totally distorting my data (and doing it less efficiently to boot). From this point of view, removing missing data is easily achieved after importing, if required.

"...rather than filling with NaN values."

Numeric arrays cannot have holes in them.

Star Strider on 1 Mar 2024

@Stephen23 — I genuinely appreciate your thorough analysis.

Sign in to comment.

readtable() ignores variable names and loses data when "extra" cells contain values

0 Comments
Show -2 older commentsHide -2 older comments

Accepted Answer

1 Comment
Show -1 older commentsHide -1 older comments

More Answers (1)

4 Comments
Show 2 older commentsHide 2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

readtable() ignores variable names and loses data when "extra" cells contain values

0 Comments Show -2 older commentsHide -2 older comments

Accepted Answer

1 Comment Show -1 older commentsHide -1 older comments

More Answers (1)

4 Comments Show 2 older commentsHide 2 older comments

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

1 Comment
Show -1 older commentsHide -1 older comments

4 Comments
Show 2 older commentsHide 2 older comments