Error with new version of readtable (R2020a)

22 views (last 30 days)
I am currently trying to import a .csv file into MATLAB R2020a using the function readtable. The input file can be found there:
https://depmap.org/portal/download/ > ALL DOWNLOADS > DepMap Public 20Q2 CCLE_expression_v2.csv 06/20 368.66 MB
  • When I use readtable with this file I get an error message on the console that I find not very explicit and would love your insights about what could be wrong with it:
T = readtable('CCLE_expression.csv');
Failed to convert character code.
To note:
T = readtable(fn,'FileType','text','Delimiter',',','TextType','string','ReadVariableNames',1);
returns the same error.
  • It is very difficult for me to track the error because:
1) the error does not specify any function name, line of code or error code I can refer to
2) although an error is returned in the console, it does not 'pause' anywhere when tracking errors (Run>pause when error)
3) if the readtable is in another function F, it returns the error when I run F but does not return it if in debug mode within F: I put a break point before the readtable line, and then run manually the readtable line while in debug mode (which is not at all a behavior I am used to see). While in debug mode running the readtable function, I won't get any error but it is as if the readtable did not run: the variable in not present in my workspace.
  • I have tested readtable on previous MATLAB versions as well as on R2020a with the 'auto' flag respecting the old behavior of readtable and have absolutely no problems running it:
T = readtable(fn,'FileType','text','Delimiter',',','TextType','string','ReadVariableNames',1);
  • I am suspecting a problem of a character encoding that could lead the 'automatic importation' of R2020a to fail recognizing the variable type (from MATLAB 2020a documentation:'Starting in R2020a, the readtable function read an input file as though it automatically called the detectImportOptions function on the file. It can detect data types, discard extra header lines, and fill in missing values.') but I do not know how to test that out as I am not able to really see what piece of the readtable function is not working properly (it calls an internaly coded function). The line 195 in readtable is the one leading to the error:
t = func.validateAndExecute(filename,varargin{:});
Has any of you encountered that issue in the past? What was the cause of it? Happy to change the input parameters of the readtable function for R2020a version with the 'auto' flag but I would love to understand what is the problem here and if I should be worried of the new behavior of readtable in R2020a. Also, out of curiosity, why is 'Format','auto' the set of argument needed to restore the previous behavior instead of the 'legacy' term?
Thank you so much for your help!
Best
Sandrine
  2 Comments
jonas
jonas on 8 Jul 2020
Can just confirm that I got the same error in 2018a when using detectImportOptions prior to readtable.
Sandrine
Sandrine on 8 Jul 2020
Edited: Sandrine on 8 Jul 2020
Thanks Jonas! I indeed did not test that! I just added the tag for detectImportOptions to reflect that it is more likely a problem with detectImportOptions than with readtable function itself.

Sign in to comment.

Accepted Answer

Walter Roberson
Walter Roberson on 15 Jul 2020
This is an internal size limit that applies even when you explicitly specify the encoding.
The size limit is exactly 12976128 bytes which is 0xC60000 . If you have even 1 byte more then you will get the decode fail message.
It is not at all obvious to me why that particular limit would be true.
But!! The limit also depends heavily on the number of input columns!
31 rows of 58029 columns -> ok, 32 rows fail
63 rows of 58028 columns -> ok, 64 rows fail
64 rows of 58026 columns -> ok, 64 rows of 58027 columns fail
95 rows of 58026 columns -> ok (99223917 bytes), 96 rows fail (100268245 bytes)
The input file in question has 58677 columns, and by that time the limit is down to about 10 lines.
  3 Comments
Walter Roberson
Walter Roberson on 15 Jul 2020
I do not have access to the internal code; it is failing inside a built-in function.
I just did a whole bunch of trial and error to find those limits.
Walter Roberson
Walter Roberson on 15 Jul 2020
Using textscan() works. You can construct the format as ['%s', repmat('%f',1,58676)] . You will deal specifically with the header line, perhaps with an fgetl() that you regexp(InputLine, ',', 'split') . When you textscan specify 'delimiter',','
data = textscan(fid, fmt, 'collectoutput',true,'headerlines',1,'delimiter',',');

Sign in to comment.

More Answers (1)

Aditya Patil
Aditya Patil on 15 Jul 2020
The issue is with file encoding. However, it won't be possible for me to tell what encoding is appropriate as this should be mentioned by the file creator.
You can either set it to the appropriate encoding with Encoding Option, or use the earlier version settings with Format option set to auto. You can find more details regarding the compatibility differences in the readtable docs.
  2 Comments
Sandrine
Sandrine on 15 Jul 2020
Edited: Sandrine on 15 Jul 2020
Thank you! I indeed (although not precised here) did try to change all the "encoding-related" arguments of readtable which did not fix the problem.
Sandrine
Sandrine on 15 Jul 2020
Do you have access by any chance at the internal Matlab code? If yes, could you please tell us the reason of the size limit? Would you have an undocumented solution where I could setvaropts with a varaiable that would push the limit size described by Walter?
The new version of readtable is interesting as it is dealing with a certain amount of issues that previous versions had in term of variables automatic replacement of symbols etc... However, it is frustrating that it is limited by the actual size/number of columns of an input file. It is not rare that we encounter datasets of this size (in terms of number of columns) and I suspect this issue to get popular as new users of R2020a increase...
Thank you for the help!

Sign in to comment.

Products


Release

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!