I'm dealing with very large csv files. I'm having little to no problem with speed in reading from them with readtable. However, I have found (and reported) a bug in readtable where a blank value in the first column (the line starts with the delimiter, e.g. ',') throws off all the data. A lot of my files have blank values in the first column (due to the way the equipment I'm using records the data)
So, I have to "preprocess" the files and look for these blank columns in the csv file. The most efficient method I've found is the following:
ch = fread(YGID, [1,chunksize], 'int8=>char');
fprintf('Getting Number Of Lines...');
nol = sum(ch == sprintf('\n'));
fprintf('Replacing final commas...\n');
cch = regexprep(ch,',(\r|\n)+','$1');
fprintf('Getting line locations...\n');
hlocs = regexp(cch,'\n');
fprintf('Writing Header File...\n');
fprintf('Replacing Initial Commas\n');
ccch = regexprep(cch,'(\r|\n)+,','$1 ,');
YGID is the file pointer from an fopen. Note that I'm purposely making new variables (not memory efficient) as I have 16 GB of RAM available on my machine and I find making a completely new variable is faster. However, once the file is of a sufficient size (>20 MB, I have some over 200MB), even this becomes very slow. The line it is getting stuck on is "ccch = regexprep(cch,'(\r|\n)+,','$1 ,');" I suspect it's because with each additional space being added (there are hundreds of thousands) it's reallocating memory for the variable. I've tried to "preallocate" the new variable with "ccch = blanks(chunksize + nol);" before it and it didn't seem to make a difference.
Is there any more efficient way to do this task?