MATLAB Answers

Readtable for a huge (no good schema) CSV file!

33 views (last 30 days)
Savan Rangegowda
Savan Rangegowda on 11 Sep 2019
Commented: dpb on 17 Sep 2019
Hi All,
I have a tricky task of reading this huge csv file (1360x107037) . The example data is attached with this question and it just contains only few rows and columns. I tried importing it through the matlab function readtable.
What I did -
  1. I tried to get the opts throught detectImportOptions()
  2. As the delimiter is 'semi-colon', I assigned it to opts.Delimeter
  3. Then I used readtable(filename, opts)
  4. I got a table with 1360 rows and 107037 columns.
Problem -
  1. I am getting in every even row for cells that supposed to be a string or datestring NaN values but all numerical values are retained
  2. I tried to make the 7th line of the csv data as VariableNames but somehow I am getting Var1, Var2, ...... etc are the VariableNames
Does anyone know how could I get rid of these NaN values and obtain the actual string if exists or just an empty string and make the 7th line of the CSV file as the VariableNames.
Update 1
I guess it is expecting a number at that point but instead it is receiving a String!
Update 2
I did this -
opts.VariableTypes(1, 1:end) = {'char'};
Now I am getting all the values but now all the numbers are casted to 'char'!
I dropped all th eampty columns;
table (:, all(ismissing(table))) =[];
I thought of saving the Table as .mat for later use but (my stupidity) it contains now all 'char' values and the file is really huge!
Any help is appreciated!
Thanks & Cheers,
Savan

  10 Comments

Show 7 older comments
Savan Rangegowda
Savan Rangegowda on 16 Sep 2019
@dpb, I agree and I wil try to explain the csv file more concretely. The thing is that it contains 18 headerlines -
;;Single;Single;Single;Single;.....
;;from_timestamp;from_timestamp;from_timestamp;from_timestamp,.......
;;to_timestamp;to_timestamp;to_timestamp;to_timestamp;.......
;;version_timestamp;version_timestamp;version_timestamp;version_timestamp;......
;;unique_str_identifier;unique_str_identifier;unique_str_identifier;unique_str_identifier......
;;1;1;1;91;3;102;1;2;79;......... % I am not sure of these intergers
;;name_machine1;name_machine1;name_machine1;............ ;name_machine2;name_machine2;.......
;;;;;;;;;;;;;;;;;; % %empty row
;;max_capacity;max_capacity;.. ;min_capacity; min_capacity;.. ;minute_capacity;minute_capacity;.. % It contains several variable names for each machine states in the 7th line
;;manufacturer;manufacturer;manufacturer;.........
;;delivery_company; delivery_company; delivery_company;delivery_company;..........
;;max_voltage;max_voltage;max_voltage;max_voltage;.......
;;fuel_type;fuel_type;fuel_type;fuel_type;fuel_type;...........
;;machine_max_volt;machine_max_volt;machine_max_volt;machine_max_volt;...........
;;;;;;;;;;;;;;;;; %empty row
;;WITHOUT;WITHOUT;WITHOUT;WITHOUT;............
;;;;;;;;;;;;;;;;; %empty row
;;value;value;value;value;value;......... % just contains the string 'value' in this complete row and it is not required
Important rows for me here are row1, row2, row3, row4, row7, row8 and all other rows are not required.
Next it contains values in the rest of the remaining rows....
from_timestamp;to_timestamp;50;;;;;;;;;56...........
from_timestamp;to_timestamp;50;;;;;;;;;;56............
...............
As it contains a lot of columns(107037), I don't how to specify a format so huge so that I could use textscan!!
Savan Rangegowda
Savan Rangegowda on 16 Sep 2019
@Jeremy,
opts = detectImportOptions(FileName, 'Delimiter',';', 'NumHeaderLines', 18)
is apparently throwing an error -
Matrix index is out of range for deletion
per isakson
per isakson on 16 Sep 2019
"As it contains a lot of columns(107037), I don't how to specify a format so huge so that I could use textscan!!" One cannot take for granted that textscan or readtable for that matter can handle that huge number of columns. If all the columns hold numerical data:
format = [ '%{fmt}D%{fmt}D', repmat( '%f', 1, 107037-2 ) ];
where fmt shall be replaced by the correct date-format. See help on textscan.
The first 18 lines can be read line by line and parsed individually.

Sign in to comment.

Accepted Answer

Guillaume
Guillaume on 16 Sep 2019
As has been suggested you will have to parse the header and the values separately. It can't be done with the same readtable call as the data in your header is organised by rows whereas the values are organised by columns. In fact, the header is completely not suitable for readtable, so if it is indeed needed, reading the file line by line and using textscan is probably the best route:
fid = fopen('Example_CSV.csv', 'rt'); %open as text mode for automatic line ending conversion
%since textscan does not support timezones read everything in the first 7 rows as text
headers = cell(7, 1);
for l = 1:7
tline = fgetl(fid);
headers(l) = textscan(tline, '%s', 'Delimiter', ';', 'MultipleDelimAsOne', true); %not sure you want that last option for your actual file
end
fclose(fid);
%conversion to datetime
for l = 2:4
headers{l} = datetime(headers{l}, 'InputFormat', 'yyyy-MM-dd''T''HH:mm:ssZ', 'TimeZone', 'UTC+2'); %Choose whichever timezone you want for display
end
headers(5:6) = []; %discard unwanted row 5 and 6
As for the data itself, with your example file, the following requires no extra parsing afterwards but may need adapting for your file with more columns:
opts = detectImportOptions('Example_CSV.csv', 'NumHeaderLines', 18, 'Delimiter', ';'); %Basic detection. If on R2019a or later, add 'ReadVariableNames', false
%correction for the date columns:
opts = opts.setvartype(1:2, 'datetime'); %proper type
opts = opts.setvaropts(1:2, 'InputFormat', 'yyyy-MM-dd''T''HH:mm:ssZ', 'TimeZone', 'UTC+2'); %proper decoding. Note: Set the timezone to whatever timezone you want it to display in
%remove unneeded variables
opts.SelectedVariableNames = opts.VariableNames([1:3, 6]);
opts.ExtraColumnsRule = 'ignore';
%read the data
data = readtable('Example_CSV.csv', opts);

  4 Comments

Show 1 older comment
per isakson
per isakson on 17 Sep 2019
textscan reads half a million columns!
%%
chr = repmat( '1.2;', 1,5e5 );
chr(end) = [];
fmt = repmat( '%f', 1,5e5 );
%%
fid = fopen( 'long.txt', 'wt' );
n = fprintf( fid, '%s\n', chr, chr, chr );
fclose( fid );
%%
fid = fopen( 'long.txt' );
cac = textscan( fid, fmt, 'Delimiter',';', 'CollectOutput',true );
fclose( fid );
num = cac{:};
together with
>> whos num
Name Size Bytes Class Attributes
num 3x500000 12000000 double
shows that
Guillaume
Guillaume on 17 Sep 2019
I certainly wouldn't expect textscan to fall over for any length of text line (unless you run out of memory of course). What may very well fail or at least take a very long time is the detectImportOption.
The OP seem to have disappeared on us...
dpb
dpb on 17 Sep 2019
It certainly shouldn't for anything that isn't system or memory limited...don't know what the underlying C i/o runtime library limits might be, if any.
Indeed, so often one never knows if the poster just gives up or finds nirvana in what answers/hints are given or what.

Sign in to comment.

More Answers (1)

Jeremy Hughes
Jeremy Hughes on 13 Sep 2019
opts = detectImportOptions(FileName,'Delimiter',';','NumHeaderLines',18)
T = readtable(FileName,opts)

  0 Comments

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!