Readtable for a huge (no good schema) CSV file!

Hi All,
I have a tricky task of reading this huge csv file (1360x107037) . The example data is attached with this question and it just contains only few rows and columns. I tried importing it through the matlab function readtable.
What I did -
  1. I tried to get the opts throught detectImportOptions()
  2. As the delimiter is 'semi-colon', I assigned it to opts.Delimeter
  3. Then I used readtable(filename, opts)
  4. I got a table with 1360 rows and 107037 columns.
Problem -
  1. I am getting in every even row for cells that supposed to be a string or datestring NaN values but all numerical values are retained
  2. I tried to make the 7th line of the csv data as VariableNames but somehow I am getting Var1, Var2, ...... etc are the VariableNames
Does anyone know how could I get rid of these NaN values and obtain the actual string if exists or just an empty string and make the 7th line of the CSV file as the VariableNames.
Update 1
I guess it is expecting a number at that point but instead it is receiving a String!
Update 2
I did this -
opts.VariableTypes(1, 1:end) = {'char'};
Now I am getting all the values but now all the numbers are casted to 'char'!
I dropped all th eampty columns;
table (:, all(ismissing(table))) =[];
I thought of saving the Table as .mat for later use but (my stupidity) it contains now all 'char' values and the file is really huge!
Any help is appreciated!
Thanks & Cheers,
Savan

10 Comments

The data (if that is actually data and not just sample "stuff" that you posted as example text) are irregular for the first 18 records in the file -- no way readtable or detectimportoptions can make heads nor tails out of what is supposed to be what.
You'll have to decide what pieces of which records are what variables and read the header (if you need that information at all) with a series of calls to textscan with the appropriate format string for each record or set of records and save those to separately (and hopefully well-chosen) named variables.
Then, one you get through that, you can read the remainder begining at 18ff as a set of variable names, maybe (altho you wrote "Value" so maybe it's another set of numerics, who knows???) followed by the dates and a numeric value in last column. Altho those records are short a column on the end based on the earlier lines which contain six delimiters and a trailing value for seven columns instead of just six delimiters with no trailing value for those.
This again, may be a fignewton of what you've done to the file for posting, and not be real.
Only hope is to attach a section of the real file for somebody to be able to try to do anything specific.
Hi,
The csv file attached is just a sample and yes it has first 18 rows which contains some irregular data. I am not allowed to share it so I annonymized the data.
I was able to read this csv file by changing
opts = detectImportOptions(FileName)
% It has semi colon as delimiter
opts.Delimeter = {';'};
opts. dataLine = 2;
opts.VariableTypes(1,1:end) = {'char'};
T = readtable(FileName, opts);
I then get a table and I drop all the empty columns
T(:, all(ismissing(T))) = [];
This dropped all the empty columns and my table actually reduced to half the size.
Now I want to change the rows which contains numbers as strings back to number. Is there any way to do it without a for loop?
I know this way is tedious but I am still not able to find any optimal way to read this file!
Thanks in advance! Cheers.
Why not read the first 18 rows and the rest in two separate steps?
Stubborness, apparently...
Hi @per isakson,
I would try that now, thank you.
Hi @dpb,
Thank you for your comment. I am pretty new to Matlab and trying to learn new things and if I were you, I woudn't call it stubborness but trying ask some help as I don't have a lot of time to complete this!
Thanks & Cheers,
Savan
Well, was the very first suggestion made...and seemed at the time to have fallen on deaf ears! Quicker to revamp the plan than to keep fighting upstream against swift currents. :)
Even if you can't actually post a real file (altho is that truly a restriction that would hurt anything at all to have a small section?), make the file you post actually reflect the real file sufficiently well that it is clear what data are what type(s) and what, precisely it is that you're looking for and must have. Ideally, it would be a file such that parsing it would return something meaningful in terms of values and names, whether they're actually original or not.
If there's a keyword that must be searched for to find the length of a variable-length header, we need to know it (or a surrogate name for it if even it is truly proprietary which would seem unlikely) or, what data are needed from that section (if any).
In particular, the confusing part with the file as posted is whether those variable names are, in fact, the names of the variables in the files or are those another set of numerics?
;;Value;Value;Value;Value;Value
2019-07-01T00:00:00+02;2019-07-01T00:15:00+02;20;;;40;
2019-07-01T00:15:00+02;2019-07-01T00:30:00+02;20;;;40;
...
Are "Value" variable names or numbers? If the former, why missing the first two? Is that really real?
It's somewhat unfortunate that readtable doesn't have the option to operate on a file handle as does textscan so can reposition inside a call, but it's still possible to workaround without too much trouble and when the header is only 18-20 lines or so, the time won't be much of an issue if do have to scan it first and then read the rest separately. If, of course, the header really is fixed number of lines, then all those problems pale and all one has to do is to decide what it is that is needed from that section of the file.
This block of code undoes pretty much everything detectImportOptions has done.
opts = detectImportOptions(FileName)
% It has semi colon as delimiter
opts.Delimeter = {';'};
opts. dataLine = 2;
opts.VariableTypes(1,1:end) = {'char'};
T = readtable(FileName, opts);
You could simply create the options directly, and that would be a bit faster, and a little clearer.
opts = delimitedTextImportOptions('Delimiter',';','DataLine',19,'NumVariables',7);
T = readtable(FileName, opts);
But if you pass some of those parameters into detectImportOptions, it gets better at figuring out the rest of the things (like the datatypes, number of variables... )
opts = detectImportOptions(FileName,'Delimiter',';','NumHeaderLines',18)
T = readtable(FileName,opts)
@dpb, I agree and I wil try to explain the csv file more concretely. The thing is that it contains 18 headerlines -
;;Single;Single;Single;Single;.....
;;from_timestamp;from_timestamp;from_timestamp;from_timestamp,.......
;;to_timestamp;to_timestamp;to_timestamp;to_timestamp;.......
;;version_timestamp;version_timestamp;version_timestamp;version_timestamp;......
;;unique_str_identifier;unique_str_identifier;unique_str_identifier;unique_str_identifier......
;;1;1;1;91;3;102;1;2;79;......... % I am not sure of these intergers
;;name_machine1;name_machine1;name_machine1;............ ;name_machine2;name_machine2;.......
;;;;;;;;;;;;;;;;;; % %empty row
;;max_capacity;max_capacity;.. ;min_capacity; min_capacity;.. ;minute_capacity;minute_capacity;.. % It contains several variable names for each machine states in the 7th line
;;manufacturer;manufacturer;manufacturer;.........
;;delivery_company; delivery_company; delivery_company;delivery_company;..........
;;max_voltage;max_voltage;max_voltage;max_voltage;.......
;;fuel_type;fuel_type;fuel_type;fuel_type;fuel_type;...........
;;machine_max_volt;machine_max_volt;machine_max_volt;machine_max_volt;...........
;;;;;;;;;;;;;;;;; %empty row
;;WITHOUT;WITHOUT;WITHOUT;WITHOUT;............
;;;;;;;;;;;;;;;;; %empty row
;;value;value;value;value;value;......... % just contains the string 'value' in this complete row and it is not required
Important rows for me here are row1, row2, row3, row4, row7, row8 and all other rows are not required.
Next it contains values in the rest of the remaining rows....
from_timestamp;to_timestamp;50;;;;;;;;;56...........
from_timestamp;to_timestamp;50;;;;;;;;;;56............
...............
As it contains a lot of columns(107037), I don't how to specify a format so huge so that I could use textscan!!
@Jeremy,
opts = detectImportOptions(FileName, 'Delimiter',';', 'NumHeaderLines', 18)
is apparently throwing an error -
Matrix index is out of range for deletion
"As it contains a lot of columns(107037), I don't how to specify a format so huge so that I could use textscan!!" One cannot take for granted that textscan or readtable for that matter can handle that huge number of columns. If all the columns hold numerical data:
format = [ '%{fmt}D%{fmt}D', repmat( '%f', 1, 107037-2 ) ];
where fmt shall be replaced by the correct date-format. See help on textscan.
The first 18 lines can be read line by line and parsed individually.

Sign in to comment.

 Accepted Answer

As has been suggested you will have to parse the header and the values separately. It can't be done with the same readtable call as the data in your header is organised by rows whereas the values are organised by columns. In fact, the header is completely not suitable for readtable, so if it is indeed needed, reading the file line by line and using textscan is probably the best route:
fid = fopen('Example_CSV.csv', 'rt'); %open as text mode for automatic line ending conversion
%since textscan does not support timezones read everything in the first 7 rows as text
headers = cell(7, 1);
for l = 1:7
tline = fgetl(fid);
headers(l) = textscan(tline, '%s', 'Delimiter', ';', 'MultipleDelimAsOne', true); %not sure you want that last option for your actual file
end
fclose(fid);
%conversion to datetime
for l = 2:4
headers{l} = datetime(headers{l}, 'InputFormat', 'yyyy-MM-dd''T''HH:mm:ssZ', 'TimeZone', 'UTC+2'); %Choose whichever timezone you want for display
end
headers(5:6) = []; %discard unwanted row 5 and 6
As for the data itself, with your example file, the following requires no extra parsing afterwards but may need adapting for your file with more columns:
opts = detectImportOptions('Example_CSV.csv', 'NumHeaderLines', 18, 'Delimiter', ';'); %Basic detection. If on R2019a or later, add 'ReadVariableNames', false
%correction for the date columns:
opts = opts.setvartype(1:2, 'datetime'); %proper type
opts = opts.setvaropts(1:2, 'InputFormat', 'yyyy-MM-dd''T''HH:mm:ssZ', 'TimeZone', 'UTC+2'); %proper decoding. Note: Set the timezone to whatever timezone you want it to display in
%remove unneeded variables
opts.SelectedVariableNames = opts.VariableNames([1:3, 6]);
opts.ExtraColumnsRule = 'ignore';
%read the data
data = readtable('Example_CSV.csv', opts);

4 Comments

Good on you, G, for taking the time to write code...now if OP would just take first 40 lines or so of the file and attach that as a .mat file so folks could actually test something, we might really get somewhere.
It's an interesting Q? whether there's an internal limit in ML I/O libraries that couldn't handle such long records as brought up above. But, I don't have the patience to build a sample file for testing... :(
textscan reads half a million columns!
%%
chr = repmat( '1.2;', 1,5e5 );
chr(end) = [];
fmt = repmat( '%f', 1,5e5 );
%%
fid = fopen( 'long.txt', 'wt' );
n = fprintf( fid, '%s\n', chr, chr, chr );
fclose( fid );
%%
fid = fopen( 'long.txt' );
cac = textscan( fid, fmt, 'Delimiter',';', 'CollectOutput',true );
fclose( fid );
num = cac{:};
together with
>> whos num
Name Size Bytes Class Attributes
num 3x500000 12000000 double
shows that
I certainly wouldn't expect textscan to fall over for any length of text line (unless you run out of memory of course). What may very well fail or at least take a very long time is the detectImportOption.
The OP seem to have disappeared on us...
It certainly shouldn't for anything that isn't system or memory limited...don't know what the underlying C i/o runtime library limits might be, if any.
Indeed, so often one never knows if the poster just gives up or finds nirvana in what answers/hints are given or what.

Sign in to comment.

More Answers (1)

opts = detectImportOptions(FileName,'Delimiter',';','NumHeaderLines',18)
T = readtable(FileName,opts)

Categories

Asked:

on 11 Sep 2019

Commented:

dpb
on 17 Sep 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!