Readtable for a huge (no good schema) CSV file!

Question

Savan Rangegowda on 11 Sep 2019

0
Link

Direct link to this question

https://in.mathworks.com/matlabcentral/answers/479879-readtable-for-a-huge-no-good-schema-csv-file

Commented: dpb on 17 Sep 2019

Example_CSV.csv

Hi All,

I have a tricky task of reading this huge csv file (1360x107037) . The example data is attached with this question and it just contains only few rows and columns. I tried importing it through the matlab function readtable.

What I did -

I tried to get the opts throught detectImportOptions()
As the delimiter is 'semi-colon', I assigned it to opts.Delimeter
Then I used readtable(filename, opts)
I got a table with 1360 rows and 107037 columns.

Problem -

I am getting in every even row for cells that supposed to be a string or datestring NaN values but all numerical values are retained
I tried to make the 7th line of the csv data as VariableNames but somehow I am getting Var1, Var2, ...... etc are the VariableNames

Does anyone know how could I get rid of these NaN values and obtain the actual string if exists or just an empty string and make the 7th line of the CSV file as the VariableNames.

Update 1

I guess it is expecting a number at that point but instead it is receiving a String!

Update 2

I did this -

opts.VariableTypes(1, 1:end) = {'char'};

Now I am getting all the values but now all the numbers are casted to 'char'!

I dropped all th eampty columns;

table (:, all(ismissing(table))) =[];

I thought of saving the Table as .mat for later use but (my stupidity) it contains now all 'char' values and the file is really huge!

Any help is appreciated!

Thanks & Cheers,

Savan

10 Comments
Show 8 older commentsHide 8 older comments

dpb on 11 Sep 2019

The data (if that is actually data and not just sample "stuff" that you posted as example text) are irregular for the first 18 records in the file -- no way readtable or detectimportoptions can make heads nor tails out of what is supposed to be what.

You'll have to decide what pieces of which records are what variables and read the header (if you need that information at all) with a series of calls to textscan with the appropriate format string for each record or set of records and save those to separately (and hopefully well-chosen) named variables.

Then, one you get through that, you can read the remainder begining at 18ff as a set of variable names, maybe (altho you wrote "Value" so maybe it's another set of numerics, who knows???) followed by the dates and a numeric value in last column. Altho those records are short a column on the end based on the earlier lines which contain six delimiters and a trailing value for seven columns instead of just six delimiters with no trailing value for those.

This again, may be a fignewton of what you've done to the file for posting, and not be real.

Only hope is to attach a section of the real file for somebody to be able to try to do anything specific.

dpb on 13 Sep 2019

Edited: dpb on 13 Sep 2019

Well, was the very first suggestion made...and seemed at the time to have fallen on deaf ears! Quicker to revamp the plan than to keep fighting upstream against swift currents. :)

Even if you can't actually post a real file (altho is that truly a restriction that would hurt anything at all to have a small section?), make the file you post actually reflect the real file sufficiently well that it is clear what data are what type(s) and what, precisely it is that you're looking for and must have. Ideally, it would be a file such that parsing it would return something meaningful in terms of values and names, whether they're actually original or not.

If there's a keyword that must be searched for to find the length of a variable-length header, we need to know it (or a surrogate name for it if even it is truly proprietary which would seem unlikely) or, what data are needed from that section (if any).

In particular, the confusing part with the file as posted is whether those variable names are, in fact, the names of the variables in the files or are those another set of numerics?

;;Value;Value;Value;Value;Value
2019-07-01T00:00:00+02;2019-07-01T00:15:00+02;20;;;40;
2019-07-01T00:15:00+02;2019-07-01T00:30:00+02;20;;;40;
...

Are "Value" variable names or numbers? If the former, why missing the first two? Is that really real?

It's somewhat unfortunate that readtable doesn't have the option to operate on a file handle as does textscan so can reposition inside a call, but it's still possible to workaround without too much trouble and when the header is only 18-20 lines or so, the time won't be much of an issue if do have to scan it first and then read the rest separately. If, of course, the header really is fixed number of lines, then all those problems pale and all one has to do is to decide what it is that is needed from that section of the file.

Savan Rangegowda on 16 Sep 2019

@dpb, I agree and I wil try to explain the csv file more concretely. The thing is that it contains 18 headerlines -

;;Single;Single;Single;Single;.....
;;from_timestamp;from_timestamp;from_timestamp;from_timestamp,.......
;;to_timestamp;to_timestamp;to_timestamp;to_timestamp;.......
;;version_timestamp;version_timestamp;version_timestamp;version_timestamp;......
;;unique_str_identifier;unique_str_identifier;unique_str_identifier;unique_str_identifier......
;;1;1;1;91;3;102;1;2;79;......... % I am not sure of these intergers
;;name_machine1;name_machine1;name_machine1;............ ;name_machine2;name_machine2;.......
;;;;;;;;;;;;;;;;;; % %empty row
;;max_capacity;max_capacity;.. ;min_capacity; min_capacity;.. ;minute_capacity;minute_capacity;.. % It contains several variable names for each machine states in the 7th line
;;manufacturer;manufacturer;manufacturer;.........
;;delivery_company; delivery_company; delivery_company;delivery_company;..........
;;max_voltage;max_voltage;max_voltage;max_voltage;.......
;;fuel_type;fuel_type;fuel_type;fuel_type;fuel_type;...........
;;machine_max_volt;machine_max_volt;machine_max_volt;machine_max_volt;...........
;;;;;;;;;;;;;;;;; %empty row
;;WITHOUT;WITHOUT;WITHOUT;WITHOUT;............
;;;;;;;;;;;;;;;;; %empty row
;;value;value;value;value;value;......... % just contains the string 'value' in this complete row and it is not required

Important rows for me here are row1, row2, row3, row4, row7, row8 and all other rows are not required.

Next it contains values in the rest of the remaining rows....

from_timestamp;to_timestamp;50;;;;;;;;;56...........
from_timestamp;to_timestamp;50;;;;;;;;;;56............
...............

As it contains a lot of columns(107037), I don't how to specify a format so huge so that I could use textscan!!

Savan Rangegowda on 16 Sep 2019

@Jeremy,

opts = detectImportOptions(FileName, 'Delimiter',';', 'NumHeaderLines', 18)

is apparently throwing an error -

Matrix index is out of range for deletion

per isakson on 16 Sep 2019

Edited: per isakson on 16 Sep 2019

"As it contains a lot of columns(107037), I don't how to specify a format so huge so that I could use textscan!!" One cannot take for granted that textscan or readtable for that matter can handle that huge number of columns. If all the columns hold numerical data:

format = [ '%{fmt}D%{fmt}D', repmat( '%f', 1, 107037-2 ) ];

where fmt shall be replaced by the correct date-format. See help on textscan.

The first 18 lines can be read line by line and parsed individually.

Sign in to comment.

Sign in to answer this question.

Answer 1

Guillaume on 16 Sep 2019

0
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/479879-readtable-for-a-huge-no-good-schema-csv-file#answer_392064

As has been suggested you will have to parse the header and the values separately. It can't be done with the same readtable call as the data in your header is organised by rows whereas the values are organised by columns. In fact, the header is completely not suitable for readtable, so if it is indeed needed, reading the file line by line and using textscan is probably the best route:

fid = fopen('Example_CSV.csv', 'rt');  %open as text mode for automatic line ending conversion
%since textscan does not support timezones read everything in the first 7 rows as text
headers = cell(7, 1);
for l = 1:7
    tline = fgetl(fid);
    headers(l) = textscan(tline, '%s', 'Delimiter', ';', 'MultipleDelimAsOne', true);  %not sure you want that last option for your actual file
end
fclose(fid);
%conversion to datetime
for l = 2:4
    headers{l} = datetime(headers{l}, 'InputFormat', 'yyyy-MM-dd''T''HH:mm:ssZ', 'TimeZone', 'UTC+2');  %Choose whichever timezone you want for display
end
headers(5:6) = [];  %discard unwanted row 5 and 6

As for the data itself, with your example file, the following requires no extra parsing afterwards but may need adapting for your file with more columns:

opts = detectImportOptions('Example_CSV.csv', 'NumHeaderLines', 18, 'Delimiter', ';'); %Basic detection. If on R2019a or later, add 'ReadVariableNames', false
%correction for the date columns:
opts = opts.setvartype(1:2, 'datetime');  %proper type
opts = opts.setvaropts(1:2, 'InputFormat', 'yyyy-MM-dd''T''HH:mm:ssZ', 'TimeZone', 'UTC+2');  %proper decoding. Note: Set the timezone to whatever timezone you want it to display in
%remove unneeded variables
opts.SelectedVariableNames = opts.VariableNames([1:3, 6]);
opts.ExtraColumnsRule = 'ignore';
%read the data
data = readtable('Example_CSV.csv', opts);

4 Comments
Show 2 older commentsHide 2 older comments

Guillaume on 17 Sep 2019

I certainly wouldn't expect textscan to fall over for any length of text line (unless you run out of memory of course). What may very well fail or at least take a very long time is the detectImportOption.

The OP seem to have disappeared on us...

dpb on 17 Sep 2019

It certainly shouldn't for anything that isn't system or memory limited...don't know what the underlying C i/o runtime library limits might be, if any.

Indeed, so often one never knows if the poster just gives up or finds nirvana in what answers/hints are given or what.

Sign in to comment.

Answer 2

Jeremy Hughes on 13 Sep 2019

0
Link

Direct link to this answer

https://in.mathworks.com/matlabcentral/answers/479879-readtable-for-a-huge-no-good-schema-csv-file#answer_391780

opts = detectImportOptions(FileName,'Delimiter',';','NumHeaderLines',18)
T = readtable(FileName,opts)

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Readtable for a huge (no good schema) CSV file!

10 Comments
Show 8 older commentsHide 8 older comments

Accepted Answer

4 Comments
Show 2 older commentsHide 2 older comments

More Answers (1)

0 Comments
Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

Readtable for a huge (no good schema) CSV file!

10 Comments Show 8 older commentsHide 8 older comments

Accepted Answer

4 Comments Show 2 older commentsHide 2 older comments

More Answers (1)

0 Comments Show -2 older commentsHide -2 older comments

See Also

Categories

Tags

Community Treasure Hunt

10 Comments
Show 8 older commentsHide 8 older comments

4 Comments
Show 2 older commentsHide 2 older comments

0 Comments
Show -2 older commentsHide -2 older comments