Readtable for a huge (no good schema) CSV file!
Show older comments
Hi All,
I have a tricky task of reading this huge csv file (1360x107037) . The example data is attached with this question and it just contains only few rows and columns. I tried importing it through the matlab function readtable.
What I did -
- I tried to get the opts throught detectImportOptions()
- As the delimiter is 'semi-colon', I assigned it to opts.Delimeter
- Then I used readtable(filename, opts)
- I got a table with 1360 rows and 107037 columns.
Problem -
- I am getting in every even row for cells that supposed to be a string or datestring NaN values but all numerical values are retained
- I tried to make the 7th line of the csv data as VariableNames but somehow I am getting Var1, Var2, ...... etc are the VariableNames
Does anyone know how could I get rid of these NaN values and obtain the actual string if exists or just an empty string and make the 7th line of the CSV file as the VariableNames.
Update 1
I guess it is expecting a number at that point but instead it is receiving a String!
Update 2
I did this -
opts.VariableTypes(1, 1:end) = {'char'};
Now I am getting all the values but now all the numbers are casted to 'char'!
I dropped all th eampty columns;
table (:, all(ismissing(table))) =[];
I thought of saving the Table as .mat for later use but (my stupidity) it contains now all 'char' values and the file is really huge!
Any help is appreciated!
Thanks & Cheers,
Savan
10 Comments
dpb
on 11 Sep 2019
The data (if that is actually data and not just sample "stuff" that you posted as example text) are irregular for the first 18 records in the file -- no way readtable or detectimportoptions can make heads nor tails out of what is supposed to be what.
You'll have to decide what pieces of which records are what variables and read the header (if you need that information at all) with a series of calls to textscan with the appropriate format string for each record or set of records and save those to separately (and hopefully well-chosen) named variables.
Then, one you get through that, you can read the remainder begining at 18ff as a set of variable names, maybe (altho you wrote "Value" so maybe it's another set of numerics, who knows???) followed by the dates and a numeric value in last column. Altho those records are short a column on the end based on the earlier lines which contain six delimiters and a trailing value for seven columns instead of just six delimiters with no trailing value for those.
This again, may be a fignewton of what you've done to the file for posting, and not be real.
Only hope is to attach a section of the real file for somebody to be able to try to do anything specific.
Savan Rangegowda
on 13 Sep 2019
per isakson
on 13 Sep 2019
Why not read the first 18 rows and the rest in two separate steps?
dpb
on 13 Sep 2019
Stubborness, apparently...
Savan Rangegowda
on 13 Sep 2019
Well, was the very first suggestion made...and seemed at the time to have fallen on deaf ears! Quicker to revamp the plan than to keep fighting upstream against swift currents. :)
Even if you can't actually post a real file (altho is that truly a restriction that would hurt anything at all to have a small section?), make the file you post actually reflect the real file sufficiently well that it is clear what data are what type(s) and what, precisely it is that you're looking for and must have. Ideally, it would be a file such that parsing it would return something meaningful in terms of values and names, whether they're actually original or not.
If there's a keyword that must be searched for to find the length of a variable-length header, we need to know it (or a surrogate name for it if even it is truly proprietary which would seem unlikely) or, what data are needed from that section (if any).
In particular, the confusing part with the file as posted is whether those variable names are, in fact, the names of the variables in the files or are those another set of numerics?
;;Value;Value;Value;Value;Value
2019-07-01T00:00:00+02;2019-07-01T00:15:00+02;20;;;40;
2019-07-01T00:15:00+02;2019-07-01T00:30:00+02;20;;;40;
...
Are "Value" variable names or numbers? If the former, why missing the first two? Is that really real?
It's somewhat unfortunate that readtable doesn't have the option to operate on a file handle as does textscan so can reposition inside a call, but it's still possible to workaround without too much trouble and when the header is only 18-20 lines or so, the time won't be much of an issue if do have to scan it first and then read the rest separately. If, of course, the header really is fixed number of lines, then all those problems pale and all one has to do is to decide what it is that is needed from that section of the file.
Jeremy Hughes
on 13 Sep 2019
This block of code undoes pretty much everything detectImportOptions has done.
opts = detectImportOptions(FileName)
% It has semi colon as delimiter
opts.Delimeter = {';'};
opts. dataLine = 2;
opts.VariableTypes(1,1:end) = {'char'};
T = readtable(FileName, opts);
You could simply create the options directly, and that would be a bit faster, and a little clearer.
opts = delimitedTextImportOptions('Delimiter',';','DataLine',19,'NumVariables',7);
T = readtable(FileName, opts);
But if you pass some of those parameters into detectImportOptions, it gets better at figuring out the rest of the things (like the datatypes, number of variables... )
opts = detectImportOptions(FileName,'Delimiter',';','NumHeaderLines',18)
T = readtable(FileName,opts)
Savan Rangegowda
on 16 Sep 2019
Savan Rangegowda
on 16 Sep 2019
per isakson
on 16 Sep 2019
Edited: per isakson
on 16 Sep 2019
"As it contains a lot of columns(107037), I don't how to specify a format so huge so that I could use textscan!!" One cannot take for granted that textscan or readtable for that matter can handle that huge number of columns. If all the columns hold numerical data:
format = [ '%{fmt}D%{fmt}D', repmat( '%f', 1, 107037-2 ) ];
where fmt shall be replaced by the correct date-format. See help on textscan.
The first 18 lines can be read line by line and parsed individually.
Accepted Answer
More Answers (1)
Jeremy Hughes
on 13 Sep 2019
opts = detectImportOptions(FileName,'Delimiter',';','NumHeaderLines',18)
T = readtable(FileName,opts)
Categories
Find more on Text Data Preparation in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!