Clear Filters
Clear Filters

Is there a way to efficiently read a .csv file into a dataset in Matlab

1 view (last 30 days)
Ok, so here is the deal.
I have a 2.5GB csv file. I'd like to have it as a dataset so that I can use some of the indexing functions (like grab a certain row provided a certain value) type functionality.
here are some sample lines:
rs180759811,1,83977,0.0078454,0.99052,0.512,'0000','1010',0.45,.,.,F,.,.,.,.,.,.,imputed, rs188652299,1,84156,0.0012772,0.99851,0.50381,'0000','1100',0.65,.,.,R,.,.,.,1,.,.,imputed, rs192830046,1,86282,0.00080435,0.99911,0.59506,'0000','1111',0,.,.,R,.,1,.,.,.,.,imputed, rs146027550,1,88429,0.018998,0.97847,0.53261,'0000','1001',0.2,.,.,R,.,.,.,1,.,.,imputed, rs187571096,1,114699,0.010444,0.98884,0.5583,'0000','1000',0.65,.,.,R,.,.,.,1,.,.,imputed, rs191891026,1,171529,0.011039,0.98724,0.51818,'0000','1001',0.2,.,.,R,.,.,.,1,.,.,imputed,
But, as I see it, there is not a good way to go from csv --> dataset.
Here are the options I've been considering:
fgetl --> regexp --> cell array --> cell2dataset
I know I can get that to work, but it can't be the most efficient way.
textscan--> textscan allows me to specify a bunch commas as the delimiter, which is useful, but i am not even sure if I can read 1 line at a time with text scan.
csvread --> will not work because most of the values are not numeric.
Is there another option that will turn a csv directly into an array or dataset without having to treat it as strings, regexp it, the whole 9 yards?
Thanks very much.

Answers (1)

Walter Roberson
Walter Roberson on 11 Sep 2013
You can read a line at a time with textscan(), by specifying a count of 1 right after the format. But why not read it all with textscan() and then cell2dataset() the result, possibly after a horzcat() ?
cellinput = textscan(fid, '%s%f%f%f%f%f%s%s%f%s%s%s%s%s%s%s%s%s%s%s', 'delimiter', ',');
cell2dataset( horzcat(cellinput{:}) )
the horzcat() would take it from being a cell row vector with each member being a cell column vector, into being a row-and-column cell array.
For lack of better instruction, each column after the last consistent numeric column has been read in as a separate string. If you know that a certain column there will always be useless ".", then switch the corresponding %s to %*s . But for the column that is either 1 or ".", do not switch that to %g as %g will not gracefully match a "." in that column.

Categories

Find more on Large Files and Big Data in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!