Is there a faster way of splitting a cell array into numeric array while preserving NaN?

Greetings,
I am trying to split a set of data into rows and columns of numeric data that will preserve the position of empty data (as NaN or anything similar).
The input data is a cell array with rows of strings. The columns are delimited by a semi-colon ' ; '. The first 8 columns are filled with garbage data and there are many trailing columns with no data at all. I even sometimes have rows with no data. The attached data sample is just 4,000 rows long but I actually have datasets that have between 50,000 and 300,000 rows.
I have been using the code below but the str2double step is incredibly slow. Can anyone offer an alternative approach that can cut down on the processing time?
% split data by the ' ; ' separator
data = cellfun(@(x) split(x,';'),data,'UniformOutput',false);
% get rid of preceding garbage data in columns 1 to 8
data = cellfun(@(x) x(9:end),data,'UniformOutput',false);
% convert data into double. This step is incredibly slow
data = cellfun(@str2double,data,'UniformOutput',false);
% example of next operations I wish to perform on this data
data_a = cellfun(@(x) x(1:2:end),data,'UniformOutput',false);
data_b = cellfun(@(x) x(2:2:end),data,'UniformOutput',false);
Thank you in advance for any help

 Accepted Answer

try this
endsWithSemicolon = cellfun(@(s) endsWith(s, ';'), data);
x = cellfun(@(s) textscan(s, '%f', 'Delimiter', ';', 'EmptyValue', nan(), 'Whitespace', ' *\n\t\r\b'), data);
x = cellfun(@(a) a(9:end), x, 'UniformOutput', false);
x(endsWithSemicolon) = cellfun(@(a) [a; nan], x(endsWithSemicolon), 'UniformOutput', false);

4 Comments

there was a bug initially, because i didn't realise you had ****** for nan values in some of the strings...
added * to the whitespace to overcome that
now the value is identical to your output, and you'll see it's more efficient
I'm not sure this is necessary:
endsWithSemicolon = cellfun(@(s) endsWith(s, ';'), data);
x(endsWithSemicolon) = cellfun(@(a) [a; nan], x(endsWithSemicolon), 'UniformOutput', false);
I just added it so that the output is identical to the output your code generated
Hi TADA,
Your solution works quite well for my purpose. Here is the difference in performance.
Original approach with str2double: 52.586311 seconds
Alternate approach with str2doubleq: 0.731596 seconds
Alternate approach with textscan: 0.343899 seconds
Both your solution and the one offered by Adam Danz improve my code significantly. Thank you.

Sign in to comment.

More Answers (0)

Categories

Products

Release

R2018b

Asked:

on 22 Aug 2019

Commented:

on 23 Aug 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!