Finding duplicate strings in a cell array and their index

41 views (last 30 days)
I have to convert a cell array with more than 100,000 elements and convert it to a structure array with four fields. Right now, I have something like:
% cell array = nameData
n = 1;
for j = 2:102
for i = 2:length(nameData)
S(n).name = nameData{i,j};
S(n).frequency = 1;
n = n+1;
end
end
However, I need to find duplicate strings in this array, and find information about them. Basically, I am collecting a database of strings and if I run across a duplicate, increase the frequency of that string rather than adding it to the structure.
I had been using loops within the previous two loops to achieve this:
for k = 1:n
if strcmpi(S(k).name, nameData{i,j}
S(k).frequency = S(k).frequency + 1;
end
end
However, I always just end up with all 100,000 structure elements. Any other solution I have gotten to work was entirely too slow, and this conversion from cell to structure array must happen in less than 20 seconds.
Thanks!
  2 Comments
Stephen23
Stephen23 on 12 Apr 2015
Edited: Stephen23 on 12 Apr 2015
You should avoid naming variables i and j as these are both names of the inbuilt imaginary unit.
Paul Wintz
Paul Wintz on 10 Sep 2021
The use of i and j as index variables are so ubiquitous to programming that I would say, instead, that you should avoid using i and j as the imaginary unit, and instead use 1i or 1j, which cannot be overwritten.

Sign in to comment.

Accepted Answer

Stephen23
Stephen23 on 12 Apr 2015
Edited: Stephen23 on 13 Apr 2015
Learn to write vectorized code to make your code neater, faster and more robust: loops are not the first choice for solving problems in MATLAB, vectorization is!
This solution takes less than one second on my machine. First we generate an array of fake data, consisting of 100000 two-character strings of random characters:
N = 100000;
C = cellstr(char(32+randi(94,N,2)));
then we collect the unique ones into D and count their frequency in Y using hist:
tic
[D,~,X] = unique(C(:));
Y = hist(X,unique(X));
Z = struct('name',D,'freq',num2cell(Y(:)));
toc
The timer functions tic and toc print this to my command window:
Elapsed time is 0.379057 seconds.
And we can have a look at a random example of the output Z:
>> Z(5).name
ans =
!%
>> Z(5).freq
ans =
12
For newer versions you can use histogram instead. Note that vectorized code scale up to larger array sizes much nicer than loops do: even for one million elements in array C this method only took 4.87 seconds on my machine.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!