Finding duplicate strings in a cell array and their index
41 views (last 30 days)
Show older comments
Jonathan Nastasi
on 11 Apr 2015
Commented: Paul Wintz
on 10 Sep 2021
I have to convert a cell array with more than 100,000 elements and convert it to a structure array with four fields. Right now, I have something like:
% cell array = nameData
n = 1;
for j = 2:102
for i = 2:length(nameData)
S(n).name = nameData{i,j};
S(n).frequency = 1;
n = n+1;
end
end
However, I need to find duplicate strings in this array, and find information about them. Basically, I am collecting a database of strings and if I run across a duplicate, increase the frequency of that string rather than adding it to the structure.
I had been using loops within the previous two loops to achieve this:
for k = 1:n
if strcmpi(S(k).name, nameData{i,j}
S(k).frequency = S(k).frequency + 1;
end
end
However, I always just end up with all 100,000 structure elements. Any other solution I have gotten to work was entirely too slow, and this conversion from cell to structure array must happen in less than 20 seconds.
Thanks!
2 Comments
Paul Wintz
on 10 Sep 2021
The use of i and j as index variables are so ubiquitous to programming that I would say, instead, that you should avoid using i and j as the imaginary unit, and instead use 1i or 1j, which cannot be overwritten.
Accepted Answer
Stephen23
on 12 Apr 2015
Edited: Stephen23
on 13 Apr 2015
Learn to write vectorized code to make your code neater, faster and more robust: loops are not the first choice for solving problems in MATLAB, vectorization is!
This solution takes less than one second on my machine. First we generate an array of fake data, consisting of 100000 two-character strings of random characters:
N = 100000;
C = cellstr(char(32+randi(94,N,2)));
tic
[D,~,X] = unique(C(:));
Y = hist(X,unique(X));
Z = struct('name',D,'freq',num2cell(Y(:)));
toc
Elapsed time is 0.379057 seconds.
And we can have a look at a random example of the output Z:
>> Z(5).name
ans =
!%
>> Z(5).freq
ans =
12
For newer versions you can use histogram instead. Note that vectorized code scale up to larger array sizes much nicer than loops do: even for one million elements in array C this method only took 4.87 seconds on my machine.
0 Comments
More Answers (0)
See Also
Categories
Find more on Matrix Indexing in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!