Loop through the unique values of a very large column and extract data
6 views (last 30 days)
Show older comments
This is more a speed question than a "how to" question.
Assume I have the following problem:
Three variables A, B and C.
A is a series of IDs and B and C are data (e.g. dates and a measurement).
For various reasons I want to seperate the data, so instead of three columns I have a structure with something like:
mystruct.First_ID_FROM_A = [B(indexFirstID,:) C(indexFirstID,:)]
Traditionally I just do the following:
[uA,IA,IB] = unique(A);
for i=1:length(uA)
ii = find(i==IB);
mystruct.(uA{i,1}) = [B(ii,:) C(ii,:)];
%sometimes I do other stuff here with some cross referencing so the index ii is useful.
end
Job done. I have tried other methods, but this is pretty fast, except now I have like crazy big data (e.g. A, B and C is like the best part of a billion rows). So this is my second attempt that I run on a server:
[uA,IA,IB] = unique(A);
N = length(uA);
temp = cell(N,1);
% do the indexing with a cell structure that can be cut.
parfor i=1:N
ii = find(i==IB);
temp{i,1} = [B(ii,:) C(ii,:)];
end
% do a second loop just to reallocate the data
for i=1:N
mystruct.(uA{i,1}) = temp{i,1};
end
So despite being two loops this can be quicker as the extraction is in parallel and the assignment is fast.
Is there a fancy way of using something like an array based version of a binary expansion function that can do this faster without the loop, in either step of the second process? Or should I make a C++ and a mex routine to speed this tedious thing up? I think a problem here is the output array is uncertain in terms of size.
If so does anyone have any experience or examples of how to create and map a Matlab structure in C++ so the output can be read by matlab? I use str2doubleq a lot, this takes cell array of strings and outputs doubles, which is quite vanilla, and I have made a few custom C and C++ codes, for fast date and time pulls, when datenum was too slow.
But this is annoying, me, I am sure there is a neater way to do it. Once the data is in the structure, it is reall fast to just use the fieldnames command and then loop through the sub data objects.
7 Comments
Sindar
on 16 Jun 2020
The point of tables is that they act like a more organized structure array. If you are naming each structure field, you already spend that memory. Depending on the shape of your data, something similar to:
mytable = array2table([B(IB,:) C(IB,:)],'RowNames',num2str(uA))
should work without any loops
Answers (0)
See Also
Categories
Find more on Matrix Indexing in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!