How many times does each number appear?
Show older comments
Let a
a =
1 4
3 5
4 7
6 8
be integer intervals
[1 2 3 4]
[3 4 5]
[4 5 6 7]
[6 7 8]
I'm looking for an output b to count the number of times each element appears. It is clear that 1 appears one time, 2 appears one time, 3 appears twice, and so on...
So the output b would list the element and how many times it appears:
b =
1 1
2 1
3 2
4 3
5 2
6 2
7 2
8 1
The challenge is, a is an extremely large nx2 array, so using find even with parallel looping is inefficient. Is there a fast way to yield b given a? If it helps, we can be sure that the elements down the left-hand column of a are sorted, but not necessarily so for the right-hand column. Thanks in advance!
2 Comments
Jan
on 2 Apr 2013
Some contributors spend a lot of time here to solve the problem. What about voting for this question to get the interst of others?
Accepted Answer
More Answers (6)
Like Cedric, I keep thinking of iterating over interval length. Another way to look at the vector b is as the sum of the outputs of a bunch of FIR filters with rectangular impulse responses to appropriate inputs. If you cut the filter length in half at each iteration, you can avoid iterating over all n interval lengths, and instead loop about log2(n) times.
Here I generate intervals of up to 1000, so the loop runs about 10 times.
N=1e8;
M=1e7;
a=[randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];
idx = 1:max(a(:));
d = diff(a, 1, 2) + 1 ; % interval length
s = a(:,1); % interval start point
L = ceil(max(d)/2);
b = 0*idx;
tic
while L>0
iL = (d>=L);
u = hist(s(iL),idx);
%b = b + conv(u, ones(1,L),'same');
b = b + filter(ones(1,L),1,u);
s(iL) = s(iL)+L;
d(iL) = d(iL)-L;
L = ceil(max(d)/2);
toc
end
This ran in 243 seconds on my machine; Cedric's solution exhausted my memory after 577 seconds....
EDIT: Replaced the call to conv() with a call to filter(), since conv() was returning the central part of the convolution, not the initial part, as we need.
1 Comment
Cedric
on 2 Apr 2013
+1 !
Are the values in the first and second column of a distinct? If so:
c = zeros(1, max(a(:, 2)) + 1);
c(a(:, 1)) = 1;
a2 = a(:, 2) + 1; % [EDITED: +1 inserted]
c(a2) = c(a2) - 1;
c = cumsum(c);
c(end) = [];
If the values are not unique, instead of 1 and -1 histc determines the values:
q = 1:max(a(:, 2));
c = histc(a(:, 1), q);
c = c - histc(a(:, 2) + 1, q); % [EDITED: +1 inserted]
c = cumsum(c);
How efficient is simple FOR loop?
n = max(a(:, 2));
c = zeros(1, n + 1);
for ia = 1:length(a)
a1 = a(ia, 1);
c(a1) = c(a1) + 1;
a2 = a(ia, 2) + 1; % [EDITED: +1 inserted]
c(a2) = c(a2) - 1;
end
c(end) = [];
c = cumsum(c);
Converting this to C would be interesting also. Most of all the limitation of CUMSUM to the type double would not matter, such that UINT16 might reduce the data size, if sufficient.
2 Comments
Interesting timings:
N = 1e7; M = 1e6;
a = [randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];
FOR-loop: 1.90 seconds
HISTC: 9.64 seconds (see also Brian B's HISTC solution)
CUMSUM: 1.71 seconds (columns of [a] must be unique!)
CONV: 41.00 seconds (Brian B's loop with HIS and CONV)
ACCUMARRAY: 1.83 seconds (Teja's CUMSUM/ACCUMARRAY)
The relation my change for the real data. But for the test data, the simple FOR loop can compete with the vectorized methods (R2009a/64, Win7).
Hamad
on 2 Apr 2013
Image Analyst
on 1 Apr 2013
Edited: Image Analyst
on 1 Apr 2013
2 votes
Are they known to be integers? I'd use histc(). You could do this in about 4 lines of code -- 1 (or maybe a 3 line for loop) to construct a 1D list of all numbers you gave in the seconds array you showed, 1 to construct the hist bin edges, one to call histc() and one to construct "b" from the result of histc(). Give it a try and come back if you have trouble.
1 Comment
Hi Hamad,
Well, how large is "extremely large"? Image Analyst might have given you the optimal solution already. As an alternative, you could go for a solution involving ACCUMARRAY or SPARSE, where you would use these integers as indices and a vector of ones as values. This way you get counts per index/integer. For example:
>> buffer = arrayfun(@(r) a(r,1):a(r,2), 1:size(a,1), 'UniformOutput', false) ;
>> buffer = [buffer{:}].' ;
>> counts = accumarray(buffer, ones(numel(buffer),1))
counts =
1
1
2
3
2
2
2
1
There is still room for improvement though (in particular, I guess that a PARFOR loop would be more efficient than ARRAYFUN for building the vector of all integers, but I kept the latter because my point was to illustrate ACCUMARRAY). If not efficient enough after you improve it, it is probably something that you could optimize writing a few lines of C/MEX, in particular for building the vector of all integers.
7 Comments
Hamad
on 2 Apr 2013
I'll keep thinking about this issue, but my (cheap) guess is that you might have to C/MEX the creation of the full array of integers.
About the homework .. I guess that both your questions could be homework questions if they were not dealing with all these GB of data ;-) So before we read the comments were the size appears, we mentally tag them as homework.
Image Analyst
on 2 Apr 2013
Why do you need that kind of resolution anyway? Can't you get by with a histogram that has less than a hundred million bins? What are you going to do with that result anyway? I seldom, if ever, see a need for beyond a thousand or so bins.
Well, I had 5 minutes and I just tested this, which is not too troublesome with ~3e7 ranges on my poor laptop..
d = diff(a, 1, 2) + 1 ;
[d_sort, id_sort] = sort(d) ;
d_range = min(d) : max(d) ;
id_buffer = cell(numel(d_range), 1) ;
for k = 1 : numel(d_range)
d_select = d_range(k) ;
n_rows = sum(d_sort == d_select) ;
buffer = ones(n_rows, d_select) ;
buffer(:,1) = a(id_sort(d_sort == d_select)) ;
buffer = cumsum(buffer, 2) ;
id_buffer{k} = reshape(buffer, 1, []) ;
end
buffer = [id_buffer{:}].' ;
counts = accumarray(buffer, ones(numel(buffer),1)) ;
I still used a cell array so you can easily PARFOR it, but I changed the approach thinking that we could operate by blocks of ranges with same size. It allows building blocks of ranges in one shot (with e.g. CUMSUM here), and it should work well if you have significantly fewer range sizes than ranges..
No time to explain further as I really had 10 minutes and I must go now, but let me know if you test it and have questions.
Cheers,
Cedric
Hamad
on 2 Apr 2013
Hamad
on 3 Apr 2013
Jan Simon's solution is interesting because cumsum can be though of as an IIR filter, and so this idea can further improve the FIR approach above, without looping:
N=1e8;
M=1e7;
a=[randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];
idx = 1:max(a(:));
b = hist(a(:,1),idx);
b = b - hist(a(:,2)+1,idx);
b = cumsum(b);
This works even if the values are not distinct.
2 Comments
Jan
on 2 Apr 2013
I've posted a very similar solution, but without the "+1" in the 2nd histc() call - I've fixed my bug now.
You can avoid saving all of the intermediate intervals (e.g., [1 2 3 4]) if you just want the count:
a = [1 4;3 5;4 7;6 8];
b = zeros(max(max(a)),1);
parfor k=1:length(b)
b(k) = sum( a(:,1)<=k & k<=a(:,2) );
end
3 Comments
Or, if the intervals tend to be small compared to max(max(a)), you could iterate over interval length:
a = [1 4;3 5;4 7;6 8];
idx = 1:max(max(a));
while ~isempty(a)
b = b + hist(a(:,1),idx);
% Then adjust the a matrix by increasing the lower bound
% by one and removing any rows of a which are no longer valid intervals.
end
I'll let you work out the details, since I just noticed this is tagged as homework.
Hamad
on 2 Apr 2013
Categories
Find more on Linear Predictive Coding in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!