# How many times does each number appear?

14 views (last 30 days)

Show older comments

Let a

a =

1 4

3 5

4 7

6 8

be integer intervals

[1 2 3 4]

[3 4 5]

[4 5 6 7]

[6 7 8]

I'm looking for an output b to count the number of times each element appears. It is clear that 1 appears one time, 2 appears one time, 3 appears twice, and so on...

So the output b would list the element and how many times it appears:

b =

1 1

2 1

3 2

4 3

5 2

6 2

7 2

8 1

The challenge is, a is an extremely large nx2 array, so using find even with parallel looping is inefficient. Is there a fast way to yield b given a? If it helps, we can be sure that the elements down the left-hand column of a are sorted, but not necessarily so for the right-hand column. Thanks in advance!

##### 2 Comments

### Accepted Answer

Teja Muppirala
on 2 Apr 2013

Assuming that the second column of a is always >= the first column of a, then this is an efficient solution.

mA = max(a(:));

b = (1:mA)';

b = [b cumsum(accumarray(a(:,1),1,[mA 1]) + ...

accumarray(min(a(:,2)+1,mA),-1,[mA 1]))];

b(end) = b(end) +1;

##### 1 Comment

Cedric Wannaz
on 2 Apr 2013

Edited: Cedric Wannaz
on 3 Apr 2013

+10! Just needs a small correction on the last element.

### More Answers (6)

Brian B
on 2 Apr 2013

Edited: Brian B
on 2 Apr 2013

Like Cedric, I keep thinking of iterating over interval length. Another way to look at the vector b is as the sum of the outputs of a bunch of FIR filters with rectangular impulse responses to appropriate inputs. If you cut the filter length in half at each iteration, you can avoid iterating over all n interval lengths, and instead loop about log2(n) times.

Here I generate intervals of up to 1000, so the loop runs about 10 times.

N=1e8;

M=1e7;

a=[randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];

idx = 1:max(a(:));

d = diff(a, 1, 2) + 1 ; % interval length

s = a(:,1); % interval start point

L = ceil(max(d)/2);

b = 0*idx;

tic

while L>0

iL = (d>=L);

u = hist(s(iL),idx);

%b = b + conv(u, ones(1,L),'same');

b = b + filter(ones(1,L),1,u);

s(iL) = s(iL)+L;

d(iL) = d(iL)-L;

L = ceil(max(d)/2);

toc

end

This ran in 243 seconds on my machine; Cedric's solution exhausted my memory after 577 seconds....

EDIT: Replaced the call to conv() with a call to filter(), since conv() was returning the central part of the convolution, not the initial part, as we need.

Jan
on 2 Apr 2013

Edited: Jan
on 2 Apr 2013

Are the values in the first and second column of a distinct? If so:

c = zeros(1, max(a(:, 2)) + 1);

c(a(:, 1)) = 1;

a2 = a(:, 2) + 1; % [EDITED: +1 inserted]

c(a2) = c(a2) - 1;

c = cumsum(c);

c(end) = [];

If the values are not unique, instead of 1 and -1 histc determines the values:

q = 1:max(a(:, 2));

c = histc(a(:, 1), q);

c = c - histc(a(:, 2) + 1, q); % [EDITED: +1 inserted]

c = cumsum(c);

How efficient is simple FOR loop?

n = max(a(:, 2));

c = zeros(1, n + 1);

for ia = 1:length(a)

a1 = a(ia, 1);

c(a1) = c(a1) + 1;

a2 = a(ia, 2) + 1; % [EDITED: +1 inserted]

c(a2) = c(a2) - 1;

end

c(end) = [];

c = cumsum(c);

Converting this to C would be interesting also. Most of all the limitation of CUMSUM to the type double would not matter, such that UINT16 might reduce the data size, if sufficient.

Image Analyst
on 1 Apr 2013

Edited: Image Analyst
on 1 Apr 2013

Cedric Wannaz
on 1 Apr 2013

Edited: Cedric Wannaz
on 1 Apr 2013

Hi Hamad,

Well, how large is "extremely large"? Image Analyst might have given you the optimal solution already. As an alternative, you could go for a solution involving ACCUMARRAY or SPARSE, where you would use these integers as indices and a vector of ones as values. This way you get counts per index/integer. For example:

>> buffer = arrayfun(@(r) a(r,1):a(r,2), 1:size(a,1), 'UniformOutput', false) ;

>> buffer = [buffer{:}].' ;

>> counts = accumarray(buffer, ones(numel(buffer),1))

counts =

1

1

2

3

2

2

2

1

There is still room for improvement though (in particular, I guess that a PARFOR loop would be more efficient than ARRAYFUN for building the vector of all integers, but I kept the latter because my point was to illustrate ACCUMARRAY). If not efficient enough after you improve it, it is probably something that you could optimize writing a few lines of C/MEX, in particular for building the vector of all integers.

Brian B
on 2 Apr 2013

Edited: Brian B
on 2 Apr 2013

Jan Simon's solution is interesting because cumsum can be though of as an IIR filter, and so this idea can further improve the FIR approach above, without looping:

N=1e8;

M=1e7;

a=[randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];

idx = 1:max(a(:));

b = hist(a(:,1),idx);

b = b - hist(a(:,2)+1,idx);

b = cumsum(b);

This works even if the values are not distinct.

##### 2 Comments

### See Also

### Categories

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!