# How many times does each number appear?

11 views (last 30 days)

Show older comments

Let a

a =

1 4

3 5

4 7

6 8

be integer intervals

[1 2 3 4]

[3 4 5]

[4 5 6 7]

[6 7 8]

I'm looking for an output b to count the number of times each element appears. It is clear that 1 appears one time, 2 appears one time, 3 appears twice, and so on...

So the output b would list the element and how many times it appears:

b =

1 1

2 1

3 2

4 3

5 2

6 2

7 2

8 1

The challenge is, a is an extremely large nx2 array, so using find even with parallel looping is inefficient. Is there a fast way to yield b given a? If it helps, we can be sure that the elements down the left-hand column of a are sorted, but not necessarily so for the right-hand column. Thanks in advance!

##### 2 Comments

Jan
on 2 Apr 2013

### Accepted Answer

Teja Muppirala
on 2 Apr 2013

Assuming that the second column of a is always >= the first column of a, then this is an efficient solution.

mA = max(a(:));

b = (1:mA)';

b = [b cumsum(accumarray(a(:,1),1,[mA 1]) + ...

accumarray(min(a(:,2)+1,mA),-1,[mA 1]))];

b(end) = b(end) +1;

##### 1 Comment

### More Answers (6)

Brian B
on 2 Apr 2013

Edited: Brian B
on 2 Apr 2013

Like Cedric, I keep thinking of iterating over interval length. Another way to look at the vector b is as the sum of the outputs of a bunch of FIR filters with rectangular impulse responses to appropriate inputs. If you cut the filter length in half at each iteration, you can avoid iterating over all n interval lengths, and instead loop about log2(n) times.

Here I generate intervals of up to 1000, so the loop runs about 10 times.

N=1e8;

M=1e7;

a=[randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];

idx = 1:max(a(:));

d = diff(a, 1, 2) + 1 ; % interval length

s = a(:,1); % interval start point

L = ceil(max(d)/2);

b = 0*idx;

tic

while L>0

iL = (d>=L);

u = hist(s(iL),idx);

%b = b + conv(u, ones(1,L),'same');

b = b + filter(ones(1,L),1,u);

s(iL) = s(iL)+L;

d(iL) = d(iL)-L;

L = ceil(max(d)/2);

toc

end

This ran in 243 seconds on my machine; Cedric's solution exhausted my memory after 577 seconds....

EDIT: Replaced the call to conv() with a call to filter(), since conv() was returning the central part of the convolution, not the initial part, as we need.

Jan
on 2 Apr 2013

Edited: Jan
on 2 Apr 2013

Are the values in the first and second column of a distinct? If so:

c = zeros(1, max(a(:, 2)) + 1);

c(a(:, 1)) = 1;

a2 = a(:, 2) + 1; % [EDITED: +1 inserted]

c(a2) = c(a2) - 1;

c = cumsum(c);

c(end) = [];

If the values are not unique, instead of 1 and -1 histc determines the values:

q = 1:max(a(:, 2));

c = histc(a(:, 1), q);

c = c - histc(a(:, 2) + 1, q); % [EDITED: +1 inserted]

c = cumsum(c);

How efficient is simple FOR loop?

n = max(a(:, 2));

c = zeros(1, n + 1);

for ia = 1:length(a)

a1 = a(ia, 1);

c(a1) = c(a1) + 1;

a2 = a(ia, 2) + 1; % [EDITED: +1 inserted]

c(a2) = c(a2) - 1;

end

c(end) = [];

c = cumsum(c);

Converting this to C would be interesting also. Most of all the limitation of CUMSUM to the type double would not matter, such that UINT16 might reduce the data size, if sufficient.

##### 2 Comments

Jan
on 2 Apr 2013

Edited: Jan
on 2 Apr 2013

Interesting timings:

N = 1e7; M = 1e6;

a = [randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];

FOR-loop: 1.90 seconds

HISTC: 9.64 seconds (see also Brian B's HISTC solution)

CUMSUM: 1.71 seconds (columns of [a] must be unique!)

CONV: 41.00 seconds (Brian B's loop with HIS and CONV)

ACCUMARRAY: 1.83 seconds (Teja's CUMSUM/ACCUMARRAY)

The relation my change for the real data. But for the test data, the simple FOR loop can compete with the vectorized methods (R2009a/64, Win7).

Image Analyst
on 1 Apr 2013

Edited: Image Analyst
on 1 Apr 2013

Cedric
on 1 Apr 2013

Edited: Cedric
on 1 Apr 2013

Hi Hamad,

Well, how large is "extremely large"? Image Analyst might have given you the optimal solution already. As an alternative, you could go for a solution involving ACCUMARRAY or SPARSE, where you would use these integers as indices and a vector of ones as values. This way you get counts per index/integer. For example:

>> buffer = arrayfun(@(r) a(r,1):a(r,2), 1:size(a,1), 'UniformOutput', false) ;

>> buffer = [buffer{:}].' ;

>> counts = accumarray(buffer, ones(numel(buffer),1))

counts =

1

1

2

3

2

2

2

1

There is still room for improvement though (in particular, I guess that a PARFOR loop would be more efficient than ARRAYFUN for building the vector of all integers, but I kept the latter because my point was to illustrate ACCUMARRAY). If not efficient enough after you improve it, it is probably something that you could optimize writing a few lines of C/MEX, in particular for building the vector of all integers.

##### 7 Comments

Brian B
on 2 Apr 2013

Edited: Brian B
on 2 Apr 2013

Jan Simon's solution is interesting because cumsum can be though of as an IIR filter, and so this idea can further improve the FIR approach above, without looping:

N=1e8;

M=1e7;

a=[randi(M,N,1) randi(1e3,N,1)]*[1 1; 0 1];

idx = 1:max(a(:));

b = hist(a(:,1),idx);

b = b - hist(a(:,2)+1,idx);

b = cumsum(b);

This works even if the values are not distinct.

##### 2 Comments

Jan
on 2 Apr 2013

### See Also

### Categories

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!