Intersect() with with repetition
54 views (last 30 days)
Show older comments
The syntax [C,ia,ib] = intersect(A,B,'rows') returns elements without repetitions. However, I need every potential combination of ia and ib that gives C. For example:
A=[1,1;1,1;1,2];
B=[0,1;1,1;1,1];
I need the output to be:
C=[1,1;1,1;1,1;1,1];
ia=[1;1;2;2];
ib=[2;3;2;3];
The answer here generates indcies into A and B that are intersecting elements: https://www.mathworks.com/matlabcentral/answers/338649-arrays-intersection-with-repetition However there are no correspondace between ia and ib generated by this method. e.g., A(ia,:)~=B(ib,:). it also doesn't give every potential combination of indices.
Any ideas?
Thanks
2 Comments
Jan
on 18 Aug 2022
C=[1,1;1,1;1,1;1,1];
ia=[1;1;2;2];
ib=[1;2;1;2];
This output does not satisfy the definition of interesct: C = A(ia,:), C = B(ib,:) . Therefore it is unclear, what the realtion between your A,B and C and ia and ib is.
This does not match "every potential combination of ia and ib that gives C" also.
Accepted Answer
Jan
on 9 Sep 2022
If you really need the redundant information in iAB_C, iAB_Cprime, idx1, idx2, this is faster than the original version tested with the data:
A = randi([0, 200], 1e6, 2);
B = randi([0, 200], 1e6, 2);
I get 1.12 s instead of 1.86 s.
function [C, iAB_C, iAB_Cprime, idx1, idx2] = repintersect_1b(A2, B2)
% Join data for faster sorting:
A = A2 * [4294967296; 1];
B = B2 * [4294967296; 1];
[uA, ~, iAx] = unique(A, "stable");
[a, idx] = sort(iAx);
aV = 1:numel(A);
aV = aV(idx).';
aI = RunLen(a);
[uB,~,iBx] = unique(B, "stable");
[b, idx] = sort(iBx);
bV = 1:numel(B);
bV = bV(idx).';
bL = cumsum([1, RunLen(b)]);
[C2, iuA, iuB] = intersect(uA, uB, "stable");
% Split data for the output:
C = [floor(C2 / 4294967296), rem(C2, 4294967296)];
iAB_C = cell(numel(iuA), 1);
a0 = 1;
for ii = 1:numel(iuA)
a1 = a0 + aI(ii); % Easier indexing for A
aa = aV(a0:a1 - 1);
a0 = a1;
na = numel(aa);
b0 = bL(iuB(ii)); % Need accumulated RunLength for B
b1 = bL(iuB(ii) + 1) - 1;
bb = bV(b0:b1);
nb = numel(bb);
qa = repmat(aa.', nb, 1); % Replace MESHGRID
qb = repmat(bb, 1, na);
iAB_C{ii} = [qa(:), qb(:)];
end
iAB_Cprime = cell2mat(iAB_C);
idx2 = cumsum(cellfun('size', iAB_C, 1)); % Faster than cellfun(@(x) size(x,1), C)
idx1 = [1; idx2(1:end-1) + 1];
end
% Helper function:
function n = RunLen(x)
d = [true; diff(x(:)) ~= 0]; % TRUE if values change
n = diff(find([d.', true])); % Indices of changes
end
I've omitted the temporarily used cell arrays.
Further time could be saved, if you do not create iAB_Cprime and the set of iAB_C and the indices, because both contain the same information.
3 Comments
Jan
on 11 Sep 2022
Vectorizing is not faster, if large arrays must be stored temporarily. I assume that vectorizing is not an advantage here.
"e.g., for 95% of the time na<=2, and 5% 2<na<=5, same goes for nb" - I've asked repeatedly for typical input arguments, because the runtimes for different approachs depend on the number and size of the overlaps.
"why did you replace the meshgrid method?" - see above: I've moved the code, which is performed inside meshgrid directly to the code to avoid the overhead of the input checks. Now it matters, how often meshgrid is called. In the input data I have invented, I see at least a small advantage. But maybe this is different for the original input data.
Another option is, that I have Matlab R2018b only. Maybe the current version uses a faster implementation of meshgrid.
As said before, the posted function wastes time and memory with producing 2 sets of redundant outputs, but this is, what you are asking for. Having realistic input data would allow for adjusting the code to increase the speed. A general purpose algorithm cannot exploit the nature of specific inputs.
More Answers (3)
Jan
on 18 Aug 2022
Edited: Jan
on 18 Aug 2022
A simple loop approach:
A = [1,1; ...
1,1; ...
1,2];
B = [0,1; ...
1,1; ...
1,1];
[C, iA, iB] = RepIntersectRows(A, B)
function [C, iA, iB] = RepIntersectRows(A, B)
% Sizes of inputs:
nA = size(A, 1);
nB = size(B, 1);
w = size(A, 2);
% Pre-allocate output:
C = zeros(nA * nB, w);
iA = zeros(nA * nB, 1);
iB = zeros(nA * nB, 1);
% Find matchs:
iC = 0;
for kA = 1:nA
a = A(kA, :);
for kB = 1:nB
if isequal(a, B(kB, :))
iC = iC + 1;
C(iC, :) = a;
iA(iC) = kA;
iB(iC) = kB;
end
end
end
% Crop unused elements:
C = C(1:iC, :);
iA = iA(1:iC);
iB = iB(1:iC);
end
If A and B have 1e4 rows, the runtime is 4 seconds already. But it is thought to clarify,what you exactly need. The output of your approach does not match the original question exactly. Before I optimize my code, I want to know, if it matchs your needs at all.
4 Comments
Jan
on 18 Aug 2022
@Yi-xiao Liu: Matlab's unique and intersect are based on binary searchs, while the loops (e.g. in my prove of concept code) perform a linear search.
Binary search means, that the data are sorted at first, such that you do not have to compare all elements, but log2 of the elements only by dividing the inteval of interest by 2 in each step.
I'm try to find a shorter and faster solution, when I'm at home.
the cyclist
on 18 Aug 2022
I frankly have not quite figured out the logic of what you want as output, but I'm pretty sure that you can construct what you want by using the ismember command instead of intersect. You'll probably need both "directions", and possibly all outputs:
A=[1,1;1,1;1,2];
B=[0,1;1,1;1,1];
[tf_ab, loc_ab] = ismember(A,B,"rows")
[tf_ba, loc_ba] = ismember(B,A,"rows")
2 Comments
the cyclist
on 18 Aug 2022
Ah, I get it now. Also, in my mind I was thinking that ismember had that third output like unique does, that gives the mapping back to all elements of the original vector.
Yi-xiao Liu
on 18 Aug 2022
11 Comments
Jan
on 9 Sep 2022
Edited: Jan
on 9 Sep 2022
@Yi-xiao Liu: The shown code multiplies the first column by 2^24 and adds the 2nd column. Of course you can do this with 2^32 also.
unique(A, 'rows') has to process 2 columns, while unique(AA) operates on 1 column only. This reduces the memory access and comparisons by 50% and in consequence uses about half of the processing time. The overhead by adding the columns and splitting them afterwards is negligible in this case, because it is linear in time (double data size => double computing time), while the sorting and the binary search are more expensive with growing data sizes.
Again: Should I guess, that
A = randi([0, 2^24-1], 1e6, 2);
B = randi([0, 2^24-1], 1e6, 2);
are typical inputs or is there a higher rate of overlaps as in:
A = randi([0, 32], 1e6, 2);
B = randi([0, 32], 1e6, 2);
You have mentioned the output size, but what are typical input sizes?
It is a certain amount of work to obtain exact explanations from you.
See Also
Categories
Find more on Logical in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!