I have a n*n matrix and want to group the numbers together in different classes. I will explain it with a 5*5 matrix. The upper and lower diagonal values are same while the diagonal matrix is always 1. This is a correlation matrix of pairs where (1,2)=(2,1). I have a threshold of 0.9 so whichever value is greater than 0.9 then both the values are clustered in the same group. I will start with row 1. I want to select the values greater than 0.9 which in this case is only 4th. Now I will group (1,4) in group A. I will also in this same iteration find the least value which should be less than 0.9. Here it is 3rd entry which is 0.75. So for the column 3 I move to row 3 in the second iteration. Now I will repeat the same process but I have to only group remaining columns which are not yet grouped which is 2,3,5 by the same rule. In this iteration only 5th is greater than 0.9 so I have (3,5) as the pair and the least value is (3,1) but since I have to only check 2,3,5 (1,4 are already grouped in A) I will look for minimum out of these two (2,5) which is 2. Now I move to row 2. In my algorithm now 2 is the only last remaining factor I'd look for the maximum value in the second row which is (2,5) and will group this remaining 2 with (2,3,5). I will repeat this process till all the items are grouped in groups A, B, C.... and so on. So to group the last remaining row number I just group them with the best value they have row number.
There would not be any case where a factor is common i.e they are in two groups. For example after (1,4) so when we do iteration for 3rd row then 4th column would not have greater value than 0.9 (most likely) because the factors are from certain image features which associates to each other in a way that since 1 is not associated with 3 very well and good with 4 hence 3 would also be not very well associateed with 4.
Please let me know if you have any questions as I think this may confuse you pretty much. Please let me know if you can let me know a simple algorithm which works for this. Thanks in advance.
My final output group will be (1,4),(2,3,5)
0001 0.88 0.75 0.91 0.79
0.88 0001 0.76 0.74 0.97
0.75 0.76 0001 0.76 0.99
0.91 0.74 0.76 0001 0.80
0.79 0.97 0.99 0.80 0001

3 Comments

Is that matrix you show the input or the output? The explanation you give is very difficult to convert to an actual Matlab code. Can you try to describe input and output, and describe the rules in terms of Matlab of how to get from one to the other?
I didn't understand the meaning of So for the column 3 I move to row 3 in the second iteration. I think you need to explain what happens on the 2nd step as well.
It would also help if you showed what is the final result for the above matrix.
@Rik, @Guillaume I have updated the question with including the output and also the second iteration. Please let me know if you still could not get the question. Thank you.

Sign in to comment.

 Accepted Answer

I'm still not entirely clear on your whole algorithm, I think the following do what you want:
function clusters = cluster(m, threshold)
columns = 1:size(m, 2); %keep track of column indices remaining. clustered columns are removed from m when they are clustered
clusters = {};
currentrow = 1; %start on 1st row
while ~isempty(m)
tocluster = m(currentrow, :) > threshold; %cluster columns whose value is above threshold on current row. Note that since m(currentrow, currentrow) is 1, it's always included in the cluster
if sum(~tocluster) == 1 %if there's only one column left to cluster afterward
tocluster = true(size(tocluster)); %then include it in the current cluster
end
clusters{end + 1} = columns(tocluster); %#ok<AGROW> %get actual index of columns to cluster
[~, newrow] = min(m(currentrow, :)); %find next row to start clustering from
currentrow = columns(newrow);
m = m(:, ~tocluster); %get rid of columns that have been clustered.
columns = columns(~tocluster);
end
end
I certainly didn't understand what to do when there's only one column/row left to cluster. In the above it's included in the last cluster.

14 Comments

Thank you for this. I tried my algorithm with hand on a small dataset and when I reached the last column all of them were clustered but I agree with you that it is a glitch. I do not know how to solve it. I am trying your algorithm on a small dataset to see if it works. I believe I should pass the argument m as matrix and threshold which in my case is 0.9. Am I getting it right?
Indeed.
Okay. I am not able to understand the output. Here it is but what does it say?
ans =
1×6 cell array
{1×9 double} {[25]} {[4]} {[23]} {1×6 double} {1×7 double}
Call it with:
clusters = cluster(yourm, yourthreshold)
then
celldisp(clusters)
if you want to see what's in each cluster. The output is a cell array, clusters{1} is the 1st cluster, cluster{2} the 2nd one, etc.
In your case, the 1st cluster has 9 elements, the 2nd one only 1 element (row/column 25), 3rd one only 1 element (row/column 4), etc.
Yes, I see now. Thank you. It is working in my favor so far and the shortcomings of algorithm is foreseen as it was just my intuition which works for now. Do you think anything could be done to achieve the last row problem which you talked about? Also, is there a method in which when I import the images I do not lose any structural information as my images are becoming a bit hazy after reading them. And any way in which I could send this images according to clusters in different numbered folders? Nevertheless, this is also a great result. Can not thank you enough :)
Do you think anything could be done to achieve the last row problem which you talked about?
Possibly, but I've not understood what the actual problem is and what you want to do. As I didn't understand, I just put it in the last cluster as it was the easiest.
As for your other question, I have no idea how you go from images to a matrix. You're probably better off starting a new question for that anyway.
Sure. I will ask in a separate question for the second problem.
For the first, the matrix contains the ssim() numbers between two images hence the numbers above the diagonal and below the diagonal are equal. I wanted to group images with similar structure in one group and hence came up with this algorithm. So in this case whichever images have more than 0.9 ssim I was keeping them in the same group.
Suppose (1,2,3) are grouped and (1,4,5) are grouped then I can not decide which group would 1 go to. And if I keep iterating each row what happens if the coming row is already been grouped? So I came up with this idea that I will iterate with the least value in the next step. One thing I can think of is my algorithm could terminate if at the present row the least value is greater than 0.9. What do you think? Let me know if I am not clear.
Dear @Guillaume, do you think this algorithm converges? So far I ran the code with few image sets and it does but not sure if mathematically it converges too.
Hi @Guillaume, I was running the code with big matrix and it works fine however I tried with a small 4*4 matrix to see if its working as anticipated. Look at this matrix for example.
1.0000 0.9185 0.7956 0.8323
0.9185 1.0000 0.8004 0.8340
0.7956 0.8004 1.0000 0.8401
0.8323 0.8340 0.8401 1.0000
If my threshold is 0.9 then in the first iteration Img1 and Img2 are paired together and since the lowest value is Img3 it will go to row 3 in the next iteration. From here col1 and col2 are already removed from iteration so it won't find any value greater than 0.9 and hence will go to the lowest which is Img4 while not pairing with it and Img3 would be loner. In the next and last iteration Img 4 is the only one left so my final clusters should be (1,2),(3),(4).
But by the code I am getting (1,2) and (3,4). Could you please help me here? I can see that your question was the same that what to do with the last left one and you include it in the last cluster but I'd like it to be clustered separately since it had none of the pairs where threshold was met. Hope you understand this example. If not, please let me know.
^^ I tried making changes at this line so that it just takes the last if left alone in a single cluster otherwise it may so happen that in end there are two or more and they paired anyway. But it is not working
if sum(~tocluster) == 1
tocluster = true(1);
end
You just have to completely remove that if if you want the possible last row to be in its own cluster:
function clusters = cluster(m, threshold)
columns = 1:size(m, 2); %keep track of column indices remaining. clustered columns are removed from m when they are clustered
clusters = {};
currentrow = 1; %start on 1st row
while ~isempty(m)
tocluster = m(currentrow, :) > threshold; %cluster columns whose value is above threshold on current row. Note that since m(currentrow, currentrow) is 1, it's always included in the cluster
clusters{end + 1} = columns(tocluster); %#ok<AGROW> %get actual index of columns to cluster
[~, newrow] = min(m(currentrow, :)); %find next row to start clustering from
currentrow = columns(newrow);
m = m(:, ~tocluster); %get rid of columns that have been clustered.
columns = columns(~tocluster);
end
end
Great thank you so much. Just one last doubt. I have read that AGROW is included to ignore warnings like initialising array. Is it also here for the same purpose?
The %#ok<AGROW> suppresses a warning in the editor that the array cluster grows inside the loop. It is typically a valid warning, you can usually preallocate the array. In this case however, you don't in advance how many clusters there will be, so you do have to grow the array in the loop. It is a cell array however and the number of clusters is not likely to be big so the impact on performance will be negligible.
If there is going to be a significant amount of clusters such that the array growing is having an impact then the alternative is to preallocate an array large enough (max size it can be is the number of rows) and trim it at the end:
function clusters = cluster(m, threshold)
columns = 1:size(m, 2); %keep track of column indices remaining. clustered columns are removed from m when they are clustered
clusters = cell(1, size(m, 1)); %preallocate cell array. Probably too big.
currentrow = 1; %start on 1st row
clusteridx = 0;
while ~isempty(m)
clusteridx = clusteridx + clusteridx + 1;
tocluster = m(currentrow, :) > threshold; %cluster columns whose value is above threshold on current row. Note that since m(currentrow, currentrow) is 1, it's always included in the cluster
clusters{clusteridx} = columns(tocluster); %get actual index of columns to cluster
[~, newrow] = min(m(currentrow, :)); %find next row to start clustering from
currentrow = columns(newrow);
m = m(:, ~tocluster); %get rid of columns that have been clustered.
columns = columns(~tocluster);
end
cluster = cluset(1:clusteridx); %trim cell array to portion used
end
This will temporarily use more memory for a possible marginal speed gain.
Thanks a lot! In that case I will use the AGROW version of code if I may need it to be more faster for a lot of images. Thanks again.

Sign in to comment.

More Answers (1)

Bruno Luong
Bruno Luong on 6 Jul 2019
It seems like you you want to cluster indexes according to correlation. In this case why not threshold you matrix, that gives some sort of connectivity graph, then using some graph technique to get the connex components.
There a a bunch of such function in File Exchange.

Categories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!