Find sets of consistent patterns with a variable pattern index

7 views (last 30 days)
Suppose I have a matrix in which the rows index multiple runs of a clustering algorithm and the columns index the clusters that each data point is assigned to. The algorithm clusters the data points but does not always use consistent names across runs (i.e., all of the data points which belong in the same cluster will typically be clustered together - with some probability - but, assuming there are 3 clusters, whether this cluster is tagged as 1 or 2 or 3 will vary from run to run).
For example:
X = [1 1 2 2 1 1 1 3 3 3;
1 1 2 2 1 1 1 3 3 3;
2 2 1 1 2 2 2 3 3 3;
3 3 2 2 3 3 3 1 1 1;
2 2 3 3 2 2 2 1 1 1];
In this matrix, columns [1 2 5 6 7] are always tagged with the same index number, columns [2 3] are always tagged with the same index number (but a different number than is used for the other clusters) and columns [8 9 10] are always tagged with the same index number (again different from the other two clusters).
Is there a way that I can identify which columns are consistently (or are probabilistically more likely to be) clustered together, ignoring the actual index that is used.
I've considered using find to index items within a row that are the same for each different cluster number and then using intersect to find the sets of column indexes which are consistent. I haven't, however, come up with an efficient method. Any suggestions would be greatly appreciated.
Thanks,
Dan
  5 Comments
Matt Kindig
Matt Kindig on 26 Jul 2012
Well using my method would force column 1 to always be assigned to cluster 1, by definition. You can then count the number of rows that contain a 1 for columns 2-end to determine the probability of matches with column 1. Similarly, you could count the number of 2's that occurs in columns 3-end to get the probability of matches with column 3, and so on.
Image Analyst
Image Analyst on 26 Jul 2012
Yes, Daniel you explanation is what I was expecting, though your initial example didn't show that. However your latest example does show that. So for X, if it did pick consistent cluster label numbers, it would have given Xmodified. But it doesn't. So the problem is for any given row, let's say the last row, how do we know that the 3 in X should really be a 1, the 2 should stay a 2, and the 1 should really be a 3, versus already being the actual numbers they're supposed to be, like the first 2 rows were?
Or take the next to the last row. It looks like the 1 is right but 2 and three are swapped. OR, are they all right and the 2 and 3 are just misclassifications due to your probabilistic nature of your classification algorithm?

Sign in to comment.

Answers (0)

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!