Maximize similarity of vectors

I have vectors that represent the categorical status of a set of "nodes" -- one vector entry per node. (The actual number value is meaningless-- only the fact that the nodes belong to the same versus different category.)
I produce output vectors as below, which are all the same length (i.e., 10 nodes) and each entry represents the category for a particular node. Note that v1 has 3 possible categories while v2 has 4.
v1 = [1 1 2 1 2 3 3 1 1 2]; %so nodes 1, 2, 4, 8, and 9 all belong to the same category, designated as 1
v2 = [2 2 4 2 4 1 4 2 2 3]; %again nodes 1, 2, 4, 8, and 9 all belong to the same category, but it just happens to be designated as a 2
I want v2 to be as similar to v1 as possible, for example by swapping around the categorical designations, while not losing the grouping structure within each vector. Obviously, there is not one right answer, but there should be an optimal one...
By eye, it would result in something like: output_v1 = [1 1 2 1 2 3 3 1 1 2]; output_v2 = [1 1 2 1 2 3 2 1 1 4];
...where for v2, all the 2s became 1s, all the 4s became 2s, all the 1s became 3s, and the 3 became a 4. Though, importantly, not all 3s in v1 will correspond to the same number in v2 (here, 3 or 2).
I've attempted this various ways (sorting and renaming based on category frequency, minimizing euclidean distance after partially permuting, completely randomly permuting), but my solutions are less good (and more time consuming) as I add more vectors and nodes. I feel as if someone has probably already worked out the best way to do this a long time ago. Any suggestions for algorithms or keywords to search for would be much appreciated!

Answers (1)

Cedric
Cedric on 27 Mar 2013
Edited: Cedric on 27 Mar 2013
I've never done anything like that, but you picked my curiosity so I gave it a quick try..
v1 = [1 1 2 1 2 3 3 1 1 2] ;
v2 = [2 2 4 2 4 1 4 2 2 3] ;
nCat = min(max(v1), max(v2)) ;
[~, lut1] = sort(accumarray(v1.', ones(size(v1))), 'descend') ;
[~, lut2] = sort(accumarray(v2.', ones(size(v2))), 'descend') ;
v2_reindexed = v2 ;
for k = 1 : nCat, v2_reindexed(v2 == lut2(k)) = lut1(k) ; end
for k = nCat+1 : max(v2), v2_reindexed(v2 == lut2(k)) = k ; end
This produces
v2_reindexed =
1 1 2 1 2 3 2 1 1 4
What we do here is to rank indices/categories in both v1 and v2 based on their occurrence (counts), and we build lookup tables ( lut1, lut2 ) that relate these rankings to actual category IDs. We then loop over categories in v2 from highest rank to lowest, and we reindex them using v1's categories taken from highest rank to lowest as well. Huh, difficult to explain clearly, but look..
>> accumarray(v1.', ones(size(v1)))
ans =
5
3
2
>> accumarray(v2.', ones(size(v2)))
ans =
1
5
1
3
here, calls to ACCUMARRAY generate counts of categories for both v1 and v2. Now if we SORT these vectors..
>> [~, lut1] = sort([5; 3; 2], 'descend')
lut1 =
1
2
3
>> [~, lut2] = sort([1; 5; 1; 3], 'descend')
lut2 =
2
4
1
3
.. and take the 2nd output arg (which gives the sort order), we can use it as a lookup table to get categories in descending order of counts for both v1 and v2. The first element of lut1 being 1 tells that the first category in v1 had the largest count. The first element of lut2 being 2 tells that the 2nd category in v2 had the largest count.
Now the only thing that is left to us is to change all entries being equal to the largest category in v2 and make them equal to the largets category in v1, which is done with the first FOR loop. Finally, remaining categories in v2 have to be reindexed with additional integers.

Products

Asked:

on 27 Mar 2013

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!