Find the K most orthogonal vectors in a set of vectors
    4 views (last 30 days)
  
       Show older comments
    
Hello All,
The context of this particular search is a step in tuning a spectral clustering routine a la Ng et al 2002. The purpose of this search is to give a initialization point for k-means clustering in higher dimensional space. In particular, well separated data ought to sit in K tight clusters on the surface of a hypersphere, so the purpose of this search is to find the locations of these cluster centroids with which to initialize k-means clustering of the same data.
I have a data matrix "Y" with O(10^4) rows and O(10^2-10^3) columns (the columns of this matrix are the [transformed and normalized a couple times] K largest eigenvectors of the affinity matrix).
Ng et al suggest "Briefly, we let the first cluster centroid be a randomly chosen row of Y, and then repeatedly choose as the next centroid the row of Y that is closest to being 90 degrees from all the centroids already picked." I translated this mathspeak to mean I need to take a bunch of dot products and look for values close to zero. This was quoted as computationally cheap (perhaps it is for clustering fewer points into say O(10^1) clusters or perhaps they meant it in the sense that it requires fewer iterations of k-means clustering once initialized), but my CPU is dragging ass at it.
So far I've tried 2 approaches: Approach #1 - Compute everything then search
% "cheap" initialization of k-means
dotProductY = zeros(length(Y)); %preallocate to make the parser turn green
% compute dot product of every row with every other row first
for k = 1:length(Y)
  dotProductY(:,k) = sum(bsxfun(@times,Y(k,:),Y),2);
end
dotProductY(logical(eye(length(Y)))) = nan; %exclude dot product of row with self
centroidIdx = randi(length(Y)); %initialize on a random row of Y
dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
dotProductY = abs(dotProductY); %use the absolute value because looking for closer to zero
for k = 2:K
  [~,im] = min(sum(dotProductY(:,centroidIdx),2)); %find next best centroid
  centroidIdx(k) = im; %reassign
  dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
end
Approach #2 - Simultaneous computation and search
    %try a cheaper one?
    centroidIdx = randi(length(Y)); %initialize on a random row of Y
    for k = 1:K-1
        dotProductY(:,k) = sum(bsxfun(@times,Y(centroidIdx(k),:),Y),2); %compute inner product 
        dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
        [~,im] = min(sum(abs(dotProductY),2)); %find next best centroid
        centroidIdx(k+1) = im; %reassign
    end
Neither of these approaches seems cheap to me. Anyone else take a stab at this before? Any suggestions?
0 Comments
Accepted Answer
  Matt J
      
      
 on 19 May 2016
        
      Edited: Matt J
      
      
 on 19 May 2016
  
      Your computation of dotProductY could be more efficient. The most vectorized may of computing it, I believe is
   dotProductY=abs(Y*Y.');
I expect that would have been the main bottleneck.
2 Comments
  Matt J
      
      
 on 19 May 2016
				
      Edited: Matt J
      
      
 on 22 May 2016
  
			This part
for k = 2:K
    [~,im] = min(sum(dotProductY(:,centroidIdx),2)); %find next best centroid
    centroidIdx(k) = im; %reassign
    dotProductY(centroidIdx,:) = nan; %dont pick the same row twice
  end
also looks like it could be incrementalized as follows
im=randi(size(Y,1));
centroidIdx(1)=im;
temp=dotProductY(:,im);
   for k = 2:K
      [~,im] = min(temp); %find next best centroid
      centroidIdx(k) = im; %reassign
      temp=temp+dotProductY(:,im); %update temp
        temp(im) = inf; %dont pick the same row twice
    end
More Answers (0)
See Also
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!
