pidst and sparse vectors/graphs

This topic has been permanently closed and transferred to MATLAB Answers.


The pdist function allows the user to specify the function coding the similarity between rows of matrices (called DISTFUN in the documentation).
With the increasing diffusion of large datasets, techniques to sparsify graphs are increasingly being explored and used in practical applications. It is easy to code one own's DISTFUN such that it returns a sparse vector. Unfortunately, pdist (and pdist2) will return a dense vector in the output, which for very large graphs will cause an out of memory error. The offending code is
(lines 434 etc.)
% Make the return have whichever numeric type the distance function
% returns, or logical.
if islogical(Y)
Y = false(1,n*(n-1)./2);
else % isnumeric
Y = zeros(1,n*(n-1)./2, class(Y));
end
To have pdist return a sparse vector, the only modification that is required is
if islogical(Y)
Y = false(1,n*(n-1)./2);
elseif issparse(Y)
Y = sparse(1,n*(n-1)./2);
else % isnumeric
Y = zeros(1,n*(n-1)./2, class(Y));
end
It is a bit more work to modify squareform to produce a sparse matrix, given a sparse vector produced by the modified pdist. Squareform includes several checks on the inputs, but the core functionality for sparse vectors would be given by something like
% given a sparse vector d returned by pdist, compute a sparse squareform
[~,j,v] = find(d);
[m,n] = pdist_ind2sub(j, nobs);
W = sparse(m, n, v, nobs, nobs);
W = W + W';
Here, pdist_ind2sub is a function that given a set of indices into a vector produced by pdist, returns the corresponding subscripts in a (triangular) matrix. Computing this requires information about the number of observations given to pdist, i.e. what was n in the preceding code. I could not figure out a way to use the function adjacency to accomplish this.