k-means with NaNs in input

Naively I would have thought one could do k-means with missing data elements without imputation. One just needed to calculate the distance between two vectors (eg by just dropping the dimensions with missing data elements.)
However all the matlab implementations seem to require vectors without missing data.
Is there a version of k-means (hopefully k-means++) which allows missing values?
---
Update: I just found a thread on this in mathworks but this thread did not answer the question properly because it doesnt seem to understand you dont want to impute! (the thread http://www.mathworks.com/matlabcentral/newsreader/view_thread/295929 )
So let me try to explain. When one uses "real world" data vectors often there are missing elements. The key to the information on similarity between vectors is the few differences--there are usually a small number of key differences when comparing two vectors so it distorts the answer to impute. But the pairwise difference between two vectors can be computed even if there are missing elements (for example by just not including those dimensions that are not full for the pair of vectors.)
So for example you can make a dissimilarity matrix with a set of vectors which contain missing elements. You can do self-organizing maps (SOMs) on these vectors too (the cluster vectors are complete but the data vectors may have NaNs.)
I would think there would be a flavor of k-means where the "mean vectors" are complete but the distance between the mean vectors and the data vectors would be calculated as described above.
---
I am not wedded to k-means, feel free to recommend alternate methods if k-means is incompatible with missing elements as inputs.
---
Thanks,
Al.

Answers (2)

Drop the missing values then:
newdata = data(~isnan(data));
Oleg

1 Comment

Al R
Al R on 22 Jan 2011
wouldnt this just drop all the dimensions that arent complete?
my input data is 99 vectors of length 64. 63 of the dimensions have at least on missing element for the 99 vectors.
i probably dont understand--please explain.
thanks.....Al

Sign in to comment.

Walter Roberson
Walter Roberson on 22 Jan 2011

0 votes

You cannot calculate the distance between two vectors when you have missing information for one of the coordinates in the vector. You cannot simply omit the coordinate with the missing data, as the missing data could have been anything in the permitted range for that variable, and you can thus only calculate a range of values for the distance. If the missing coordinate is infinite, the range of values for the vector distance could be infinite. The probability of that the coordinate was very different than the others might be quite low, but still it might have happened.
If you have a model distribution for the missing coordinate, of course you could impute the mean of the distribution or the point of greatest likelihood in place of the missing data, but your question implies you do not want to do that. Still a model for the missing data would allow you to set confidence intervals against the range of contributions of the missing data.

Products

Tags

Asked:

on 22 Jan 2011

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!