k-means with NaNs in input
Show older comments
Naively I would have thought one could do k-means with missing data elements without imputation. One just needed to calculate the distance between two vectors (eg by just dropping the dimensions with missing data elements.)
However all the matlab implementations seem to require vectors without missing data.
Is there a version of k-means (hopefully k-means++) which allows missing values?
---
Update: I just found a thread on this in mathworks but this thread did not answer the question properly because it doesnt seem to understand you dont want to impute! (the thread http://www.mathworks.com/matlabcentral/newsreader/view_thread/295929 )
So let me try to explain. When one uses "real world" data vectors often there are missing elements. The key to the information on similarity between vectors is the few differences--there are usually a small number of key differences when comparing two vectors so it distorts the answer to impute. But the pairwise difference between two vectors can be computed even if there are missing elements (for example by just not including those dimensions that are not full for the pair of vectors.)
So for example you can make a dissimilarity matrix with a set of vectors which contain missing elements. You can do self-organizing maps (SOMs) on these vectors too (the cluster vectors are complete but the data vectors may have NaNs.)
I would think there would be a flavor of k-means where the "mean vectors" are complete but the distance between the mean vectors and the data vectors would be calculated as described above.
---
I am not wedded to k-means, feel free to recommend alternate methods if k-means is incompatible with missing elements as inputs.
---
Thanks,
Al.
Answers (2)
Walter Roberson
on 22 Jan 2011
0 votes
You cannot calculate the distance between two vectors when you have missing information for one of the coordinates in the vector. You cannot simply omit the coordinate with the missing data, as the missing data could have been anything in the permitted range for that variable, and you can thus only calculate a range of values for the distance. If the missing coordinate is infinite, the range of values for the vector distance could be infinite. The probability of that the coordinate was very different than the others might be quite low, but still it might have happened.
If you have a model distribution for the missing coordinate, of course you could impute the mean of the distribution or the point of greatest likelihood in place of the missing data, but your question implies you do not want to do that. Still a model for the missing data would allow you to set confidence intervals against the range of contributions of the missing data.
Categories
Find more on k-Means and k-Medoids Clustering in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!