k-means with NaNs in input

Question

0 votes

Naively I would have thought one could do k-means with missing data elements without imputation. One just needed to calculate the distance between two vectors (eg by just dropping the dimensions with missing data elements.)

However all the matlab implementations seem to require vectors without missing data.

Is there a version of k-means (hopefully k-means++) which allows missing values?

---

Update: I just found a thread on this in mathworks but this thread did not answer the question properly because it doesnt seem to understand you dont want to impute! (the thread http://www.mathworks.com/matlabcentral/newsreader/view_thread/295929 )

So let me try to explain. When one uses "real world" data vectors often there are missing elements. The key to the information on similarity between vectors is the few differences--there are usually a small number of key differences when comparing two vectors so it distorts the answer to impute. But the pairwise difference between two vectors can be computed even if there are missing elements (for example by just not including those dimensions that are not full for the pair of vectors.)

So for example you can make a dissimilarity matrix with a set of vectors which contain missing elements. You can do self-organizing maps (SOMs) on these vectors too (the cluster vectors are complete but the data vectors may have NaNs.)

I would think there would be a flavor of k-means where the "mean vectors" are complete but the distance between the mean vectors and the data vectors would be calculated as described above.

---

I am not wedded to k-means, feel free to recommend alternate methods if k-means is incompatible with missing elements as inputs.

---

Thanks,

Al.

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Oleg Komarov on 22 Jan 2011

Open in MATLAB Online

0 votes

Drop the missing values then:

newdata = data(~isnan(data));

Oleg

1 Comment
Show -1 older comments Hide -1 older comments

Al R on 22 Jan 2011

wouldnt this just drop all the dimensions that arent complete?

my input data is 99 vectors of length 64. 63 of the dimensions have at least on missing element for the 99 vectors.

i probably dont understand--please explain.

thanks.....Al

Sign in to comment.

Answer 2

Walter Roberson on 22 Jan 2011

0 votes

You cannot calculate the distance between two vectors when you have missing information for one of the coordinates in the vector. You cannot simply omit the coordinate with the missing data, as the missing data could have been anything in the permitted range for that variable, and you can thus only calculate a range of values for the distance. If the missing coordinate is infinite, the range of values for the vector distance could be infinite. The probability of that the coordinate was very different than the others might be quite low, but still it might have happened.

If you have a model distribution for the missing coordinate, of course you could impute the mean of the distribution or the point of greatest likelihood in place of the missing data, but your question implies you do not want to do that. Still a model for the missing data would allow you to set confidence intervals against the range of contributions of the missing data.

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

k-means with NaNs in input

0 Comments
Show -2 older comments Hide -2 older comments

Answers (2)

1 Comment
Show -1 older comments Hide -1 older comments

0 Comments
Show -2 older comments Hide -2 older comments

Categories

Products

Tags

Community Treasure Hunt

k-means with NaNs in input

0 Comments Show -2 older comments Hide -2 older comments

Answers (2)

1 Comment Show -1 older comments Hide -1 older comments

0 Comments Show -2 older comments Hide -2 older comments

Categories

Products

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

1 Comment
Show -1 older comments Hide -1 older comments

0 Comments
Show -2 older comments Hide -2 older comments