This is a super duper fast implementation of the kmeans clustering algorithm. The code is fully vectorized and extremely succinct. It is much much faster than the Matlab builtin kmeans function. The kmeans++ seeding algorithm is also included (kseeds.m) for good initialization. Therefore, this package is not only for coolness, it is indeed practical.
Please try the demo script in the package.
Detail explanation of this algorithm can be found in following blog post:
This function is now a part of the PRML toolbox (http://www.mathworks.com/matlabcentral/fileexchange/55826-pattern-recognition-and-machine-learning-toolbox).
'gamrnd' requires Statistics and Machine Learning Toolbox.
Error using *
MTIMES (*) is not supported for one sparse argument and one single
Error in kkmeans (line 17)
mu = X*normalize(sparse(idx,last,1),1); % compute cluster centers
How can I solve this error? Thanks
A little bit of substitution of bsxfun(@minus,x,y) in for the places where this fails in Matlab < R2016b makes this work just fine. And it is still faster or at least on par with the builtin kmeans (without the need for a Stats toolbox license - which we run out of sometimes...).
A little bit of conditional tests could make this work for pretty much every Matlab version, i.e. using an if statement or an anonymous function to perform those expansions that fail on older versions, but if performance/speed is your only metric, then...
I definitely appreciated the kmeans++ initialization.
@Stephane CHO Just like rating a windows software low because not working on linux, especially, it claims windows only.
Not working on R2015a
I can see visually the clusters are separated ,but how do I get to know the members of the clusters (i.e index of the main dataset )
@John, this is a new syntax of Matlab >=R2016b.
Same Problem as John. Running the demo throws an error:
Error using -
Matrix dimensions must agree.
Error in kseeds (line 15)
D = min(D,sum((X-mu(:,i-1)).^2,1));
Error in kmeans_demo (line 9)
y = kmeans(X,kseeds(X,k));
Does not work in MATLAB 7.12.0 (R2011a).
This tool does: https://www.mathworks.com/matlabcentral/fileexchange/28804-k-means++
I'm having trouble running kmeans_demo. In kmeans_demo, the line y = kmeans(X,kseeds(X,k)); calls kseed with X(2,5000) and k=3. In kseeds, the line D = min(D,sum((X-mu(:,i-1)).^2,1)); the term sum((X-mu(:,i-1)).^2 doesn't evaluated with X as a 2x5000 array and mu a 2x1 array.
HOW TO CREATE A DATABASE IN MATLAB?
There is a small bug for newer versions of MATLAB because the behavior of UNIQUE has changed so that the indices are always returned as column vectors. See demo below.
Though I'm not a guru on speed, one suggested change would be to add the following line after calls to UNIQUE:
label = label(:)';
clc; lb=[3 1 1 5]; szb=size(lb); [~,~,la] = unique(lb); sza=size(la);
fprintf('<<<<<< %s >>>>>>\nLabels before: %s, Labels after: %s\nSize before:%s, Size after:%s\n', version, mat2str(lb), mat2str(la), mat2str(szb), mat2str(sza));
Output on different versions of MATLAB:
<<<<<< 184.108.40.2062 (R2014a) >>>>>>
Labels before: [3 1 1 5], Labels after: [2;1;1;3]
Size before:[1 4], Size after:[4 1]
<<<<<< 220.127.116.115 (R2011a) >>>>>>
Labels before: [3 1 1 5], Labels after: [2 1 1 3]
Size before:[1 4], Size after:[1 4]
How to view the segmented result! pls help me!
how to run k-means for a input as a any dataset file
The algorithm randomly fails, probably due to the random initialization.
For instance, the vector of 1D data X=[9.13,2.68,2.33,9.41] with k=2 is sometimes clustered (labeled) as [1 1 1 1], instead of the right values [1 2 2 1] or [2 1 1 2];
Hi, i'm new to Matlab coding. Pls tell how to run this code...
how should i run this program ??
Works fine, almost always. However, besides the needed transpose mentioned by Matei, I noticed another problem. Suppose I want 9 clusters from a dataset with 100 elements. Quite often I only get 7 or 8 groups. The same thing happens with other cluster numbers, though I haven't seen it for fewer than wanting 5.
did you already implement the Replication Feature? (From judging your code no, but maybe I don't get the stuff correctly yet ;-))
And why the spread function can only get an input matrix X of 2 or 3 rows, while in your kmeans algorithm it can have any size in the first dimension?
@Zoe: I would rather just transpose instead of reshaping, i.e. in line 14
last = label';
Cuts down my calculation from 4h down to 19min, so very pleased. Had to reshape label as per suggestion by Zoe below to make it work with 2013. Otherwise very good!
are no. of seeds equal to no.of clusters??
and i need labels for all data points.Whereas this is giving label whose dimension is equal to no. of seeds
please help me in Em estimator for parameter estimation
Seems to be broken in R2013a due too a column/row vector mix-up. As a quick workaround I have added
label = reshape(label, length(label), 1);
after line 15 in litekmeans.m.
Thank you for sharing. Does the code support 3d data?
not working ..
I am looking for an implemented kmeans clustering algorithm to segment a full body 3D CT matrix into 4 classes in Matlab. This kmeans output will then be used as input to Potts model segmentation.
Is there anyone who can help med with this or give me some suggestions.
Thanks in advance
It is simply not for image.
does it works for rgb images?
Good, fast implementation but it should be possible to pass in the cluster centers. Also arguments are not checked (try k=1) and standard help is not provided.
It is normal behavior. This line of code remove empty clusters which does not happen very often when k is small.
I met problem when compiling the following sentence of your code,
[~,~,label] = unique(label);
As is known, the character '~' is not supported for representing non-used variables in recent releases of Matlab. However, if I change the '~' to some other variable names (not to be used), then this sentence seems unnecessary --- label is not changed after executing this sentence.
Any comments on this?
Even if we choose the cluster centers initially still there is no guarantee that we will get the same clusters every time.
There is one paperon this and a specific clustering technique to solve this problem But I am unable to code it properly. Can you have a look on this algorithm called Quality Threshold (QT) based clustering.
Heyer, L. J., Kruglyak, S., Yooseph, S. (1999). Exploring expression
% data: Identification and analysis of coexpressed genes. Genome Research
% 9, 1106–1115.
If you want deterministic behavior of kmeans, you can use some deterministic method to initialize the clusters. For example, you can first choose the sample which is fastest from the center of the whole data set as a cluster center, then iteratively choose sample which has the largest sum of distances from the chosen centers.
However, you are not guarantee to obtain a good result on any data set.
Thanks for the posting Michael. I have a small query is thery any clustering technique by which the cluster results will remain same everytime we run the code.
Please kindly tell.
This is a really nice function.
As Fen Xie pointed out, it sometimes returns empty clusters (esp if you set k, the desired number of clusters too high compare to the size of you data).
The solution is easy though. You only need two extra lines of code to get rid of the empty clusters:
% 1- dicsard empty clusters
centers = centers(:, unique(labels));
% 2- re-assign labels
[~,labels] = max(bsxfun(@minus,centers'*X,0.5*sum(centers.^2)'));
Thanks a lot for the nice code. It was very usefult for me.
Hi Yen, thanks, I've update the file that fixed the problem.
Hmm, the site submitted an empty comment for me when I left the page. Nice.
Anyways, thank you for the highly optimized code. I can tell that a lot of thought went into minimizing the memory footprint and the number of computations (e.g., sparse matrices, coefficient placement outside of summations, computing xc-0.5c^2 instead of the full Euclidean distance x^2-2xc+c^2). I'd like to point out a bug, though -- litekmeans.m, line 9: your sum command isn't equipped to handle data sets that are 1-by-N. Change it to: 0.5*sum(center.^2,1)' to force summation over rows.
I just implement the classical kmeans. there is no improvement in the algorithm aspect. The speed gain is obtained by coding tricks which is vectorization. For the question of whether this one is suitable for large k, I guess you have to try it. You might get a result with the number of clusters less than k, because some clusters might become empty during the iterations.
And is it suitable for problems that the number of clusters is large, e.g. k=256?
Hi, Michael, I tried some other kmeans algorithms and found that your method indeed fast in speed.Why you say that it could be faster than other fast kmeans method(such as kd-tree and linear time algorithm), what is the computation complexity of your method, while the classical is O(nk)? Thanks a lot!
Thanks for sharing.
I wonder the complexity of this thing. Do you know?
The results of kmeans algorithm can be different with different initializations. Actually the kmeans function in matlab is not a standard kmeans algorithm. It tries to get smaller energy by switching data points in different clusters after the standard kmeans procedure converged.
One purpose of the litekmeans is to be simple (only 10 lines of code), therefore I did not add extra code to handle empty cluster. It just discard the cluster if the cluster becomes empty. You can modify the code yourself if you want extra functionality.
this method produces empty clusters constantly, be careful dealing with these exceptions～
Sorry, I have compared the results of your program and the embedded program of matlab, the two results doesn't show the same, so what does it mean??
Yes, just call the litekmeans.m to get the clustering results. You cannot get a visualization in a simple way for the data whose dimensions are more than 3. The scatterd.m can only handle data of 2d or 3d.
Thank you for the share. I have two questions. Do we have to use both of the functions to cluster? I have 13x7000 matrix which I want to cluster. Should I just simply apply the matrice to litekmeans.m? And how can I plot the result as displayed in the picture?
Sorry for the inconvenience. Here's the answer to your questions.
The function takes two parameters. The first one is a d x n data matrix, of which each column is assumed to be a sample vector of d dimension. The second parameter is the number of clusters. The output is a 1 x n vector, of which each element is the label of the corresponding input sample vector.
The function handles data of arbitrary dimensions.
This gave a simple implementation to the problem I had.
A couple of notes:
- Reduction of code is good, reduction of comments is not. "litekmeans" is without any header line or comments describing what this (very useful) function does, what kind of input it takes, what it produces etc. Things like the dimension to arrange the input points are necessary to use the function properly, but are not explained in a header. Whether or not the function handles 1D, 2D, 3D, etc data isn't written anywhere.
- Providing data as a data.mat file is useful for functions requiring custom made data, but it tends to obscure the format of the data. It would be simpler and easier to understand by providing equally simple sample code that makes the data.
- Providing brief sample code on the file exchange is fine, but it should be included as a header to the file, so that people can find it when they need it, rather than when they first download it.
Otherwise, thanks! Like I said, it solved my immediate problem.
finalize, I'm done with kmeans.
tweak and require Matlab R2016b or later
fix incompatibility issue with newer version of Matlab, due to the stupid API change of function unique()
remove empty clusters according to suggestion
remove empty clusters according to suggestion
fix a bug for 1d data
update the files and description
Inspired by: Pattern Recognition and Machine Learning Toolbox