K-mode clustering algorithm to cluster categorical data?

Has anyone come across k-mode script in the Matlabsphere? I've seen people respond with links to supervised learning algos, but I need unsupervised. Even a pseudo code would be okay, so I can build it.
I'm using R2017b.
Really trying to avoid using R..

Answers (1)

I can't imagine why you'd use kmeans with categorical data. If it's categorical you can simply just use the category to classify the data point, right?

4 Comments

K-mode is for categorical data. I have multiple dimensions of categorical data so I'd have to choose which category to classify.
Now, since you're not clarifying anything, I'll pose an example. Let's say that you have car makes (manufacturers) and your data is categorical, like Ford, GM, VW, BMW, Toyota, Nissan, Kia, and Jaguar (8 makes). Now, how would I cluster that? Let's say I had anywhere from 1000 to 10,000 counts for each car make and you want to find the "clusters". Well, how about 2 clusters? OK, then which makes would you group into each cluster? If you have no other info, then there is not really enough info to decide what makes a cluster. How about clustering by make, so it would make sense to have 8 clusters, one for each unique make. Or, if you want, you could use categorical() and cluster based on some other factor, like the count in each category so that you could have classes of "sold many" or "sold few".
I attach an example where I use kmeans to cluster an image. You could, if you want, consider that the gray levels are like a set of 256 categories and the clusters/classes are 2 (or however many you specify) gray level ranges.
I apologize for not clearly stating my question/issue. I was hoping just for some one having come across k-mode script, but I'll try to pose my question better.
I think this analogy is similar enough to my data set. I have 200 questionnaires, and within each questionnaire I have 40 questions that are categorical. I would like to cluster them such that similar questionnaires cluster together. So even if 1-2 questions were answered different, the distance measure would not be too large between those two data points.
How my question differs from what you replied, which perhaps my interpretation is wrong, but I can't simply cluster the questionnaire based on an arbitrary question (i.e just Question 1, or just the car makers)-- I need to consider all of them.
k-means is appropriate for numerical data. There is no way of translating my categorical data into meaningful numeric data. They are currently numeric in my matrix, but consecutive numbers are not related and thus any distance measure is meaningless.
Does that make more sense?
I've found this, https://shapeofdata.wordpress.com/2014/03/04/k-modes/, which may seem to be of use -- and this is what I am looking to try? I just would rather avoid having to code it myself because of time constraints.
I would also entertain any other suggestion of data clustering. I am not sold on k-mode.
I'm not an expert on questionnaires, though we have many statisticians in our company who spend their whole lives doing that. I'd suggest you try the Classification Learner app, and pick the best one. Check out this page https://www.mathworks.com/help/stats/machine-learning-in-matlab.html. You have unsupervised learning because you have data but no ground truth - you don't know the classes/groupings of any of them in advance.

Sign in to comment.

Asked:

on 10 Aug 2018

Commented:

on 12 Aug 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!