How to use svm in Matlab for my binary feature vector.
1 view (last 30 days)
Show older comments
Let say I have a main feature set which combine of six binary feature vector. These six binary feature vector are 105X6 logical. Eg:
1. 10100001000001111111100000000001..
2. 00001010101111000010101010110001..
3. 00101011101111111100001000000000..
4. 11111111110000101010101001010111..
5. 0000011110000101010101001010111..
6. 11111111110000101010101001010110..
While three of the feature vector is for benign, another three is for malware. How can I train my feature vector using svmtrain and svmclassify? I have no idea how to start, please guide me.
0 Comments
Answers (2)
Walter Roberson
on 8 Apr 2017
Do you mean you have 105 samples, each of which have feature vectors totaling 6 bits, or do you mean you have 6 samples, each of which has a total of 105 bits of features?
If you only have 6 samples with 105 bits of features per sample, then you do not have enough data to do classification.
2 Comments
Walter Roberson
on 9 Apr 2017
To do the calculations for classifications, you need at least as many samples as you have bits of features. More than that, actually.
user2030669, @cbeleites answer below is superb but as a rough rule of thumb: you need at least 6 times the number of cases (samples) as features. – BGreene Mar 7 '13 at 14:48 2 ... in each class. I've also seen recommendations of 5p and 3p / class. – cbeleites Mar 7 '13 at 20:02
[...] but you need a minimum of 96 observations to accurately predict the probability of a binary outcome even if there are no features to be examined [this is to achieve of 0.95 confidence margin of error of 0.1 in estimating the actual marginal probability that Y=1].
Ilya
on 11 Apr 2017
You most certainly do not need as many samples as you have features. Statements like "you need at least 6 times the number of cases (samples) as features" are sheer nonsense.
However, with so few observations (6) you will likely find that several, perhaps many, features individually give perfect separation between the two classes. For example, staring at the posted patterns, I observe that the 6th bit is 0 for the first three samples and 1 for the last three samples. So if the first three are benign and the last three are malignant, the 6th bit is a perfect predictor. And there may be more.
You do not need SVM or any clever classifier for this problem. Just find all such perfect predictors and see if they make sense. Passing data to smart black boxes shouldn't be the first step in your analysis. Think about what your data means first. See if you can get a simple classification model by hand. If you fail, proceed with sophisticated algorithms.
11 Comments
Walter Roberson
on 14 Apr 2017
Ilya, this resource (MATLAB Answers) is not an academic journal: it is a resource in which people do the best they can in their spare time to help other people.
Cross-checking competing papers takes time, and might require years of background experience to know all the relevant factors, and to know things like which papers were later refuted. From time to time someone with a lot of deep knowledge in a topic wanders by here and helps out.
But... mostly topic experts do not wander here and help out. That leaves the volunteers with a choice:
A) Leave nearly all the questions here unanswered because we are not the topic experts; or
B) Do some surface-level research of appropriate papers and books, hoping that our S/T/E/M backgrounds are enough to guide us to something useful that we can interpret for the people asking the questions; or
C) Answer based upon our memory, and using past postings of how other people have answered similar questions in the past (people who might not have been topic experts either.)
I often end up answering questions that involve matters outside my topic expertise, including on topics that I may never have heard of before. I would prefer if there were experts on hand on every topic, ready to step in promptly... but those people simply are not available.
So I have a look; and I answer what I can, in the time I have available; to the extent that my health allows.
It is not the best of situations, but unfortunately a lot of the time I am the only help people have. It is, to be frank, a very heavy burden at times.
Ilya
on 15 Apr 2017
Walter, I appreciate this explanation.
I agree that this resource is not an academic journal, and the threshold for posting an answer is much lower than that for a publication. I also note that there are no consistent rules for people who answer on Answers (at least I am not aware of any), and for that reason you can choose any philosophy you like with respect to the quality/thoughtfulness of your answers. Yet I believe that answering questions outside your expertise without doing some verification first is dangerous and often produces plain wrong (not just somewhat incorrect) answers, which is worse than not giving any answer at all. Just like you said - "those people simply are not available", where "those people" means "experts". Because experts are not available, no one is there to refute a wrong answer, and the wrong answer stays on this site forever, serving as a source of confusion and support for similarly misguided future answers.
On my part, I choose to answer only questions for which I consider myself an expert. I doubt that by doing so I fail to provide critical help to people out there. Many people asking questions on this site are students, and they can certainly find other sources of help such as, for instance, their professors. This is especially true for questions such as this one, where the entire discussion revolves around theory and has nothing to do with MATLAB. It's just that submitting a question to Answers takes less effort than scheduling an appointment with faculty, and they resort to this easy way. If they knew the likelihood of getting a plain wrong answer was high, they would likely not resort to this easy way.
I appreciate your desire to help and am not asking you to apply the same level of scrutiny as that for an academic publication. I think though that raising the bar a bit higher would be a positive change toward improving quality, perhaps at the expense of reducing the overall number of answers; I think such a reduction would be acceptable since it would also lead to reduction of plain wrong answers. Also, doing more verification would allow you to learn the material at a deeper level and develop knowledge of new areas. I do not know to what extent you are interested in learning, of course.
See Also
Categories
Find more on Performance and Memory in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!