counting occurances of a specific character in a cell array
Show older comments
Hi guys,
I want to count repeated occurances of characters in a cell array,
e.g.
AAA AAT AAG AAT AGC ACG
I want something to automatically identify and count the occurrences?
could anyone give me some help.
1 Comment
Walter Roberson
on 19 Jan 2013
So 'AAA' would be 3 'A's? Or do you mean that you want to count the number of 'AAA', the number of 'AAT', and so on?
The cell arrays: is there one entry per cell, or are they blank-separated strings that need to be broken up?
Answers (1)
Assuming that these are amino acids/codons (3 uppercase letters), here are three "not-very-orthodox" solutions, just for fun. But keep in mind that with bioinformatics being a hot topic, there are quite a few very specialized libs out there (e.g. http://www.mathworks.com/help/bioinfo/functionlist.html) that would do the job in a much better fashion. You might also get a more orthodox version from someone else once you answer Walter's comment.
Assuming, for the example (but it works for any cell array of 3 uppercase letters codes):
C = {'AAA','AAT','AAG','AAT','AGC','ACG'} ;
n = numel(C) ;
1. Probably the most efficient of these non-orthodox solutions (~0.58s for processing 1 million codons on my poor laptop):
D = accumarray([[C{:}]-64; reshape([1;1;1]*(1:n), 1, [])].', 1, [26 n]) ;
2. Closely followed by a "sparse" version:
D = sparse([C{:}]-64, reshape([1;1;1]*(1:n), 1, []), ones(1,3*n), 26, n) ;
3. And finally a much less efficient cell2mat/cellfun:
D = cell2mat(cellfun(@(code)accumarray(code.'-64, 1, [26,1]), C, ...
'UniformOutput', false)) ;
They all three produce a 26 x #codes matrix whose columns are the distributions of the 26 letters of the alphabet for each code, with row index = letter ID, A=1,..,Z=26. (the sparse version produces a sparse matrix) :
>> D
D =
3 2 2 2 1 1
0 0 0 0 0 0
0 0 0 0 1 1
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 1 0 1 1
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 1 0 1 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
Note that the 3rd version doesn't assume 3 letters codes and would work with arbitrary codes lengths. The first 2 versions could be adapted to have this flexibility.
Cheers,
Cedric
Categories
Find more on Genomics and Next Generation Sequencing in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!