counting occurances of a specific character in a cell array

Hi guys,
I want to count repeated occurances of characters in a cell array,
e.g.
AAA AAT AAG AAT AGC ACG
I want something to automatically identify and count the occurrences?
could anyone give me some help.

1 Comment

So 'AAA' would be 3 'A's? Or do you mean that you want to count the number of 'AAA', the number of 'AAT', and so on?
The cell arrays: is there one entry per cell, or are they blank-separated strings that need to be broken up?

Sign in to comment.

Answers (1)

Assuming that these are amino acids/codons (3 uppercase letters), here are three "not-very-orthodox" solutions, just for fun. But keep in mind that with bioinformatics being a hot topic, there are quite a few very specialized libs out there (e.g. http://www.mathworks.com/help/bioinfo/functionlist.html) that would do the job in a much better fashion. You might also get a more orthodox version from someone else once you answer Walter's comment.
Assuming, for the example (but it works for any cell array of 3 uppercase letters codes):
C = {'AAA','AAT','AAG','AAT','AGC','ACG'} ;
n = numel(C) ;
1. Probably the most efficient of these non-orthodox solutions (~0.58s for processing 1 million codons on my poor laptop):
D = accumarray([[C{:}]-64; reshape([1;1;1]*(1:n), 1, [])].', 1, [26 n]) ;
2. Closely followed by a "sparse" version:
D = sparse([C{:}]-64, reshape([1;1;1]*(1:n), 1, []), ones(1,3*n), 26, n) ;
3. And finally a much less efficient cell2mat/cellfun:
D = cell2mat(cellfun(@(code)accumarray(code.'-64, 1, [26,1]), C, ...
'UniformOutput', false)) ;
They all three produce a 26 x #codes matrix whose columns are the distributions of the 26 letters of the alphabet for each code, with row index = letter ID, A=1,..,Z=26. (the sparse version produces a sparse matrix) :
>> D
D =
3 2 2 2 1 1
0 0 0 0 0 0
0 0 0 0 1 1
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 1 0 1 1
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 1 0 1 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
Note that the 3rd version doesn't assume 3 letters codes and would work with arbitrary codes lengths. The first 2 versions could be adapted to have this flexibility.
Cheers,
Cedric

Categories

Find more on Genomics and Next Generation Sequencing in Help Center and File Exchange

Asked:

on 19 Jan 2013

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!