How to see if characters are present in a string array.

I am trying to write some code that will take a short amino acid sequence, ex. 'GSA' and then search through a string array of sequences to find the number and index of matches, but I would like it to ignore the order of the characters. As long as each character is present, I would like to consider it a hit.
Here is the code I have so far, which kind of works. InputSeq is the sequence I would like to search for, and AAseq is the string array of sequences that I would be searching through. This code only produces a match if all characters are present AND the order is correct.
InputSeq = "GSA";
AAseq = [ SGD; SGS; SGA; SGV; SGS; SGA; SGD; SGS; SGS; SGY; SGD; SGS; SGI.........];
result = ismember(InputSeq, AAseq)
This kind of works, but it will not register a match if the order of the characters does not match.

 Accepted Answer

Assuming that all string elements contain exactly the same number of characters, then you can do this easily with basci logical operations on character arrays:
A = "GSA";
B = ["SGD";"SGS";"SGA";"SGV";"SGS";"SGA";"SGD";"SGS";"SGS";"SGY";"SGD";"SGS";"SGI"]
B = 13×1 string array
"SGD" "SGS" "SGA" "SGV" "SGS" "SGA" "SGD" "SGS" "SGS" "SGY" "SGD" "SGS" "SGI"
X = all(sort(char(A))==sort(char(B),2),2)
X = 13×1 logical array
0 0 1 0 0 1 0 0 0 0
Or without sorting:
X = all(any(char(A)==permute(char(B),[1,3,2]),3),2)
X = 13×1 logical array
0 0 1 0 0 1 0 0 0 0

3 Comments

Thanks! This worked the best for me, but I had to make some changes to the way I sorted my character array. They way you coded it, it alphabetized by row and didn't alphabetize the columns. I solved it by doing this. I think it is because my variable was a character array already rather than a string.
A = 'GSA'
B = ['SGD';'SGS';'SGA';'SGV';'SGS';'SGA';'SGD';'SGS';'SGS';'SGY';'SGD';'SGS';'SGI']
for i = 1:length(B)
B(i,:) = sort(B(i,:));
end
Result = all(sort(A) == B, 2);
MatchIdx = find(Result == 1);
MatchIdx =
3
6
You don't need the loop, youc an simply specify the sort dimension argument:
A = 'GSA'
A = 'GSA'
B = ['SGD';'SGS';'SGA';'SGV';'SGS';'SGA';'SGD';'SGS';'SGS';'SGY';'SGD';'SGS';'SGI']
B = 13×3 char array
'SGD' 'SGS' 'SGA' 'SGV' 'SGS' 'SGA' 'SGD' 'SGS' 'SGS' 'SGY' 'SGD' 'SGS' 'SGI'
X = all(sort(A)==sort(B,2),2)
X = 13×1 logical array
0 0 1 0 0 1 0 0 0 0
Yep, you're right! That worked. Thank you!

Sign in to comment.

More Answers (1)

You could use multiple contains() tests.
But I suggest that instead you do something like
ismember(sort(char(InputSeq)), cellfun(@sort, cellstr(AAseq), 'uniform', 0))

2 Comments

That is only returning true or false i.e. "InputSeq is found somewhere in AAseq." I would like to know get a logic array of the same size as AAseq, so I can get all of the indeces of the matching sequences.
I had some luck with this, I also trimmed the input sequence down to 'GS,' and the AAseq are all two characters long as well
Matches = ismember(InputSeq, AAseq); (both variables are char arrays)
This gave me a 96x2 logic array. Column one seems to be "is G a member" and column 2 is "is S a member"
This kind of works for me. If I can get the row indeces where both columns are true I will be good.
I tried this
MatchIndex = find(Matches == [1 1])
but it just gave me every index where there is a 1, rather than giving me indeces where both columns are 1.
ismember( cellfun(@sort, cellstr(AAseq), 'uniform', 0), sort(char(InputSeq)) )
You could also strcmp()

Sign in to comment.

Categories

Products

Release

R2019b

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!