How to search a string with multiple rows for text?

Hello, After running seq=getgenpept('NP_036795'); . I want to search seq.Features for some text value 'Protein' . I have been unable to find the correct function to search a string with multiple rows.
Running: k=strfind(seq.Features,'Protein') results with "Error using strfind. Input strings must have one row."
Any thoughts? Best, Joe

3 Comments

Can you give us some more information so that we can help you i.e. seq is structure, cellarray...??? Can you attach an example
Excerpt from doc of getgenpept
Features: [40x64 char]
strfind cannot handle multi-row character arrays.
What does this array of characters look like? &nbsp BTW: it's allowed to use for-loops.
Looks like the pic below.
What kind of info are you trying to extract from 'Protein'?

Sign in to comment.

Answers (1)

I guess this block of characters is easier to read on screen than to read and parse automatically. "find the correct function" I don't think there is the function; a small program is needed. Anyhow, the script below creates a structure, sas, which is a start
%%Create test data. (The OCR-program missed most of the underscore.)
buf = { 'source 1..116 '
' /organism="Rattus norvegicus" '
' /dbxref="taxon: 10116^ '
' /chromosome=^10^ '
' /map="10824" '
'Protein 1..116 '
' /product="vesicle-associated membrane protein 2^ '
' /note="VAMP-2; synaptobrevin-2; Synaptobrevin 2 '
' (vesicle-associated membrane protein VAMP-2); '
' Vesicle-associated membrane protein (synaptobrevin 2)"'
' /calculated mol wt=12560 '
'Region 28..101 '
' /region name="Synaptobrevin" '
' /note="Synaptobrevin; pfam00957" '
' /dbxref="CDD:250253" '
'Site 95..114 '
' /site type="transmembrane region" '
' /inference="non-experimental evidence, no additional '
' details recorded" '
' /note="propagated from UniProt./Swiss-Prot (P63045.2).'
'CDS 1..116 '
' /gene="Vamp2^ '
' /gene synonym="RATVAMPB; RATVAMPIR; SYS; Syb2^ '
' /coded by="NM 012663.2:83..433" '
' /dbxref="GeneID:24803^ '
' /dbxref="RGD:3949" '};
str_array = char( buf );
%%read and parse
for rr = 1 : size( str_array, 1 )
% search rows starting with a word and followed by digits, two ".", digits
buf = regexp( str_array(rr,:), '^(\w+)\s+(\d+\.{2}\d+)', 'tokens' );
if not( isempty( buf ) )
field_name = buf{1}{1};
sas.(field_name) = buf{1}(2);
else
sas.(field_name) = cat( 1, sas.(field_name) ...
, strtrim( str_array(rr,:) ) );
end
end
The structure, sas, has one field for each sub-group
>> sas
sas =
source: {5x1 cell}
Protein: {6x1 cell}
Region: {4x1 cell}
Site: {4x1 cell}
CDS: {6x1 cell}
>> sas.Protein
ans =
'1..116'
'/product="vesicle-associated membrane protein 2^'
'/note="VAMP-2; synaptobrevin-2; Synaptobrevin 2'
'(vesicle-associated membrane protein VAMP-2);'
'Vesicle-associated membrane protein (synaptobrevin 2)"'
'/calculated mol wt=12560'
>> char( sas.Protein )
ans =
1..116
/product="vesicle-associated membrane protein 2^
/note="VAMP-2; synaptobrevin-2; Synaptobrevin 2
(vesicle-associated membrane protein VAMP-2);
Vesicle-associated membrane protein (synaptobrevin 2)"
/calculated mol wt=12560
>>
Next step is to parse the sub-blocks.

Categories

Find more on Genomics and Next Generation Sequencing in Help Center and File Exchange

Tags

Asked:

on 27 Mar 2015

Edited:

on 29 Mar 2015

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!