how to extract only the type of the membrane protein,from its fasta file header?

1 view (last 30 days)
hie.i am trying to prepare some training data for ,let's say a learning machine. i have extracted some features from a fasta file. now , i want to specify the class of each training instance which is the "type" of the membrane protein, represented as a part of the fasta header. i just dunno how to get access to this part of the header and how to code each of five types, with a number.i have access to header through fastaread. i can see a regular pattern in the header representation.it is just that this regularity slightly changes for each instance. i mean consider three following cases :
41BB_HUMAN Q07011 homo sapiens (human). 4-1bb ligand receptor precursor (t-cell antigen 4-1bb homolog) (t-cell antigen ila) (cd137 antigen). 11/97
A33_HUMAN Q99795 homo sapiens (human). a33 antigen precursor. 11/97
A4_DROME P14599 drosophila melanogaster (fruit fly). beta-amyloid-like protein precursor. 11/97 can any body help please?
  3 Comments
hiva
hiva on 25 Jan 2015
Edited: hiva on 25 Jan 2015
hie.yes.there are 2059 proteins in my training set and they are of 5 different types . here is a link with more details about the data http://garfield.library.upenn.edu/histcomp/alberts_all_fixed2/node/13713.ht
my data is from uniprot according to your links.
the point is that the type of proteins is already extracted from uniprot.it is mentioned as a part of the protein's header. take a look at some of the proteins in the file :
>AMA1_PLACH P16445 plasmodium chabaudi. apical membrane antigen 1 precursor (merozoite surface antigen). 11/97 MKEIYYIVILCSLYLINLGNCSEGTDKIISENGDVKFDLIPKENTERSHKLINPWEKFME KYDIEKVHGSGIRVDLGEDARVENQDYRIPSGKCPVMGKGITIQNSKVSFLTRVATGNQK VREGGLAFPQTDVNISPITIDNLKLMYKDHKEILALNDMSLCAKHASFYVPGTNVNTAYR HPAVYDKSNKTCYILYVAAQENMGPRYCSNEEDNENQPFCFTPEKKDEYKNLSYLTKNLR EDWETSCPNKSIQNAKFGVWVDGYCSEYQKKEVHDNKTLLECNQIVFNESASDQPKQYEK HLEDTAKIRRGIVDRNGKLIGEALLPIGSYRADQVKSKGKGYNWANYDKKTKKCYIFNKK PTCLINDKDFVATTALSSLEEGPQESFPCDIYKKKIAEEIKVMNVNRNNNGNDTIKFPRI FISDDKESLNCPCEPTQLTQSTCKFFVCNCVEKRQFISENNEVEIKDEFKSEYESPINQR MLIIIILIATGAILASLLIFYFFKSNKPGDDYDKMGQADTYGKAQSRKDEMLDPEVSFWG EDKRASHTTPVLMEKPYY >AMFR_HUMAN P26442 homo sapiens (human). autocrine motility factor receptor precursor (amf receptor) (gp78). 2/96 MRDSACWSQRKDELLQQARKRFLNKSSEDDAASESFLPSEGASSDPVTLRRRMLAAARNG GFRSSRPPSAPLPSSAASCALCPTDWRRPVPILPLHGKAGLTALPLYKACGLIVFGQLIN LILLCNTFYVTFLFPLETLQILTVGMISSGVDWTAWGGGRSGGSEPVACLQQAASTPASC IRPTNAGVLSTTPSGKSVGEAHSVSPPPRRGVTSVIKLLSLLWKHVDCARARPTGSCTPE QQGILEKELLVRYLEQRRGKSRAIGCDEVTPFCPTTSGTDFPSLQSKAGLISVNSGAPAS HECAPWVPSPLSISLSRLDLGSG >ANPA_HUMAN P16066 homo sapiens (human). atrial natriuretic peptide receptor a precursor (anp-a) (anpra) (gc-a) (guanylate cyclase) (ec 4.6.1.2). 10/96 MPGPRRPAGSRLRLLLLLLLPPLLLLLRGSHAGNLTVAVVLPLANTSYPWSWARVGPAVE LALAQVKARPDLLPGWTVRTVLGSSENALGVCSDTAAPLAAVDLKWEHNPAVFLGPGCVY AAAPVGRFTAHWRVPLLTAGAPALGFGVKDEYALTTRAGPSYAKLGDFVAALHRRLGWER QALMLYAYRPGDEEHCFFLVEGLFMRVRDRLNITVDHLEFAEDDLSHYTRLLRTMPRKGR VIYICSSPDAFRTLMLLALEAGLCGEDYVFFHLDIFGQSLQGGQGPAPRRPWERGDGQDV SARQAFQAAKIITYKDPDNPEYLEFLKQLKHLAYEQFNFTMEDGLVNTIPASFHDGLLLY IQAVTETLAHGGTVTDGENITQRMWNRSFQGVTGYLKIDSSGDRETDFSLWDMDPENGAF RVVLNYNGTSQELVAVSGRKLNWPLGYPPPDIPKCGFDNEDPACNQDHLSTLEVLALVGS LSLLGILIVSFFIYRKMQLEKELASELWRVRWEDVEPSSLERHLRSAGSRLTLSGRGSNY GSLLTTEGQFQVFAKTAYYKGNLVAVKRVNRKRIELTRKVLFELKHMRDVQNEHLTRFVG ACTDPPNICILTEYCPRGSLQDILENESITLDWMFRYSLTNDIVKGMLFLHNGAICSHGN LKSSNCVVDGRFVLKITDYGLESFRDLDPEQGHTVYAKKLWTAPELLRMASPPVRGSQAG DVYSFGIILQEIALRSGVFHVEGLDLSPKEIIERVTRGEQPPFRPSLALQSHLEELGLLM QRCWAEDPQERPPFQQIRLTLRKFNRENSSNILDNLLSRMEQYANNLEELVEERTQAYLE EKRKAEALLYQILPHSVAEQLKRGETVQAEAFDSVTIYFSDIVGFTALSAESTPMQVVTL LNDLYTCFDAVIDNFDVYKVETIGDAYMVVSGLPVRNGRLHACEVARMALALLDAVRSFR IRHRPQEQLRLRIGIHTGPVCAGVVGLKMPRYCLFGDTVNTASRMESNGEALKIHLSSET KAVLEEFGGFELELRGDVEMKGKGKVRTYWLLGERGSSTRG
my problem is that in order to prepare my data vectors, i need the class of each protein along the rest of the features so that i can use supervised learning. i just dunno how to only extract the "type" of these proteins from their headers and then to assign an integer to this type, in matlab.
Luuk van Oosten
Luuk van Oosten on 25 Jan 2015
Edited: Luuk van Oosten on 25 Jan 2015
Could you change the end of the link to .html please (instead of .ht)? this one does not work.
If I understand you well, you already have the list of proteins mentioned in this article, and you know to which of the five classes they belong. So if I say SSR3_HUMAN, you say 'multipass transmembrane protein', right?
And now you want to extract this SSR3_HUMAN part from the header in the FASTA file, correct?

Sign in to comment.

Answers (1)

Luuk van Oosten
Luuk van Oosten on 25 Jan 2015
Got an idea. You import your FASTA-file. you put your headers in one column, the corresponding sequence in the next column or something (anything that works for you).
Something like:
File = 'C:\Users\Documents\your_fastafile.fasta';
your_data = fastaread(File);
Note that both your header as the amino acid sequence are in quotations marks in the generated struct.
Now, remove those quotation marks with something similar to the following:
for i = 1:(length(your_data))
header{i,1} = {your_data(i).Header};
header_no_quotationmarks {i,1} = header{i,1}{:};
end
(this could be easier, but I had part of this still lying around from another project, I'm just copy-pasting here). Now you want to extract the part that describes your protein from the header; so let us take an example:
AMA1_PLACH P16445 plasmodium chabaudi. apical membrane antigen 1 precursor (merozoite surface antigen)
This can be seen as a string. And as you already noted yourself: there is some regularity in these strings. They all start with what you want: the AMA1_PLACH part. If you obtained the FASTA file from some other sources they tend to start with something like
sp|P0C2K0|A1KB_LOXBO etc. etc.
Where the sp|P0C2K0|-part will screw things up. Let me assume that ALL your headers start with the info you want. Anyway, we have a lot of strings containing the full header. Now take:
str = 'AMA1_PLACH P16445 plasmodium chabaudi. apical membrane antigen 1 precursor (merozoite surface antigen)'
Now use regexp (see help regexp / the online help documents of regexp for more info).
g = regexp(str, ' ', 'split');
What it does you generate 'g', which is your string 'str' which you split (hence 'split') in separate parts whenever it observes a space (the ' ' part in regexp).
if you now request
g{1}
MATLAB looks at 'g' and takes the first.... your AMA1_PLACH!!!
So if you write your script/program/function to loop over all your data, take the header as string, split it whenever it sees a space, then take g{1} and stores that info in a cell/array (whatever works for you, probably you want it next to the amino acid sequence....).
Maybe you can have a look here, it is where I got the idea of the regexp. You can write this in many different ways, and maybe this is not the most elegant, but hey, it seems to work (at least for my own mini-fastafile).

Categories

Find more on Sequence and Numeric Feature Data Workflows in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!