Main Content

featureparse

Parse features from GenBank, GenPept, or EMBL data

Syntax

FeatStruct = featureparse(Features)
FeatStruct = featureparse(Features, ...'Feature', FeatureValue, ...)
FeatStruct = featureparse(Features, ...'Sequence', SequenceValue, ...)

Input Arguments

FeaturesAny of the following:
  • MATLAB® structure with fields corresponding to GenBank®, GenPept, or EMBL data, such as those returned by genbankread, genpeptread, emblread, getgenbank, getgenpept, or getembl

  • Character vector or character array containing the text from the Features section of a GenBank, GenPept, or EMBL-formatted file

FeatureValueName of a feature contained in Features. When specified, featureparse returns only the substructure that corresponds to this feature. If there are multiple features with the same FeatureValue, then FeatStruct is an array of structures.
SequenceValueProperty to control the extraction, when possible, of the sequences respective to each feature, joining and complementing pieces of the source sequence and storing them in the Sequence field of the returned structure, FeatStruct. When extracting the sequence from an incomplete CDS feature, featureparse uses the codon_start qualifier to adjust the frame of the sequence. Choices are true or false (default).

Output Arguments

FeatStructOutput structure containing a field for every database feature. Each field name in FeatStruct matches the corresponding feature name in the GenBank, GenPept, or EMBL database, with the exceptions listed in the table below. Fields in FeatStruct contain substructures with feature qualifiers as fields. In the GenBank, GenPept, and EMBL databases, for each feature, the only mandatory qualifier is its location, which featureparse translates to the field Location. When possible, featureparse also translates this location to numeric indices, creating an Indices field.

Note

If you use the Indices field to extract sequence information, you may need to complement the sequences.

Description

FeatStruct = featureparse(Features) parses the features from Features, which contains GenBank, GenPept, or EMBL features. Features can be a:

  • Character vector or string containing GenBank, GenPept, or EMBL features

  • MATLAB character array including text describing GenBank, GenPept, or EMBL features

  • MATLAB structure with fields corresponding to GenBank, GenPept, or EMBL data, such as those returned by genbankread, genpeptread, emblread, getgenbank, getgenpept, or getembl

FeatStruct is the output structure containing a field for every database feature. Each field name in FeatStruct matches the corresponding feature name in the GenBank, GenPept, or EMBL database, with the following exceptions.

Feature Name in GenBank, GenPept, or EMBL DatabaseField Name in MATLAB Structure
-10_signalminus_10_signal
-35_signalminus_35_signal
3'UTRthree_prime_UTR
3'clip three_prime_clip
5'UTR five_prime_UTR
5'clip five_prime_clip
D-loop D_loop

Fields in FeatStruct contain substructures with feature qualifiers as fields. In the GenBank, GenPept, and EMBL databases, for each feature, the only mandatory qualifier is its location, which featureparse translates to the field Location. When possible, featureparse also translates this location to numeric indices, creating an Indices field.

Note

If you use the Indices field to extract sequence information, you may need to complement the sequences.

FeatStruct = featureparse (Features, ...'PropertyName', PropertyValue, ...) calls featureparse with optional properties that use property name/property value pairs. You can specify one or more properties in any order. Each PropertyName must be enclosed in single quotation marks and is case insensitive. These property name/property value pairs are as follows:

FeatStruct = featureparse(Features, ...'Feature', FeatureValue, ...) returns only the substructure that corresponds to FeatureValue, the name of a feature contained in Features. If there are multiple features with the same FeatureValue, then FeatStruct is an array of structures.

FeatStruct = featureparse(Features, ...'Sequence', SequenceValue, ...) controls the extraction, when possible, of the sequences respective to each feature, joining and complementing pieces of the source sequence and storing them in the field Sequence. When extracting the sequence from an incomplete CDS feature, featureparse uses the codon_start qualifier to adjust the frame of the sequence. Choices are true or false (default).

Examples

collapse all

Obtain all the features stored in a GeneBank file.

gbkStruct = genbankread('nm175642.txt');
features = featureparse(gbkStruct)
features = struct with fields:
    source: [1×1 struct]
      gene: [1×1 struct]
      exon: [1×31 struct]
       CDS: [1×1 struct]
       STS: [1×2 struct]

Get a subset of features from a GeneBank record. For example, obtain only the coding sequences (CDS) feature of two strains of the Influenza A virus (H5N1) from the GenBank database.

hk01 = getgenbank('AF509094');
vt04 = getgenbank('DQ094287');
hk01_cds = featureparse(hk01,'feature','CDS','Sequence',true);
vt04_cds = featureparse(vt04,'feature','CDS','Sequence',true);

Use nt2aa and nwalign to align the amino acid sequences converted from the corrresponding nucleotide sequences

[sc,al] = nwalign(nt2aa(hk01_cds),nt2aa(vt04_cds),'extendgap',1);

Use seqinsertgaps to copy the gaps from the aligned amino acide sequences to their corresponding nucleotide sequences to codon-align them.

hk01_aligned = seqinsertgaps(hk01_cds,al(1,:))
hk01_aligned = 
'caaaagcaggagattaaaatgaatccaaatcagaagataatgaccattggatcaatctgtatggtaatcggaatgattagcctggtgttacaaattgggaacatgatttcaatatgggccagtcattcaattcagaaaatgaaccaacaccaaactgaaccatgcaatcaaagcatcattacttatgaaaacaacacctgggtaaatcagacatatgtcaacatcagcaataccaattttcttactgagaaagttgtggcttcaatagcattatcgggcaattcatctctttgccccatcagtggatgggctgtctacagtaaggacaacggtataagaatcggttccaaaggggatgtgttcgttataagagagccgttcatctcgtgctcccacttggaatgcagaactttctttttgactcagggatccttgctgaatgacaagcattctaatgggaccgtcaaagatagaagcccttacagaacattgatgagttgccctgtaggtgaggctccctccccatataactcaagatttgagtctgttgcttggtcggcaagtgcttgtcatgacggcactagttggttgacaattggaatttctggcccagacaatggggctgtggctgtattgaaatacaatggcataataacagacactatcaagagttggaggaacagcatactgagaactcaagagtctgaatgtgcatgtgtaaatggttcttgtttcactgtaatgactgacggaccaagtaatgggcaggcatcatataaaatcttcaaaatagaaaaagggaaagtagttaaatcagtcgaattgaatgcccctaattatcactatgaggagtgctcctgttatcctgacgctggcgaaatcacatgtgtgtgcagggataattggcatggctcaaatcggccatgggtatctttcgatcaaaatttggagtatcaaataggatatatatgcagtggagttttcggagacaatccacgccccaatgatgggacaggcagttgtgatccggtgctccctaacggggcctatggagtaaaagggttttcatttaaatacggcgatggtgtttggatcgggagaaccaaaagcactaattccaggagcggctttgaaatgatttgggatccaaatgggtggactggaacggacagtaacttctcgctgaagcaagatatcgtagcgatgactgattggtcaggatatagcgggagttttgtccagcatccagacattacagaattagattgcataagaccttgtttctcggttgagctaatcagagggcggcccaaagagagcaccatttggactagtgggagcagcatatctttttgtggtgtaaatagtgacactgtgggttggtcttggccagacggtgctgagttgccattcaccattgacaagtaa'
vt04_aligned = seqinsertgaps(vt04_cds,al(3,:))
vt04_aligned = 
'------------------atgaatccaaatcagaagataataaccatcggatcaatctgtatggtaactggaatagttagcttaatgttacaagttgggaacatgatctcaatatgggtcagtcattcaattcacacagggaatcaacaccaagctgaacca------------------------------------------------------------gtcagcaatactaattttcttactgagaaagctgtggcttcagtaaaattagcgggcaattcatctctttgccccattaacggatgggctgtatacagtaaggacaacagtataaggatcggttccaagggggatgtgtttgttataagagagccgttcatctcatgctcccacttggaatgcagaactttctttttgactcagggagccttgctgaatgacaagcactccaatgggactgtcaaagacagaagccctcacagaacattaatgagttgtcctgtgggtgaggctccctccccatataactcaaggtttgagtctgttgcttggtcagcaagtgcttgccatgatggcaccagttggttgacaattggaatttctggcccagacaatggggctgtggctgtattgaaatacaatggcataataacagacactatcaagagttggaggaacaacatactgagaactcaagagtctgaatgtgcatgtgtaaatggctcttgctttactgtaatgactgacggaccaagtaatggtcaggcatcacataagatcttcaaaatggaaaaagggaaagtggttaaatcagtcgaattggatgctcctaattatcattatgaggaatgctcctgttatcctgatgccggcgaaatcacatgtgtgtgcagggataattggcatggctcaaatcggccatgggtatctttcaatcaaaacttggagtatcaaataggatatatatgcagtggagttttcggagacaatccacgccccaatgatggaacaggtagttgtggtccggtgtcctctaacggggcatatggggtaaaagggttttcatttaaatacggcaatggtgtctggatcgggagaaccaaaagcactaattccaggagcggctttgaaatgatttgggatccaaatgggtggactgaaacggacagtagcttttcagtgaaacaagatatcgtagcaataactgattggtcaggatatagcgggagttttgtccagcatccagaactgacaggactagattgcataagaccttgtttctgggttgagttgatcagagggcggcccaaagagagcacaatttggactagtgggagcagcatatctttttgtggtgtaaatagtgacactgtgggttggtcttggccagacggtgctgagttgccattcaccattgacaagtag'

Once you have code aligned the two sequences, use them as input to other functions, such as dnds, which calculates the synonymous and nonsynonymous substitutions rates of the codon-aligned nucleotide sequences. By setting Verbose to true, you can also display the codons considered in the computations and their amino acid translations.

[dn,ds] = dnds(hk01_aligned,vt04_aligned,'verbose',true)
DNDS: 
Codons considered in the computations:
ATGAATCCAAATCAGAAGATAATGACCATTGGATCAATCTGTATGGTAATCGGAATGATTAGCCTGGTGTTACAAATTGGGAACATGATTTCAATATGGGCCAGTCATTCAATTCAGAAAATGAACCAACACCAAACTGAACCAATCAGCAATACCAATTTTCTTACTGAGAAAGTTGTGGCTTCAATAGCATTATCGGGCAATTCATCTCTTTGCCCCATCAGTGGATGGGCTGTCTACAGTAAGGACAACGGTATAAGAATCGGTTCCAAAGGGGATGTGTTCGTTATAAGAGAGCCGTTCATCTCGTGCTCCCACTTGGAATGCAGAACTTTCTTTTTGACTCAGGGATCCTTGCTGAATGACAAGCATTCTAATGGGACCGTCAAAGATAGAAGCCCTTACAGAACATTGATGAGTTGCCCTGTAGGTGAGGCTCCCTCCCCATATAACTCAAGATTTGAGTCTGTTGCTTGGTCGGCAAGTGCTTGTCATGACGGCACTAGTTGGTTGACAATTGGAATTTCTGGCCCAGACAATGGGGCTGTGGCTGTATTGAAATACAATGGCATAATAACAGACACTATCAAGAGTTGGAGGAACAGCATACTGAGAACTCAAGAGTCTGAATGTGCATGTGTAAATGGTTCTTGTTTCACTGTAATGACTGACGGACCAAGTAATGGGCAGGCATCATATAAAATCTTCAAAATAGAAAAAGGGAAAGTAGTTAAATCAGTCGAATTGAATGCCCCTAATTATCACTATGAGGAGTGCTCCTGTTATCCTGACGCTGGCGAAATCACATGTGTGTGCAGGGATAATTGGCATGGCTCAAATCGGCCATGGGTATCTTTCGATCAAAATTTGGAGTATCAAATAGGATATATATGCAGTGGAGTTTTCGGAGACAATCCACGCCCCAATGATGGGACAGGCAGTTGTGATCCGGTGCTCCCTAACGGGGCCTATGGAGTAAAAGGGTTTTCATTTAAATACGGCGATGGTGTTTGGATCGGGAGAACCAAAAGCACTAATTCCAGGAGCGGCTTTGAAATGATTTGGGATCCAAATGGGTGGACTGGAACGGACAGTAACTTCTCGCTGAAGCAAGATATCGTAGCGATGACTGATTGGTCAGGATATAGCGGGAGTTTTGTCCAGCATCCAGACATTACAGAATTAGATTGCATAAGACCTTGTTTCTCGGTTGAGCTAATCAGAGGGCGGCCCAAAGAGAGCACCATTTGGACTAGTGGGAGCAGCATATCTTTTTGTGGTGTAAATAGTGACACTGTGGGTTGGTCTTGGCCAGACGGTGCTGAGTTGCCATTCACCATTGACAAG
ATGAATCCAAATCAGAAGATAATAACCATCGGATCAATCTGTATGGTAACTGGAATAGTTAGCTTAATGTTACAAGTTGGGAACATGATCTCAATATGGGTCAGTCATTCAATTCACACAGGGAATCAACACCAAGCTGAACCAGTCAGCAATACTAATTTTCTTACTGAGAAAGCTGTGGCTTCAGTAAAATTAGCGGGCAATTCATCTCTTTGCCCCATTAACGGATGGGCTGTATACAGTAAGGACAACAGTATAAGGATCGGTTCCAAGGGGGATGTGTTTGTTATAAGAGAGCCGTTCATCTCATGCTCCCACTTGGAATGCAGAACTTTCTTTTTGACTCAGGGAGCCTTGCTGAATGACAAGCACTCCAATGGGACTGTCAAAGACAGAAGCCCTCACAGAACATTAATGAGTTGTCCTGTGGGTGAGGCTCCCTCCCCATATAACTCAAGGTTTGAGTCTGTTGCTTGGTCAGCAAGTGCTTGCCATGATGGCACCAGTTGGTTGACAATTGGAATTTCTGGCCCAGACAATGGGGCTGTGGCTGTATTGAAATACAATGGCATAATAACAGACACTATCAAGAGTTGGAGGAACAACATACTGAGAACTCAAGAGTCTGAATGTGCATGTGTAAATGGCTCTTGCTTTACTGTAATGACTGACGGACCAAGTAATGGTCAGGCATCACATAAGATCTTCAAAATGGAAAAAGGGAAAGTGGTTAAATCAGTCGAATTGGATGCTCCTAATTATCATTATGAGGAATGCTCCTGTTATCCTGATGCCGGCGAAATCACATGTGTGTGCAGGGATAATTGGCATGGCTCAAATCGGCCATGGGTATCTTTCAATCAAAACTTGGAGTATCAAATAGGATATATATGCAGTGGAGTTTTCGGAGACAATCCACGCCCCAATGATGGAACAGGTAGTTGTGGTCCGGTGTCCTCTAACGGGGCATATGGGGTAAAAGGGTTTTCATTTAAATACGGCAATGGTGTCTGGATCGGGAGAACCAAAAGCACTAATTCCAGGAGCGGCTTTGAAATGATTTGGGATCCAAATGGGTGGACTGAAACGGACAGTAGCTTTTCAGTGAAACAAGATATCGTAGCAATAACTGATTGGTCAGGATATAGCGGGAGTTTTGTCCAGCATCCAGAACTGACAGGACTAGATTGCATAAGACCTTGTTTCTGGGTTGAGTTGATCAGAGGGCGGCCCAAAGAGAGCACAATTTGGACTAGTGGGAGCAGCATATCTTTTTGTGGTGTAAATAGTGACACTGTGGGTTGGTCTTGGCCAGACGGTGCTGAGTTGCCATTCACCATTGACAAG
Translations:
M  N  P  N  Q  K  I  M  T  I  G  S  I  C  M  V  I  G  M  I  S  L  V  L  Q  I  G  N  M  I  S  I  W  A  S  H  S  I  Q  K  M  N  Q  H  Q  T  E  P  I  S  N  T  N  F  L  T  E  K  V  V  A  S  I  A  L  S  G  N  S  S  L  C  P  I  S  G  W  A  V  Y  S  K  D  N  G  I  R  I  G  S  K  G  D  V  F  V  I  R  E  P  F  I  S  C  S  H  L  E  C  R  T  F  F  L  T  Q  G  S  L  L  N  D  K  H  S  N  G  T  V  K  D  R  S  P  Y  R  T  L  M  S  C  P  V  G  E  A  P  S  P  Y  N  S  R  F  E  S  V  A  W  S  A  S  A  C  H  D  G  T  S  W  L  T  I  G  I  S  G  P  D  N  G  A  V  A  V  L  K  Y  N  G  I  I  T  D  T  I  K  S  W  R  N  S  I  L  R  T  Q  E  S  E  C  A  C  V  N  G  S  C  F  T  V  M  T  D  G  P  S  N  G  Q  A  S  Y  K  I  F  K  I  E  K  G  K  V  V  K  S  V  E  L  N  A  P  N  Y  H  Y  E  E  C  S  C  Y  P  D  A  G  E  I  T  C  V  C  R  D  N  W  H  G  S  N  R  P  W  V  S  F  D  Q  N  L  E  Y  Q  I  G  Y  I  C  S  G  V  F  G  D  N  P  R  P  N  D  G  T  G  S  C  D  P  V  L  P  N  G  A  Y  G  V  K  G  F  S  F  K  Y  G  D  G  V  W  I  G  R  T  K  S  T  N  S  R  S  G  F  E  M  I  W  D  P  N  G  W  T  G  T  D  S  N  F  S  L  K  Q  D  I  V  A  M  T  D  W  S  G  Y  S  G  S  F  V  Q  H  P  D  I  T  E  L  D  C  I  R  P  C  F  S  V  E  L  I  R  G  R  P  K  E  S  T  I  W  T  S  G  S  S  I  S  F  C  G  V  N  S  D  T  V  G  W  S  W  P  D  G  A  E  L  P  F  T  I  D  K  
M  N  P  N  Q  K  I  I  T  I  G  S  I  C  M  V  T  G  I  V  S  L  M  L  Q  V  G  N  M  I  S  I  W  V  S  H  S  I  H  T  G  N  Q  H  Q  A  E  P  V  S  N  T  N  F  L  T  E  K  A  V  A  S  V  K  L  A  G  N  S  S  L  C  P  I  N  G  W  A  V  Y  S  K  D  N  S  I  R  I  G  S  K  G  D  V  F  V  I  R  E  P  F  I  S  C  S  H  L  E  C  R  T  F  F  L  T  Q  G  A  L  L  N  D  K  H  S  N  G  T  V  K  D  R  S  P  H  R  T  L  M  S  C  P  V  G  E  A  P  S  P  Y  N  S  R  F  E  S  V  A  W  S  A  S  A  C  H  D  G  T  S  W  L  T  I  G  I  S  G  P  D  N  G  A  V  A  V  L  K  Y  N  G  I  I  T  D  T  I  K  S  W  R  N  N  I  L  R  T  Q  E  S  E  C  A  C  V  N  G  S  C  F  T  V  M  T  D  G  P  S  N  G  Q  A  S  H  K  I  F  K  M  E  K  G  K  V  V  K  S  V  E  L  D  A  P  N  Y  H  Y  E  E  C  S  C  Y  P  D  A  G  E  I  T  C  V  C  R  D  N  W  H  G  S  N  R  P  W  V  S  F  N  Q  N  L  E  Y  Q  I  G  Y  I  C  S  G  V  F  G  D  N  P  R  P  N  D  G  T  G  S  C  G  P  V  S  S  N  G  A  Y  G  V  K  G  F  S  F  K  Y  G  N  G  V  W  I  G  R  T  K  S  T  N  S  R  S  G  F  E  M  I  W  D  P  N  G  W  T  E  T  D  S  S  F  S  V  K  Q  D  I  V  A  I  T  D  W  S  G  Y  S  G  S  F  V  Q  H  P  E  L  T  G  L  D  C  I  R  P  C  F  W  V  E  L  I  R  G  R  P  K  E  S  T  I  W  T  S  G  S  S  I  S  F  C  G  V  N  S  D  T  V  G  W  S  W  P  D  G  A  E  L  P  F  T  I  D  K  
dn = 0.0397
ds = 0.1957

Version History

Introduced in R2006b