multialign
Align multiple sequences using progressive method
Syntax
Description
performs a progressive multiple alignment for a set of sequences. SeqsMultiAligned
= multialign(Seqs
)
Pairwise distances between sequences are computed after pairwise alignment with the Gonnet scoring matrix and then by counting the proportion of sites at which each pair of sequences are different (ignoring gaps). The guide tree is calculated by the neighbor-joining method assuming equal variance and independence of evolutionary distance estimates.
uses a tree as a guide for the progressive alignment. The sequences should have the
same order as the leaves in the tree or use a field (SeqsMultiAligned
= multialign(Seqs
,Tree
)"Header"
or
"Name"
) to identify the sequences.
uses additional options specified by one or more name-value arguments.SeqsMultiAligned
= multialign(___,Name=Value
)
Examples
Align multiple sequences
This example shows how to align multiple protein sequences.
Use the fastaread
function to read p53samples.txt, a FASTA-formatted file included with Bioinformatics Toolbox™, which contains p53 protein sequences of seven species.
p53 = fastaread('p53samples.txt')
p53=7×1 struct array with fields:
Header
Sequence
Compute the pairwise distances between each pair of sequences using the 'GONNET' scoring matrix.
dist = seqpdist(p53,'ScoringMatrix','GONNET');
Build a phylogenetic tree using an unweighted average distance (UPGMA) method. This tree will be used as a guiding tree in the next step of progressive alignment.
tree = seqlinkage(dist,'average',p53)
Phylogenetic tree object with 7 leaves (6 branches)
Perform progressive alignment using the PAM family scoring matrices.
ma = multialign(p53,tree,'ScoringMatrix',... {'pam150','pam200','pam250'})
ma=7×1 struct array with fields:
Header
Sequence
Align Nucleotide Sequences
Enter an array of sequences.
seqs = {'CACGTAACATCTC','ACGACGTAACATCTTCT','AAACGTAACATCTCGC'};
Promote terminations with gaps in the alignment.
multialign(seqs,'terminalGapAdjust',true)
ans = 3x17 char array
'--CACGTAACATCTC--'
'ACGACGTAACATCTTCT'
'-AAACGTAACATCTCGC'
Compare the alignment without termination gap adjustment.
multialign(seqs)
ans = 3x17 char array
'CA--CGTAACATCT--C'
'ACGACGTAACATCTTCT'
'AA-ACGTAACATCTCGC'
Input Arguments
Seqs
— Nucleotide or amino acid sequences
cell array of character vectors | vector of strings | matrix of characters | vector of structures
Nucleotide or amino acid sequences, specified as a cell array of character vectors, vector of strings, matrix of characters, or vector of structures.
You can specify:
Cell array of character vectors or vector of strings containing nucleotide or amino acid sequences.
Matrix of characters, in which each row corresponds to a nucleotide or amino acid sequence.
Vector of structures containing a
Sequence
field for the residues and aHeader
orName
field for the labels.
Tree
— Phylogenetic tree
phytree
object
Phylogenetic tree, specified as a phytree
object. You can calculate the
tree using the seqlinkage
or seqneighjoin
function.
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Example: SeqsMultiAligned = multialign(Seqs,Weights="equal")
assigns the same weight to every sequence.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: SeqsMultiAligned =
multialign(Seqs,"Weights","equal")
Weights
— Sequence weighting method
"THG"
(default) | "equal"
Sequence weighting method, specified as "THG"
or
"equal"
. Weights emphasize highly divergent
sequences by scaling the scoring matrix and gap penalties. Closer
sequences receive smaller weights.
"THG"
— Thompson-Higgins-Gibson method using the phylogenetic tree branch distances weighted by their thickness."equal"
— Assigns the same weight to every sequence.
ScoringMatrix
— Scoring matrix for progressive alignment
character vector | string scalar | cell array of character vectors | array of strings | numeric matrix | numeric array
Scoring matrix for the progressive alignment, specified as a character vector, string scalar, or numeric matrix. You can specify a series of scoring matrices as a cell array of character vectors, array of strings, or numeric array.
Match and mismatch scores are interpolated from the series of scoring matrices by considering the distances between the two profiles or sequences being aligned. The first matrix corresponds to the smallest distance, and the last matrix to the largest distance. Intermediate distances are calculated using linear interpolation.
You can specify scoring matrix names. Valid choices are:
"BLOSUM62"
"BLOSUM30"
increasing by5
up to"BLOSUM90"
(default for amino acid sequences is the"BLOSUM80"
to"BLOSUM30"
series)"BLOSUM100"
"PAM10"
increasing by10
up to"PAM500"
"DAYHOFF"
"GONNET"
"NUC44"
(default for nucleotide sequences). This choice is not supported for amino acid sequences.
Note
The above scoring matrices, provided with the software, also include a scale factor that converts the units of the output score to bits.
You can also specify a numeric matrix of size
M-by-M, such as the one
returned by the blosum
, pam
, dayhoff
, gonnet
, or nuc44
function.
You can also specify a numeric array of size
M-by-M-by-N
for a series of N user-defined scoring
matrices.
Note
If you use a scoring matrix that you created or was created by one of the above functions, the matrix does not include a scale factor. The output score will be returned in the same units as the scoring matrix. When passing your own series of scoring matrices, ensure they share the same scale.
If you need to compile
multialign
into a standalone application or software component using MATLAB® Compiler™, use a numeric matrix instead of the scoring matrix name.
Example: "BLOSUM62"
or 'BLOSUM62'
specifies a BLOSUM scoring matrix with a percent identity level of 62,
and includes a scale factor.
Example: ["pam150","pam200","pam250']
or
{'pam150','pam200','pam250'}
specifies a series
of three PAM scoring matrices.
Example: blosum(62)
specifies the numeric matrix
returned by the blosum
function, and does not include
a scale factor.
SMInterp
— Use linear interpolation of scoring matrices
true
or
1
(default) | false
or 0
Use linear interpolation of the scoring matrices, specified as a
numeric or logical true
(1
) or
false
(0
). When
SMInterp
is false
, each
scoring matrix is assigned to a fixed range depending on the distances
between the two profiles or sequences being aligned.
GapOpen
— Initial penalty for opening gap
positive scalar | function handle
Initial penalty for opening a gap, specified as a positive scalar or a function handle.
If you enter a function, multialign
passes four
values to the function: the average score for two matched residues
(sm
), the average score for two mismatched
residues (sx
), and, the length of both profiles or
sequences (len1
, len2
). By
default, multialign
uses the function handle
@(sm,sx,len1,len2) 5*sm
, which sets the initial
penalty for opening the gap at five times the average score for two
matched residuals. Although the default function does not depend on
sx
, len1
, or
len2
, your custom function can use these
values.
Data Types: double
ExtendGap
— Initial penalty for extending gap
positive scalar | function handle
Initial penalty for extending a gap, specified as a positive scalar or
a function handle. If you specify this value, the function uses the
affine gap penalty scheme, that is, it scores the first gap using the
GapOpen
value and scores subsequent gaps using
the ExtendGap
value. If you do not specify this
value, the function scores all gaps equally, using the
GapOpen
penalty.
If you enter a function, multialign
passes four
values to the function: the average score for two matched residues
(sm
), the average score for two mismatched
residues (sx
), and, the length of both profiles or
sequences (len1
, len2
). By
default, multialign
uses the function handle
@(sm,sx,len1,len2) sm/4
, which sets the initial
penalty for extending the gap at one-fourth the average score for two
matched residuals. Although the default function does not depend on
sx
, len1
, or
len2
, your custom function can use these
values.
Data Types: double
DelayCutoff
— Threshold delay of divergent sequences
numeric scalar
Threshold delay of divergent sequences, specified as a numeric scalar.
The multialign
function delays the alignment of
divergent sequences whose closest neighbor is farther than:
(DelayCutoff
) * (median patristic distance between sequences)
The default value is unity, where sequences with the closest sequence farther than the median distance are delayed.
UseParallel
— Use parallel computation
false
or
0
(default) | true
or 1
Use parallel computation of the pairwise alignments, specified as a
numeric or logical false
(0
) or
true
(1
).
If
true
, and Parallel Computing Toolbox™ is installed, then computation occurs usingparfor
-loops.If a
parpool
is open, then the computation uses the openparpool
and occurs in parallel.If there are no open
parpool
, but automatic creation is enabled in the Parallel Preferences, then the default pool will be automatically opened and computation occurs in parallel.If there are no open
parpool
and automatic creation is disabled, then computation usesparfor
-loops in serial mode.
If Parallel Computing Toolbox is not installed, then computation uses
parfor
-loops in serial mode.If
false
, then the computation uses for-loops in serial mode.
Verbose
— Display sequences with sequence information
false
or
0
(default) | true
or 1
Display the sequences with sequence information, specified as a
numeric or logical false
(0
) or
true
(1
).
ExistingGapAdjust
— Control automatic adjustment based on existing gaps
true
or
1
(default) | false
or 0
Control automatic adjustment based on existing gaps, specified as a
numeric or logical true
(1
) or
false
(0
).
When true
, for every profile position,
multialign
proportionally lowers the penalty
for opening a gap toward the penalty of extending a gap based on the
proportion of gaps found in the contiguous symbols and on the weight of
the input profile.
When false
, turns off the automatic adjustment
based on existing gaps of the position-specific penalties for opening a
gap.
This argument is analogous to the function profalign
and is used through every step of the
progressive alignment of profiles.
TerminalGapAdjust
— Adjust penalty for opening gap at ends of sequence
false
or
0
(default) | true
or 1
Adjust the penalty for opening a gap at the ends of the sequence,
specified as a numeric or logical false
(0
) or true
(1
). When true
, the
multialign
function adjusts the penalty for
opening a gap at the ends of the sequence to be equal to the penalty for
extending a
gap.
This argument is analogous to the function profalign
and is used through every step of the
progressive alignment of profiles.
Output Arguments
SeqsMultiAligned
— Aligned sequences
cell array of character vectors | vector of strings | matrix of characters | vector of structures
Aligned sequences, returned as a cell array of character vectors, vector
of strings, matrix of characters, or vector of structures. The format of
SeqsMultiAligned
matches the format of the input
sequences to align, Seqs
.
When
Seqs
is a cell array of character vectors, vector of strings, or matrix of characters, the output alignment inSeqsMultiAligned
follows the same order as the input.When
Seqs
is a vector of structures, theSequence
field ofSeqsMultiAligned
is updated with the alignment. Other fields ofSeqsMultiAligned
match the fields ofSeq
.
Extended Capabilities
Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.
To run in parallel, set 'UseParallel'
to true
.
For more information, see the 'UseParallel'
name-value pair argument.
Version History
Introduced before R2006a
See Also
align2cigar
| hmmprofalign
| multialignread
| multialignwrite
| nwalign
| profalign
| seqprofile
| seqconsensus
| seqneighjoin
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)