Main Content

seqtrim

Trim sequences based on specified criterion

Description

seqtrim(fastqFile) trims the sequences in fastqFile and saves the trimmed sequences in new FASTQ files. By default, the trimmed sequences are saved under file names with the suffix '_trimmed' appended. If you do not specify any trimming criterion, the function trims sequences using the default.

example

seqtrim(fastqFile,Name,Value) uses additional options specified by one or more Name,Value pair arguments.

example

[outFiles,nSeqTrimmed,nSeqUntrimmed] = seqtrim(___) returns a cell array outFiles with the names of output files. nSeqTrimmed and nSeqUntrimmed represent the numbers of sequences trimmed and untrimmed from each input file, respectively.

example

Examples

collapse all

Trim each sequence when the number of bases with quality below 20 is greater than 3 within a sliding window of size 25.

[outFile,nt,unt] =  seqtrim('SRR005164_1_50.fastq', 'Method', 'MaxNumberLowQualityBases', ...
                 'Threshold', [3 20], 'WindowSize', 25);

Check the number of sequences that were trimmed.

nt
nt = 
36

Check the number of sequences that were untrimmed.

unt
unt = 
14

Trim the first 10 bases of each sequence.

[outfile,nt] = seqtrim('SRR005164_1_50.fastq','Method','Termini', ...
                     'Threshold',[10 0]);

Trim the last 5 bases.

[outfile,nt] = seqtrim('SRR005164_1_50.fastq','Method','Termini', ...
                     'Threshold',[0 5]);

Trim each sequence at position 50.

[outfile,nt] = seqtrim('SRR005164_1_50.fastq','Method','BasePositions', ...
                     'Threshold',[1 50]);

Trim each sequence when the running average base quality becomes less than 20.

[outFile,nt,unt] = seqtrim('SRR005164_1_50.fastq','Method','MeanQuality', ...
     'Threshold',20)

Trim each sequence when the percentage of bases with quality below 10 is more than 15.

[outFile,nt,unt] = seqtrim('SRR005164_1_50.fastq','Method','MaxPercentLowQualityBases', ...
     'Threshold',[15 10])

Input Arguments

collapse all

Names of FASTQ-formatted files with sequence and quality information, specified as a character vector, string, string vector, or cell array of character vectors.

Example: 'SRR005164_1_50.fastq'

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'Method','MaxNumberLowQualityBases','Threshold',[3 20] specifies to trim each sequence when the number of bases with quality below 20 is greater than 3.

Criterion to trim sequences, specified as one of the following options. Specify only one trimming criterion per function call.

  • 'MaxNumberLowQualityBases'– applies a maximum threshold on the number of low-quality bases allowed before trimming a sequence starting at the 5' end.

  • 'MaxPercentLowQualityBases'– applies a maximum threshold on the percentage of low-quality bases allowed before trimming a sequence starting at the 5' end.

  • 'MeanQuality'– applies a minimum threshold on the running average base quality allowed before trimming a sequence starting at the 5' end.

  • 'BasePositions'– trims each sequence according to the base positions (first base and last base) starting at the 5' end.

  • 'Termini'– trims each sequence from either the 5' or 3' end or from both ends.

Use this name-value pair argument together with 'Threshold' to specify the appropriate threshold value. Depending on the trimming criterion, the corresponding value for 'Threshold' varies. See the 'Threshold' option for the default values.

Note

Sequences resulting in empty sequences after trimming are saved in the output files as empty sequences. To remove empty sequences from files, use the seqfilter function with the 'MinLength' option set to the value of 1.

Threshold value for the trimming criterion, specified as a scalar or vector. Use this name-value pair to define the threshold value for the trimming criterion specified by 'Method'.

Depending on the trimming criterion, the corresponding value for 'Threshold' can be a scalar or two-element vector. If you do not specify 'Threshold', then the function uses the default threshold value of the corresponding method. For each trimming criterion, the function uses the encoding format of the base quality specified by the 'Encoding' name-value pair argument.

'Method''Threshold'Default 'Threshold' value
'MaxNumberLowQualityBases'Two-element vector [V1 V2]. V1 is a nonnegative integer that specifies the maximum number of low-quality bases allowed before trimming. V2 specifies the minimum base quality. Any base with quality less than V2 is considered a low-quality base. [0 10]
'MaxPercentLowQualityBases'Two-element vector [V1 V2]. V1 is a scalar between 0 and 100 that specifies the maximum percentage of low quality bases allowed before trimming. V2 specifies the minimum base quality. Any base with quality less than V2 is considered a low-quality base.[0 10]
'MeanQuality'Positive scalar that specifies the minimum threshold on the running average base quality allowed before trimming a sequence starting at the 5' end. 0
'BasePositions'

Two-element vector [V1 V2], where V1 and V2 are positive integers specifying the base positions to start trimming at the 5' end and 3' end, respectively.

To trim only the 5' end of each sequence before position V1, use [V1 Inf].

To trim only the 3' end of each sequence after position V2, use [1 V2].

[1 Inf], that is, each sequence is left untrimmed.
'Termini'

Two-element vector [V1 V2], where V1 and V2 are nonnegative integers specifying the number of bases to trim at the 5' end and the 3' end, respectively.

To trim V1 bases at the 5' end only, use [V1 0].

To trim V2 bases at the 3' end only, use [0 V2].

[0 0], that is, each sequence is left untrimmed.

Size of the sliding window to apply the trimming criterion to a sequence, specified as a positive integer. The size of the window corresponds to the number of bases that the function uses at one time to apply the criterion. Any given sequence is trimmed before the first base of the window that violates the given criterion.

The sliding window can be applied to the following methods:

  • 'MaxNumberLowQualityBases',

  • 'MaxPercentLowQualityBases', and

  • 'MeanQuality'.

Note

Sequences shorter than the size of the window are saved in the output file as empty sequences. To remove empty sequences from files, use the seqfilter function with the 'MinLength' option set to the value of 1.

Base quality encoding format, specified as a character vector or string.

Relative or absolute path to the output file directory, specified as a character vector or string. The default is the current directory.

Example: 'OutputDir','F:\results'

Suffix to use in the output file name, specified as a character vector or string. It is inserted after the input file name and before the file extension. The default is '_trimmed'.

Boolean indicating whether to perform computation in parallel, specified as true or false.

For parallel computing, you must have Parallel Computing Toolbox™. If a parallel pool does not exist, one is created automatically when the auto-creation option is enabled in your parallel preferences. Otherwise, computation runs in serial mode.

Note

  • There is a cost associated with sharing large input files across workers in a distributed environment. In some cases, running in parallel may not be beneficial in terms of performance.

  • During parallel computations, the work is divided by files, not by sequences, meaning that, for a single large file, running in parallel does not make a difference.

Example: 'UseParallel',true

Flag to overwrite existing files, specified as a numeric or logical 1 (true) or 0 (false).

When the value is false and a file matching one of the output file names already exists, the function generates an error.

Data Types: double | logical

Output Arguments

collapse all

Output file names, returned as a cell array of character vectors.

Number of sequences trimmed from each input file, returned as a scalar or an n-by-1 vector where n is the number of input files. If there are multiple input files, the order within nSeqTrimmed corresponds to the order of the input files.

Number of sequences untrimmed from each input file, returned as a scalar or an n-by-1 vector where n is the number of input files. If there are multiple input files, the order within nSeqUntrimmed corresponds to the order of the input files.

Extended Capabilities

Version History

Introduced in R2016b