bioinfo.pipeline.block.SRAFasterqDump

Download NGS data from SRA

Since R2024a

Description

An SRAFasterqDump block enables you to download sequence read data in the FASTQ or FASTA format from SRA (Sequence Read Archive) [1].

bioinfo.pipeline.block.SRAFasterqDump requires the SRA Toolkit for Bioinformatics Toolbox™. If this support package is not installed, then the function provides a download link. For details, see Bioinformatics Toolbox Software Support Packages.

Creation

Syntax

b = bioinfo.pipeline.block.SRAFasterqDump

b = bioinfo.pipeline.block.SRAFasterqDump(options)

b = bioinfo.pipeline.block.SRAFasterqDump(Name=Value)

Description

b = bioinfo.pipeline.block.SRAFasterqDump creates an SRAFasterqDump block.

example

b = bioinfo.pipeline.block.SRAFasterqDump(options) uses additional options specified by options.

b = bioinfo.pipeline.block.SRAFasterqDump(Name=Value) specifies additional options using one or more name-value arguments. For example, you can specify to retrieve the FASTA-formatted file using the FastaOutput name-value argument. The name-value arguments sets the property names and values of an SRAFasterqDumpOptions object. These property values are assigned to the Options property of the block.

Input Arguments

expand all

`options` — `SRAFasterqDump` options
`SRAFasterqDumpOptions` | string scalar | character vector

SRAFasterqDump options, specified as an SRAFasterqDumpOptions object, string scalar, or character vector.

If you specify a string scalar or character vector, it must be in the fasterq-dump original syntax (prefixed by a dash).

Data Types: char | string

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: b = bioinfo.pipeline.block.SRAFasterqDump(FastaOutput=true) specifies to download the FASTA-formatted file.

`AppendOutputFile` — Flag to append new data to the output file
`false` or 0 (default) | `true` or 1

Flag to append new data to the output file instead of overwriting it, specified as a numeric or logical 1 ( true) or 0 (false). By default, the output file is overwritten with new data.

Data Types: double | logical

`ConcatenateReads` — Flag to concatenate sequence information pertaining to each spot
`false` or 0 (default) | `true` or 1

Flag to concatenate sequence information pertaining to each spot, specified as a numeric or logical 1 (true) or 0 (false). By default, the software does not concatenate the information pertaining to each spot. That is, the software writes four lines of FASTQ or two lines of FASTA into one output file for each spot. For details, see FASTQ/FASTA concatenated.

Data Types: double | logical

`ExtraCommand` — Additional commands
`""` (default) | character vector | string scalar

Additional commands, specified as a character vector or string scalar.

The commands must be in the native syntax (prefixed by one or two dashes). Use this option to apply undocumented flags and flags without corresponding MATLAB^® properties.

Example: ExtraCommand="--fasta-ref-tbl --internal-ref"

Data Types: char | string

`FastaOutput` — Flag to save output in FASTA format
`false` or 0 (default) | `true` or 1

Flag to save the output in the FASTA format, specified as a numeric or logical 1 (true) or 0 (false). The default output format is the FASTQ format.

Data Types: double | logical

`FastaOutputUnsorted` — Flag to split sequence information without preserving spot order
`false` or 0 (default) | `true` or 1

Flag to split sequence information pertaining to each spot without preserving the spot order, specified as a numeric or logical 1 (true) or 0 (false).

If the value is true, the software splits the sequence information in each spot is into reads. For each read, two lines of FASTA are written into the single output file. Setting FastaOutputUnsorted=true is the same as setting SplitType=SplitSpot, with the following exceptions:

With FastaOutputUnsorted=true, the original order of the spots and reads is not preserved, and FastaOutputUnsorted name-value argument is exclusively for the FASTA output.
This setting is faster than the SplitSpot option and does not use temporary files.

Data Types: double | logical

`FilterByBases` — String of bases used to filter output
empty string array (default) | string scalar

String of bases used to filter the output, specified as a string scalar. The output is filtered by comparing it to the specified string of bases and keeping reads that include the specified string of bases.

Data Types: string

`IncludeAll` — Flag to include all object properties
`false` or 0 (default) | `true` or 1

Flag to include all object properties with corresponding default values when converting properties to the original option syntax, specified as a numeric or logical 1 (true) or 0 (false). You can convert properties to the original syntax prefixed by one or two dashes (such as '-e 8 --split-file') by using the getCommand function.

When IncludeAll=false and you call getCommand(optionsObject), the software converts only the specified properties. If the value is true, getCommand converts all available properties, using default values for unspecified properties, to the original syntax.

Note

If you set IncludeAll to true, the software converts all available properties, using default values for unspecified properties. The only exception is when the default value of a property is NaN, Inf, [], '', or "". In this case, the software does not translate the corresponding property.

Data Types: logical | double

`IncludeTechnical` — Flag to include technical reads in downloaded files
`false` or 0 (default) | `true` or 1

Flag to include technical reads in the downloaded files, specified as a numeric or logical 1 (true) or 0 (false).

Data Types: double | logical

`MinReadLength` — Minimum length required for read to be included
0 (default) | nonnegative integer

Minimum length required for a read to be included in the output, specified as a nonnegative integer. By default, no read is filtered out.

Data Types: double

`NumThreads` — Number of parallel threads
`6` (default) | positive integer

Number of parallel threads to use, specified as a positive integer. The software runs threads on separate processors or cores. Increasing the number of threads generally improves the runtime significantly, but also increases the memory footprint.

Data Types: double

`OutputDirectory` — Folder where output files are saved
empty string array (default) | character vector | string scalar

Folder where the output files are saved, specified as a character vector or string scalar. By default, the software saves the files in the current directory.

Data Types: char | string

`OutputFileName` — Base name of output files
empty string array (default) | character vector | string scalar

Base name of the output files, specified as a character vector or string scalar. The default base name is the accession run number.

Data Types: char | string

`SplitType` — Method used to split sequence information
`"SplitThree"` (default) | `"SplitFiles"` | `"SplitSpot"`

Method used to split sequence information pertaining to each spot, specified as one of the following:

"SplitThree" — The software splits spots into reads. For each read, the software writes four lines of FASTQ or two lines of FASTA. For spots with two reads, the software produces *_1.fastq and *_2.fastq files. The software places unmated reads in *.fastq. If the accession does not have any spot with one single read, the software does not create a *.fastq file. For details, see FASTQ/FASTA split 3.
"SplitSpot" — The software splits spots into reads. For each read, the software writes four lines of FASTQ or two lines of FASTA. All the reads are saved to a single output file. For details, see FASTQ/FASTA split spot.
"SplitFiles" — The software splits spots into reads. For each read, the software writes four lines of FASTQ or two lines of FASTA. The software assigns each read a number n, where 1 ≤ n ≤ 5, and then saves each nth read to the nth file (*_n.fastq). For details, see FASTQ/FASTA split file.

By default, the reads refer to biological reads only. However, if you set IncludeTechnical to true, then the software also includes the technical reads in the output files.

Data Types: char | string

Properties

expand all

`ErrorHandler` — Function to handle errors from `run` method
`[]` (default) | function handle

Function to handle errors from the run method of the block, specified as a function handle. The handle specifies the function to call if the run method encounters an error within a pipeline. For the pipeline to continue after a block fails, ErrorHandler must return a structure that is compatible with the output ports of the block. The error handling function is called with the following two inputs:

Structure with these fields:

Field	Description
identifier	Identifier of the error that occurred
message	Text of the error message
index	Linear index indicating which block process failed in the parallel run. By default, the index is 1 because there is only one run per block. For details on how block inputs can be split across different dimensions for multiple run calls, see Bioinformatics Pipeline SplitDimension.

Input structure passed to the run method when it fails

Data Types: function_handle

`Inputs` — Input ports
structure

This property is read-only.

Input ports of the block, specified as a structure. The field names of the structure are the names of the block input ports, and the field values are bioinfo.pipeline.Input objects. These objects describe the input port behaviors. The input port names are the expected field names of the input structure that you pass to the block run method.

The SRAFasterqDump block Inputs structure has the following field SRRID, which contains the accession numbers. This input is required and must be satisfied.

Data Types: struct

`Outputs` — Output ports
structure

This property is read-only.

Output ports of the block, specified as a structure. The field names of the structure are the names of the block output ports, and the field values are bioinfo.pipeline.Output objects. These objects describe the output port behaviors. The field names of the output structure returned by the block run method are the same as the output port names.

The SRAFasterqDump block Outputs structure has the following fields: Reads, Reads_1, Reads_2, Reads_3, Reads_4, Reads_5. The field values are the output filenames. The total number of output files varies depending on the SplitType option and the accession run number.

The Reads field corresponds to the single output file produced when you specify SplitType="SplitSpot". The Reads_n fields, where 1 ≤ n ≤ 5, correspond to the output files produced when you specify SplitType="SplitThree" or SplitType="SplitFiles". For details, see SplitType.

Tip

To see the actual location of the output files, get the results of the block first. Then use the unwrap function as shown in this example.

Data Types: struct

`Options` — `SRAFasterqDump` options
`SRAFasterqDumpOptions` object (default)

SRAFasterqDump options, specified as an SRAFasterqDumpOptions object. The default value is a default SRAFasterqDumpOptions object.

Object Functions

`compile`	Perform block-specific additional checks and validations
`copy`	Copy array of handle objects
`emptyInputs`	Create input structure for use with `run` method
`eval`	Evaluate block object
`run`	Run block object

Examples

collapse all

Download NGS Data from SRA Using Bioinformatics Pipeline

This example uses:

Open Live Script

Import the pipeline and block objects needed for the example so that you can create these objects without specifying the entire namespace.

import bioinfo.pipeline.Pipeline
import bioinfo.pipeline.block.*

Create a pipeline.

P = Pipeline;

Create an SRAFasterqDump block and specify the accession number SRR11846824 as the block input. SRR11846824 has two reads per spot and no unaligned reads.

SRAFQDump = SRAFasterqDump;
SRAFQDump.Inputs.SRRID.Value = "SRR11846824";
addBlock(P,SRAFQDump);

Run the pipeline to download the corresponding FASTQ files from SRA for the specified accession number.

run(P);

Get the results of the SRAFQDump block.

R = results(P,SRAFQDump)

R = struct with fields:
      Reads: [1×1 bioinfo.pipeline.datatype.Incomplete]
    Reads_1: [1×1 bioinfo.pipeline.datatype.File]
    Reads_2: [1×1 bioinfo.pipeline.datatype.File]
    Reads_3: [1×1 bioinfo.pipeline.datatype.Incomplete]
    Reads_4: [1×1 bioinfo.pipeline.datatype.Incomplete]
    Reads_5: [1×1 bioinfo.pipeline.datatype.Incomplete]

View the names of the downloaded files by using the unwrap function.

unwrap(R.Reads_1)
unwrap(R.Reads_2)

By default, the block uses the SplitType="SplitThree" option and downloads only biological reads. Specifically, the block splits spots into reads. For spots with two reads, the block produces *_1.fastq and *_2.fastq and displays them in the Reads_1 and Reads_2 fields, respectively. The block saves any unaligned reads in a *.fastq file and displays it in the Reads field. Because this accession has no unaligned reads, the block did not produce a *.fastq file, and the Reads field is returned as Incomplete. Reads_3, Reads_4, and Reads_5 are also Incomplete because of the usage of SplitType="SplitThree". For more details on the block output behavior, see Outputs.

You can specify other download options using the SRAFasterqDumpOptions. For instance, to download the FASTA-formatted file, specify FastaOutput=true and rerun the block.

opt = SRAFasterqDumpOptions;
opt.FastaOutput = true;
SRAFQDump.Options = opt;

You can also download SAM files from SRA using the SRASAMDump block.

SRASDump = SRASAMDump;

Specify the accession number to download.

SRASDump.Inputs.SRRID.Value = "SRR11846824";

Specify the options using an SRASAMDumpOptions object. For instance, set the output filename and compress the output file using bzip2.

samdumpopt = SRASAMDumpOptions;
samdumpopt.BZip2 = 1;
samdumpopt.OutputFileName = "SRR11846824.sam.bz2"

samdumpopt = 
  SRASAMDumpOptions with properties:

   Default properties:
       ExtraCommand: ""
        FastaOutput: 0
        FastqOutput: 0
               GZip: 0
      HideIdentical: 0
         IncludeAll: 0
      MinMapQuality: 0
      OutputPrimary: 0
    OutputUnaligned: 0
            Version: "3.0.6"

   Modified properties:
              BZip2: 1
     OutputFileName: "SRR11846824.sam.bz2"

SRASDump.Options = samdumpopt;

Add the block to the pipeline and run the pipeline.

addBlock(P,SRASDump);
run(P);

Get the block results.

R2 = results(P,SRASDump);

View the names of the output files by using the unwrap function.

unwrap(R2.OutputFiles)

After downloading the files, you can use them for downstream analyses. For instance, you can run bowtie2 to map the reads to the reference sequence, and then visualize the mapped reads in the Genomics Viewer app.

First, download the C. elegans reference sequence.

celegans_refseq = fastaread("https://s3.amazonaws.com/igv.broadinstitute.org/genomes/seq/ce11/ce11.fa");

Save the Chromosome 3 reference data in a FASTA file.

celegans_chr3 = celegans_refseq(3).Sequence;
fastawrite("celegans_chr3.fa",celegans_chr3);

Create a FileChooser block to select the Chromosome 3 reference file.

fcRef = FileChooser;
fcRef.Files = fullfile(pwd,"celegans_chr3.fa");
addBlock(P,fcRef);

Build a set of index files using the Bowtie2Build block. Set the base name of the index files and the name of the reference FASTA file.

buildIndex = Bowtie2Build;
buildIndex.Inputs.IndexBaseName.Value = "celegans_chr3_index";
addBlock(P,buildIndex);
connect(P,fcRef,buildIndex,["Files","ReferenceFASTAFiles"]);
run(P);

Align reads to the reference using the Bowtie2 block. Create the block and then connect it to buildIndex and SRAFQDump blocks.

alignReads = Bowtie2;
alignReads.OutFilename = "SRR11846824_mapped.sam";
addBlock(P,alignReads);
connect(P,buildIndex,alignReads,["IndexBaseName","IndexBaseName"]);
connect(P,SRAFQDump,alignReads,["Reads_1","Reads1Files";"Reads_2","Reads2Files"]);
run(P);

Bowtie2 produces a SAM file. To visualize the mapped reads in the Genomics Viewer app, convert the SAM file to a BAM file.

First, make a UserFunction block to create a BioMap object from the SAM file.

biomapObj = UserFunction;
biomapObj.Function = "BioMap";
biomapObj.RequiredArguments = "inputSAM";
biomapObj.OutputArguments = "biomapObject";
addBlock(P,biomapObj);

Next, connect the biomapObj block to the alignReads block, which provides the SAM file needed. Suppress two informational warnings issued during the creation of a BioMap object.

connect(P,alignReads,biomapObj,["SAMFile","inputSAM"]);
w = warning;
warning("off","bioinfo:BioMap:BioMap:UnsortedReadsInSAMFile");
warning("off","bioinfo:saminfo:InvalidTagField");
run(P);
warning(w); % Restore warnings

Use the write method of the BioMap object to convert the SAM file to a BAM file.

sam2bam = UserFunction;
sam2bam.Function = "write";
sam2bam.RequiredArguments = ["biomapObj","BAMFileName"];
sam2bam.NameValueArguments = "Format";
sam2bam.Inputs.BAMFileName.Value = "../../../SRR11846824_mapped.bam";
sam2bam.Inputs.Format.Value = "BAM";
addBlock(P,sam2bam);
connect(P,biomapObj,sam2bam,["biomapObject","biomapObj"]);
run(P);

Create a FileChooser block to select the generated BAM file.

fcBAM = FileChooser;
fcBAM.Files = fullfile(pwd,"SRR11846824_mapped.bam");
addBlock(P,fcBAM);

Create a FileChooser block to select the C. elegans cytoband file, which is provided with the toolbox.

fcCyto = FileChooser;
fcCyto.Files = fullfile(pwd,"celegans_cytoBandIdeo.txt.gz");
addBlock(P,fcCyto);

View the alignment data using the Genomics Viewer app.

gv = GenomicsViewer;
addBlock(P,gv);
connect(P,fcRef,gv,["Files","Reference"]);
connect(P,fcCyto,gv,["Files","Cytoband"]);
connect(P,fcBAM,gv,["Files","Tracks"]);
run(P);

Use the zoom slider to zoom in and see the features. Or you can enter the following in the search text box: Generated:3,711,861-3,711,940.

Delete the pipeline results and downloaded files.

deleteResults(P,IncludeFiles=true);

References

[1] SRA Toolkit Development Team https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit

Version History

Introduced in R2024a

bioinfo.pipeline.block.SRAFasterqDump

Description

Creation

Syntax

Description

Input Arguments

options — SRAFasterqDump options SRAFasterqDumpOptions | string scalar | character vector

AppendOutputFile — Flag to append new data to the output file false or 0 (default) | true or 1

ConcatenateReads — Flag to concatenate sequence information pertaining to each spot false or 0 (default) | true or 1

ExtraCommand — Additional commands "" (default) | character vector | string scalar

FastaOutput — Flag to save output in FASTA format false or 0 (default) | true or 1

FastaOutputUnsorted — Flag to split sequence information without preserving spot order false or 0 (default) | true or 1

FilterByBases — String of bases used to filter output empty string array (default) | string scalar

IncludeAll — Flag to include all object properties false or 0 (default) | true or 1

IncludeTechnical — Flag to include technical reads in downloaded files false or 0 (default) | true or 1

MinReadLength — Minimum length required for read to be included 0 (default) | nonnegative integer

NumThreads — Number of parallel threads 6 (default) | positive integer

OutputDirectory — Folder where output files are saved empty string array (default) | character vector | string scalar

OutputFileName — Base name of output files empty string array (default) | character vector | string scalar

SplitType — Method used to split sequence information "SplitThree" (default) | "SplitFiles" | "SplitSpot"

Properties

ErrorHandler — Function to handle errors from run method [] (default) | function handle

Inputs — Input ports structure

Outputs — Output ports structure

Options — SRAFasterqDump options SRAFasterqDumpOptions object (default)

Object Functions

Examples

Download NGS Data from SRA Using Bioinformatics Pipeline

References

Version History

See Also

`options` — `SRAFasterqDump` options
`SRAFasterqDumpOptions` | string scalar | character vector

`AppendOutputFile` — Flag to append new data to the output file
`false` or 0 (default) | `true` or 1

`ConcatenateReads` — Flag to concatenate sequence information pertaining to each spot
`false` or 0 (default) | `true` or 1

`ExtraCommand` — Additional commands
`""` (default) | character vector | string scalar

`FastaOutput` — Flag to save output in FASTA format
`false` or 0 (default) | `true` or 1

`FastaOutputUnsorted` — Flag to split sequence information without preserving spot order
`false` or 0 (default) | `true` or 1

`FilterByBases` — String of bases used to filter output
empty string array (default) | string scalar

`IncludeAll` — Flag to include all object properties
`false` or 0 (default) | `true` or 1

`IncludeTechnical` — Flag to include technical reads in downloaded files
`false` or 0 (default) | `true` or 1

`MinReadLength` — Minimum length required for read to be included
0 (default) | nonnegative integer

`NumThreads` — Number of parallel threads
`6` (default) | positive integer

`OutputDirectory` — Folder where output files are saved
empty string array (default) | character vector | string scalar

`OutputFileName` — Base name of output files
empty string array (default) | character vector | string scalar

`SplitType` — Method used to split sequence information
`"SplitThree"` (default) | `"SplitFiles"` | `"SplitSpot"`

`ErrorHandler` — Function to handle errors from `run` method
`[]` (default) | function handle

`Inputs` — Input ports
structure

`Outputs` — Output ports
structure

`Options` — `SRAFasterqDump` options
`SRAFasterqDumpOptions` object (default)