mRNA Markup Workflow Overview
We present the mRNA Markup workflow as a realistic workflow example created within the BioExtract Server. The mRNA Markup workflow represents comprehensive annotation and primary analysis of a set of transcripts. Such sets are currently abundantly generated from assembly of EST sequences, and with increasing read lengths of novel sequencing technologies, there will be similar assemblies of RNA-Seq data from many species and sampling conditions. Using NCBI BLAST+, MuSeqBox (Multi-query sequence BLAST output examination with MuSeqBox developed by the Brendel Group), and Linux shell scripting, an input transcript set is partitioned in many ways, including into contaminants (sequencing artifacts), potential chimeras, likely full-length protein-coding mRNAs, miRNAs, and potentially novel transcripts for further analysis with other programs.
mRNA Markup Workflow Tools
- mRNA_Markup_Start
- This tool simply accepts the initial mRNA input file as well as the bacterial, reference protein, and comprehensive protein databases and passes them to the appropriate steps.
- blastn, blastx, rpstblastn
- The standard NCBI BLAST+ programs from http://blast.ncbi.nlm.nih.gov/
- MuSeqBox
- MuSeqBox is a program designed for multi-query sequence BLAST output examination. It examines the BLAST output, extracts the informative parameters of BLAST hits, and saves them in tabular form. The hit tables can be further analyzed to produce subsets of BLAST hits according to user-specified criteria. More information is available at http://www.plantgdb.org/MuSeqBox/help.php. Input may be either a blast report or another MuSeqBox file. Output is a tabular description of BLAST hits, stored in a .msb file.
- MuSeqBox_Partition
- This program separates portions of a sequence that produced blast hits from those that did not. The "-label" parameter allows labelling the kind of hits being described. For example, the label "BC" is used to denote bacterial contamination. With such a label, MuSeqBox_Partition would produce BC-mRNA.fas and not_BC-mRNA.fas. Input consists of a MuSeqBox output file and a file containing the query sequences for the BLAST whose output was processed by MuSeqBox. Output consists of two FASTA files, one containing the sequences which correspond to BLAST hits, and one containing the sequences which do not.
- mRNA_Markup_Summary
- This tool performs the final step of the workflow. It produces a summary report (Summary.txt) detailing how many sequences were matched during each step as well as how many potentially novel sequences remain. The matched and unmatched sequences themselves are presented as well. Input consists of the FASTA files produced in the previous steps.
Sample Input
The workflow uses a sample input file consisting of Arabidopsis mRNA and searches the following BLAST databases by default:
- Vector Database: UniVec
- Bacteria Database: E. Coli from NCBI Nucleotide
- Reference Database: ATpepTAIR10 (A set of Arabidopsis protein sequences available at http://www.plantgdb.org/XGDB/phplib/download.php?GDB=At)
- All Protein Database: UniRef90-Viridiplantae (http://www.uniprot.org/uniref/?query=identity:0.9+taxonomy:33090&format=*&compress=yes)
- Protein Domain Database: NCBI’s CDD
Detailed Breakdown
The workflow consists of several “steps” each of which comprises several analytic tools. Each step reads an mRNA input and executes BLAST+ to identify, for example, bacterial contamination or reference protein hits followed by MuSeqBox which partitions the input sequences into those that produced hits and those that did not. Sequences that did not produce hits are used as input into subsequent steps.
Searches for vector contamination are always performed against UniVec and protein domain searches are always performed against CDD. Other BLAST+ subject databases are user-selectable.
- Step 0: Input sequence submission
-
mRNA_Markup_StartInputs
- Initial mRNA input file
- Bacterial contamination DB: a file representing typical bacterial hosts (such as E. coli) in a sequencing project
- Ref. Protein DB: a reference protein set (proteins most likely to have homologs in the mRNA translations of the input
- All Protein DB: a comprehensive protein set (to be searched when the reference protein set did not give hits)
Outputs- mRNA input file
- Step 1: Eliminate Vector Contamination
-
blastnInputs
- Query: Initial mRNA input file from Step 0
- Subject Database: Vector Database UniVec
Outputs- blastn_Vector
MuSeqBoxInputs- mRNA File: blastn_Vector
Outputs- VC.msb
MuSeqBox_PartitionInputs- mRNA file: blastn Input Query from Step 1
- MuSeqBox File: VC.msb
Outputs- VC-mRNA.fas: sequences likely resulting from vector contamination
- not_VC-mRNA.fas: remaining sequences
- Step 2: Eliminate bacterial contamination
-
blastnInputs
- Query: not_VC-mRNA.fas
- Subject Database: Bacteria Database from Step 0
Outputs- blastn_Bacteria
MuSeqBoxInputs- mRNA File: blastn_Bacteria
Outputs- BC.msb
MuSeqBox_PartitionInputs- mRNA file: not_VC-mRNA.fas
- MuSeqBox File: BC.msb
Outputs- BC-mRNA.fas: sequences likely resulting from bacterial contamination
- not_BC-mRNA.fas: remaining sequences
- Step 3: find matches in a reference protein database
-
blastxInputs
- Query: not_BC-mRNA.fas
- Subject: Reference Protein Database from Step 0
Outputs- blastx_RefProt
MuSeqBoxInputs- mRNA File: blastx_RefProt
Outputs- RA.msb
MuSeqBox_PartitionInputs- mRNA File: not_BC-mRNA.fas
- MuSeqBox File: RA.msb
Outputs- RA-mRNA.fas: sequences likely matching the reference protein
- not_RA-mRNA.fas: remaining sequences
- Step 3.1: Identify potential full-length coding sequences
-
MuSeqBoxInputs
- RA.msb
Outputs- fullcds.txt
MuSeqBox_PartitionInputs- mRNA File: RA-mRNA.fas
- MuSeqBox File: fullcds.txt
Outputs- FL-mRNA.fas: potential full-length coding sequences
- not_FL-mRNA.fas: remaining sequences
- Step 3.2: Identify potential chimeric sequences
-
MuSeqBoxInputs
- RA.msb
Outputs- PC.msb
MuSeqBox_PartitionInputs- mRNA File: not_FL-mRNA.fas
- MuSeqBox File: PC.msb
Outputs- PC-mRNA.fas
- not_PC-mRNA.fas
- Step 4: Find matches in a Comprehensive Protein database
-
blastxInputs
- Query: not_RA-mRNA.fas
- Subject: All Protein Database from Step 0
Outputs- blastx_AllProt
MuSeqBoxInputs- blastx_AllProt
Outputs- AA.msb
MuSeqBox_PartitionInputs- mRNA File: not_RA-mRNA.fas
- MuSeqBox File: AA.msb
Outputs- AA-mRNA.fas
- not_AA-mRNA.fas
- Step 5: Find matches in Protein Domain Database
-
rpstblastnInputs
- Query: not_AA-mRNA.fas
- Subject: CDD
Outputs- rpstblastn_ProtDomain
MuSeqBoxInputs- rpstblastn_ProtDomain
Outputs- CD.msb
MuSeqBox_PartitionInputs- mRNA File: not_AA-mRNA.fas
- MuSeqBox File: CD.msb
Outputs- CD-mRNA.fas
- not_CD-mRNA.fas
- Step 6: Produce summary report
-
mRNA_Markup_SummaryInputs
- Original mRNA input: blastn Input Query from Step 1
- Vector-contaminated: VC-mRNA.fas
- Bacteria-contaminated: BC-mRNA.fas
- Sequences matching ReferenceDB: RA-mRNA.fas
- Full-length coding Sequences: FL-mRNA.fas
- not Full-length coding Sequences: not_FL-mRNA.fas
- Potential chimeric sequences: PC-mRNA.fas
- not Potential chimeric sequences: not-PC-mRNA.fas
- matching AllProteinDB: AA-mRNA.fas
- Sequences matching ProteinDomainDB: CD-mRNA.fas
- Remaining sequences: not_CD-mRNA.fas
Outputs- Summary.txt
- The FASTA files submitted as input are also presented here.