CpGAT Workflow Overview

PlantGDB's Comprehensive plant Gene Annotation Tool (CpGAT) workflow allows users to annotate any genomic region using any combination of transcript and protein datasets. The pipeline uses EVM (EVidence Modeler) to evaluate transcript- and ab initio-derived exons, and incorporates PASA to derive UTR regions and alternative splice variants. The pipeline outputs a GFF3-formatted file of gene structures.

How it works: The user chooses a genome region to annotate, and then selects transcript datasets (same species and/or related species) and/or protein datasets (related species) based on taxonomic similarity to the genome of interest. The user selects a splice-site model as close as possible representing the species of interest. The pipeline then does the following:

  1. BLASTn/BLASTx analysis using the transcript/protein datasets to query the genome region, returning a subset of transcript/protein with high similarity
  2. GenomeThreader spliced-alignment of high similarity transcripts/proteins to the genome region, outputting the longest open reading frames (ORF) and their exons.
  3. BLASTx analysis of the ORF exons to a reference protein database (e.g. UniRef90) to allow the coding sequences to be ranked by similarity to known proteins. Output is GFF3 files for high and low quality coding sequences.
  4. Ab initio genefinders (GenMark, BGF, Augustus) are used to evaluate the genomic region (after repeat masking), outputting 3 GFF3 files of ab initio gene models.
  5. The respective outputs of step 3 and 4 are used as input to EVidence Modeler (EVM), along with a weighting table, to derive optimal gene models based on a combinination of evidence alignments and ab initio models. The output is a GFF3 file of optimal gene models (without UTR).
  6. The respective outputs of steps 3 and 5 are used as input to PASA which identifies untranslated regions (UTR) from evidence alignments, and creates alternative splicing models from the exon list. The output is a single GFF3 file showing all possible gene models (including UTR) that meet the threshold quality criteria.

A new BioExtract Server workflow was developed in parallel with CpGAT, allowing the execution and customization of the cpGAT pipeline from within The BioExtract Server. This page contains the documentation for the specialized tools created to implement the CpGAT workflow on the BioExtract Server. Here is a summary of the workflow steps. Some of these steps may be repeated at different points throughout the workflow.

CpGAT_Start
An Overview of CpGAT

An input node which feeds data to each branch of the workflow.

input:4 files
  1. A Genomic DNA FASTA file
  2. The name of the species splice-site model
  3. Transcript data files
  4. Reference protein and repeat mask data files
output:4 files
  1. A Genomic DNA FASTA file
  2. The name of the species splice-site model
  3. Transcript data files
  4. Reference protein and repeat mask data files
VMatch_CpGAT
Combines the mkvtree and vmatch commands. That is, mkvtree is called each time vmatch is called. This is not ideal, but is necessary because of the way tools are implemented.
input: fasta genomic sequence
output: masked (X) fasta genomic (for ab initio gene finders)
BLAST_CpGAT
The standard blast suite (blastn, blastx, etc), but with no pre-processed databases. The formatdb command is called on the database supplied with .d each time the tool is run. As with vmatch, this is not ideal, but ensures the index files are available to blast when it runs.
input: fasta genomic (transcript, protein indices)
output: standard output format (not table format), up to evalue threshold
Solar_CpGAT
Sorting Out Local Alignment Results (SOLAR)
parses blast output, combines local alignment
input: blast output in standard format, source fasta file
output: tabular format with query id, strand, subject id, query start query end structure, OR fasta formatted files matching the id's in the solar output table (rna and protein)
GenomeThreader_CpGAT
The standard gth tool using either protein or cdna input
input: genomic sequence (not masked), matched protein/mrna sequence, species parameters
output: GTH flat file output (not xml) e.g. tabular data, similar to gsq
GetPredictedmRNA_CpGAT
This combines the following scripts/commands:
input: gth output, genomic sequence
output: predicted mrna, fasta dna format, with fasta headers containing genomic sequence ID, evidence ID, alignment start/stop (absolute)
GetLongestORFs_CpGAT
GetLongestORFs_CpGAT gets the longest ORFs whose length is at least as long as the length specified by -minsize
getorf
gets all open reading frames from all output of prev step
SelectLongOrf
gets longest ORF
input: previous output fasta
output: longest ORF in protein fasta format, including stop codons. fasta headers consist of genomic sequence ID, evidence ID, alignment start/stop (absolute), orf start/ orf stop (relative)
ExtractFullCDS_CpGAT
This tool combines the ExtraFlCDS_protgth and ExtraFlCDS_mRNAgth programs, calling the appropriate one based on the -p parameter. This script uses the solar-formatted blast reference output to identify Full CDS structures and parse them into a Full CDS gff3 file; at the same time it identifies non-CDS containing structures and ranks them accoding to similarity and coverage in a separate file.
input: 4 files
  1. longest ORFs (fasta)
  2. gth exon structure (tabular flat file format)
  3. homologous alignments (solar format)
  4. threshold parameters for confidence (currently hard-coded)
output: 2 files
  1. GFF formatted exon structures with full CDS
  2. GFF formatted exon structures with no CDS annotation, GFF 2nd keywords column indicates whether high or low confidence, according to similarity and coverage
BGF_CpGAT
bgf
Run BGF to get gene prediction
bgfmerge
Calls the merge2results program to combine two bgf output files.
bgf2gff
Calls the convert_bgf_to_jigsaw.pl, sort_and_rewrite_jigsaw.pl and bgf2gff2.pl scripts to convert the bgf output to gff format. get gene models from BGF, gff format
input:masked genomic sequence (fasta format ); species parameter
output:GFF3 formatted file with gene models. GFF3 ID consists of genome segment ID, left right coordinates (absolute)
GeneMark_CpGAT
genemark
gmark2gff
Calls the convert_genemark_to_jigsaw.pl, sort_and_rewrite_jigsaw.pl and gmark2gff3.pl scripts to convert the genemark output to gff format
input: masked genomic sequence (fasta format ); species parameter
output: GFF3 formatted file with gene models, ID consists of genome segment ID, left right coordinates (absolute)
Augustus_CpGAT
augustus
Calls the standard augustus program, though only enough parameters to accommodate the workflow are implemented
clean_aug_gff
Calls convert_to_jigsaw.pl (it.s not clear if the output is used anywhere) and clean_aug.pl. same as BGF
input: masked genomic sequence (fasta format ); species parameter
output: GFF3 formatted file with gene models, ID consists of genome segment ID, left right coordinates (absolute)
EVM_CpGAT
evm
GetAbsPos_GFF
This script changes the relative coordinates to absolute coordinates, allowing the gff3 file to be uploaded to mysql tables and displayed in a genome context viewer.
input: 14 files
  1. input genome file: Genomic_DNA_Input.fasta from CpGAT_Start
  2. transcript file: DNA.mRNAgth.fullcds.gff from ExtractFullCDS_CpGAT
  3. protein file: DNA.protgth.fullcds.gff from ExtractFullCDS_CpGAT
  4. genemark file: Genomic_DNA_Input.masked.genemark.gff from GeneMark_CpGAT
  5. augustus file: Genomic_DNA_Input.masked.AUG.gff from Augustus_CpGAT
  6. bgf file: Genomic_DNA_Input.masked.bgf.gff from BGF_CpGAT
  7. mRNA exon file: DNA.mRNAgth.asembl.exon.txt from GetPredictedmRNA_CpGAT
  8. protein exon file: DNA.protgth.sortnr.exon.txt from GetPredictedmRNA_CpGAT
  9. mRNA PASA output file: exon.10197.pasa.out from GetPredictedmRNA_CpGAT
  10. Reference Protein db: Reference_Protein from CpGAT_Start
  11. transcript alignment file: DNA.mRNAgth.noflcds.gff from ExtractFullCDS_CpGAT
  12. protein alignment file: DNA.protgth.noflcds.gff from ExtractFullCDS_CpGAT
  13. mRNA solar id map file: matched_mRNAs_evde_id.map from Solar_CpGAT
  14. protein solar id map: matched_Peps_evde_id.map from Solar_CpGAT
output:
CpGAT_Finish
Filters the output from EVM showing only the pertinent output files.
input:
output: