gff3_QC full documentation

Background

The GFF3 format is flexible and easy to use for most biologists, but this flexibility also allows many errors to be introduced. This QC program aims to detect over 50 types of formatting errors.

Errors are detected by reviewing three types of feature sets in a GFF3 file, and thus are grouped into three categories (Error category – feature type): * Intra-model errors (Ema) – multiple features within a model * Inter-model errors (Emr) – multiple features across models * Single feature errors (Esf) – each single feature.

In addition, we distinguish between errors that apply to protein-coding genes in the ‘canonical’ Sequence ontology style, and errors that apply to ‘non-canonical’ gene models – i.e. non-coding models, or protein-coding genes that are not modeled with gene, mRNA, CDS and exon features. To perform error-checking on a gff3 file that contains non-canonical gene models, you can specify the –noncg argument when running the program.

Below we list all errors currently considered by gff3_QC.py, including the error code, the error tag (a brief explanation of the error), and whether the error is checked for non-canonical gene models (when using the –noncg argument).

View the gff3_QC.py readme for instructions on how to run the program.

Intra-model: Multiple features within a model (Ema)

The error category ‘Intra-model’ collects formatting errors that can be found by jointly considering multiple features within a gene model, such as gene, mRNA, exon, and CDS features. Errors in this category are given an ‘Error_Code’ starting with ‘Ema’.

Error_Code

Error_level

Error_Tag

Checked if non-canonical

Ema0001

Warning

Parent feature start and end coordinates exceed those of child features

Yes

Ema0002

Warning

Protein sequence contains internal stop codons

No

Ema0003

Warning

This feature is not contained within the parent feature coordinates

Yes

Ema0004

Info

Incomplete gene feature that should contain at least one mRNA, exon, and CDS

No

Ema0005

Info

Pseudogene has invalid child feature type

Yes

Ema0006

Info

Wrong phase

No

Ema0007

Warning

CDS and parent feature on different strands

Yes

Ema0008

Warning

Warning for distinct isoforms that do not share any regions

No

Ema0009

Warning

Incorrectly merged gene parent? Isoforms that do not share coding sequences are found

No

Inter-model: Multiple features across models (Emr)

The error category ‘Inter-model’ collects formatting errors that can be found by comparing multiple gene models. Errors in this category are given an ‘Error_Code’ starting with ‘Emr’.

Error_Code

Error_level

Error_Tag

Checked if non-canonical

Emr0001

Warning

Duplicate transcript found

No

Emr0002

Warning

Incorrectly split gene parent?

No

Emr0003

Error

Duplicate ID

Yes

Single feature (Esf)

The error category ‘Single Feature’ collects formatting errors that can be found by searching the GFF3 file line by line. Errors in this category are given an ‘Error_Code’ starting with ‘Esf’.

Error_Code

Error_level

Error_Tag

Checked if non-canonical

Esf0001

Info

Feature type may need to be changed to pseudogene

Yes

Esf0002

Error

Start/Stop is not a valid 1-based integer coordinate

Yes

Esf0003

Error

strand information missing

Yes

Esf0004

Error

Seqid not found in any ##sequence-region

Yes

Esf0005

Error

Start is less than the ##sequence-region start

Yes

Esf0006

Error

End is greater than the ##sequence-region end

Yes

Esf0007

Error

Seqid not found in the embedded ##FASTA

Yes

Esf0008

Error

End is greater than the embedded ##FASTA sequence length

Yes

Esf0009

Info

Found Ns in a feature using the embedded ##FASTA

Yes

Esf0010

Error

Seqid not found in the external FASTA file

Yes

Esf0011

Error

End is greater than the external FASTA sequence length

Yes

Esf0012

Info

Found Ns in a feature using the external FASTA

Yes

Esf0013

Error

White chars not allowed at the start of a line

Yes

Esf0014

Error

##gff-version” missing from the first line

Yes

Esf0015

Error

Expecting certain fields in the feature

Yes

Esf0016

Error

##sequence-region seqid may only appear once

Yes

Esf0017

Error

Start/End is not a valid integer

Yes

Esf0018

Error

Start is not less than or equal to end

Yes

Esf0019

Info

Version is not “3”

Yes

Esf0020

Error

Version is not a valid integer

Yes

Esf0021

Info

Unknown directive

Yes

Esf0022

Error

Features should contain 9 fields

Yes

Esf0023

Error

escape certain characters

Yes

Esf0024

Error

Score is not a valid floating point number

Yes

Esf0025

Error

Strand has illegal characters

Yes

Esf0026

Error

Phase is not 0, 1, or 2, or not a valid integer

Yes

Esf0027

Error

Phase is required for all CDS features

Yes

Esf0028

Info

Attributes must escape the percent (%) sign and any control characters

Yes

Esf0029

Error

Attributes must contain one and only one equal (=) sign

Yes

Esf0030

Error

Empty attribute tag

Yes

Esf0031

Error

Empty attribute value

Yes

Esf0032

Error

Found multiple attribute tags

Yes

Esf0033

Info

Found “, “ in a attribute, possible unescaped

Yes

Esf0034

Info

attribute has identical values (count, value)

Yes

Esf0035

Info

attribute has unresolved forward reference

Yes

Esf0036

Info

Value of a attribute contains unescaped “,”

Yes

Esf0037

Error

Target attribute should have 3 or 4 values

Yes

Esf0038

Error

Start/End value of Target attribute is not a valid integer coordinate

Yes

Esf0039

Error

Strand value of Target attribute has illegal characters

Yes

Esf0040

Error

Value of Is_circular attribute is not “true”

Yes

Esf0041

Error

Unknown reserved (uppercase) attribute

Yes