gff3_fix full documentation

Background

The gff3_fix program fixes 30 error types detected by the program gff3_QC.py. The section ‘gff3_fix’ lists all error types that currently can be fixed by the gff3_fix.py function (currently 30), including the method used for the fix. (Note that in some cases, this means removing the affected gene model). The section ‘Fix function’ describes the methods used to fix the error type in question. The section ‘Currently no automatic fix available’ lists the error types which gff3_fix currently does not handle.

Note that the gff3_fix program requires that all features contain an ID attribute. You can use lib/gff3_ID_generator.py to generate IDs if your gff3 file does not have them for every feature.

gff3_fix

Error code

Error tag

Fix function

Ema0001

Parent feature start and end coordinates exceed those of child features

fix_boundary

Ema0003

This feature is not contained within the parent feature coordinates

fix_boundary

Ema0005

Pseudogene has invalid child feature type

pseudogene

Ema0006

Wrong phase

fix_phase

Ema0007

CDS and parent feature on different strands

delete_model

Ema0009

Incorrectly merged gene parent? Isoforms that do not share coding sequences are found

split

Emr0001

Duplicate transcript found

remove_duplicate_trans

Emr0002

Incorrectly split gene parent?

merge

Esf0001

Feature type may need to be changed to pseudogene

pseudogene

Esf0002

Start/Stop is not a valid 1-based integer coordinate

delete_model

Esf0003

strand information missing

delete_model

Esf0013

White chars not allowed at the start of a line

gff3 parse

Esf0014

##gff-version” missing from the first line

add_gff3_version

Esf0016

##sequence-region seqid may only appear once

remove_directive

Esf0017

Start/End is not a valid integer

delete_model

Esf0018

Start is not less than or equal to end

delete_model

Esf0020

Version is not a valid integer

remove_directive

Esf0021

Unknown directive

remove_directive

Esf0022

Features should contain 9 fields

delete_model

Esf0025

Strand has illegal characters

delete_model

Esf0026

Phase is not 0, 1, or 2, or not a valid integer

fix_phase

Esf0027

Phase is required for all CDS features

fix_phase

Esf0029

Attributes must contain one and only one equal (=) sign

fix_attributes

Esf0030

Empty attribute tag

fix_attributes

Esf0031

Empty attribute value

fix_attributes

Esf0032

Found multiple attribute tags

fix_attributes

Esf0033

Found “, “ in a attribute, possible unescaped

fix_attributes

Esf0034

attribute has identical values (count, value)

fix_attributes

Esf0036

Value of a attribute contains unescaped “,”

fix_attributes

Esf0041

Unknown reserved (uppercase) attribute

fix_attributes

Esf0041

Unknown reserved (uppercase) attribute

fix_attributes

Fix function

fix function

method

delete_model

remove the whole model from the original gff3 file

remove_duplicate_trans

remove the duplicate transcripts

remove_directive

remove the directive

pseudogene

remove CDS feature and change the feature type of the other feature: first-level → pseudogene; second-level → pseudogenic_transcript; third-level(exon) → pseudogenic_exon

fix_boundary

update the coordinate of the parent by using the minimum and the maximum coordinate of the child feature

fix_phase

correct phase by the function next_phase = (3 - ((CDS['end'] - CDS['start'] + 1 - phase) % 3)) % 3. Note: If the first CDS segment doesn’t have a phase, the initial phase will be 0.

fix_attributes

remove empty attribute tag/value; remove the redundant equal sign(=); remove dupliacte attribute; make the first character of the unknown reserved attribute lower case; merge multiple attribute tag and remove the duplicate attribute value; replace , with %2C

split

split the incorrectly merged transcript from a gene model and generate a new gene model

merge

merge the incorrectly split gene model

add_gff3_version

Add ##gff-version 3 to the first line of gff3 file

gff3 parse

parse the gff3 file; ignore blank line in gff3; remove the white chars at the start of a line

Currently no automatic fix available

Error code

Error tag

Ema0002

Protein sequence contains internal stop codons

Ema0004

Incomplete gene feature that should contain at least one mRNA, exon, and CDS

Ema0008

Warning for distinct isoforms that do not share any regions

Emr0003

Duplicate ID

Esf0004

Seqid not found in any ##sequence-region

Esf0005

Start is less than the ##sequence-region start

Esf0006

End is greater than the ##sequence-region end

Esf0007

Seqid not found in the embedded ##FASTA

Esf0008

End is greater than the embedded ##FASTA sequence length

Esf0009

Found Ns in a feature using the embedded ##FASTA

Esf0010

Seqid not found in the external FASTA file

Esf0011

End is greater than the external FASTA sequence length

Esf0012

Found Ns in a feature using the external FASTA

Esf0015

Expecting certain fields in the feature

Esf0019

Version is not “3”

Esf0023

escape certain characters

Esf0024

Score is not a valid floating point number

Esf0035

attribute has unresolved forward reference

Esf0037

Target attribute should have 3 or 4 values

Esf0038

Start/End value of Target attribute is not a valid integer coordinate

Esf0039

Strand value of Target attribute has illegal characters

Esf0040

Value of Is_circular attribute is not “true”