Compositional features of eukaryotic genomes for checking predicted genes.

Stéphane Cruveiller; Kamel Jabbari; Oliver Clay; Giorgio Bemardi

doi:10.1093/bib/4.1.43

Compositional features of eukaryotic genomes for checking predicted genes.

Stéphane Cruveiller, Kamel Jabbari, Oliver Clay, Giorgio Bemardi

Producción científica: Contribución a una revista › Artículo de revisión › revisión exhaustiva

15 Citas (Scopus)

Resumen

Gene prediction relies on the identification of characteristic features of coding sequences that distinguish them from non-coding DNA. The recent large-scale sequencing of entire genomes from higher eukaryotes, in conjunction with currently used gene prediction algorithms, has provided an abundance of putative genes that can now be analysed for their compositional properties. Strong, systematic differences still exist, in several species, between the compositional properties of sets of ex novo predicted genes and genes that have been experimentally detected and/or verified. This is particularly evident in the estimated gene set (>45,000 genes) of the recently sequenced rice genome, where roughly half the predicted genes are compositionally unusual and have no known orthologues in the dicot Arabidopsis. In a few cases such differences might suggest a bias in experimental gene-finding protocols, but the quasi-random nature of the compositionally aberrant predicted genes is a strong indication that many, if not most, of them are false positives. It therefore appears that some important features of coding regions have not yet been taken into account in existing gene prediction programs. Statistical base compositional properties of curated gene data sets from vertebrates, which we briefly review here, should therefore provide a useful benchmark for fine-tuning probabilistic gene models and model parameters that are currently in use.

Idioma original	Inglés estadounidense
Páginas (desde-hasta)	43-52
Número de páginas	10
Publicación	Briefings in Bioinformatics
Volumen	4
N.º	1
DOI	https://doi.org/10.1093/bib/4.1.43
Estado	Publicada - mar. 2003
Publicado de forma externa	Sí

Áreas temáticas de ASJC Scopus

Sistemas de información
Biología molecular

Acceder al documento

10.1093/bib/4.1.43

Otros archivos y enlaces

Citar esto

@article{df32cd10094749dfaf53a8ae9b8c442e,

title = "Compositional features of eukaryotic genomes for checking predicted genes.",

abstract = "Gene prediction relies on the identification of characteristic features of coding sequences that distinguish them from non-coding DNA. The recent large-scale sequencing of entire genomes from higher eukaryotes, in conjunction with currently used gene prediction algorithms, has provided an abundance of putative genes that can now be analysed for their compositional properties. Strong, systematic differences still exist, in several species, between the compositional properties of sets of ex novo predicted genes and genes that have been experimentally detected and/or verified. This is particularly evident in the estimated gene set (>45,000 genes) of the recently sequenced rice genome, where roughly half the predicted genes are compositionally unusual and have no known orthologues in the dicot Arabidopsis. In a few cases such differences might suggest a bias in experimental gene-finding protocols, but the quasi-random nature of the compositionally aberrant predicted genes is a strong indication that many, if not most, of them are false positives. It therefore appears that some important features of coding regions have not yet been taken into account in existing gene prediction programs. Statistical base compositional properties of curated gene data sets from vertebrates, which we briefly review here, should therefore provide a useful benchmark for fine-tuning probabilistic gene models and model parameters that are currently in use.",

author = "St{\'e}phane Cruveiller and Kamel Jabbari and Oliver Clay and Giorgio Bemardi",

note = "Copyright: This record is sourced from MEDLINE{\textregistered}/PubMed{\textregistered}, a database of the U.S. National Library of Medicine",

year = "2003",

month = mar,

doi = "10.1093/bib/4.1.43",

language = "English (US)",

volume = "4",

pages = "43--52",

journal = "Briefings in Bioinformatics",

issn = "1467-5463",

publisher = "Oxford University Press",

number = "1",

}

TY - JOUR

T1 - Compositional features of eukaryotic genomes for checking predicted genes.

AU - Cruveiller, Stéphane

AU - Jabbari, Kamel

AU - Clay, Oliver

AU - Bemardi, Giorgio

N1 - Copyright: This record is sourced from MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine

PY - 2003/3

Y1 - 2003/3

N2 - Gene prediction relies on the identification of characteristic features of coding sequences that distinguish them from non-coding DNA. The recent large-scale sequencing of entire genomes from higher eukaryotes, in conjunction with currently used gene prediction algorithms, has provided an abundance of putative genes that can now be analysed for their compositional properties. Strong, systematic differences still exist, in several species, between the compositional properties of sets of ex novo predicted genes and genes that have been experimentally detected and/or verified. This is particularly evident in the estimated gene set (>45,000 genes) of the recently sequenced rice genome, where roughly half the predicted genes are compositionally unusual and have no known orthologues in the dicot Arabidopsis. In a few cases such differences might suggest a bias in experimental gene-finding protocols, but the quasi-random nature of the compositionally aberrant predicted genes is a strong indication that many, if not most, of them are false positives. It therefore appears that some important features of coding regions have not yet been taken into account in existing gene prediction programs. Statistical base compositional properties of curated gene data sets from vertebrates, which we briefly review here, should therefore provide a useful benchmark for fine-tuning probabilistic gene models and model parameters that are currently in use.

AB - Gene prediction relies on the identification of characteristic features of coding sequences that distinguish them from non-coding DNA. The recent large-scale sequencing of entire genomes from higher eukaryotes, in conjunction with currently used gene prediction algorithms, has provided an abundance of putative genes that can now be analysed for their compositional properties. Strong, systematic differences still exist, in several species, between the compositional properties of sets of ex novo predicted genes and genes that have been experimentally detected and/or verified. This is particularly evident in the estimated gene set (>45,000 genes) of the recently sequenced rice genome, where roughly half the predicted genes are compositionally unusual and have no known orthologues in the dicot Arabidopsis. In a few cases such differences might suggest a bias in experimental gene-finding protocols, but the quasi-random nature of the compositionally aberrant predicted genes is a strong indication that many, if not most, of them are false positives. It therefore appears that some important features of coding regions have not yet been taken into account in existing gene prediction programs. Statistical base compositional properties of curated gene data sets from vertebrates, which we briefly review here, should therefore provide a useful benchmark for fine-tuning probabilistic gene models and model parameters that are currently in use.

UR - http://www.scopus.com/inward/record.url?scp=0038206894&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0038206894&partnerID=8YFLogxK

U2 - 10.1093/bib/4.1.43

DO - 10.1093/bib/4.1.43

M3 - Review article

C2 - 12715833

AN - SCOPUS:0038206894

SN - 1467-5463

VL - 4

SP - 43

EP - 52

JO - Briefings in Bioinformatics

JF - Briefings in Bioinformatics

IS - 1

ER -

Compositional features of eukaryotic genomes for checking predicted genes.

Resumen

Áreas temáticas de ASJC Scopus

Acceder al documento

Otros archivos y enlaces

Huella

Citar esto