Frontiers in Computational Genomics
Chapter Abstracts

How to buy this book


Chapter 1
Gene Prediction Accuracy in Large DNA Sequences

Roderic Guigó and Thomas Wiehe

Abstract
Essential to our ability of converting the nucleotide sequence of the genomes into valuable biological knowledge of the organisms is the identification of the genes encoded in these genomes. Despite substantial progress during the nineties, computational gene identification remains an open problem in higher eukariotic organisms: a year after the publication of the draft sequence of the human genome, the exact number of human genes remains highly controversial. In this chapter we review the accuracy of computational gene finding methods. We emphasize recently developed methods, based on the comparison of genomic sequences. These methods show promise in identifying novel genes, not well represented in the databases of known coding sequences.


Chapter 2
Sensitive Protein Alignment Algorithms

Adam Godzik

Abstract
The recent advances in sensitive protein alignment algorithms come mostly from exploring additional information available either from the analysis of substitution patterns in homologous proteins (profile or PSSM methods) or from the requirements of the stability of the final protein structure (threading). Many new methods bring in new types of information beyond structure and sequence variation among homologs and thus hold significant promise for further advancing the sensitivity of discovering distant homologs. Methods such as ISS (intermediate sequence search) or other neighborhood methods explore features of protein sequence space searching for "bridging proteins" to connect divergent families by a series of comparisons. Context analysis implemented in systems use position in a genome and function of genome neighbors to identify function and indirectly recognizes homologies. Other methods use keyword and literature searches to build a "literature profile" of a protein, discovering distant relations between proteins.


Chapter 3
Measuring Generality of Knowledge-Based Potentials Extracted from Protein Structure Sets

Frank Eisenhaber, Shamil R. Sunyaev, Vladimir G. Tumanyan, and Eugene N. Kuznetsov

Abstract
Sequence-structure matching techniques such as threading with so-called knowledge-based potentials rely on the extraction of amino acid location distributions (e.g., of solvent accessibilities, backbone conformational descriptors or pairwise residue-residue distances) from learning sets of protein structures. These functions are used for the characterization of amino acid type preferences as well as for the description of spatial locations in structural templates in the same physical terms. The transferability of these functions from tertiary structure sets used for their derivation to new proteins outside the learning set is critical for predictor performance. We present the problem in the framework of statistical hypotheses tests that allow quantifying to which extent the extracted knowledge-based functions can be extrapolated to new proteins. Several statistical models for knowledge-based functions are considered in this work: (1) We show that the derivation of knowledge-based potentials within the framework of statistical physics, especially with Boltzmann's distribution, is not consistent with the observed distributions of residue location parameters. (2) The assumption of amino acid type-specific but protein-independent propensities is often a valid approximation, although many cases of clear disagreement can be found. (3) We propose two models of propensities corrected for the actual amino acid composition in a query protein.


Chapter 4
Protein Structure Modeling in Functional Genomics

Jürgen Kopp, Manuel C. Peitsch, Torsten Schwede

Abstract
Functional genomics can be greatly assisted by three-dimensional structures, giving insights to the molecular function of proteins. Although structural genomics projects are aiming to provide protein structures on a large scale, in the foreseeable future this can be achieved only for a small fraction of all protein sequences. Comparative modeling techniques are used to complement experimental methods by building models for protein sequences which are related to known structures. According to their quality, applications of three-dimensional models range from low-resolution functional assignments to molecular simulations at atomic level of detail.


Chapter 5
Identifying Domains, Repeats and Motifs from Protein Sequences

Chris P. Ponting and Alex Bateman

Abstract
The newly-sequenced genomes provide previously undreamt-of opportunities to generate experimentally-testable hypotheses. These hypotheses arise from the identification of gene regions that are homologous, having arisen from a common ancestor, that often possess similarities in function. These homologues encode families of protein domains, repeats or motifs, that often have been assembled by evolution into multiple distinct combinations, or architectures. This article suggests strategies for the analysis of protein sequences in terms of their domain, repeat and motif architectures with an emphasis on statistical evaluation of sequence similarities and the prediction of function from sequence.


Chapter 6
Exploiting the Variations in the Genomic Associations of Genes to Predict Pathways and Reconstruct their Evolution

Martijn A. Huynen and Berend Snel

Abstract
Various methods have been proposed to predict functional interactions between proteins using the genomic association of their genes. Here we exploit the variation in this genomic association to increase the resolution of these methods. The metabolic distance between enzymes, as defined by the number of substrates that separate them in a pathway, is used to quantify the directness of functional interaction. We show a positive correlation between the strength of the genomic association and the directness of the functional interaction: i.e. the stronger the genomic association is, the shorter the average metabolic distance between the enzymes. Subsequently we map the variation and reconstruct the evolution of a single pathway in detail: iron-sulfur cluster assembly in Proteobacteria and mitochondria. By deconstructing the evolution of this pathway to the level of gene-gain and gene-loss events we identify (sub)modules within it. Proteins that are within the same sub-module are expected to have a more direct functional interaction than ones that are in different sub-modules.


Chapter 7
Using Genomic Context in the Analysis of Multi-Component Systems

Natalie D. Fedorova, Anastasia N. Nikolskaya, and Michael Y. Galperin

Abstract
Multi-domain organization of newly sequenced proteins has long been recognized as a factor that significantly complicates their functional annotation, particularly the automated annotation in the course of large-scale genome sequencing projects. We discuss here some approaches to the analysis of multi-domain (multi-component) systems. We argue that comprehensive annotation of multi-subunit enzymes and complex enzyme systems, such as ABC transporters or members of the two-component signal transduction system, has to combine detailed analysis of their domain composition, operon organization of the corresponding genes, gene content in the particular genome, and phylogenetic analysis. This multi-faceted "genomic context" approach needs to include a human component, i.e. a highly-qualified genome biologist who should be able to identify new truly interesting cases and put them into proper context.


Chapter 8
Prolegomena to the Evolution of Transcriptional Regulation in Bacterial Genomes

Mikhail S. Gelfand, Olga N. Laikova

Abstract
Availability of many complete bacterial genomes makes it possible to apply the genomic comparative approaches to the analysis of transcriptional regulation and evolution of regulatory interactions. Here we list some observations about co-evolution of regulators and the signals they recognize in DNA; birth, evolution, and degeneration of regulons; influence of horizontal gene transfer on regulation and evolution of ubiquitous regulators; interaction of regulons. Although scattered and incomplete, these observations may provide the first base for development of a general theory of evolution of regulation.


Chapter 9
Experimental RNomics

Alexander Hüttenhofer and Jürgen Brosius

Abstract
A major milestone of genome projects, identification of all genes, can only be achieved when genes encoding non-messenger RNAs including longer mRNA-like RNAs that lack open reading frames and small non-messenger RNAs (snmRNAs) are not ignored. RNAs play a variety of important roles in different compartments and functions of the cell. We review biomathematical identification of snmRNA candidates as well as experimental approaches in various organisms from bacteria to man. As Experimental RNomics is unbiased in identifying novel snmRNAs, it is suitable to identify even novel classes of RNAs, such as micro RNAs (miRNAs). Combined approaches promise a further plethora of additional genome encoded RNAs and their roles in gene regulation.


Chapter 10
Genome-Scale Phylogenetic Trees

Yuri I. Wolf, Igor B. Rogozin, Nick V. Grishin, and Eugene V. Koonin

Abstract
Genome comparisons indicate that horizontal gene transfer and differential gene loss are major evolutionary phenomena that involve a large fraction, if not the majority, of the genes, at least in prokaryotes. The extent of these events casts doubt on the very feasibility of constructing a "tree of life" because the trees for different genes often tell different stories. However, alternative approaches to tree construction that attempt to determine tree topology on the basis of comparisons of complete gene sets seem to reveal a phylogenetic signal that supports the three-domain evolutionary scenario and suggests the possibility of delineation of previously undetected major clades of prokaryotes. If the validity of these whole-genome approaches to tree building is confirmed by analyses of numerous new genomes that are currently sequenced at an increasing rate, it will seem that the concept of a universal, "species" tree still makes sense, but this tree should be reinterpreted as a prevailing trend in the evolution of genome-scale gene sets rather than a complete picture of evolution.


Chapter 11
Mathematical Modeling of the Evolution of Domain Composition of Proteomes: A Birth-and-Death Process with Innovation

Georgy P. Karev, Yuri I. Wolf, Andrey Y. Rzhetsky, Faina S. Berezovskaya, and Eugene V. Koonin

Abstract
A simple model of evolution of the domain composition of proteomes is described, with the following elementary processes: i) domain birth (duplication with divergence), ii) death (inactivation and/or deletion), and iii) innovation (emergence from non-coding or non-globular sequences or acquisition via horizontal gene transfer). This formalism can be described as a birth, death and innovation model (BDIM). The formulas for equilibrium frequencies of domain families of different size and the total number of families at equilibrium were derived for a general BDIM. All asymptotics of equilibrium frequencies of domain families possible for the given type of models are found and their appearance depending on model parameters is investigated. It is proved that the power law asymptotics appears if, and only if, the model is balanced, i.e. domain duplication and deletion rates are asymptotically equal up to the second order. It is further proved that any power asymptotic with the power -1 can appear only if the hypothesis of independence of the duplication/deletion rates on the size of a domain family is rejected. Specific cases of BDIMs, namely simple, linear, polynomial and rational models, are considered in details and the distributions of the equilibrium frequencies of domain families of different size are determined for each case. We apply the developed formalism to the analysis of the domain family size distributions in prokaryotic and eukaryotic proteomes and show a good fit between these empirical data and a particular form of the model, the second-order balanced linear BDIM. The developed approach is oriented at a mathematical description of evolution of domain composition, but a simple reformulation could be applied to models of other evolving networks with preferential attachment.

Current Books: