Roderic Guigó and Thomas Wiehe
Abstract
Essential to our ability of converting the nucleotide sequence of the genomes into valuable biological knowledge of
the organisms is the identification of the genes encoded in these genomes. Despite substantial progress during the
nineties, computational gene identification remains an open problem in higher eukariotic organisms: a year after the publication of
the draft sequence of the human genome, the exact number of human genes remains highly controversial. In this chapter we
review the accuracy of computational gene finding methods. We emphasize recently developed methods, based on the comparison
of genomic sequences. These methods show promise in identifying novel genes, not well represented in the databases of
known coding sequences.
Chapter 2
Sensitive Protein Alignment Algorithms
Adam Godzik
Abstract
The recent advances in sensitive protein alignment algorithms come mostly from exploring additional information
available either from the analysis of substitution patterns in homologous proteins (profile or PSSM methods) or from the requirements
of the stability of the final protein structure (threading). Many new methods bring in new types of information beyond structure
and sequence variation among homologs and thus hold significant promise for further advancing the sensitivity of discovering
distant homologs. Methods such as ISS (intermediate sequence search) or other neighborhood methods explore features of
protein sequence space searching for "bridging proteins" to connect divergent families by a series of comparisons. Context
analysis implemented in systems use position in a genome and function of genome neighbors to identify function and
indirectly recognizes homologies. Other methods use keyword and literature searches to build a "literature profile" of a protein,
discovering distant relations between proteins.
Chapter 3
Measuring Generality of Knowledge-Based Potentials
Extracted from Protein Structure Sets
Frank Eisenhaber, Shamil R. Sunyaev, Vladimir G. Tumanyan, and Eugene N. Kuznetsov
Abstract
Sequence-structure matching techniques such as threading with so-called knowledge-based potentials rely on the extraction
of amino acid location distributions (e.g., of solvent accessibilities, backbone conformational descriptors or pairwise
residue-residue distances) from learning sets of protein structures. These functions are used for the characterization of amino acid
type preferences as well as for the description of spatial locations in structural templates in the same physical terms.
The transferability of these functions from tertiary structure sets used for their derivation to new proteins outside the learning set
is critical for predictor performance. We present the problem in the framework of statistical hypotheses tests that allow
quantifying to which extent the extracted knowledge-based functions can be extrapolated to new proteins. Several statistical models
for knowledge-based functions are considered in this work: (1) We show that the derivation of knowledge-based potentials within
the framework of statistical physics, especially with Boltzmann's distribution, is not consistent with the observed distributions
of residue location parameters. (2) The assumption of amino acid type-specific but protein-independent propensities is often a
valid approximation, although many cases of clear disagreement can be found. (3) We propose two models of propensities
corrected for the actual amino acid composition in a query protein.
Chapter 4
Protein Structure Modeling in Functional Genomics
Jürgen Kopp, Manuel C. Peitsch, Torsten Schwede
Abstract
Functional genomics can be greatly assisted by three-dimensional structures, giving insights to the molecular function
of proteins. Although structural genomics projects are aiming to provide protein structures on a large scale, in the foreseeable
future this can be achieved only for a small fraction of all protein sequences. Comparative modeling techniques are used to
complement experimental methods by building models for protein sequences which are related to known structures. According to their
quality, applications of three-dimensional models range from low-resolution functional assignments to molecular simulations at
atomic level of detail.
Chapter 5
Identifying Domains, Repeats and Motifs from Protein Sequences
Chris P. Ponting and Alex Bateman
Abstract
The newly-sequenced genomes provide previously undreamt-of opportunities to generate experimentally-testable
hypotheses. These hypotheses arise from the identification of gene regions that are homologous, having arisen from a common ancestor,
that often possess similarities in function. These homologues encode families of protein domains, repeats or motifs, that often
have been assembled by evolution into multiple distinct combinations, or architectures. This article suggests strategies for the
analysis of protein sequences in terms of their domain, repeat and motif architectures with an emphasis on statistical evaluation
of sequence similarities and the prediction of function from sequence.
Chapter 6
Exploiting the Variations in the Genomic Associations of Genes to Predict Pathways and Reconstruct their Evolution
Martijn A. Huynen and Berend Snel
Abstract
Various methods have been proposed to predict functional interactions between proteins using the genomic association of
their genes. Here we exploit the variation in this genomic association to increase the resolution of these methods. The
metabolic distance between enzymes, as defined by the number of substrates that separate them in a pathway, is used to quantify
the directness of functional interaction. We show a positive correlation between the strength of the genomic association and
the directness of the functional interaction:
i.e. the stronger the genomic association is, the shorter the average metabolic
distance between the enzymes. Subsequently we map the variation and reconstruct the evolution of a single pathway in detail:
iron-sulfur cluster assembly in Proteobacteria and mitochondria. By deconstructing the evolution of this pathway to the level of
gene-gain and gene-loss events we identify (sub)modules within it. Proteins that are within the same sub-module are expected to have
a more direct functional interaction than ones that are in different sub-modules.
Chapter 7
Using Genomic Context in the Analysis of Multi-Component Systems
Natalie D. Fedorova, Anastasia N. Nikolskaya, and Michael Y. Galperin
Abstract
Multi-domain organization of newly sequenced proteins has long been recognized as a factor that significantly complicates
their functional annotation, particularly the automated annotation in the course of large-scale genome sequencing projects. We
discuss here some approaches to the analysis of multi-domain (multi-component) systems. We argue that comprehensive annotation
of multi-subunit enzymes and complex enzyme systems, such as ABC transporters or members of the two-component
signal transduction system, has to combine detailed analysis of their domain composition, operon organization of the
corresponding genes, gene content in the particular genome, and phylogenetic analysis. This multi-faceted "genomic context" approach needs
to include a human component, i.e. a highly-qualified genome biologist who should be able to identify new truly interesting
cases and put them into proper context.
Chapter 8
Prolegomena to the Evolution of Transcriptional Regulation in Bacterial Genomes
Mikhail S. Gelfand, Olga N. Laikova
Abstract
Availability of many complete bacterial genomes makes it possible to apply the genomic comparative approaches to the
analysis of transcriptional regulation and evolution of regulatory interactions. Here we list some observations about co-evolution
of regulators and the signals they recognize in DNA; birth, evolution, and degeneration of regulons; influence of horizontal
gene transfer on regulation and evolution of ubiquitous regulators; interaction of regulons. Although scattered and incomplete,
these observations may provide the first base for development of a general theory of evolution of regulation.
Chapter 9
Experimental RNomics
Alexander Hüttenhofer and Jürgen Brosius
Abstract
A major milestone of genome projects, identification of all genes, can only be achieved when genes encoding
non-messenger RNAs including longer mRNA-like RNAs that lack open reading frames and small non-messenger RNAs (snmRNAs) are
not ignored. RNAs play a variety of important roles in different compartments and functions of the cell. We review
biomathematical identification of snmRNA candidates as well as experimental approaches in various organisms from bacteria to man.
As Experimental RNomics is unbiased in identifying novel snmRNAs, it is suitable to identify even novel classes of RNAs, such
as micro RNAs (miRNAs). Combined approaches promise a further plethora of additional genome encoded RNAs and their roles
in gene regulation.
Chapter 10
Genome-Scale Phylogenetic Trees
Yuri I. Wolf, Igor B. Rogozin, Nick V. Grishin, and Eugene V. Koonin
Abstract
Genome comparisons indicate that horizontal gene transfer and differential gene loss are major evolutionary phenomena
that involve a large fraction, if not the majority, of the genes, at least in prokaryotes. The extent of these events casts doubt on
the very feasibility of constructing a "tree of life" because the trees for different genes often tell different stories.
However, alternative approaches to tree construction that attempt to determine tree topology on the basis of comparisons of complete
gene sets seem to reveal a phylogenetic signal that supports the three-domain evolutionary scenario and suggests the possibility
of delineation of previously undetected major clades of prokaryotes. If the validity of these whole-genome approaches to
tree building is confirmed by analyses of numerous new genomes that are currently sequenced at an increasing rate, it will seem
that the concept of a universal, "species" tree still makes sense, but this tree should be reinterpreted as a prevailing trend in
the evolution of genome-scale gene sets rather than a complete picture of evolution.
Chapter 11
Mathematical Modeling of the Evolution of Domain Composition of Proteomes: A Birth-and-Death Process
with Innovation
Georgy P. Karev, Yuri I. Wolf, Andrey Y. Rzhetsky, Faina S. Berezovskaya, and Eugene V. Koonin
Abstract
A simple model of evolution of the domain composition of proteomes is described, with the following elementary processes:
i) domain birth (duplication with divergence), ii) death (inactivation and/or deletion), and iii) innovation (emergence from
non-coding or non-globular sequences or acquisition via horizontal gene transfer). This formalism can be described as a
birth, death and innovation
model (BDIM). The formulas for equilibrium frequencies of domain families of different size and the
total number of families at equilibrium were derived for a general BDIM. All asymptotics of equilibrium frequencies of
domain families possible for the given type of models are found and their appearance depending on model parameters is investigated. It
is proved that the power law asymptotics appears if, and only if, the model is balanced,
i.e. domain duplication and deletion rates are asymptotically equal up to the second order. It is further proved that any power asymptotic with the power
-1 can appear only if the hypothesis of independence of the duplication/deletion rates on the size of a domain family is rejected. Specific
cases of BDIMs, namely simple, linear, polynomial and rational models, are considered in details and the distributions of
the equilibrium frequencies of domain families of different size are determined for each case. We apply the developed formalism
to the analysis of the domain family size distributions in prokaryotic and eukaryotic proteomes and show a good fit between
these empirical data and a particular form of the model, the second-order balanced linear BDIM. The developed approach is oriented
at a mathematical description of evolution of domain composition, but a simple reformulation could be applied to models of
other evolving networks with preferential attachment.
Current Books: