current microbiology books

Bioinformatics and Genomes: Current Perspectives Chapter Abstracts

How to buy this book


Chapter 1
Predicting Protein Structural Domain Boundaries From Sequence Data
Richard A. George, Jens Kleinjung and Jaap Heringa

Abstract
Understanding the domain content of a protein and delineating the associated domain boundaries is a crucial step for many areas in protein science. Structural studies by NMR and X-ray crystallography are often greatly aided by such knowledge. Also, sequence database searches are typically enhanced when performed using sequence stretches corresponding to a single domain, because a domain is a recurring functional and evolutionary unit in proteins. In this chapter, we will review existing methods for predicting and delineating domain boundaries from sequence and structure information. We will also describe the available public databases with protein domain information. Finally, we will give an overview of our approaches for domain prediction. These include: attempts to predict linker fragments directly from observed differences between linker and non-linker segments; our method SnapDRAGON, which exploits the consistency of domain boundary placement observed in a large set of ab initio 3D models generated for a given query sequence by a distance geometry-based method; and DOMAINATION, a method that predicts boundaries based on a new PSI-BLAST-related iterative protocol.


Chapter 2
Using Multiple Alignment Methods to Assess the Quality of Genomic Data Analysis
Cédric Notredame and Chantal Abergel

Abstract
The analysis of multiple sequence alignments can generate essential clues for genomic data analysis. Yet, to be informative such analyses require some means of estimating the reliability of a multiple alignment. In this chapter we describe a novel method allowing the unambiguous identification of the residues correctly aligned within a multiple alignment. This method uses an index named CORE (Consistency of the Overall Residue Evaluation) based on the T-Coffee multiple sequence alignment algorithm. We provide two examples of applications: one where the CORE index is used to identify correct blocks within a difficult multiple alignment and another where the CORE index is used on genomic data to identify the proper start codon and a frame-shift within one of the sequences.


Chapter 3
Analysis of expression data
Javier Tamames

Abstract
DNA array technology has become the most powerful tool in functional genomics. It is capable of monitoring simultaneously the expression profiles of thousand of genes in different conditions, hence it is an invaluable tool for exploring the responses of an organism. In addition, the potential of array techniques for exploring mutations and polymorphisms in genes is enormous.

From the bioinformatics point of view, a microarray experiment involves high complexity, both in the preparation and management of the experiment, and in the processing of the results. The difficulties are summarized as follows: To prepare microarrays, high numbers of clones are used, with the expectations that arrays containing hundred of thousand of genes will be soon available. The management of the collections of clones must be planned carefully, and every clone must be tracked in terms of its location, function and performance in the experiments. In the management of the experiment, the creation of interfaces to control the machinery involved in the experiment is desirable but difficult, since different vendors exist and no standard is being used. After the hybridisation step, accurate imaging software must be used in order to obtain clear results.

The analysis of the results obtained can be performed using different approaches: Clustering algorithms are widely used, however other methods have been used as well. It is important to know the advantages and drawbacks of the different methods in order to obtain meaningful information. Once groups of related genes have been found, a very important and often ignored element is the data-mining on the results, to extend the biological information about such groups of genes.

Finally, it is very important to store all the information relative to the experiment, using relational schemes of storage. To date, even though several different schemes have been proposed, no single standard exists. This makes it difficult to exchange and compare experiments between laboratories. In this chapter, I address in detail some of these points and propose possible solutions.


Chapter 4
High Rate Of Gene Displacement In Vitamin Biosynthesis Pathways
Enrique Morett, Gloria Saab-Rincon, Enrique Merino, Peer Bork, Emmanuvel Rajan, Leticia Olvera, and Maricela Olvera

Abstract
One of the most challenging tasks in genomic science is the prediction of the function of genes for which there is no clear sequence similarity to annotated genes. However, it is even more challenging to assign the correct function to genes that display sequence similarity to genes of unrelated function: analogous enzymes perform the same biochemical reaction but they are not phylogenetically related, such that it is not possible to identify them by sequence similarity. Here we propose that the vitamin biosynthetic pathways have experienced multiple events of gene loss and recovery of function by unrelated genes, the so-called analogous gene displacement. We carried out an extensive search for the genes that participate in thiamin biosynthetic pathways in the completely sequenced genomes. We show that the great majority of these organisms lack from a few to many orthologues to the Escherichia coli thi genes. We searched for the analogous enzymes using gene neighbourhood, coocurrence in operons, identification of regulatory sequences, and anticorrelation strategies. Our strategy resulted in the identification of some possible analogous enzymes in this pathway.


Chapter 5
Prediction of Posttranslational Modifications From Amino Acid Sequence: Problems, Pitfalls, Methodological Hints
Frank Eisenhaber, Birgit Eisenhaber and Sebastian Maurer-Stroh

Abstract
Although high-throughput techniques for experimental proteome characterization are still in their infancy, the number of reported instances of posttranslational modifications of proteins is already larger than the number of their sequences in databases. Thus, a typical protein appears to be covalently modified several times during its lifetime, to allow the regulation of its function. Given the huge number of sequences of otherwise uncharacterised proteins resulting from genome projects, the computer-aided prediction of the possibility of post-translational modifications from amino acid sequence becomes a necessity for genome annotation.

Only some types of posttranslational modification can be considered as reasonable targets for predictor development for the moment due to problems with the quality of learning sets and the complexity of protein substrate recognition for modification. Information on the sequence motif responsible for targeting can be extracted from the sequence variability of natural substrate proteins as well as of model compounds and from the structures of responsible enzymes. The motif description needs to be rich enough, albeit not necessarily in terms of positional amino acid type preferences, for a reliable discrimination of true substrates from unrelated sequences. Solutions for the construction of prediction functions as well as for the assessment of false prediction with rigorous statistical criteria are reviewed.


Chapter 6
Automatic Genome Annotation And The Status Of Sequence Databases
Miguel A. Andrade

Abstract
The characterisation of the genes predicted from a completely sequenced genome gives a detailed view of an organism. However, most of the proteins deduced from a genome project have never been seen before, and cannot be biochemically characterised on a short time scale. Therefore, computational methods are used to annotate the function of new protein sequences based on similar sequences already annotated in the sequence databases. However, the complex relation between function and sequence makes this difficult and erroneous annotation can happen. New errors can be produced when a wrongly annotated entry is used to annotate a new protein. In this chapter, I discuss the problems of function inference by protein sequence similarity, considering also the status of databases in relation to annotation errors. Finally, I evaluate the inter-relation between databases and the annotation process by similarity (using an automatic system) and how this situation is likely to evolve.


Chapter 7
Dynamics And Complexity In Systems Biology Modeling: Theoretical Challenges In Metabolic Simulation
Eric Minch

Abstract
The networks of metabolic and control interactions in the cell challenge our ability to gather appropriate data and to develop appropriate tools, but primarily the challenge is a theoretical one. The system we are trying to understand is not a machine with fixed structure. The components themselves as well as the types of components, and the interaction events in which they participate, are continually changing and altering the control and spatial structure, due to environmental perturbations and autonomous dynamics. This chapter briefly describes the available modeling methodologies (including stochastic, semiquantitative, and hybrid models), and explains the essential conceptual challenges such as:


* What is an appropriate representation of events involving interactions among internal regions of a macromolecule?
* How can we detect and characterize signalling and regulatory activities, and their effect on the control structure?
* At what point must a component be regarded as macroscopic, affecting the spatial structure?
* What is the proper representation of function as opposed to process?


Chapter 8
Mapping Words For Genome Data Integration
Carolina Perez-Iratxeta and Miguel A. Andrade

Abstract
In molecular databases, each entry appears commonly related to others, within the database and in different databases, through cross-linking associations. Moreover, each single entry is annotated, more or less systematically, using words from an ontology: a controlled and structured set of concepts. Knowledge can be gained by means of comparing the annotations of the items within or between databases. We present a very general scheme to derive existing associations between features or elements in data bases. This heuristic approach can be applied to discover associations between the annotations of the entries. An application is presented consisting in a mapping from the MeSH (Medical Subject Headings) ontology of MEDLINE to the one of SWISS-PROT keywords. The mapping can be incorporated in a protocol for suggesting to the SWISS-PROT annotators keywords for new entries and, in some cases, correcting previous inconsistencies.


Chapter 9
Structural Genomics and Structural Bioinformatics
Patrick Aloy, Baldomero Oliva and Robert B. Russell

Abstract
During the last years, we have seen how molecular biology has gone from sequencing a single gene to sequencing an entire eukaryotic genome. The large-scale sequencing projects are producing more and more sequences and the sequence databases are growing exponentially. However, full understanding of the biological role of these proteins will require knowledge of their structure and function. The different structural genomics projects were born as an attempt to tackle the problem of protein structure determination for complete genomes (proteomes). Experimental and computational approaches are needed to fulfil the expectative of assigning structures to proteomes. The experimental level (Structural Genomics) includes a large-scale cloning, expression, purification and structure determination of the proteins. The computational level (Structural Bioinformatics) includes protein target selection, structure interpretation, comparative modelling and, finally, prediction of protein function. In this chapter, our intention is to take you through all these different aspects of Structural Genomics and to give you a rough idea of the state-of-the-art techniques and discuss about the foreseen improvements and the limitations.


Chapter 10
Role of Mnemonics and Virtual Reality in Visualizing Genomics and Proteomics Data
Seán I. O'Donoghue

Abstract
Students and researchers in the life sciences face a major challenge in dealing with large volumes of rapidly growing data. To meet the challenge they require informatics systems that integrate the many diverse databases in the life sciences, and facilitate cross-querying and data retrieval. In addition, there is an increasing need for systems that automatically organize new and existing data into meaningful graphical displays that help users with understanding, remembering, and navigating through these data. Several such systems have been developed. In the future, such systems may use insights from mnemonic techniques to help users deal with the large volume of data involved. The primary method of mnemonics is firstly, to encode abstract data into concrete objects (in this case, graphical representations of proteins or genes), and secondly, to place these objects in a space with a specific and meaningful context. For proteomics data, a natural spatial context is a 'bioatlas', i.e. where proteins are located within the context of a cell, organ, or organism. While there are clear limitations to such a view, I argue that it is a good starting point. Finally, I argue that the usefulness and usability of such views will be greatly enhanced by virtual reality techniques.


Chapter 11
Pseudogenes and genomes
David Torrents, Mikita Suyama and Peer Bork

Abstract
The knowledge on the gene content of any organism is essential for the study and understanding of its biology. The recent sequencing of large and complex genomes has forced the scientific community to develop or improve computer programs in order to identify such genes. These algorithms are based on the identification of characteristic patterns of gene-related elements (such as promoters, splice sites, polyadenilation signals, and others) and present an estimated success rate of 80%. But, neither these programs nor their evaluation procedures normally take into consideration the presence of non-functional gene copies in the genome. These dispensable gene copies, known as pseudogenes, are formed either by retrotransposition or by tandem duplication. In some cases they are difficult to differentiate by using standard procedures since they share many sequence characteristics with their corresponding functional parental genes. The only criteria used so far to identify such non-functional elements depends on the detection of either disruptions in the open reading frame or any typical sign of retrotransposition. This leads to misclassification of some genes. In order to overcome this situation, we have developed an independent strategy that is capable to differentiate many functional from non-functional sequences. This procedure takes advantage of the different selective constrains associated to pseudogenes and genes. Using this method we estimated that the human genome contains 40000 pseudogenes, doubling current approximations. We are also proposing an error rate of 23% in standard procedures of gene annotation regarding the classification of genes and pseudogenes.


Chapter 12
Bioinformatics and Genomes: The Future
Contributions from all authors

No abstract is available for this chapter. The first paragraph is as follows:

Predicting how the emergence of a new research area may affect biology or even the entire world is not really what we may call an exact science. Actually, the vast majority of such predictions made in the course of the last century proved to be quite short sighted. However, it is important to try! Without looking too much ahead, all the authors in this book have a clear intuition of what seems to be the genuine driving force in their respective fields of speciality. Albeit it may be difficult to tell the exact destination one may at least point out the directions these forces seem to be aiming at.

Current Books: