Abstract
Understanding the domain content of a protein and delineating the associated domain boundaries is a crucial step for many areas in protein science. Structural studies by NMR and X-ray crystallography are often greatly aided by such knowledge. Also, sequence database searches are typically enhanced when performed using sequence stretches corresponding to a single domain, because a domain is a recurring functional
and evolutionary unit in proteins. In this chapter, we will review existing methods for predicting and
delineating domain boundaries from sequence and structure information. We will also describe the available
public databases with protein domain information. Finally, we will give an overview of our approaches for
domain prediction. These include: attempts to predict linker fragments directly from observed differences
between linker and non-linker segments; our method SnapDRAGON, which exploits the consistency of
domain boundary placement observed in a large set of ab initio 3D models generated for a given query sequence
by a distance geometry-based method; and DOMAINATION, a method that predicts boundaries based on
a new PSI-BLAST-related iterative protocol.
Abstract
The analysis of multiple sequence alignments can generate essential clues for genomic data analysis. Yet, to be informative such analyses require some means of estimating the reliability of a multiple alignment. In this chapter we describe a novel method allowing the unambiguous identification of the residues correctly aligned within a multiple alignment. This method uses an index named
CORE (Consistency of the Overall Residue Evaluation) based on the T-Coffee multiple sequence
alignment algorithm. We provide two examples of applications: one where the CORE index is used to
identify correct blocks within a difficult multiple alignment and another where the CORE index is used
on genomic data to identify the proper start codon and a frame-shift within one of the sequences.
Abstract
DNA array technology has become the most powerful tool in functional genomics. It is capable of monitoring simultaneously the expression profiles of thousand of genes in different conditions, hence it is an invaluable tool for exploring the responses of an organism. In addition, the potential of array techniques for exploring mutations and polymorphisms in genes is enormous.
From the bioinformatics point of view, a microarray experiment involves high complexity, both in the preparation and management of the experiment, and in the processing of the results. The difficulties are summarized as follows: To prepare microarrays, high numbers of clones are used, with the expectations that arrays containing hundred of thousand of genes will be soon available. The management of the collections of clones must be planned carefully, and every clone must be tracked in terms of its location, function and performance in the experiments. In the management of the experiment, the creation of interfaces to control the machinery involved in the experiment is desirable but difficult, since different vendors exist and no standard is being used. After the hybridisation step, accurate imaging software must be used in order to obtain clear results.
The analysis of the results obtained can be performed using different approaches: Clustering algorithms are widely used, however other methods have been used as well. It is important to know the advantages and drawbacks of the different methods in order to obtain meaningful information. Once groups of related genes have been found, a very important and often ignored element is the data-mining on the results, to extend the biological information about such groups of genes.
Finally, it is very important to store all the information relative to the experiment, using relational schemes of storage. To date, even though several different schemes have been proposed, no single standard exists. This makes it difficult to exchange and compare experiments between laboratories. In this chapter, I address in detail some of these points and propose possible solutions.
Abstract
One of the most challenging tasks in genomic science is the prediction of the function of genes for which there is no clear sequence similarity to annotated genes. However, it is even more challenging to assign the correct function to genes that display sequence similarity to genes of unrelated function: analogous
enzymes perform the same biochemical reaction but they are not phylogenetically related, such that it is not
possible to identify them by sequence similarity. Here we propose that the vitamin biosynthetic pathways
have experienced multiple events of gene loss and recovery of function by unrelated genes, the so-called
analogous gene displacement. We carried out an extensive search for the genes that participate in thiamin
biosynthetic pathways in the completely sequenced genomes. We show that the great majority of these organisms
lack from a few to many orthologues to the Escherichia coli thi genes. We searched for the analogous enzymes using gene neighbourhood, coocurrence in operons, identification of regulatory sequences, and anticorrelation strategies. Our strategy resulted in the identification of some possible analogous enzymes in this pathway.
Abstract
Although high-throughput techniques for experimental proteome characterization are still in their infancy, the number of reported instances of posttranslational modifications of proteins is already larger than the number of their sequences in databases. Thus, a typical protein appears to be covalently modified
several times during its lifetime, to allow the regulation of its function. Given the huge number of sequences
of otherwise uncharacterised proteins resulting from genome projects, the computer-aided prediction of
the possibility of post-translational modifications from amino acid sequence becomes a necessity for
genome annotation.
Only some types of posttranslational modification can be considered as reasonable targets for predictor development for the moment due to problems with the quality of learning sets and the complexity of protein substrate recognition for modification. Information on the sequence motif responsible for targeting can be extracted from the sequence variability of natural substrate proteins as well as of model compounds and from the structures of responsible enzymes. The motif description needs to be rich enough, albeit not necessarily in terms of positional amino acid type preferences, for a reliable discrimination of true substrates from unrelated sequences. Solutions for the construction of prediction functions as well as for the assessment of false prediction with rigorous statistical criteria are reviewed.
Abstract
The characterisation of the genes predicted from a completely sequenced genome gives a detailed view of an organism. However, most of the proteins deduced from a genome project have never been seen before, and cannot be biochemically characterised on a short time scale. Therefore, computational methods are used to annotate the function of new protein sequences based on similar sequences already annotated in
the sequence databases. However, the complex relation between function and sequence makes this difficult
and erroneous annotation can happen. New errors can be produced when a wrongly annotated entry is used
to annotate a new protein. In this chapter, I discuss the problems of function inference by protein
sequence similarity, considering also the status of databases in relation to annotation errors. Finally, I evaluate
the inter-relation between databases and the annotation process by similarity (using an automatic system)
and how this situation is likely to evolve.
Abstract
The networks of metabolic and control interactions in the cell challenge our ability to gather appropriate data and to develop appropriate tools, but primarily the challenge is a theoretical one. The system we are trying to understand is not a machine with fixed structure. The components themselves as well as the
types of components, and the interaction events in which they participate, are continually changing and
altering the control and spatial structure, due to environmental perturbations and autonomous dynamics. This
chapter briefly describes the available modeling methodologies (including stochastic, semiquantitative, and
hybrid models), and explains the essential conceptual challenges such as:
* What is an appropriate representation of events involving interactions among internal regions of
a macromolecule?
* How can we detect and characterize signalling and regulatory activities, and their effect on
the control structure?
* At what point must a component be regarded as macroscopic, affecting the spatial structure?
* What is the proper representation of function as opposed to process?
Abstract
In molecular databases, each entry appears commonly related to others, within the database and in different databases, through cross-linking associations. Moreover, each single entry is annotated, more or less systematically, using words from an ontology: a controlled and structured set of concepts. Knowledge can be gained by means of comparing the annotations of the items within or between databases. We present
a very general scheme to derive existing associations between features or elements in data bases. This
heuristic approach can be applied to discover associations between the annotations of the entries. An application
is presented consisting in a mapping from the MeSH (Medical Subject Headings) ontology of MEDLINE
to the one of SWISS-PROT keywords. The mapping can be incorporated in a protocol for suggesting to
the SWISS-PROT annotators keywords for new entries and, in some cases, correcting previous inconsistencies.
Abstract
During the last years, we have seen how molecular biology has gone from sequencing a single gene to sequencing an entire eukaryotic genome. The large-scale sequencing projects are producing more and more sequences and the sequence databases are growing exponentially. However, full understanding of the biological role of these proteins will require knowledge of their structure and function. The different structural
genomics projects were born as an attempt to tackle the problem of protein structure determination for
complete genomes (proteomes). Experimental and computational approaches are needed to fulfil the expectative
of assigning structures to proteomes. The experimental level
(Structural Genomics) includes a large-scale cloning, expression, purification and structure determination of the proteins. The computational level (Structural Bioinformatics) includes protein target selection, structure interpretation, comparative modelling and, finally, prediction of protein function. In this chapter, our intention is to take you through all these different aspects of Structural Genomics and to give you a rough idea of the state-of-the-art techniques and discuss about the foreseen improvements and the limitations.
Abstract
Students and researchers in the life sciences face a major challenge in dealing with large volumes of rapidly growing data. To meet the challenge they require informatics systems that integrate the many diverse databases in the life sciences, and facilitate cross-querying and data retrieval. In addition, there is an increasing need for systems that automatically organize new and existing data into meaningful graphical displays that
help users with understanding, remembering, and navigating through these data. Several such systems have
been developed. In the future, such systems may use insights from mnemonic techniques to help users deal
with the large volume of data involved. The primary method of mnemonics is firstly, to encode abstract data
into concrete objects (in this case, graphical representations of proteins or genes), and secondly, to place
these objects in a space with a specific and meaningful context. For proteomics data, a natural spatial context is
a 'bioatlas', i.e. where proteins are located within the context of a cell, organ, or organism. While there
are clear limitations to such a view, I argue that it is a good starting point. Finally, I argue that the usefulness
and usability of such views will be greatly enhanced by virtual reality techniques.
Abstract
The knowledge on the gene content of any organism is essential for the study and understanding of its biology. The recent sequencing of large and complex genomes has forced the scientific community to develop or improve computer programs in order to identify such genes. These algorithms are based on the identification of characteristic patterns of gene-related elements (such as promoters, splice sites, polyadenilation
signals, and others) and present an estimated success rate of 80%. But, neither these programs nor their
evaluation procedures normally take into consideration the presence of non-functional gene copies in the
genome. These dispensable gene copies, known as pseudogenes, are formed either by retrotransposition or by
tandem duplication. In some cases they are difficult to differentiate by using standard procedures since they
share many sequence characteristics with their corresponding functional parental genes. The only criteria used
so far to identify such non-functional elements depends on the detection of either disruptions in the
open reading frame or any typical sign of retrotransposition. This leads to misclassification of some genes.
In order to overcome this situation, we have developed an independent strategy that is capable to
differentiate many functional from non-functional sequences. This procedure takes advantage of the different
selective constrains associated to pseudogenes and genes. Using this method we estimated that the human
genome contains 40000 pseudogenes, doubling current approximations. We are also proposing an error rate of
23% in standard procedures of gene annotation regarding the classification of genes and pseudogenes.
No abstract is available for this chapter. The first paragraph is as follows:
Predicting how the emergence of a new research area may affect biology or even the entire world is not really what we may call an exact science. Actually, the vast majority of such predictions made in the course of the last century proved to be quite short sighted. However, it is important to try! Without looking too much ahead, all the authors in this book have a clear intuition of what seems to be the genuine driving force in their respective fields of speciality. Albeit it may be difficult to tell the exact destination one may at least point out the directions these forces seem to be aiming at.
Current Books: