The basis of all sequence comparison is predicated on evolutionary relationships, or homology. The more similar are two protein (or nucleic acid) sequences then the greater the probability that they are descended from a common ancestor, and hence more likely to have a similar function. One of the first steps in the characterization of a 'new' protein is searching existing databases to find other proteins that are sufficiently similar so that functional and structural characteristics may be inferred. An understanding of the ways in which such comparisons are performed may help the scientist be more effective in performing such searches and interpreting the results.
Chapter 2.
Internet: Sequence Databases
Paul Rangel
DNA and Protein sequence databases are the cornerstone of bioinformatics research. DNA databases such as GenBank and EMBL accept genome data from sequencing projects around the world and make it available for researchers via the internet. In a similar fashion protein sequence databases are to protein sequences what GenBank and EMBL are to nucleotide sequences. They are the central location of protein sequence data submissions. PIR's Protein Sequence Database (PSD) and SWISS-PROT are the two main databases. They provide a variety of ways to query the data and bioinformatics analysis tools to help facilitate genetic research. The underlying organization of these databases has shaped the way computer-based molecular biology research is conducted. This chapter will build an understanding sequence databases by reviewing data storage, common tools and online resources pertaining to these resources.
Chapter 3.
Internet: Multiple Sequence Analysis Part I: An Introduction to the Theory and Application of Multiple Sequence Analysis
Steve Thompson
I introduce the foundations, principles, and applications of multiple sequence analysis in this chapter, with a beginners perspective in mind. I begin with a general introduction to the principles of pairwise sequence comparison, scoring matrices, and the dynamic programming algorithm. The concepts of similarity, significance, and homology are next discussed. These principles are then extended to multiple sequence alignment and analysis and its varied applications, specifically motif, profile, and phylogenetic techniques. A brief discussion of multiple sequence alignment related to protein structure prediction concludes the chapter. These concepts are all illustrated in Part II's (Chapter 4) practical session using the Accelrys Wisconsin Package software.
Chapter 4.
Internet: Multiple Sequence Analysis Part II: A Practical Tour of SeqLab, the Accelrys GCG Wisconsin Package
Steve Thompson
Using an example protein, Elongation Factor 1a, and the foundations laid out in the previous chapter, I lead the reader through a 'hands-on' instructional tour of multiple sequence alignment and analysis using the Accelrys Genetics Computer Group SeqLab graphical user interface to the Wisconsin Package. A protein dataset is assembled and refined with LookUp and FastA; the sequences are analyzed for motifs, both from PROSITE and de novo using expectation maximization; an alignment is created, refined, and visualized; and profiles, including Hidden Markov Models, are built from the alignment, which are used to search sequence databases and to merge distant homologues into the alignment. Phylogenetic issues related to multiple sequence alignment are next investigated: masking concepts, format complications, and reliability. I conclude with a brief discussion of protein versus coding DNA and suggest a way in which they can be dealt with simultaneously.
Chapter 5.
Internet: Hidden Markov Models: Principles and Applications
Julian Gough
This chapter takes the reader through all of the steps involved in using hidden Markov models. The purpose is to make the reader aware of all of the important components which a user might potentially need to understand, but not go into a detailed explanation of algorithms such as dynamic programming which are necessary to write software from scratch. For more detailed information on the algorithms, the book "Biological sequence analysis" (Durbin et al., 1998) is recommended as further reading. Software which has already been written and which is freely available is recommended at the end of the chapter. In this chapter there is a strong emphasis on the use of hidden Markov models for remote protein homology detection, but the principles are described in general terms applicable to other applications.
Chapter 6.
Internet: Large scale EST analysis
Matthew G. Links, Jacek J. Nowak and William L. Crosby
Expressed Sequence Tag (EST) collections serve as an initial foundation for investigation into the gene expression of a given organism. While there have been advances in the acquisition of genomic sequence data this has been limited to a select group of organisms. Thus for many organisms EST data will continue to be a major focus of inquiry. The data present in an EST collection consists of varying quality, redundant sequences and incomplete transcripts. Presented here is a discussion of the approaches commonly undertaken to deal with the imperfect nature of EST data and thereby glean insight into the gene expression of a given organism.
Chapter 7.
Internet: Genquire and Apollo: The Evolution of Genome Browsers
Mark Wilkinson, David Block, William Crosby, Suzanna Lewis, Mark Gibson, Nomi Harris, Colin Wiel, John Richter, Michele Clamp and Steve Searle
Two genome browser/editors, Genquire and Apollo, reveal a convergence of understanding in the genome annotation and curation communities regarding the most intuitive and useful interface to genome sequence, feature, and annotation data. They share a common history of prior genome browsing and editing packages, and have evolved with the goal of selecting the best features from these ancestors, and combining them into a comprehensive software suite. Apollo and Genquire both provide a straightforward graphical interface for browsing pre-existing features and annotations by globally dispersed research communities. In addition these programs support de novo manual identification, editing, and annotation of new sequence features by experts. This software gives researchers the ability to directly amend gene structures and other important features of the genome such as pseudogenes, promoters, functional RNAs, transposable elements, and more. This detailed correction of genomic feature descriptions brings them into agreement with what is known from direct biological research and is a necessary prerequisite for further genomic studies. Thus Apollo and Genquire enable researchers who are anxious to move on to the study of whole genome transcriptomes and other comprehensive genomic analyses. These programs share a common heritage and architecture as well. The difference between the two lies in the language they are written in: Genquire is written in Perl and Apollo is written in Java. Details of the implementation of the data and graphical widget layers are documented in the companion chapter (Chapter 8). We describe here only the two browsers from a user's perspective. We believe that these two programs represent the culmination of the development of Genome browsers, and are sufficiently expandable and flexible to alleviate the need for additional duplication of effort by others in the future.
Chapter 8.
Internet: Genquire and Apollo: Construction of a Genome 'Sandbox'
Suzanna Lewis, Mark Gibson, Nomi Harris, Colin Wiel, John Richter, Mark Wilkinson, David Block, William Crosby, Michele Clamp and Steve Searle
Apollo and Genquire provide a straightforward graphical interface for browsing features and editing de novo annotations. These programs share a common heritage and architecture as well. The difference between the two lies in the language they are written in: Genquire is written in Perl and Apollo is written in Java. Their user interfaces are described in the companion chapter; this chapter discusses the implementation, architecture, and internals from the perspective of what a software person would need to understand in order install and configure the software and extend it by plugging in new components. Both programs are flexible in their ability to interface with various underlying data sources, ranging from proprietary database schemas, to common flat-file formats such as Genbank, with only minor additional coding required to adapt to any new data source. Finally, both are expandable through a plug-ins interface, enabling external software to alter the display and/or data being browsed both graphically and in the underlying data store.
Chapter 9.
Internet: Protein Structure Prediction
Lawrence Kelley
Predicting the 3-dimensional structure of a protein from its amino acid sequence is a major unsolved problem in biology. Many techniques have been developed to tackle the problem, from close and remote homology detection, to statistical mechanics and empirical energy functions. This chapter provides a broad overview of some of the techniques shown to be successful at international blind trials in the fields of ab initio protein folding, comparative/homology modelling and fold recognition (threading). Secondary structure prediction, iterative sequence searching and the importance of the sequence/structure databases is discussed. Finally, there will be a brief appraisal of automated protein structure prediction on the internet and the use of protein structure in predicting molecular function.
Chapter 10.
Internet: Molecular Structure Databases and Proteomics Tools
Jeremy Giovannetti
Sequence data holds only so much information, especially when it comes to proteins. Three-dimensional protein structures are representative of the molecule as it functions in the cell. Knowing what a molecule looks like in its biologically active state is a powerful piece of information. Molecular structure databases are portals into the three dimensional configurations of molecules. Structure databases, primarily concerned with proteins, are used for functional and evolutionary studies of molecules. Databases developed for structural studies of DNA and RNA also exist, though the amount of structural data they contain is minimal. Proteins can be analyzed in a variety of ways beyond the scope of structure databases. Proteomics tools, computer programs that assist molecular biological research, expedite the analysis process by generating data with minimal experimentation. The tools described in the second half of this chapter include more than 100 related to protein sequences and various 3-D structure applications. Applications of the tools vary from translating a DNA sequence to determining the amino acid composition of a protein sequence to generating a 3-D structure from a sequence.
Chapter 11.
Internet: An Introduction to Microarray Data Analysis
M. Madan Babu
This chapter aims to provide an introduction to the analysis of gene expression data obtained using microarray experiments. It has been divided into four sections. The first section provides basic concepts on the working of microarrays and describes the basic principles behind a microarray experiment. The second section deals with the representation and extraction of information from images obtained from microarray experiments. The third section addresses different methods for comparing expression profiles of genes and also provides an overview of different methods for clustering genes with similar expression profiles. The last section focuses on relating gene expression data with other biological information; it will provide the readers with a feel for the kind of biological discoveries one can make by integrating gene expression data with external information.
Chapter 12.
Internet: Jemboss: A Graphical Interface to EMBOSS
Tim Carver
Jemboss is an extensible graphical front end to the European Molecular Biology Open Software Suite (EMBOSS) applications (Rice et al., 2000). The user is presented with a graphical display of the parameters for an application. Default parameters can be accepted or changed. Jemboss runs the EMBOSS applications and displays the results. A useful feature of EMBOSS is that some parameters change or become active or redundant according to the value given to other parameters or on the properties of a sequence. Jemboss calculates and displays these dependencies on-the-fly. This interface can run the applications either locally in a standalone mode or by calling a remote server in client-server mode. The second method uses web services technology to make EMBOSS accessible from external sites. Features of both modes are discussed, including their installation and when they should be used. A description is given of the input parameter forms, and sequence, project and file management systems that are incorporated into the interface.
Chapter 13.
Internet: Bioinformatics Over the Web: SeWeR, as You May Think
Malay K. Basu
The recent proliferation of bioinformatics services in World Wide Web (WWW) requires efficient means to utilize these resources. The current model of using these services over Internet via HTML-CGI suffers from severe deficiencies, and more efficient means are yet to be implemented widely. This review discusses some of these deficiencies and suggests alternatives to perform bioinformatics analyses over the web. As a case study, it discusses SEquence analysis using WEb Resources (SeWeR), a new generation, intelligent web interface for performing bioinformatics analyses over web. SeWeR is versatile, cross-platform, customizable, integrated, and can be used to access any HTML-CGI based service without any change of the architecture of the existing service. SeWeR is distributed freely under the GNU General Public License (GPL), and can be used as a webpage or as stand-alone software from a desktop.
Chapter 14.
Internet: BioMOBY: The MOBY-S Platform for Interoperable Data-Service Provision
Mark Wilkinson
The BioMOBY project was initiated in late 2001 with the goal of exploring open-source platforms for the interoperable provision and discovery of biological data services. To date, this has resulted in creation and deployment of a simple, extensible platform to enable the discovery, representation, retrieval and integration of biological data from widely disparate data hosts and analysis services. The current platform, MOBY-S (for "MOBY Services") is based on a web-services paradigm, but extends this by using a novel, ontology-aware web-service registry system. An open and extensible data-modeling system was designed where data structures are determined by a data-type (Class) ontology that allows inheritance and container relationships, thus enabling new data-types to be created; service providers are obliged to use data types that exist in this ontology, but may extend the ontology with their own novel data-types. Similarly, an ontology of service types is being built, where specific data retrieval and analysis services are categorized and related to one another. The registry is publicly available, as are the interfaces to the Class and Service ontologies. At this time, a wide variety of services are available via the MOBY-S system, ranging from keyword searches against Medline to BLAST sequence comparisons. As new services are registered, they immediately become interoperable with all other appropriate services, thus allowing the service provider to generate ad-hoc services without concerning themselves with wider data-integration issues. Though it has been operational for less than a year, this "modular" approach is already showing great promise at solving the biological data integration problem.
Current Books: