Comparative genomics provides a powerful set of tools for identifying functional DNA elements which often cannot be distinguished from non-functional DNA based solely on single genome analysis. As the amount of available sequence data has grown in recent years, so has the power and scope of comparative analyses. Here we will provide a broad overview of recent work in the field, with the intent of describing some of the more important or promising techniques, as well as explaining some basic theory. We will also try to cover some basics of data interpretation and use as a common starting point for many different downstream methods. Finally, we discuss some of the theoretical and practical limitations of comparative genomics and some of the clever ways researchers have tried to overcome them.
Chapter 2
Computational Analysis and Paleogenomics of Interspersed Repeats in Eukaryotes
Cédric Feschotte and Ellen J. Pritham
Interspersed repeats occupy a significant fraction of the genetic material and represent the single major component of large eukaryotic genomes. They result from the persistent activity and gradual accumulation of transposable elements (TEs), sequences that are able to replicate in virtually all organisms and that have been successfully maintained throughout the evolution of life. Despite their selfish nature, the movement and amplification of TEs have had an enormous impact on the evolution of genes and the dynamics of genomes. Improvements to the efficiency of DNA sequencing coupled with decreases in its associated costs have fueled the sequencing of dozens of eukaryotic genomes. This has resulted in the rapid accumulation of large quantities of DNA sequences in the public databases. As such, the identification and annotation of repeats has become an integral facet of genome biology and has provoked a shift from the study of single TEs to huge populations of elements. Here we review the approaches and methods by which TEs are identified, classified and analyzed in complete eukaryotic genome sequences. We provide examples illustrating how these processes greatly facilitate genome annotation and illuminate the extent of the role of TEs in the evolution of genomes and species.
Chapter 3
Eukaryotic Transcriptional Regulation: Signals, Interactions and Modules
Sridhar Hannenhalli
Active genes and proteins in a cell are highly regulated, in significant part, at the level of transcription. The basal transcription, although executed by polymerase enzyme, is regulated by a cohort of cooperating transcription factor proteins that bind to their cognate DNA binding sites in the vicinity of the gene and assist in proper positioning and activation of polymerase at the transcription start site through interactions with each other and with polymerase. The overall transcriptional control is distributed among smaller groups of transcription factors - transcriptional modules. Transcriptional modules provide an elegant mechanism for co-regulation of many genes required in the same pathway. Computational methods have attempted to address mainly two types of biological questions pertaining to transcription - initiation and regulation. Computational prediction of transcription start sites has been extensively studied and many excellent reviews are published and we will briefly summarize the current state-of-the-art. We will mainly focus on the computational research pertaining to transcriptional regulation. We have organized this review by three computational problems - predicting transcription factor binding sites, predicting interaction between a pair of transcription factors, and predicting transcription modules.
Chapter 4
Genomic Sequence Analysis: A Case Study in Constrained Heaviest Segments
Kun-Mao Chao
Methods for genomic sequence analysis have been studied for more than a decade. One line of investigation is to locate the biologically meaningful segments, like conserved regions or GC-rich regions in DNA sequences. A common approach is to assign a real number (also called scores) to each residue, and then look for the maximum-sum or maximum-average segment. In this chapter, we address a few interesting applications concerning the search for the "heaviest" segment of a numerical sequence that naturally arises in the biomolecular sequence analysis. We also introduce some fundamental algorithmic techniques for solving them.
Chapter 5
A Survey of Sequence Alignment
Daniel G. Brown
We give a general overview of different perspectives on the meaning and creation of sequence alignments. Our points of view come from statistical, computational, algorithmic, evolutionary, and biological directions. We then survey recent work on algorithms in this area, focusing on local pairwise alignment of DNA sequences, but also discussing some recent work on multiple and global alignments.
Chapter 6
Computational Challenges of Microarray Analysis
Pawel Michalak, Young Bun Kim, and Jean Gao
DNA microarrays, often known as chips or biochips, enable researchers to analyze the expression of thousands of genes in a single experiment. Microarrays have thus largely contributed to the recent explosion in the rate of acquisition of biomedical data. However, microarray analysis, which until recently has lagged behind the technological development, poses a number of unique challenges. Here we provide a short review of computational problems of microarray analysis related to data normalization, the rate of false positives due to multiple comparisons, detection of differential expression, and cross-species use of microarrays.
Chapter 7
Computational Analysis of HIV Molecular Sequences
Colombe Chappey
Despite the approval of 21 antiretroviral drugs and the use of combination therapy, despite the breakthrough in vaccine-related immunology research, and the successes in HIV-1 prevention, HIV/AIDS continues to be a leading cause of illness and death in the United States. It is estimated that approximately one million individuals in the United States are currently living with this disease, and approximately 40,000 new cases of HIV-1 infection are diagnosed each year. The extraordinary replication dynamics of HIV-1 facilitates its escape from selective pressure exerted by the human immune system and by combination drug therapy. Since the beginning of the epidemic, bioinformatics approaches have been successfully used in multiple HIV research areas. A multitude of computational analysis has been published that quantified and simulated the biological interplay between viral genetic variation and host immune response throughout the infection in HIV-1 infected patients. Insights gained from these studies are crucial for the development of potential therapeutic agents and vaccines that will result in the control, treatment, and prevention of HIV infection. The chapter covers a range of computational analyses that focus largely on HIV-1 genetic variation. Bioinformatics is described here as a set of computation and quantitative approaches working hand in hand with other disciplines: clinicians, immunologists, epidemiologists, virologists, structural biologists, evolutionary biologists, statisticians, and mathematicians.
Chapter 8
Biological Databases
Alberto Riva
Databases are increasingly important tools for modern biological research. They not only collect the data and knowledge being generated by modern high-throughput methods and make it available in efficient and flexible ways, but they also provide a common reference point to ensure consistency among multiple, constantly growing data sources. More importantly, they determine the way our knowledge is organized, represented and communicated: creating a database automatically implies creating or adopting standardized nomenclatures and common identifiers for the entities it contains. Despite the growing trend towards "clusters" of databases that are extensively linked to each other (such as NCBI Entrez or Ensembl), many useful resources are still isolated and poorly connected with the rest, resulting in inconsistencies, difficulties in retrieving information, delays in updating. The number and variety of publicly available biological databases reflects the extreme diversity of the known biological processes and structures, and is also a product of the "organic" growth they underwent, starting as small-scale resources developed by individual research groups for their own purposes, and becoming in many cases comprehensive sources of knowledge that benefit the whole research community. This chapter will provide an overview of the most important publicly available resources, touching on databases of sequences, genes, mutations, protein structure, protein interactions, all the way to clinically relevant phenotypes and diseases. We will describe the structure, contents and intended use of these resources, and we will address the problems related with creating, updating and integrating such large collections of dynamic data.
Chapter 9
Distributed Computational Biology: Clusters and Grids
David Levine
Demands for processing and storing large biological data have driven the creation of clusters and grids of computers. Working in parallel, assemblies of computers can tackle very large computational problems and allow previously unmanageable work to be accomplished. Many clusters and grids have been created, and we describe several representative systems illustrating both hardware and software architectures. Software applications are described to provide examples of current use, and cooperative computational grids show a path to future trends in distributed computational biology.
Current Books: