Genome sequencing is a young but evolving field. At the beginning, the possibility of sequencing entire eukaryotic genomes was naturally applied to those for which biological data was already the most abundant. Today however, genome sequencing is turning towards species that are of interest because of their key position in the evolutionary tree or because of their economic value. The corollary to this situation is that little information may be available on the genome of a species before a sequencing project is initiated. This chapter nevertheless describes some parameters that are worth looking into before time and money is invested in a large project, such as determining precisely the size of the genome, estimating the level of polymorphism in a population or exploring the repeat content. Depending on each particular case, knowledge of some of these basic properties may help lay the foundation for a successful sequencing project.
Radiation hybrid mapping is a method that combines aspects of both genetic and physical mapping and has proven to be key in the rapid construction of whole genome maps. A panel of radiation hybrid cell lines, each retaining a different portion of the donor genome in the background of a recipient cell, is used to score the presence or absence of sequence tagged sites (STSs). The frequency with which markers co-segregate corresponds to their proximity to one another. Statistical analysis of the similarity of the retention patterns allows definition of marker order and intermarker distances. This chapter aims to provide hands on experience in using this mapping methodology and extends into the logistics of building a radiation hybrid map to support the assembly of bacterial clone maps spanning whole chromosomes.
This chapter provides an introduction to the theory and practice of genomic library construction in fosmid and BAC vectors. It contains a protocol for preparing agarose-embedded high-molecular-mass genomic DNA from materials that require grinding and homogenization to access the DNA. It describes the preparation of fosmid and BAC-cloning vector, partial digestion of the genomic DNA, preparative pulsed-field gel electrophoresis and guides the reader step-by-step through the entire process of fosmid and BAC-library construction.
The ability to clone long stretches of DNA has been an essential feature of the development of genome analysis technologies over the twenty years that culminated in the completion of a draft sequence of the human genome in 2001. Yeast artificial chromosomes (YACs) have played a central role in this development mainly due to their capability for carrying inserts of exogenous DNA up to 2Mb in size. Although many of the apparent attributes of YACs have been usurped by P1 artificial chromosomes (PACs) (Ioannou et al., 1994) and bacterial artificial chromosomes (BACs) (Shizuya et al., 1992), YACs generally speaking offer the most comprehensive clone coverage of complex genomes, and hence are invaluable for generating highly complete clone-based physical maps. Despite being relatively difficult to purify (being linear and for practical purposes indistinguishable from host chromosomes) they are, being propagated in a eukaryotic host, able to clone particular eukaryotic sequences that will not be supported in a bacterial host. Consequently, although inserts can be unstable and chimaerism can be a problem, particularly in larger YACs, they are an important resource for the completion of genome sequencing projects.
Molecular cytogenetics is an evolving scientific field that combines the techniques of both cytogenetics and molecular genetics. FISH-based molecular cytogenetic techniques allow chromosomal rearrangements to be investigated at a higher resolution than standard G-banding analysis. FISH and related technologies are used to identify cryptic chromosomal rearrangements, to clarify complex karyotypes and to define the position of translocation breakpoints. These investigations help to refine specific translocation breakpoint regions allowing them to be characterised at the molecular level by PCR and sequencing. In this chapter I describe the techniques that may be employed to identify both the genes involved and the potential causative mechanisms underlying chromosomal rearrangements.
The majority of large genomes sequenced to date have made heavy use of a hierarchical mapping approach and this is the strategy in use by the International Human Genome Sequencing Consortium. Ideally, to minimise redundancy and cost, physical clone maps of the genome are developed ahead of large-scale genomic sequencing. However the necessity to produce the sequence and the technology now available to achieve high-throughput sequencing dictates the pace and mapping strategy employed. In this chapter we discuss the resources and methods needed to generate physical clone maps of human chromosomes 6 and 9 at the Sanger Institute and the criteria for selecting clones to be sequenced. Large-insert bacterial clone maps were constructed by a combined restriction enzyme fingerprinting and landmark content analysis and were the substrate for genomic sequencing of the chromosomes. In addition the clones in the map are a lasting resource for future genomic analyses which include chromosome structure, comparative genomic hybridisation, gene inactivation and other functional genetics.
FPC (FingerPrinted Contigs) builds contigs from marker data and fingerprinted clones. Contigs are ordered by framework markers. The initial versions of FPC mainly supported interactive assembly of maps, which would not scale up efficiently to whole genome maps. Subsequent versions of FPC added increased automation for assembling maps. In this chapter, we will explore how to efficiently build physical maps with FPC using the most recent features. We begin with a brief history of physical mapping as it relates to FPC, and then analyze the concepts and parameters incorporated by the software. Finally, we present a tutorial that guides you through the most useful features of FPC. The demo files used in the tutorial are available online.
The genomes of most organisms, from the simplest unicellular organism to more complex species, consist of a variety of genomic landscapes. Each landscape has a unique profile of high-copy repeats, low-copy repeats, genic content, GC-richness and so forth. Particularly near regions of structurally important sequences, such as the centromere, the genomic landscape can become quite problematic. The primary factor contributing to the additional difficulty in studying these areas within the human genome is that clusters of genomic duplications and unusual repeat structures often lie in close proximity (1-2 Mb) to the centromeres. Consequently, sequence similarity-based methods of global genome assembly fail to properly assign the correct positions of duplicated sequences. Because of this, artificial overlaps form, significant warping of working draft sequences occurs, and numerous gaps appear in the assembly, reducing the overall quality and relevance of the assembly in these regions. These effects are further compounded by the absence of unique STS within such regions and a general under-representation of such areas in clone-by-clone sequencing projects. Thus, the presence of large spans of duplicated sequence near centromeres (and telomeres) interferes with a generalized approach to genome analysis. To combat this problem, specialized computational/experimental approaches have been developed to accurately map and assemble these difficult, but biologically relevant, genomic landscapes. Herein we will explore recently developed techniques aimed at building sequence-ready maps of duplicated regions and solving the structure of pericentromeric regions.
The DNA sequence organization of human telomeres includes large stretches of highly similar duplicated and low-copy DNA adjacent to terminal telomere repeat sequences. This unusual sequence organization has led to significant complications with respect to mapping and sequencing. Our approach to solving these problems, described in this paper, has been to isolate each telomere region using a specialized yeast artificial chromosome (YAC) system that permits propagation of large telomere-terminal human DNA fragments as linear plasmids in yeast. Each YAC contains a terminal repeat tract, the entire subtelomeric repeat region, and the adjacent single-copy DNA region, physically linked on a single large DNA segment that has been purified from the rest of the human genome. From this starting material, the most distal single-copy segments of each chromosome arm can be identified, analyzed, and used to validate subtelomeric sequence structure. The particular repeat organization and DNA sequence of each subtelomeric region can then be deciphered without interference from duplicons derived from elsewhere in the genome. This basic approach has been used successfully for most human telomeres, and is applicable to all vertebrate and most eukaryotic genomes.
Shotgun sequencing is the strategy of choice for large scale sequencing projects. In addition to comparing shotgun sequencing with alternative strategies, this chapter details the methodologies of producing high quality random subclone libraries, template production and sequencing in use at the Wellcome Trust Sanger Institute.
Finishing is the process by which assembled shotgun DNA sequence data is manipulated and supplemented to produce a complete high quality reference sequence. This chapter highlights a number of techniques that can be used to carry out this process and their application to the finishing of both large and small genome sequencing projects. All the techniques described below are being applied, and continue to be developed, at the Wellcome Trust Sanger Institute (WTSI).
Modern high-throughput DNA sequencing laboratories increasingly rely on software to automate the processing, analyzing, storing, and retrieving of sequence data. We outline the issues and tasks faced in such an environment and describe software solutions that have been developed to address them. Major applications used in the human sequencing project are emphasized.
As more mammalian genomes have become sequenced attention has turned to sequence annotation. An annotated sequence provides a wealth of information about the organism not directly obvious from the sequence alone. It also acts as a standard, allowing investigators around the world to work on the same basic gene structures and to compare subsequent findings. In this chapter we show how to assemble finished sequence clones in a contiguous reference sequence and then subject this to an array of sequence analysis tools. By examining these analyses we show how to annotate exon/intron structures on the genomic sequence to define genes. In regions where there is insufficient evidence to draw a complete gene structure or where further evidence is required, we suggest methods to identify the necessary sequence from mRNA sources. Finally, we show how these data can be compiled into simple flat files of coordinates on a reference sequence and transferred between investigators.
Biological sequence databases represent indispensable tools for scientific discovery. A comprehensive database of all publicly available DNA sequences has been maintained through a longstanding collaboration among informatics groups in the United States, Europe, and Japan. Over the past decade, extraordinary gains have been made in the extensive sequencing of genomes and transcriptomes. Consequently, the databases have developed new procedures for bulk submission of new classes of data in both intermediate and finished forms. Additional databases have been established to track clone sequencing progress and to capture the original sequencing traces. More recent developments have focused on tools for browsing and analyzing whole genomes.
Current Books: