Microbial genomics

Genome-wide Comparative Alignment Tools

from Luo et al (2011) in Microbial Population Genetics

Genome sequence comparison has been an important method for understanding gene function and genome evolution since the early days of gene sequencing. Alignment of DNA sequences is the core process in comparative genomics. In recent years, an important new sequence-analysis task has emerged: comparing an entire genome with another. Several powerful alignment algorithms have been developed to align two or more sequences.

MUMmer
MUMmer is a system for rapidly aligning entire genomes, whether in complete or draft form. MUMmer can also align incomplete genomes; it can handle thousands of contigs from a shotgun sequencing project, and will align them to another set of contigs or a genome using the NUCmer program included within the system. If the species are too divergent for a DNA sequence alignment to detect similarity, then the PROmer program within the environment can generate alignments based upon the six-frame translations of both input sequences. The original MUMmer system, version 1.0, was described in a 1999 Nucleic Acids Research paper. Version 2.1 appeared a few years later and was described in a 2002 Nucleic Acids Research paper , and the most recent version MUMmer 3.0 was described in a 2004 Genome Biology paper.

BLAT
BLAT (The BLAST-Like Alignment Tool) is a new tool for sequence alignment, which is similar in many ways to BLAST. The program rapidly scans for relatively short matches (hits), and extends these into high-scoring pairs (HSPs). However, BLAT differs from BLAST in several significant ways. Specifically, where BLAST builds an index of the query sequence and then scans linearly through the database, BLAT builds an index of the database and then scans linearly through the query sequence. Where BLAST triggers an extension when one or two hits occur in proximity to each other, BLAT can trigger extensions on any number of perfect or near-perfect hits. Where BLAST returns each area of homology between two sequences as separate alignments, BLAT stitches them together into a larger alignment. Both the client/server and the stand-alone can do comparisons at the nucleotide, protein, or translated nucleotide level.

MEGABlast
Mega BLAST uses the greedy algorithm of Zhang et al. for nucleotide sequence alignment search and concatenates many queries to save time spent scanning the database. This program is optimized for aligning sequences that differ slightly as a result of sequencing or other similar "errors". It is up to 10 times faster than more common sequence similarity search and alignment programs and therefore can be used to swiftly compare two large sets of sequences against each other.

Suggested reading:
1. Microbial Population Genetics
2. Genomics books

Genome comparison visualization tool

from Luo et al (2011) in Microbial Population Genetics

Comparative analysis is an increasingly important step in the annotation and analysis process of genome sequence data, allowing phenotypic differences between strains and species to be correlated with changes in the chromosomes. For example, comparative sequence analysis has enabled the identification of cis-regulatory regions and location of coding exons using purely computational means. Visual front-ends are necessary and important to make the process of viewing alignments intuitive and easy to facilitate discovery of conserved sequences for functionally significant regions. Below we describe a few visualization tools for genome comparisons.

PipMaker and MultiPipMaker
PipMaker is a World-Wide Web site for comparing two long DNA sequences to identify conserved segments and for producing informative, high-resolution displays of the resulting alignments. One display is a percent identity plot (pip), which shows both the position in one sequence and the degree of similarity for each aligning segment between the two sequences in a compact and easily understandable form. The web site also provides a plot of the locations of those segments in both species. PipMaker is appropriate for comparing genomic sequences from any two related species, although the types of information that can be inferred (e.g., protein-coding regions and cis-regulatory elements) depend on the level of conservation and the time and divergence rate since the separation of the species. PipMaker supports analysis of unfinished or working draft sequences by permitting one of the two sequences to be in un-oriented and unordered contigs. Similarly, MultiPipMaker allows the user to visualize relationships among more than two sequences. All pairwise alignments with the first sequence are computed and then returned as interleaved pips. Moreover, MultiPipMaker can be requested to compute a true multiple alignment of the input sequences and return a nucleotide-level view of the results.

ACT
ACT (Artemis Comparison Tool) is a DNA sequence comparison viewer, such as parsed BLAST alignments based on Artemis - an annotation tool. Similar to other Artemis tools, ACT is written in Java and runs on Unix, GNU/Linux, Macintosh and MS Windows systems. It can read complete EMBL and GENBANK entries or sequence in FASTA or raw sequence format. Other types of readable sequence input files include EMBL, GENBANK and GFF formats. The sequence comparison displayed by ACT is usually the result of running a blastn or tblastx search.

VISTA
Vista (Visualization and Alignment Software for Comparative Genomics) is a visualization tool for alignments, which displays GLASS alignments. It is a program to depict long alignments of DNA sequences from two or more organisms with various types of annotation in a clear and easily interpretable format. Originally it was developed to locate conserved sequences in syntenic regions of different genomes. The key features of the VISTA program are mainly the following:
1. Clean graphical output, allowing for easy identification of sequence similarities and differences.
2. Easily configurable, enabling the visualization of alignments of up to several million bases at different levels of resolution.
3. Displays alignments of draft sequences.
4. Displays sequence annotations such as repeats, coding exons, UTRs and more.
The VISTA plot is based on moving a user-specified window over the entire alignment and calculating the percent identity over the window at each base pair.

SynPlot
Synplot (displays DIALIGN and GLASS alignments) is an application program, written in Perl, for viewing global alignments of syntenic regions of genomic DNA sequence. The alignment is used to calculate the percentage identity along the alignment within a sliding window, the width of which can be specified by the user. This information is used to draw a picture of the alignment in postscript format. The sequences are rendered as lines interrupted by spaces corresponding to the gaps introduced by the alignment, with a plot of the percentage identity underneath. Features can also be drawn on the sequence lines. This program uses a GFF format file output by ACeDB from the annotated genomic sequence, and a configuration file which specifies the color, height and order in which the rectangles representing the features are drawn.

Suggested reading:
1. Microbial Population Genetics
2. Genomics books

Examples of Comparative Microbial Genomics

from Luo et al (2011) in Microbial Population Genetics

CFGP
CFGP (Comparative Fungal Genomics Platform) is a web-based multifunctional informatics workbench. The CFGP comprises three layers, including the basal layer, middleware and the user interface. The data warehouse in the basal layer contains standardized genome sequences of 65 fungal species. The middleware processes queries via six analysis tools, including BLAST, ClustalW, InterProScan, SignalP 3.0, PSORT II and a newly developed tool named BLASTMatrix. The BLASTMatrix permits the identification and visualization of genes homologous to a query across multiple species. The Data-driven User Interface (DUI) of the CFGP was built on a new concept of pre-collecting data and post-executing analysis instead of the 'fill-in-the-form-and-press-SUBMIT' user interfaces utilized by most bioinformatics sites. A tool termed Favorite, which supports the management of encapsulated sequence data and provides a personalized data repository to users, is another novel feature in the DUI.

MicroScope
MicroScope is a microbial genomes annotation and comparative analysis platform, which was developed by the French National Sequencing Center located at Genoscope. It is made of three major components : (i) a set of syntactic and functional annotation tools, (ii) a relational database, the Prokaryotic Genome DataBase, (PkGDB) which is linked to metabolic pathway databases (MicroCyc) created using the Pathway Tools software, and (iii) a graphical interface, the Magnifying Genome (MaGe), which allows performing relevant expert annotation that combine synteny results with metabolic network predictions.

Suggested reading:
1. Microbial Population Genetics
2. Genomics books

Microbial Genome Resources for Comparative Genomics

from Luo et al (2011) in Microbial Population Genetics

A variety of specialized data resources manage the results of microbial genome data processing and interpretation at different stages. These stages correspond to different levels of microbial genome characterization. Draft and finished microbial genome data are continuously incorporated in various microbial genome data resources. Below are brief descriptions to the main microbial genome data resources.

GOLD Genomes Online Database
GOLD (Genomes Online Database) is a World Wide Web resource for comprehensive access to information regarding complete and ongoing genome projects, as well as metagenomes and metadata, around the world. GOLD was created in 1997 with the aim to (i) monitor all genome sequencing projects from instigation to completion and (ii) provide the community with a centralized database integrating diverse information related to those projects in the form of hyper-text links to disparate web-based resources. Although several different types of statistics, related to each of the data fields, can be derived from the user at any point using the search engine, the database also provides readily available graphical overviews for specific data types.

Complete and ongoing projects and their associated metadata can be accessed in GOLD through pre-computed lists and a search page. As of March 2008, GOLD contains information on more than 3613 sequencing projects, out of which 731 have been completed and their sequence data deposited in the public databases (GOLD V2.0). GOLD continues to expand with the goal of providing metadata information related to the projects and the organisms/environments towards the Minimum Information about a Genome Sequence' (MIGS) guideline.

ASAP A systematic annotation package
ASAP (a systematic annotation package for community analysis of genomes) is a relational database and has a web interface developed to store, update and distribute genome sequence data and their functional characterizations. ASAP facilitates ongoing community annotation of genomes and tracking of information as genome projects move from preliminary data collection through post sequencing functional analysis. The database includes multiple genome sequences at various stages of analysis, corresponding experimental data and access to collections of related genome resources. Its development was motivated by the need to more directly involve a greater community of researchers, with their collective expertise, in keeping the genome annotation current and to provide a synergistic link between up-to-date annotation and functional genomic data. ASAP supports three levels of users: public viewers, annotators and curators. Public viewers can readily browse updated annotation information such as for Escherichia coli K-12 strain MG1655, genome-wide transcript profiles from more than 50 microarray experiments and an extensive collection of mutant strains and associated phenotypic data. Annotators worldwide are currently using ASAP to participate in a community annotation project for the Erwinia chrysanthemi strain 3937 genome. Curation of the E. chrysanthemi genome annotation as well as those of additional published enterobacterial genomes are underway and will be publicly accessible in the near future.

CMR Comprehensive Microbial Resource
CMR (Comprehensive Microbial Resource) is a tool that allows researchers to access all the bacterial genome sequences completed to date. It contains robust annotation of all completed microbial genomes and allows for a wide variety of data retrievals. For each genome not sequenced at The Institute of Genome Research (TIGR), two kinds of annotation are displayed: the Primary annotation taken from the genome sequencing center and the TIGR annotation generated by an automated annotation process at TIGR. CMR thus allows access of all the information on all of the bacterial genomes or any subset of them. Retrievals can be based on protein properties such as molecular role assignments and taxonomy. The CMR also has special web-based tools to allow data mining using pre-run homology searches, whole genome dot-plots, batch downloading and traversal across genomes using a variety of data types.

IMG Integrated Microbial Genomes
The IMG (Integrated Microbial Genomes) system serves as a community resource for comparative analysis and annotation of all publicly available genomes from the three domains of life, in a uniquely integrated context. IMG provides tools and viewers for analyzing and annotating genomes, genes and functions, individually or in a comparative context. An increasing number of eukaryotic genomes, viruses (including phages) and plasmids have also been added to IMG in order to increase its genomic context for comparative analysis. IMG's analytical tools have been gradually generalized and enhanced in terms of their usability, analysis flow and performance. These tools allow users to focus on a subset of genes, genomes and functions of interest, and conduct analysis using summary tables, graphical viewers and various methods for comparing genes, pathways and functions across genomes.

SEED Comparative genomics research
SEED is a software environment to support early phases in building design that has been adopted for comparative genomics research. Database support in SEED allows designers to store and retrieve different design versions, alternatives and past designs that can be reused and adapted in different contexts (case-based design in the terminology of Artificial Intelligence). In addition, the database stores recurring problem specifications and typical requirements for building types or functional areas common to many buildings. The database serves also as a main means of information exchange between modules, which do not communicate design decisions directly to each other. Current literature refers to this as information modeling or product and process modeling.

Suggested reading:
1. Microbial Population Genetics
2. Genomics books

Phylogenetics and Phylogenomics for Microbial Genomes

from Luo et al (2011) in Microbial Population Genetics

Generally, microorganisms, in particular prokaryotes often lack morphological and behavioral characters amenable to phylogenetic analysis. Such a lack of information in these areas makes gene sequence information the most prevalent source of data for phylogenetic analysis in pre-genomic era. Molecular phylogenetics based on single genes, in particular the small-subunit rRNA (SSU rRNA) , has laid the foundation for a modern classification system, conceptually represented by the 'universal tree of life'. However, phylogenetic trees based on single genes or gene families may show conflict results due to a variety of problem, such as mutational saturation of the single genes and horizontal gene transfer. Consequently, although SSU rRNA gene sequence continue to be considered as molecular criteria for species delineation, it is anticipated that much additional taxonomic information can be extracted from complete genome sequences.

Now that large-scale genome-sequencing projects are sampling many organismal lineages, it is becoming possible to compare large data sets of DNA or protein sequences to study speciation and evolution. The steady increase in the number of completely sequenced prokaryotic genomes has created a boom for bioinformatics. With more than 700 prokaryotic genomes completely sequenced, there has been an increasing interest in the use of various characters in whole genomes for prokaryotic genomes studies. This is giving birth to a brand new field of research - phylogenomics. Phylogenomics use entire genomes to infer a species tree and has become the de facto standard for reconstructing reliable phylogenies.

One major branch of phylogenomics involves the use of these data to reconstruct the evolutionary history of organisms. Access to large amount of genomic data could potentially alleviate problems associated with single-gene based phylogenetics. This is because a large number of characters can now be used for phylogenetic analysis to avoid biases. With this increase, the emphasis of phylogenetic inference is shifting from the search for informative characters to the development of better reconstruction methods using genomic data. Existing models used in tree-building algorithms only partially take into account molecular evolutionary processes, and phylogenomic inference will benefit from an increased understanding of these mechanisms. Interestingly, phylogenomics is also providing the opportunity to use new 'morphological-like' characters that are based on genome structure, such as rare genomic changes (RGCs). The integration of genomics data into the phylogenetics is still at an early stage. Given the breadth of organismal diversity, the gene-scale era of phylogenetics is still an invaluable asset to the pursuit of the Tree of Life. Comparative genomics, with its ability and potential to vastly increase both the amount and type of molecular data available for a small but critical fraction of biodiversity, is bound to play an increasingly important role in efforts to assemble a robust picture of the Tree of Life.

BPhyOG A database for overlapping genes in prokaryotic genomes
BPhyOG (Bacterial Phylogeny based on Overlapping Genes) is an online interactive server for reconstructing completely sequenced bacterial genomes on the basis of their shared overlapping genes. It provides two tree-reconstruction methods: Neighbor Joining (NJ) and Unweighted Pair-Group Method using Arithmetic averages (UPGMA). Users can apply the desired method to generate phylogenetic trees, which are based on an evolutionary distance matrix for the selected genomes. The distance between two genomes is defined by the normalized number of their shared OG pairs. BPhyOG also allows users to browse the OGs that were used to infer the phylogenetic relationships. It provides detailed annotation for each OG pair and the features of the component genes through hyperlinks. Users can also retrieve each of the homologous OG pairs that have been determined among 177 genomes. BPhyOG is useful tool for analyzing the tree of life and overlapping genes from a genomic standpoint. It currently includes 177 completely sequenced bacterial genomes containing 79,855 OG pairs, with their annotation and homologous OG pairs comprehensively integrated. The reliability of phylogenies and completeness of annotations make BPhyOG a comprehensive and powerful web server for genomic and genetic studies.

Suggested reading:
1. Microbial Population Genetics
2. Genomics books

Comparative Genomics of Metabolic Pathways in Microbial Genomes

from Luo et al (2011) in Microbial Population Genetics

Understanding the regulatory mechanisms should allow the examination of engineering pathways with pre-determined expression patterns (i.e. expression is activated by a given compound or in a specific environmental or physiological condition). Metabolic pathways have evolved to execute their function efficiently, while tolerating perturbations, such as changes in environmental parameters or in the physiological status of the cell. Below we describe some of the databases and programs for integrated analyses of metabolic pathways.

KEGG
KEGG (Kyoto Encyclopedia of Genes and Genomes) is a database of biological systems that integrates genomic, chemical and systemic functional information. KEGG provides a reference knowledge base for linking genomes to life through the process of PATHWAY mapping. The PATHWAY database contains information about conserved sub-pathways (or pathway motifs), which are often encoded by positionally coupled genes on the chromosome and which are especially useful in predicting gene functions. The genomic information is stored in the GENES database, which is a collection of gene catalogs for all the completely sequenced genomes and some partial genomes with up-to-date annotation of gene functions. A third database in KEGG is LIGAND that includes information about chemical compounds, enzyme molecules and enzymatic reactions. In addition, KEGG provides a reference knowledge base for linking genomes to the environment, such as for the analysis of drug-target relationships, through the process of BRITE mapping. KEGG BRITE is an ontology database representing functional hierarchies of various biological objects, including molecules, cells, organisms, diseases and drugs, as well as relationships among them. Additionally, the KEGG resource is being expanded to suit the needs for practical applications. KEGG DRUG contains all approved drugs in the US and Japan, and KEGG DISEASE is a new database linking disease genes, pathways, drugs and diagnostic markers.

KEGG provides Java graphics tools for browsing genome maps, comparing two genome maps, manipulating expression maps, as well as including computational tools for sequence comparison, graph comparison and path computation.

BioCyc
The BioCyc database collection is a set of 160 pathway/genome databases (PGDBs) for most eukaryotic and prokaryotic species whose genomes have been completely sequenced to date. Each PGDB offers a wealth of genomic and metabolic information on certain microorganisms, including P. aeruginosa and S. cerevisiae. Each database provides information on a microorganism's annotated genome, on the biochemical reaction(s) that each gene product catalyses and on the organism's metabolic pathways, predicted from its annotated genome by a program called PathoLogic. The information from each database is comprehensive and complex. In addition, each bacterial PGDB includes predicted operons for the corresponding species. The BioCyc collection provides a unique resource for computational systems biology, namely global and comparative analyses of genomes and metabolic networks, and a supplement to the BioCyc resource of curated PGDBs. The Omics viewer available through the BioCyc website allows scientists to visualize combinations of gene expression, proteomics and metabolomics data on the metabolic maps of these organisms.

MetaCyc
MetaCyc is a universal database of metabolic pathways and enzymes from all domains of life. The pathways in MetaCyc are curated from the primary scientific literature, and the small-molecule metabolic pathways are experimentally determined. Each reaction in a MetaCyc pathway is annotated with one or more well-characterized enzymes. Because MetaCyc contains only experimentally elucidated knowledge, it provides a uniquely high-quality resource for metabolic pathways and enzymes. MetaCyc stores pathways involved in both primary metabolism and secondary metabolism. MetaCyc also stores compounds, proteins, protein complexes and genes associated with these pathways. It is extensively linked to other biological databases containing protein and nucleic-acid sequence data, bibliographic data and protein structures. MetaCyc also contains objects for the genes that encode many enzymes within the DB. While it does not contains primary sequence data, MetaCyc does contain links to external sequence databases.

EcoCyc
EcoCyc is a bioinformatics database that describes the genome and the biochemical machinery of E. coli K-12 MG1655. The long-term goal of this project is to describe the molecular catalog of the E. coli cell, as well as the functions of each of its molecular parts, to facilitate a system-level understanding of E. coli. EcoCyc is an electronic reference source for E. coli biologists, and for biologists who work with related microorganisms. EcoCyc contains the complete genome sequence of E. coli, and describes the nucleotide position and function of every E. coli gene. The annotation of the Escherichia coli K-12 genome in the EcoCyc database is one of the most accurate, complete and multidimensional genome annotations. EcoCyc information was derived from 15 000 publications. The database contains extensive descriptions of E. coli cellular networks, describing its metabolic, transport and transcriptional regulatory processes. Database queries to EcoCyc survey the global properties of E. coli cellular networks and illuminate the extent of information gaps for E. coli, such as dead-end metabolites. EcoCyc provides a genome browser with novel properties, and a novel interactive display of transcriptional regulatory networks.

Suggested reading:
1. Microbial Population Genetics
2. Genomics books

Comparative Genomics Microarray Analyses Technology

from Luo et al (2011) in Microbial Population Genetics

The advent of DNA microarray technology has greatly expanded our ability to monitor changes in the abundance of transcripts. Such a development has been a milestone in several areas of microbiology. In clinical microbiology, microarrays are used for microorganism detection and identification and gene-expression analysis. DNA microarrays have allowed us to monitor the effects of pathogens on host-cell gene expression in a much greater depth and on a significantly broader scale than previous single gene studies. The results generated by these studies are complex, and few systematic studies have been carried out to compare results among studies. Comparative transcriptomics - whole genome mRNA transcript profiling using microarrays.

Whole-genome microarrays from fully sequenced genomes are a powerful platform for identifying differences in gene content between organisms and for studying gene expression dynamics. The generation of messenger RNA expression profiles is referred to as transcriptomics, as these are based on the process of transcription. Given the inhospitable in vivo and the varied ex vivo environments encountered by most microbial pathogens, transcriptome analysis holds particular promise for identifying and determining the functions of differentially regulated, virulence associated genes. The basic principle of this technique involves extracting the mRNA expressed under a range of environmental conditions and hybridizing these sequences to a high-density gridded microarray of the DNA content of an organism. Such high-throughput analysis allows massive parallel gene expression and gene discovery studies to be undertaken. DNA microarray analysis will complement other technologies such as in vivo expression technology and differential fluorescence analysis to identify and investigate which bacterial genes are differentially expressed in the host.

The application of DNA microarrays to microbial pathogens
The study of the complete set of genes expressed and modified in a cell is an important and rapidly evolving discipline that is readily applicable to microbial pathogens. For example, strains of Staphylococcus aureus resistant to the antibiotic vancomycin present a potentially serious public-health problem. In the case, the Gaasterland group described the development of a multi strain S. aureus microarray. Pairwise comparisons of the available genomes of strains of S. aureus have revealed considerable variation in gene content across the epidemiological landscape. They identified changes in protein-coding potentials that correlate with antibiotic resistance by measuring differences in gene expression in vancomycin-sensitive and vancomycin-resistant pairs of S. aureus isolates. Philip Butcher's group used microarrays to help understand the complex pathophysiology of Mycobacterium tuberculosis infection. Besides discussing some methodological aspects of microarray work, They focused on the use of M. tuberculosis microarrays to investigate the intracellular lifestyle of this organism and its interaction with host macrophages. In the future, it would be exciting to integrate results from in vitro work like this with results from in vivo microarray work on mammalian hosts to provide a whole-genomic view of host-pathogen interactions.

Suggested reading:
1. Microbial Population Genetics
2. Genomics books

Analytic Tools in Comparative Genomics

from Luo et al (2011) in Microbial Population Genetics

The rapid accumulation of bacterial genome sequences has opened up a new field of research, that of comparative genomics. Interpretation of raw DNA sequence data involves the identification and annotation of genes, proteins, and regulatory and/or metabolic pathways. Therefore, there is a natural shift towards the creation of tools for viewing and manipulating data in a comparative genomics context. In addition, genome annotations need to be reprocessed on a regular basis to take into account the newly characterized functions of genes. Furthermore, large-scale functional analyses generate additional data that contribute to the interpretation of genomic data. These considerations are driving the research community to think about how to manage public collections of genomes in novel ways. One role of bioinformatics is to assist biologists in the extraction of biological knowledge from this flood of data. Consequently, software designed for the analyses and functional annotation of a single genome have evolved to tools for comparative genomics, detecting the relatively conserved information across many genomes simultaneously. Here we introduce several popular tools for bacterial genome annotation and comparative genomics.

Suggested reading:
1. Microbial Population Genetics
2. Genomics books