Academic Appointments

Administrative Appointments

  • Faculty Co-Director, Medicine Teaching and Mentoring Academy (2016 - Present)

Honors & Awards

  • John Buckley Entrance Scholarship for Science, Manchester University (1988-1991)
  • Prize Studentship, The Wellcome Trust (1991-1994)
  • Cold Spring Harbor Fellowship, Cold Spring Harbor Laboratory (1996-1997)
  • Army Breast Cancer Research Fellowship, Department of Defence (1997-1998)

Professional Education

  • B.Sc., Manchester University, Genetics (1991)
  • Ph.D., Manchester University, Molecular Biology (1994)

Current Research and Scholarly Interests

Adaptive Evolution and the Fitness Landscape

When yeast are evolved under various selective pressures in a chemostat, mutations that arise and provide an adaptive advantage will expand within the population. We have pioneered the use of high throughput sequencing to determine the identity of such mutations, as well as to understand the dynamics of the mutations within the populations, and the interactions between the mutations (such as epistasis). Further, we have developed a DNA barcode based lineage tracking system to determine the distribution of fitness effects (DFE) for newly arising beneficial mutations. We have also characterized what we call the genotype-fitness map for beneficial mutations, and have investigated why beneficial mutations provide a positive fitness effect. We are also interested in how beneficial mutations trade-off for different traits, and how those trade-offs constrain adaptive evolution.

2023-24 Courses

Stanford Advisees

Graduate and Fellowship Programs

All Publications

  • Changes in the distribution of fitness effects and adaptive mutational spectra following a single first step towards adaptation. Nature communications Aggeli, D., Li, Y., Sherlock, G. 2021; 12 (1): 5193


    Historical contingency and diminishing returns epistasis have been typically studied for relatively divergent genotypes and/or over long evolutionary timescales. Here, we use Saccharomyces cerevisiae to study the extent of diminishing returns and the changes in the adaptive mutational spectra following a single first adaptive mutational step. We further evolve three clones that arose under identical conditions from a common ancestor. We follow their evolutionary dynamics by lineage tracking and determine adaptive outcomes using fitness assays and whole genome sequencing. We find that diminishing returns manifests as smaller fitness gains during the 2nd step of adaptation compared to the 1st step, mainly due to a compressed distribution of fitness effects. We also find that the beneficial mutational spectra for the 2nd adaptive step are contingent on the 1st step, as we see both shared and diverging adaptive strategies. Finally, we find that adaptive loss-of-function mutations, such as nonsense and frameshift mutations, are less common in the second step of adaptation than in the first step.

    View details for DOI 10.1038/s41467-021-25440-7

    View details for PubMedID 34465770

  • Single nucleotide mapping of trait space reveals Pareto fronts that constrain adaptation. Nature ecology & evolution Li, Y., Petrov, D. A., Sherlock, G. 2019


    Trade-offs constrain the improvement of performance of multiple traits simultaneously. Such trade-offs define Pareto fronts, which represent a set of optimal individuals that cannot be improved in any one trait without reducing performance in another. Surprisingly, experimental evolution often yields genotypes with improved performance in all measured traits, perhaps indicating an absence of trade-offs at least in the short term. Here we densely sample adaptive mutations in Saccharomyces cerevisiae to ask whether first-step adaptive mutations result in trade-offs during the growth cycle. We isolated thousands of adaptive clones evolved under carefully chosen conditions and quantified their performances in each part of the growth cycle. We too find that some first-step adaptive mutations can improve all traits to a modest extent. However, our dense sampling allowed us to identify trade-offs and establish the existence of Pareto fronts between fermentation and respiration, and between respiration and stationary phases. Moreover, we establish that no single mutation in the ancestral genome can circumvent the detected trade-offs. Finally, we sequenced hundreds of these adaptive clones, revealing new targets of adaptation and defining the genetic basis of the identified trade-offs.

    View details for DOI 10.1038/s41559-019-0993-0

    View details for PubMedID 31611676

  • The dynamics of adaptive genetic diversity during the early stages of clonal evolution. Nature ecology & evolution Blundell, J. R., Schwartz, K., Francois, D., Fisher, D. S., Sherlock, G., Levy, S. F. 2018


    The dynamics of genetic diversity in large clonally evolving cell populations are poorly understood, despite having implications for the treatment of cancer and microbial infections. Here, we combine barcode lineage tracking, sequencing of adaptive clones and mathematical modelling of mutational dynamics to understand adaptive diversity changes during experimental evolution of Saccharomyces cerevisiae under nitrogen and carbon limitation. We find that, despite differences in beneficial mutational mechanisms and fitness effects, early adaptive genetic diversity increases predictably, driven by the expansion of many single-mutant lineages. However, a crash in adaptive diversity follows, caused by highly fit double-mutant 'jackpot' clones that are fed from exponentially growing single mutants, a process closely related to the classic Luria-Delbruck experiment. The diversity crash is likely to be a general feature of asexual evolution with clonal interference; however, both its timing and magnitude are stochastic and depend on the population size, the distribution of beneficial fitness effects and patterns of epistasis.

    View details for PubMedID 30598529

  • Hidden Complexity of Yeast Adaptation under Simple Evolutionary Conditions CURRENT BIOLOGY Li, Y., Venkataram, S., Agarwala, A., Dunn, B., Petrov, D. A., Sherlock, G., Fisher, D. S. 2018; 28 (4): 515-+


    Few studies have "quantitatively" probed how adaptive mutations result in increased fitness. Even in simple microbial evolution experiments, with full knowledge of the underlying mutations and specific growth conditions, it is challenging to determine where within a growth-saturation cycle those fitness gains occur. A common implicit assumption is that most benefits derive from an increased exponential growth rate. Here, we instead show that, in batch serial transfer experiments, adaptive mutants' fitness gains can be dominated by benefits that are accrued in one growth cycle, but not realized until the next growth cycle. For thousands of evolved clones (most with only a single mutation), we systematically varied the lengths of fermentation, respiration, and stationary phases to assess how their fitness, as measured by barcode sequencing, depends on these phases of the growth-saturation-dilution cycles. These data revealed that, whereas all adaptive lineages gained similar and modest benefits from fermentation, most of the benefits for the highest fitness mutants came instead from the time spent in respiration. From monoculture and high-resolution pairwise fitness competition experiments for a dozen of these clones, we determined that the benefits "accrued" during respiration are only largely "realized" later as a shorter duration of lag phase in the following growth cycle. These results reveal hidden complexities of the adaptive process even under ostensibly simple evolutionary conditions, in which fitness gains can accrue during time spent in a growth phase with little cell division, and reveal that the memory of those gains can be realized in the subsequent growth cycle.

    View details for PubMedID 29429618

    View details for PubMedCentralID PMC5823527

  • Development of a Comprehensive Genotype-to-Fitness Map of Adaptation-Driving Mutations in Yeast. Cell Venkataram, S., Dunn, B., Li, Y., Agarwala, A., Chang, J., Ebel, E. R., Geiler-Samerotte, K., Hérissant, L., Blundell, J. R., Levy, S. F., Fisher, D. S., Sherlock, G., Petrov, D. A. 2016; 166 (6): 1585-1596 e22


    Adaptive evolution plays a large role in generating the phenotypic diversity observed in nature, yet current methods are impractical for characterizing the molecular basis and fitness effects of large numbers of individual adaptive mutations. Here, we used a DNA barcoding approach to generate the genotype-to-fitness map for adaptation-driving mutations from a Saccharomyces cerevisiae population experimentally evolved by serial transfer under limiting glucose. We isolated and measured the fitness of thousands of independent adaptive clones and sequenced the genomes of hundreds of clones. We found only two major classes of adaptive mutations: self-diploidization and mutations in the nutrient-responsive Ras/PKA and TOR/Sch9 pathways. Our large sample size and precision of measurement allowed us to determine that there are significant differences in fitness between mutations in different genes, between different paralogs, and even between different classes of mutations within the same gene.

    View details for DOI 10.1016/j.cell.2016.08.002

    View details for PubMedID 27594428

    View details for PubMedCentralID PMC5070919

  • Quantitative evolutionary dynamics using high-resolution lineage tracking. Nature Levy, S. F., Blundell, J. R., Venkataram, S., Petrov, D. A., Fisher, D. S., Sherlock, G. 2015; 519 (7542): 181-186


    Evolution of large asexual cell populations underlies ∼30% of deaths worldwide, including those caused by bacteria, fungi, parasites, and cancer. However, the dynamics underlying these evolutionary processes remain poorly understood because they involve many competing beneficial lineages, most of which never rise above extremely low frequencies in the population. To observe these normally hidden evolutionary dynamics, we constructed a sequencing-based ultra high-resolution lineage tracking system in Saccharomyces cerevisiae that allowed us to monitor the relative frequencies of ∼500,000 lineages simultaneously. In contrast to some expectations, we found that the spectrum of fitness effects of beneficial mutations is neither exponential nor monotonic. Early adaptation is a predictable consequence of this spectrum and is strikingly reproducible, but the initial small-effect mutations are soon outcompeted by rarer large-effect mutations that result in variability between replicates. These results suggest that early evolutionary dynamics may be deterministic for a period of time before stochastic effects become important.

    View details for DOI 10.1038/nature14279

    View details for PubMedID 25731169

  • Arrayed in vivo barcoding for multiplexed sequence verification of plasmid DNA and demultiplexing of pooled libraries. Nucleic acids research Li, W., Miller, D., Liu, X., Tosi, L., Chkaiban, L., Mei, H., Hung, P. H., Parekkadan, B., Sherlock, G., Levy, S. F. 2024


    Sequence verification of plasmid DNA is critical for many cloning and molecular biology workflows. To leverage high-throughput sequencing, several methods have been developed that add a unique DNA barcode to individual samples prior to pooling and sequencing. However, these methods require an individual plasmid extraction and/or in vitro barcoding reaction for each sample processed, limiting throughput and adding cost. Here, we develop an arrayed in vivo plasmid barcoding platform that enables pooled plasmid extraction and library preparation for Oxford Nanopore sequencing. This method has a high accuracy and recovery rate, and greatly increases throughput and reduces cost relative to other plasmid barcoding methods or Sanger sequencing. We use in vivo barcoding to sequence verify >45 000 plasmids and show that the method can be used to transform error-containing dispersed plasmid pools into sequence-perfect arrays or well-balanced pools. In vivo barcoding does not require any specialized equipment beyond a low-overhead Oxford Nanopore sequencer, enabling most labs to flexibly process hundreds to thousands of plasmids in parallel.

    View details for DOI 10.1093/nar/gkae332

    View details for PubMedID 38709890

  • Analysis and culturing of the prototypic crAssphage reveals a phage-plasmid lifestyle. bioRxiv : the preprint server for biology Schmidtke, D. T., Hickey, A. S., Liachko, I., Sherlock, G., Bhatt, A. S. 2024


    The prototypic crAssphage ( Carjivirus communis ) is one of the most abundant, prevalent, and persistent gut bacteriophages, yet it remains uncultured and its lifestyle uncharacterized. For the last decade, crAssphage has escaped plaque-dependent culturing efforts, leading us to investigate alternative lifestyles that might explain its widespread success. Through genomic analyses and culturing, we find that crAssphage uses a phage-plasmid lifestyle to persist extrachromosomally. Plasmid-related genes are more highly expressed than those implicated in phage maintenance. Leveraging this finding, we use a plaque-free culturing approach to measure crAssphage replication in culture with Phocaeicola vulgatus, Phocaeicola dorei, and Bacteroides stercoris , revealing a broad host range. We demonstrate that crAssphage persists with its hosts in culture without causing major cell lysis events or integrating into host chromosomes. The ability to switch between phage and plasmid lifestyles within a wide range of hosts contributes to the prolific nature of crAssphage in the human gut microbiome.

    View details for DOI 10.1101/2024.03.20.585998

    View details for PubMedID 38562748

  • Spindle architecture constrains karyotype in budding yeast. bioRxiv : the preprint server for biology Helsen, J., Reza, M. H., Sherlock, G., Dey, G. 2023


    The eukaryotic cell division machinery must rapidly and reproducibly duplicate and partition the cell's chromosomes in a carefully coordinated process. However, chromosome number varies dramatically between genomes, even on short evolutionary timescales. We sought to understand how the mitotic machinery senses and responds to karyotypic changes by using a set of budding yeast strains in which the native chromosomes have been successively fused. Using a combination of cell biological profiling, genetic engineering, and experimental evolution, we show that chromosome fusions are well tolerated up until a critical point. However, with fewer than five centromeres, outward forces in the metaphase spindle cannot be countered by kinetochore-microtubule attachments, triggering mitotic defects. Our findings demonstrate that spindle architecture is a constraining factor for karyotype evolution.

    View details for DOI 10.1101/2023.10.25.563899

    View details for PubMedID 37961714

    View details for PubMedCentralID PMC10634821

  • Evolution of haploid and diploid populations reveals common, strong, and variable pleiotropic effects in non-home environments. eLife Chen, V., Johnson, M. S., Hérissant, L., Humphrey, P. T., Yuan, D. C., Li, Y., Agarwala, A., Hoelscher, S. B., Petrov, D. A., Desai, M. M., Sherlock, G. 2023; 12


    Adaptation is driven by the selection for beneficial mutations that provide a fitness advantage in the specific environment in which a population is evolving. However, environments are rarely constant or predictable. When an organism well adapted to one environment finds itself in another, pleiotropic effects of mutations that made it well adapted to its former environment will affect its success. To better understand such pleiotropic effects, we evolved both haploid and diploid barcoded budding yeast populations in multiple environments, isolated adaptive clones, and then determined the fitness effects of adaptive mutations in "non-home" environments in which they were not selected. We find that pleiotropy is common, with most adaptive evolved lineages showing fitness effects in non-home environments. Consistent with other studies, we find that these pleiotropic effects are unpredictable: they are beneficial in some environments and deleterious in others. However, we do find that lineages with adaptive mutations in the same genes tend to show similar pleiotropic effects. We also find that ploidy influences the observed adaptive mutational spectra in a condition-specific fashion. In some conditions, haploids and diploids are selected with adaptive mutations in identical genes, while in others they accumulate mutations in almost completely disjoint sets of genes.

    View details for DOI 10.7554/eLife.92899

    View details for PubMedID 37861305

  • Arrayedin vivobarcoding for multiplexed sequence verification of plasmid DNA and demultiplexing of pooled libraries. bioRxiv : the preprint server for biology Li, W., Miller, D., Liu, X., Tosi, L., Chkaiban, L., Mei, H., Hung, P., Parekkadan, B., Sherlock, G., Levy, S. F. 2023


    Sequence verification of plasmid DNA is critical for many cloning and molecular biology workflows. To leverage high-throughput sequencing, several methods have been developed that add a unique DNA barcode to individual samples prior to pooling and sequencing. However, these methods require an individual plasmid extraction and/or in vitro barcoding reaction for each sample processed, limiting throughput and adding cost. Here, we develop an arrayed in vivo plasmid barcoding platform that enables pooled plasmid extraction and library preparation for Oxford Nanopore sequencing. This method has a high accuracy and recovery rate, and greatly increases throughput and reduces cost relative to other plasmid barcoding methods or Sanger sequencing. We use in vivo barcoding to sequence verify >45,000 plasmids and show that the method can be used to transform error-containing dispersed plasmid pools into sequence-perfect arrays or well-balanced pools. In vivo barcoding does not require any specialized equipment beyond a low-overhead Oxford Nanopore sequencer, enabling most labs to flexibly process hundreds to thousands of plasmids in parallel.

    View details for DOI 10.1101/2023.10.13.562064

    View details for PubMedID 37873145

  • Improved Sugarcane-Based Fermentation Processes by an Industrial Fuel-Ethanol Yeast Strain. Journal of fungi (Basel, Switzerland) Muller, G., de Godoy, V. R., Dário, M. G., Duval, E. H., Alves-Jr, S. L., Bücker, A., Rosa, C. A., Dunn, B., Sherlock, G., Stambuk, B. U. 2023; 9 (8)


    In Brazil, sucrose-rich broths (cane juice and/or molasses) are used to produce billions of liters of both fuel ethanol and cachaça per year using selected Saccharomyces cerevisiae industrial strains. Considering the important role of feedstock (sugar) prices in the overall process economics, to improve sucrose fermentation the genetic characteristics of a group of eight fuel-ethanol and five cachaça industrial yeasts that tend to dominate the fermentors during the production season were determined by array comparative genomic hybridization. The widespread presence of genes encoding invertase at multiple telomeres has been shown to be a common feature of both baker's and distillers' yeast strains, and is postulated to be an adaptation to sucrose-rich broths. Our results show that only two strains (one fuel-ethanol and one cachaça yeast) have amplification of genes encoding invertase, with high specific activity. The other industrial yeast strains had a single locus (SUC2) in their genome, with different patterns of invertase activity. These results indicate that invertase activity probably does not limit sucrose fermentation during fuel-ethanol and cachaça production by these industrial strains. Using this knowledge, we changed the mode of sucrose metabolism of an industrial strain by avoiding extracellular invertase activity, overexpressing the intracellular invertase, and increasing its transport through the AGT1 permease. This approach allowed the direct consumption of the disaccharide by the cells, without releasing glucose or fructose into the medium, and a 11% higher ethanol production from sucrose by the modified industrial yeast, when compared to its parental strain.

    View details for DOI 10.3390/jof9080803

    View details for PubMedID 37623574

  • Paths to adaptation under fluctuating nitrogen starvation: The spectrum of adaptive mutations in Saccharomyces cerevisiae is shaped by retrotransposons and microhomology-mediated recombination. PLoS genetics Hays, M., Schwartz, K., Schmidtke, D. T., Aggeli, D., Sherlock, G. 2023; 19 (5): e1010747


    There are many mechanisms that give rise to genomic change: while point mutations are often emphasized in genomic analyses, evolution acts upon many other types of genetic changes that can result in less subtle perturbations. Changes in chromosome structure, DNA copy number, and novel transposon insertions all create large genomic changes, which can have correspondingly large impacts on phenotypes and fitness. In this study we investigate the spectrum of adaptive mutations that arise in a population under consistently fluctuating nitrogen conditions. We specifically contrast these adaptive alleles and the mutational mechanisms that create them, with mechanisms of adaptation under batch glucose limitation and constant selection in low, non-fluctuating nitrogen conditions to address if and how selection dynamics influence the molecular mechanisms of evolutionary adaptation. We observe that retrotransposon activity accounts for a substantial number of adaptive events, along with microhomology-mediated mechanisms of insertion, deletion, and gene conversion. In addition to loss of function alleles, which are often exploited in genetic screens, we identify putative gain of function alleles and alleles acting through as-of-yet unclear mechanisms. Taken together, our findings emphasize that how selection (fluctuating vs. non-fluctuating) is applied also shapes adaptation, just as the selective pressure (nitrogen vs. glucose) does itself. Fluctuating environments can activate different mutational mechanisms, shaping adaptive events accordingly. Experimental evolution, which allows a wider array of adaptive events to be assessed, is thus a complementary approach to both classical genetic screens and natural variation studies to characterize the genotype-to-phenotype-to-fitness map.

    View details for DOI 10.1371/journal.pgen.1010747

    View details for PubMedID 37192196

  • Experimental evolution for cell biology. Trends in cell biology Helsen, J., Sherlock, G., Dey, G. 2023


    Evolutionary cell biology explores the origins, principles, and core functions of cellular features and regulatory networks through the lens of evolution. This emerging field relies heavily on comparative experiments and genomic analyses that focus exclusively on extant diversity and historical events, providing limited opportunities for experimental validation. In this opinion article, we explore the potential for experimental laboratory evolution to augment the evolutionary cell biology toolbox, drawing inspiration from recent studies that combine laboratory evolution with cell biological assays. Primarily focusing on approaches for single cells, we provide a generalizable template for adapting experimental evolution protocols to provide fresh insight into long-standing questions in cell biology.

    View details for DOI 10.1016/j.tcb.2023.04.006

    View details for PubMedID 37188561

  • An improved algorithm for inferring mutational parameters from bar-seq evolution experiments. BMC genomics Li, F., Mahadevan, A., Sherlock, G. 2023; 24 (1): 246


    BACKGROUND: Genetic barcoding provides a high-throughput way to simultaneously track the frequencies of large numbers of competing and evolving microbial lineages. However making inferences about the nature of the evolution that is taking place remains a difficult task.RESULTS: Here we describe an algorithm for the inference of fitness effects and establishment times of beneficial mutationsfrom barcode sequencing data, which builds upon a Bayesian inference method by enforcing self-consistency between the population mean fitness and the individual effects of mutations within lineages. By testing our inference method on a simulation of 40,000 barcoded lineages evolving in serial batch culture, we find that this new method outperforms its predecessor, identifying more adaptive mutations and more accurately inferring their mutational parameters.CONCLUSION: Our new algorithm is particularly suited to inference of mutational parameters when read depth is low. We have made Python code for our serial dilution evolution simulations, as well as both the old and new inference methods, available on GitHub ( ), in the hope that it can find broader use by the microbial evolution community.

    View details for DOI 10.1186/s12864-023-09345-x

    View details for PubMedID 37149606

  • Insufficient evidence for non-neutrality of synonymous mutations. Nature Kruglyak, L., Beyer, A., Bloom, J. S., Grossbach, J., Lieberman, T. D., Mancuso, C. P., Rich, M. S., Sherlock, G., Kaplan, C. D. 2023; 616 (7957): E8-E9

    View details for DOI 10.1038/s41586-023-05865-4

    View details for PubMedID 37076734

    View details for PubMedCentralID 4359748

  • Yca1 metacaspase: Diverse functions determine how yeast live and let die. FEMS yeast research Lam, D. K., Sherlock, G. 2023


    The Yca1 metacaspase was discovered due to its role in the regulation of apoptosis in Saccharomyces cerevisiae. However, the mechanisms that drive apoptosis in yeast remain poorly understood. Additionally, Yca1 and other metacaspase proteins have recently been recognized for their involvement in other cellular processes, including cellular proteostasis and cell cycle regulation. In this Minireview, we outline recent findings on Yca1 that will enable the further study of metacaspase multifunctionality and novel apoptosis pathways in yeast and other non-metazoans. In addition, we discuss advancements in high-throughput screening technologies that can be applied to answer complex questions surrounding the apoptotic and non-apoptotic functions of metacaspase proteins across a diverse range of species.

    View details for DOI 10.1093/femsyr/foad022

    View details for PubMedID 37002543

  • Fit-Seq2.0: An Improved Software for High-Throughput Fitness Measurements Using Pooled Competition Assays. Journal of molecular evolution Li, F., Tarkington, J., Sherlock, G. 2023


    The fitness of a genotype is defined as its lifetime reproductive success, with fitness itself being a composite trait likely dependent on many underlying phenotypes. Measuring fitness is important for understanding how alteration of different cellular components affects a cell's ability to reproduce. Here, we describe an improved approach, implemented in Python, for estimating fitness in high throughput via pooled competition assays.

    View details for DOI 10.1007/s00239-023-10098-0

    View details for PubMedID 36877292

  • S. cerevisiae Cells Can Grow without the Pds5 Cohesin Subunit. mBio Choudhary, K., Itzkovich, Z., Alonso-Perez, E., Bishara, H., Dunn, B., Sherlock, G., Kupiec, M. 2022: e0142022


    During DNA replication, the newly created sister chromatids are held together until their separation at anaphase. The cohesin complex is in charge of creating and maintaining sister chromatid cohesion (SCC) in all eukaryotes. In Saccharomyces cerevisiae cells, cohesin is composed of two elongated proteins, Smc1 and Smc3, bridged by the kleisin Mcd1/Scc1. The latter also acts as a scaffold for three additional proteins, Scc3/Irr1, Wpl1/Rad61, and Pds5. Although the HEAT-repeat protein Pds5 is essential for cohesion, its precise function is still debated. Deletion of the ELG1 gene, encoding a PCNA unloader, can partially suppress the temperature-sensitive pds5-1 allele, but not a complete deletion of PDS5. We carried out a genetic screen for high-copy-number suppressors and another for spontaneously arising mutants, allowing the survival of a pds5Delta elg1Delta strain. Our results show that cells remain viable in the absence of Pds5 provided that there is both an elevation in the level of Mcd1 (which can be due to mutations in the CLN2 gene, encoding a G1 cyclin), and an increase in the level of SUMO-modified PCNA on chromatin (caused by lack of PCNA unloading in elg1Delta mutants). The elevated SUMO-PCNA levels increase the recruitment of the Srs2 helicase, which evicts Rad51 molecules from the moving fork, creating single-stranded DNA (ssDNA) regions that serve as sites for increased cohesin loading and SCC establishment. Thus, our results delineate a double role for Pds5 in protecting the cohesin ring and interacting with the DNA replication machinery. IMPORTANCE Sister chromatid cohesion is vital for faithful chromosome segregation, chromosome folding into loops, and gene expression. A multisubunit protein complex known as cohesin holds the sister chromatids from S phase until the anaphase stage. In this study, we explore the function of the essential cohesin subunit Pds5 in the regulation of sister chromatid cohesion. We performed two independent genetic screens to bypass the function of the Pds5 protein. We observe that Pds5 protein is a cohesin stabilizer, and elevating the levels of Mcd1 protein along with SUMO-PCNA accumulation on chromatin can compensate for the loss of the PDS5 gene. In addition, Pds5 plays a role in coordinating the DNA replication and sister chromatid cohesion establishment. This work elucidates the function of cohesin subunit Pds5, the G1 cyclin Cln2, and replication factors PCNA, Elg1, and Srs2 in the proper regulation of sister chromatid cohesion.

    View details for DOI 10.1128/mbio.01420-22

    View details for PubMedID 35708277

  • Neural networks enable efficient and accurate simulation-based inference of evolutionary parameters from adaptation dynamics. PLoS biology Avecilla, G., Chuong, J. N., Li, F., Sherlock, G., Gresham, D., Ram, Y. 2022; 20 (5): e3001633


    The rate of adaptive evolution depends on the rate at which beneficial mutations are introduced into a population and the fitness effects of those mutations. The rate of beneficial mutations and their expected fitness effects is often difficult to empirically quantify. As these 2 parameters determine the pace of evolutionary change in a population, the dynamics of adaptive evolution may enable inference of their values. Copy number variants (CNVs) are a pervasive source of heritable variation that can facilitate rapid adaptive evolution. Previously, we developed a locus-specific fluorescent CNV reporter to quantify CNV dynamics in evolving populations maintained in nutrient-limiting conditions using chemostats. Here, we use CNV adaptation dynamics to estimate the rate at which beneficial CNVs are introduced through de novo mutation and their fitness effects using simulation-based likelihood-free inference approaches. We tested the suitability of 2 evolutionary models: a standard Wright-Fisher model and a chemostat model. We evaluated 2 likelihood-free inference algorithms: the well-established Approximate Bayesian Computation with Sequential Monte Carlo (ABC-SMC) algorithm, and the recently developed Neural Posterior Estimation (NPE) algorithm, which applies an artificial neural network to directly estimate the posterior distribution. By systematically evaluating the suitability of different inference methods and models, we show that NPE has several advantages over ABC-SMC and that a Wright-Fisher evolutionary model suffices in most cases. Using our validated inference framework, we estimate the CNV formation rate at the GAP1 locus in the yeast Saccharomyces cerevisiae to be 10-4.7 to 10-4 CNVs per cell division and a fitness coefficient of 0.04 to 0.1 per generation for GAP1 CNVs in glutamine-limited chemostats. We experimentally validated our inference-based estimates using 2 distinct experimental methods-barcode lineage tracking and pairwise fitness assays-which provide independent confirmation of the accuracy of our approach. Our results are consistent with a beneficial CNV supply rate that is 10-fold greater than the estimated rates of beneficial single-nucleotide mutations, explaining the outsized importance of CNVs in rapid adaptive evolution. More generally, our study demonstrates the utility of novel neural network-based likelihood-free inference methods for inferring the rates and effects of evolutionary processes from empirical data with possible applications ranging from tumor to viral evolution.

    View details for DOI 10.1371/journal.pbio.3001633

    View details for PubMedID 35622868

  • How to Use the Candida Genome Database. Methods in molecular biology (Clifton, N.J.) Skrzypek, M. S., Binkley, J., Sherlock, G. 2022; 2542: 55-69


    The Candida Genome Database provides access to biological information about genes and proteins of several medically important Candida species. The website is organized into easily navigable pages that enable data retrieval and analysis. This chapter shows how to explore the CGD Home page and Locus Summary pages, which are the main access points to the database. It also provides a description of how to use the GO analysis tools, GO Term Finder, and GO Slim Mapper and how to browse large-scale datasets using the JBrowse genome browser. Finally, it shows how to search and retrieve data for user-defined sets of genes using the Advanced Search and Batch Download tools.

    View details for DOI 10.1007/978-1-0716-2549-1_4

    View details for PubMedID 36008656

  • Quantifying rapid bacterial evolution and transmission within the mouse intestine. Cell host & microbe Vasquez, K. S., Willis, L., Cira, N. J., Ng, K. M., Pedro, M. F., Aranda-Diaz, A., Rajendram, M., Yu, F. B., Higginbottom, S. K., Neff, N., Sherlock, G., Xavier, K. B., Quake, S. R., Sonnenburg, J. L., Good, B. H., Huang, K. C. 2021


    Due to limitations on high-resolution strain tracking, selection dynamics during gut microbiota colonization and transmission between hosts remain mostly mysterious. Here, we introduced hundreds of barcoded Escherichiacoli strains into germ-free mice and quantified strain-level dynamics and metagenomic changes. Mutations in genes involved in motility and metabolite utilization are reproducibly selected within days. Even with rapid selection, coprophagy enforced similar barcode distributions across co-housed mice. Whole-genome sequencing of hundreds of isolates revealed linked alleles that demonstrate between-host transmission. A population-genetics model predicts substantial fitness advantages for certain mutants and that migration accounted for 10% of the resident microbiota each day. Treatment with ciprofloxacin suggests interplay between selection and transmission. While initial colonization was mostly uniform, in two mice a bottleneck reduced diversity and selected for ciprofloxacin resistance in the absence of drug. These findings highlight the interplay between environmental transmission and rapid, deterministic selection during evolution of the intestinal microbiota.

    View details for DOI 10.1016/j.chom.2021.08.003

    View details for PubMedID 34473943

  • Evolutionary dynamics and structural consequences of de novo beneficial mutations and mutant lineages arising in a constant environment. BMC biology Kinnersley, M., Schwartz, K., Yang, D., Sherlock, G., Rosenzweig, F. 2021; 19 (1): 20


    BACKGROUND: Microbial evolution experiments can be used to study the tempo and dynamics of evolutionary change in asexual populations, founded from single clones and growing into large populations with multiple clonal lineages. High-throughput sequencing can be used to catalog de novo mutations as potential targets of selection, determine in which lineages they arise, and track the fates of those lineages. Here, we describe a long-term experimental evolution study to identify targets of selection and to determine when, where, and how often those targets are hit.RESULTS: We experimentally evolved replicate Escherichia coli populations that originated from a mutator/nonsense suppressor ancestor under glucose limitation for between 300 and 500 generations. Whole-genome, whole-population sequencing enabled us to catalog 3346 de novo mutations that reached >1% frequency. We sequenced the genomes of 96 clones from each population when allelic diversity was greatest in order to establish whether mutations were in the same or different lineages and to depict lineage dynamics. Operon-specific mutations that enhance glucose uptake were the first to rise to high frequency, followed by global regulatory mutations. Mutations related to energy conservation, membrane biogenesis, and mitigating the impact of nonsense mutations, both ancestral and derived, arose later. New alleles were confined to relatively few loci, with many instances of identical mutations arising independently in multiple lineages, among and within replicate populations. However, most never exceeded 10% in frequency and were at a lower frequency at the end of the experiment than at their maxima, indicating clonal interference. Many alleles mapped to key structures within the proteins that they mutated, providing insight into their functional consequences.CONCLUSIONS: Overall, we find that when mutational input is increased by an ancestral defect in DNA repair, the spectrum of high-frequency beneficial mutations in a simple, constant resource-limited environment is narrow, resulting in extreme parallelism where many adaptive mutations arise but few ever go to fixation.

    View details for DOI 10.1186/s12915-021-00954-0

    View details for PubMedID 33541358

  • Adaptation is influenced by the complexity of environmental change during evolution in a dynamic environment. PLoS genetics Boyer, S., Herissant, L., Sherlock, G. 2021; 17 (1): e1009314


    The environmental conditions of microorganisms' habitats may fluctuate in unpredictable ways, such as changes in temperature, carbon source, pH, and salinity to name a few. Environmental heterogeneity presents a challenge to microorganisms, as they have to adapt not only to be fit under a specific condition, but they must also be robust across many conditions and be able to deal with the switch between conditions itself. While experimental evolution has been used to gain insight into the adaptive process, this has largely been in either unvarying or consistently varying conditions. In cases where changing environments have been investigated, relatively little is known about how such environments influence the dynamics of the adaptive process itself, as well as the genetic and phenotypic outcomes. We designed a systematic series of evolution experiments where we used two growth conditions that have differing timescales of adaptation and varied the rate of switching between them. We used lineage tracking to follow adaptation, and whole genome sequenced adaptive clones from each of the experiments. We find that both the switch rate and the order of the conditions influences adaptation. We also find different adaptive outcomes, at both the genetic and phenotypic levels, even when populations spent the same amount of total time in the two different conditions, but the order and/or switch rate differed. Thus, in a variable environment adaptation depends not only on the nature of the conditions and phenotypes under selection, but also on the complexity of the manner in which those conditions are combined to result in a given dynamic environment.

    View details for DOI 10.1371/journal.pgen.1009314

    View details for PubMedID 33493203

  • Acquisition, transmission and strain diversity of human gut-colonizing crAss-like phages. Nature communications Siranosian, B. A., Tamburini, F. B., Sherlock, G. n., Bhatt, A. S. 2020; 11 (1): 280


    CrAss-like phages are double-stranded DNA viruses that are prevalent in human gut microbiomes. Here, we analyze gut metagenomic data from mother-infant pairs and patients undergoing fecal microbiota transplantation to evaluate the patterns of acquisition, transmission and strain diversity of crAss-like phages. We find that crAss-like phages are rarely detected at birth but are increasingly prevalent in the infant microbiome after one month of life. We observe nearly identical genomes in 50% of cases where the same crAss-like clade is detected in both the mother and the infant, suggesting vertical transmission. In cases of putative transmission of prototypical crAssphage (p-crAssphage), we find that a subset of strains present in the mother are detected in the infant, and that strain diversity in infants increases with time. Putative tail fiber proteins are enriched for nonsynonymous strain variation compared to other genes, suggesting a potential evolutionary benefit to maintaining strain diversity in specific genes. Finally, we show that p-crAssphage can be acquired through fecal microbiota transplantation.

    View details for DOI 10.1038/s41467-019-14103-3

    View details for PubMedID 31941900

  • Improved discovery of genetic interactions using CRISPRiSeq across multiple environments GENOME RESEARCH Jaffe, M., Dziulko, A., Smith, J. D., St Onge, R. P., Levy, S. F., Sherlock, G. 2019; 29 (4): 668–81
  • Gene flow contributes to diversification of the major fungal pathogen Candida albicans NATURE COMMUNICATIONS Ropars, J., Maufrais, C., Diogo, D., Marcet-Houben, M., Perin, A., Sertour, N., Mosca, K., Permal, E., Laval, G., Bouchier, C., Ma, L., Schwartz, K., Voelz, K., May, R. C., Poulain, J., Battail, C., Wincker, P., Borman, A. M., Chowdhary, A., Fan, S., Kim, S., Le Pape, P., Romeo, O., Shin, J., Gabaldon, T., Sherlock, G., Bougnoux, M., d'Enfert, C. 2018; 9: 2253


    Elucidating population structure and levels of genetic diversity and recombination is necessary to understand the evolution and adaptation of species. Candida albicans is the second most frequent agent of human fungal infections worldwide, causing high-mortality rates. Here we present the genomic sequences of 182 C. albicans isolates collected worldwide, including commensal isolates, as well as ones responsible for superficial and invasive infections, constituting the largest dataset to date for this major fungal pathogen. Although, C. albicans shows a predominantly clonal population structure, we find evidence of gene flow between previously known and newly identified genetic clusters, supporting the occurrence of (para)sexuality in nature. A highly clonal lineage, which experimentally shows reduced fitness, has undergone pseudogenization in genes required for virulence and morphogenesis, which may explain its niche restriction. Candida albicans thus takes advantage of both clonality and gene flow to diversify.

    View details for PubMedID 29884848

  • Diff-seq: A high throughput sequencing-based mismatch detection assay for DNA variant enrichment and discovery NUCLEIC ACIDS RESEARCH Aggeli, D., Karas, V. O., Sinnott-Armstrong, N. A., Varghese, V., Shafer, R. W., Greenleaf, W. J., Sherlock, G. 2018; 46 (7)


    Much of the within species genetic variation is in the form of single nucleotide polymorphisms (SNPs), typically detected by whole genome sequencing (WGS) or microarray-based technologies. However, WGS produces mostly uninformative reads that perfectly match the reference, while microarrays require genome-specific reagents. We have developed Diff-seq, a sequencing-based mismatch detection assay for SNP discovery without the requirement for specialized nucleic-acid reagents. Diff-seq leverages the Surveyor endonuclease to cleave mismatched DNA molecules that are generated after cross-annealing of a complex pool of DNA fragments. Sequencing libraries enriched for Surveyor-cleaved molecules result in increased coverage at the variant sites. Diff-seq detected all mismatches present in an initial test substrate, with specific enrichment dependent on the identity and context of the variation. Application to viral sequences resulted in increased observation of variant alleles in a biologically relevant context. Diff-Seq has the potential to increase the sensitivity and efficiency of high-throughput sequencing in the detection of variation.

    View details for PubMedID 29361139

    View details for PubMedCentralID PMC5909455

  • Using the Candida Genome Database. Methods in molecular biology (Clifton, N.J.) Skrzypek, M. S., Binkley, J., Sherlock, G. 2018; 1757: 31–47


    Studying Candida biology requires access to genomic sequence data in conjunction with experimental information that together provide functional context to genes and proteins, and aid in interpreting newly generated experimental data. The Candida Genome Database (CGD) curates the Candida literature, and integrates functional information about Candida genes and their products with a set of analysis tools that facilitate searching for sets of genes and exploring their biological roles. This chapter describes how the various types of information available at CGD can be searched, retrieved, and analyzed. Starting with the guided tour of the CGD Home page and Locus Summary page, this unit shows how to navigate the various assemblies of the C. albicans genome, how to use Gene Ontology tools to make sense of large-scale data, and how to access the microarray data archived at CGD, as well as visualize high-throughput sequencing data through the use of JBrowse.

    View details for PubMedID 29761455

  • Comparative genomics reveals high biological diversity and specific adaptations in the industrially and medically important fungal genus Aspergillus. Genome biology de Vries, R. P., Riley, R., Wiebenga, A., Aguilar-Osorio, G., Amillis, S., Uchima, C. A., Anderluh, G., Asadollahi, M., Askin, M., Barry, K., Battaglia, E., Bayram, Ö., Benocci, T., Braus-Stromeyer, S. A., Caldana, C., Cánovas, D., Cerqueira, G. C., Chen, F., Chen, W., Choi, C., Clum, A., Dos Santos, R. A., Damásio, A. R., Diallinas, G., Emri, T., Fekete, E., Flipphi, M., Freyberg, S., Gallo, A., Gournas, C., Habgood, R., Hainaut, M., Harispe, M. L., Henrissat, B., Hildén, K. S., Hope, R., Hossain, A., Karabika, E., Karaffa, L., Karányi, Z., Kraševec, N., Kuo, A., Kusch, H., LaButti, K., Lagendijk, E. L., Lapidus, A., Levasseur, A., Lindquist, E., Lipzen, A., Logrieco, A. F., Maccabe, A., Mäkelä, M. R., Malavazi, I., Melin, P., Meyer, V., Mielnichuk, N., Miskei, M., Molnár, Á. P., Mulé, G., Ngan, C. Y., Orejas, M., Orosz, E., Ouedraogo, J. P., Overkamp, K. M., Park, H., Perrone, G., Piumi, F., Punt, P. J., Ram, A. F., Ramón, A., Rauscher, S., Record, E., Riaño-Pachón, D. M., Robert, V., Röhrig, J., Ruller, R., Salamov, A., Salih, N. S., Samson, R. A., Sándor, E., Sanguinetti, M., Schütze, T., Sepcic, K., Shelest, E., Sherlock, G., Sophianopoulou, V., Squina, F. M., Sun, H., Susca, A., Todd, R. B., Tsang, A., Unkles, S. E., van de Wiele, N., van Rossen-Uffink, D., Oliveira, J. V., Vesth, T. C., Visser, J., Yu, J., Zhou, M., Andersen, M. R., Archer, D. B., Baker, S. E., Benoit, I., Brakhage, A. A., Braus, G. H., Fischer, R., Frisvad, J. C., Goldman, G. H., Houbraken, J., Oakley, B., Pócsi, I., Scazzocchio, C., Seiboth, B., vanKuyk, P. A., Wortman, J., Dyer, P. S., Grigoriev, I. V. 2017; 18 (1): 28-?


    The fungal genus Aspergillus is of critical importance to humankind. Species include those with industrial applications, important pathogens of humans, animals and crops, a source of potent carcinogenic contaminants of food, and an important genetic model. The genome sequences of eight aspergilli have already been explored to investigate aspects of fungal biology, raising questions about evolution and specialization within this genus.We have generated genome sequences for ten novel, highly diverse Aspergillus species and compared these in detail to sister and more distant genera. Comparative studies of key aspects of fungal biology, including primary and secondary metabolism, stress response, biomass degradation, and signal transduction, revealed both conservation and diversity among the species. Observed genomic differences were validated with experimental studies. This revealed several highlights, such as the potential for sex in asexual species, organic acid production genes being a key feature of black aspergilli, alternative approaches for degrading plant biomass, and indications for the genetic basis of stress response. A genome-wide phylogenetic analysis demonstrated in detail the relationship of the newly genome sequenced species with other aspergilli.Many aspects of biological differences between fungal species cannot be explained by current knowledge obtained from genome sequences. The comparative genomics and experimental study, presented here, allows for the first time a genus-wide view of the biological diversity of the aspergilli and in many, but not all, cases linked genome differences to phenotype. Insights gained could be exploited for biotechnological and medical applications of fungi.

    View details for DOI 10.1186/s13059-017-1151-0

    View details for PubMedID 28196534

  • Seeking Goldilocks During Evolution of Drug Resistance. PLoS biology Sherlock, G., Petrov, D. A. 2017; 15 (2)


    Speciation can occur when a population is split and the resulting subpopulations evolve independently, accumulating mutations over time that make them incompatible with one another. It is thought that such incompatible mutations, known as Bateson-Dobzhansky-Muller (BDM) incompatibilities, may arise when the two populations face different environments, which impose different selective pressures. However, a new study in PLOS Biology by Ono et al. finds that the first-step mutations selected in yeast populations evolving in parallel in the presence of the antifungal drug nystatin are frequently incompatible with one another. This incompatibility is environment dependent, such that the combination of two incompatible alleles can become advantageous under increasing drug concentrations. This suggests that the activity for the affected pathway must have an optimum level, the value of which varies according to the drug concentration. It is likely that many biological processes similarly have an optimum under a given environment and many single-step adaptive ways to reach it; thus, not only should BDM incompatibilities commonly arise during parallel evolution, they might be virtually inevitable, as the combination of two such steps is likely to overshoot the optimum.

    View details for DOI 10.1371/journal.pbio.2001872

    View details for PubMedID 28158184

    View details for PubMedCentralID PMC5291373

  • The Candida Genome Database (CGD): incorporation of Assembly 22, systematic identifiers and visualization of high throughput sequencing data NUCLEIC ACIDS RESEARCH Skrzypek, M. S., Binkley, J., Binkley, G., Miyasato, S. R., Simison, M., Sherlock, G. 2017; 45 (D1): D592-D596

    View details for DOI 10.1093/nar/gkw924

    View details for Web of Science ID 000396575500083

  • iSeq: A New Double-Barcode Method for Detecting Dynamic Genetic Interactions in Yeast G3-GENES GENOMES GENETICS Jaffe, M., Sherlock, G., Levy, S. F. 2017; 7 (1): 143-153
  • Extremely Rare Polymorphisms in Saccharomyces cerevisiae Allow Inference of the Mutational Spectrum. PLoS genetics Zhu, Y. O., Sherlock, G., Petrov, D. A. 2017; 13 (1)


    The characterization of mutational spectra is usually carried out in one of three ways-by direct observation through mutation accumulation (MA) experiments, through parent-offspring sequencing, or by indirect inference from sequence data. Direct observations of spontaneous mutations with MA experiments are limited, given (i) the rarity of spontaneous mutations, (ii) applicability only to laboratory model species with short generation times, and (iii) the possibility that mutational spectra under lab conditions might be different from those observed in nature. Trio sequencing is an elegant solution, but it is not applicable in all organisms. Indirect inference, usually from divergence data, faces no such technical limitations, but rely upon critical assumptions regarding the strength of natural selection that are likely to be violated. Ideally, new mutational events would be directly observed before the biased filter of selection, and without the technical limitations common to lab experiments. One approach is to identify very young mutations from population sequencing data. Here we do so by leveraging two characteristics common to all new mutations-new mutations are necessarily rare in the population, and absent in the genomes of immediate relatives. From 132 clinical yeast strains, we were able to identify 1,425 putatively new mutations and show that they exhibit extremely low signatures of selection, as well as display a mutational spectrum that is similar to that identified by a large scale MA experiment. We verify that population sequencing data are a potential wealth of information for inferring mutational spectra, and should be considered for analysis where MA experiments are infeasible or especially tedious.

    View details for DOI 10.1371/journal.pgen.1006455

    View details for PubMedID 28046117

    View details for PubMedCentralID PMC5207638

  • Preparation of Yeast DNA Sequencing Libraries. Cold Spring Harbor protocols Schwartz, K., Sherlock, G. 2016; 2016 (10): pdb prot088930-?


    This protocol provides a detailed description of how to prepare a DNA sequencing library from yeast genomic DNA for use with the Illumina sequencing platform. This method does not require purchase of Illumina kits for library preparation but instead employs specific reagents purchased largely from New England BioLabs, which significantly reduces the cost of library preparation. Although we assume here that users intend to generate libraries with ∼400-bp insert sizes for paired-end sequencing, it is relatively straightforward to modify the shearing and size selection steps for longer or shorter inserts.

    View details for DOI 10.1101/pdb.prot088930

    View details for PubMedID 27698239

  • High-Throughput Yeast Strain Sequencing. Cold Spring Harbor protocols Schwartz, K., Sherlock, G. 2016; 2016 (10): pdb top077651-?


    The original yeast genome sequencing project was a monumental task, spanning several years, which resulted in the first sequenced eukaryotic genome. The 12 Mbp reference sequence was generated from yeast strain S288c and was of extremely high quality. In the years since it was published, sequencing technology has advanced apace, such that it is within the reach of most labs to sequence yeast strains of interest almost as a matter of standard practice, either via core facilities at their institution or through commercial sequencing services. Because of the availability of the high-quality reference sequence (which itself has received approximately 1500 updates derived from high-throughput sequencing data), reliable identification of differences between a strain of interest and the reference is relatively straightforward, at least for the nonrepetitive regions of the genome. In this introduction, we describe current high-throughput sequencing technology and methods for analysis of the resulting data.

    View details for DOI 10.1101/pdb.top077651

    View details for PubMedID 27698244

  • Analysis of Repair Mechanisms following an Induced Double-Strand Break Uncovers Recessive Deleterious Alleles in the Candida albicans Diploid Genome MBIO Feri, A., Loll-Krippleber, R., Commere, P., Maufrais, C., Sertour, N., Schwartz, K., Sherlock, G., Bougnoux, M., d'Enfert, C., Legrand, M. 2016; 7 (5)


    The diploid genome of the yeast Candida albicans is highly plastic, exhibiting frequent loss-of-heterozygosity (LOH) events. To provide a deeper understanding of the mechanisms leading to LOH, we investigated the repair of a unique DNA double-strand break (DSB) in the laboratory C. albicans SC5314 strain using the I-SceI meganuclease. Upon I-SceI induction, we detected a strong increase in the frequency of LOH events at an I-SceI target locus positioned on chromosome 4 (Chr4), including events spreading from this locus to the proximal telomere. Characterization of the repair events by single nucleotide polymorphism (SNP) typing and whole-genome sequencing revealed a predominance of gene conversions, but we also observed mitotic crossover or break-induced replication events, as well as combinations of independent events. Importantly, progeny that had undergone homozygosis of part or all of Chr4 haplotype B (Chr4B) were inviable. Mining of genome sequencing data for 155 C. albicans isolates allowed the identification of a recessive lethal allele in the GPI16 gene on Chr4B unique to C. albicans strain SC5314 which is responsible for this inviability. Additional recessive lethal or deleterious alleles were identified in the genomes of strain SC5314 and two clinical isolates. Our results demonstrate that recessive lethal alleles in the genomes of C. albicans isolates prevent the occurrence of specific extended LOH events. While these and other recessive lethal and deleterious alleles are likely to accumulate in C. albicans due to clonal reproduction, their occurrence may in turn promote the maintenance of corresponding nondeleterious alleles and, consequently, heterozygosity in the C. albicans species.Recessive lethal alleles impose significant constraints on the biology of diploid organisms. Using a combination of an I-SceI meganuclease-mediated DNA DSB, a fluorescence-activated cell sorter (FACS)-optimized reporter of LOH, and a compendium of 155 genome sequences, we were able to unmask and identify recessive lethal and deleterious alleles in isolates of Candida albicans, a diploid yeast and the major fungal pathogen of humans. Accumulation of recessive deleterious mutations upon clonal reproduction of C. albicans could contribute to the maintenance of heterozygosity despite the high frequency of LOH events in this species.

    View details for DOI 10.1128/mBio.01109-16

    View details for Web of Science ID 000390132900016

    View details for PubMedID 27729506

    View details for PubMedCentralID PMC5061868

  • Whole Genome Analysis of 132 Clinical Saccharomyces cerevisiae Strains Reveals Extensive Ploidy Variation G3-GENES GENOMES GENETICS Zhu, Y. O., Sherlock, G., Petrov, D. A. 2016; 6 (8): 2421-2434


    Budding yeast has undergone several independent transitions from commercial to clinical lifestyles. The frequency of such transitions suggests that clinical yeast strains are derived from environmentally available yeast populations, including commercial sources. However, despite their important role in adaptive evolution, the prevalence of polyploidy and aneuploidy has not been extensively analyzed in clinical strains. In this study, we have looked for patterns governing the transition to clinical invasion in the largest screen of clinical yeast isolates to date. In particular, we have focused on the hypothesis that ploidy changes have influenced adaptive processes. We sequenced 144 yeast strains, 132 of which are clinical isolates. We found pervasive large-scale genomic variation in both overall ploidy (34% of strains identified as 3n/4n) and individual chromosomal copy numbers (36% of strains identified as aneuploid). We also found evidence for the highly dynamic nature of yeast genomes, with 35 strains showing partial chromosomal copy number changes and eight strains showing multiple independent chromosomal events. Intriguingly, a lineage identified to be baker's/commercial derived with a unique damaging mutation in NDC80 was particularly prone to polyploidy, with 83% of its members being triploid or tetraploid. Polyploidy was in turn associated with a >2× increase in aneuploidy rates as compared to other lineages. This dataset provides a rich source of information on the genomics of clinical yeast strains and highlights the potential importance of large-scale genomic copy variation in yeast adaptation.

    View details for DOI 10.1534/g3.116.029397/-/DC1

    View details for Web of Science ID 000381282300017

    View details for PubMedID 27317778

    View details for PubMedCentralID PMC4978896

  • Heterozygote Advantage Is a Common Outcome of Adaptation in Saccharomyces cerevisiae GENETICS Sellis, D., Kvitek, D. J., Dunn, B., Sherlock, G., Petrov, D. A. 2016; 203 (3): 1401-?


    Adaptation in diploids is predicted to proceed via mutations that are at least partially dominant in fitness. Recently, we argued that many adaptive mutations might also be commonly overdominant in fitness. Natural (directional) selection acting on overdominant mutations should drive them into the population but then, instead of bringing them to fixation, should maintain them as balanced polymorphisms via heterozygote advantage. If true, this would make adaptive evolution in sexual diploids differ drastically from that of haploids. The validity of this prediction has not yet been tested experimentally. Here, we performed four replicate evolutionary experiments with diploid yeast populations (Saccharomyces cerevisiae) growing in glucose-limited continuous cultures. We sequenced 24 evolved clones and identified initial adaptive mutations in all four chemostats. The first adaptive mutations in all four chemostats were three copy number variations, all of which proved to be overdominant in fitness. The fact that fitness overdominant mutations were always the first step in independent adaptive walks supports the prediction that heterozygote advantage can arise as a common outcome of directional selection in diploids and demonstrates that overdominance of de novo adaptive mutations in diploids is not rare.

    View details for DOI 10.1534/genetics.115.185165

    View details for Web of Science ID 000379473600028

    View details for PubMedID 27194750

    View details for PubMedCentralID PMC4937471

  • How to Use the Candida Genome Database. Methods in molecular biology (Clifton, N.J.) Skrzypek, M. S., Binkley, J., Sherlock, G. 2016; 1356: 3-15


    Studying Candida biology requires access to genomic sequence data in conjunction with experimental information that provides functional context to genes and proteins. The Candida Genome Database (CGD) integrates functional information about Candida genes and their products with a set of analysis tools that facilitate searching for sets of genes and exploring their biological roles. This chapter describes how the various types of information available at CGD can be searched, retrieved, and analyzed. Starting with the guided tour of the CGD Home page and Locus Summary page, this unit shows how to navigate the various assemblies of the C. albicans genome, how to use Gene Ontology tools to make sense of large-scale data, and how to access the microarray data archived at CGD.

    View details for DOI 10.1007/978-1-4939-3052-4_1

    View details for PubMedID 26519061

  • Structured nucleosome fingerprints enable high-resolution mapping of chromatin architecture within regulatory regions GENOME RESEARCH Schep, A. N., Buenrostro, J. D., Denny, S. K., Schwartz, K., Sherlock, G., Greenleaf, W. J. 2015; 25 (11): 1757-1770

    View details for DOI 10.1101/gr.192294.115

    View details for PubMedID 26314830

  • A single nucleotide polymorphism uncovers a novel function for the transcription factor Ace2 during Candida albicans hyphal development. PLoS genetics Calderón-Noreña, D. M., González-Novo, A., Orellana-Muñoz, S., Gutiérrez-Escribano, P., Arnáiz-Pita, Y., Dueñas-Santero, E., Suárez, M. B., Bougnoux, M., del Rey, F., Sherlock, G., d'Enfert, C., Correa-Bordes, J., de Aldana, C. R. 2015; 11 (4)


    Candida albicans is a major invasive fungal pathogen in humans. An important virulence factor is its ability to switch between the yeast and hyphal forms, and these filamentous forms are important in tissue penetration and invasion. A common feature for filamentous growth is the ability to inhibit cell separation after cytokinesis, although it is poorly understood how this process is regulated developmentally. In C. albicans, the formation of filaments during hyphal growth requires changes in septin ring dynamics. In this work, we studied the functional relationship between septins and the transcription factor Ace2, which controls the expression of enzymes that catalyze septum degradation. We found that alternative translation initiation produces two Ace2 isoforms. While full-length Ace2, Ace2L, influences septin dynamics in a transcription-independent manner in hyphal cells but not in yeast cells, the use of methionine-55 as the initiation codon gives rise to Ace2S, which functions as the nuclear transcription factor required for the expression of cell separation genes. Genetic evidence indicates that Ace2L influences the incorporation of the Sep7 septin to hyphal septin rings in order to avoid inappropriate activation of cell separation during filamentous growth. Interestingly, a natural single nucleotide polymorphism (SNP) present in the C. albicans WO-1 background and other C. albicans commensal and clinical isolates generates a stop codon in the ninth codon of Ace2L that mimics the phenotype of cells lacking Ace2L. Finally, we report that Ace2L and Ace2S interact with the NDR kinase Cbk1 and that impairing activity of this kinase results in a defect in septin dynamics similar to that of hyphal cells lacking Ace2L. Together, our findings identify Ace2L and the NDR kinase Cbk1 as new elements of the signaling system that modify septin ring dynamics in hyphae to allow cell-chain formation, a feature that appears to have evolved in specific C. albicans lineages.

    View details for DOI 10.1371/journal.pgen.1005152

    View details for PubMedID 25875512

    View details for PubMedCentralID PMC4398349

  • The Valley-of-Death: Reciprocal sign epistasis constrains adaptive trajectories in a constant, nutrient limiting environment GENOMICS Chiotti, K. E., Kvitek, D. J., Schmidt, K. H., Koniges, G., Schwartz, K., Donckels, E. A., Rosenzweig, F., Sherlock, G. 2014; 104 (6): 431-437


    The fitness landscape is a powerful metaphor for describing the relationship between genotype and phenotype for a population under selection. However, empirical data as to the topography of fitness landscapes are limited, owing to difficulties in measuring fitness for large numbers of genotypes under any condition. We previously reported a case of reciprocal sign epistasis (RSE), where two mutations individually increased yeast fitness in a glucose-limited environment, but reduced fitness when combined, suggesting the existence of two peaks on the fitness landscape. We sought to determine whether a ridge connected these peaks so that populations founded by one mutant could reach the peak created by the other, avoiding the low-fitness "Valley-of-Death" between them. Sequencing clones after 250 generations of further evolution provided no evidence for such a ridge, but did reveal many presumptive beneficial mutations, adding to a growing body of evidence that clonal interference pervades evolving microbial populations.

    View details for DOI 10.1016/j.ygeno.2014.10.011

    View details for Web of Science ID 000346059200007

    View details for PubMedID 25449178

  • Experimental evolution: prospects and challenges. Genomics Rosenzweig, F., Sherlock, G. 2014; 104 (6 Pt A): v-vi

    View details for DOI 10.1016/j.ygeno.2014.11.008

    View details for PubMedID 25496938

    View details for PubMedCentralID PMC4428657

  • Literature-based gene curation and proposed genetic nomenclature for cryptococcus. Eukaryotic cell Inglis, D. O., Skrzypek, M. S., Liaw, E., Moktali, V., Sherlock, G., Stajich, J. E. 2014; 13 (7): 878-883


    Cryptococcus, a major cause of disseminated infections in immunocompromised patients, kills over 600,000 people per year worldwide. Genes involved in the virulence of the meningitis-causing fungus are being characterized at an increasing rate, and to date, at least 648 Cryptococcus gene names have been published. However, these data are scattered throughout the literature and are challenging to find. Furthermore, conflicts in locus identification exist, so that named genes have been subsequently published under new names or names associated with one locus have been used for another locus. To avoid these conflicts and to provide a central source of Cryptococcus gene information, we have collected all published Cryptococcus gene names from the scientific literature and associated them with standard Cryptococcus locus identifiers and have incorporated them into FungiDB ( FungiDB is a panfungal genome database that collects gene information and functional data and provides search tools for 61 species of fungi and oomycetes. We applied these published names to a manually curated ortholog set of all Cryptococcus species currently in FungiDB, including Cryptococcus neoformans var. neoformans strains JEC21 and B-3501A, C. neoformans var. grubii strain H99, and Cryptococcus gattii strains R265 and WM276, and have written brief descriptions of their functions. We also compiled a protocol for gene naming that summarizes guidelines proposed by members of the Cryptococcus research community. The centralization of genomic and literature-based information for Cryptococcus at FungiDB will help researchers communicate about genes of interest, such as those related to virulence, and will further facilitate research on the pathogen.

    View details for DOI 10.1128/EC.00083-14

    View details for PubMedID 24813190

  • Ex Uno Plures: Clonal Reinforcement Drives Evolution of a Simple Microbial Community PLOS GENETICS Kinnersley, M., Wenger, J., Kroll, E., Adams, J., Sherlock, G., Rosenzweig, F. 2014; 10 (6)


    A major goal of genetics is to define the relationship between phenotype and genotype, while a major goal of ecology is to identify the rules that govern community assembly. Achieving these goals by analyzing natural systems can be difficult, as selective pressures create dynamic fitness landscapes that vary in both space and time. Laboratory experimental evolution offers the benefit of controlling variables that shape fitness landscapes, helping to achieve both goals. We previously showed that a clonal population of E. coli experimentally evolved under continuous glucose limitation gives rise to a genetically diverse community consisting of one clone, CV103, that best scavenges but incompletely utilizes the limiting resource, and others, CV101 and CV116, that consume its overflow metabolites. Because this community can be disassembled and reassembled, and involves cooperative interactions that are stable over time, its genetic diversity is sustained by clonal reinforcement rather than by clonal interference. To understand the genetic factors that produce this outcome, and to illuminate the community's underlying physiology, we sequenced the genomes of ancestral and evolved clones. We identified ancestral mutations in intermediary metabolism that may have predisposed the evolution of metabolic interdependence. Phylogenetic reconstruction indicates that the lineages that gave rise to this community diverged early, as CV103 shares only one Single Nucleotide Polymorphism with the other evolved clones. Underlying CV103's phenotype we identified a set of mutations that likely enhance glucose scavenging and maintain redox balance, but may do so at the expense of carbon excreted in overflow metabolites. Because these overflow metabolites serve as growth substrates that are differentially accessible to the other community members, and because the scavenging lineage shares only one SNP with these other clones, we conclude that this lineage likely served as an "engine" generating diversity by creating new metabolic niches, but not the occupants themselves.

    View details for DOI 10.1371/journal.pgen.1004430

    View details for Web of Science ID 000338847700045

    View details for PubMedID 24968217

    View details for PubMedCentralID PMC4072538

  • Extensive and coordinated control of allele-specific expression by both transcription and translation in Candida albicans GENOME RESEARCH Muzzey, D., Sherlock, G., Weissman, J. S. 2014; 24 (6): 963-973


    Though sequence differences between alleles are often limited to a few polymorphisms, these differences can cause large and widespread allelic variation at the expression level. Such allele-specific expression (ASE) has been extensively explored at the level of transcription but not translation. Here we measured ASE in the diploid yeast Candida albicans at both the transcriptional and translational levels using RNA-seq and ribosome profiling, respectively. Since C. albicans is an obligate diploid, our analysis isolates ASE arising from cis elements in a natural, nonhybrid organism, where allelic effects reflect evolutionary forces. Importantly, we find that ASE arising from translation is of a similar magnitude as transcriptional ASE, both in terms of the number of genes affected and the magnitude of the bias. We further observe coordination between ASE at the levels of transcription and translation for single genes. Specifically, reinforcing relationships--where transcription and translation favor the same allele--are more frequent than expected by chance, consistent with selective pressure tuning ASE at multiple regulatory steps. Finally, we parameterize alleles based on a range of properties and find that SNP location and predicted mRNA-structure stability are associated with translational ASE in cis. Since this analysis probes more than 4000 allelic pairs spanning a broad range of variations, our data provide a genome-wide view into the relative impact of cis elements that regulate translation.

    View details for DOI 10.1101/gr.166322.113

    View details for Web of Science ID 000336662200008

    View details for PubMedID 24732588

    View details for PubMedCentralID PMC4032860

  • PHENOTYPIC AND GENOTYPIC CONVERGENCES ARE INFLUENCED BY HISTORICAL CONTINGENCY AND ENVIRONMENT IN YEAST EVOLUTION Spor, A., Kvitek, D. J., Nidelet, T., Martin, J., Legrand, J., Dillmann, C., Bourgais, A., de Vienne, D., Sherlock, G., Sicard, D. 2014; 68 (3): 772-790


    Different organisms have independently and recurrently evolved similar phenotypic traits at different points throughout history. This phenotypic convergence may be caused by genotypic convergence and in addition, constrained by historical contingency. To investigate how convergence may be driven by selection in a particular environment and constrained by history, we analyzed nine life-history traits and four metabolic traits during an experimental evolution of six yeast strains in four different environments. In each of the environments, the population converged towards a different multivariate phenotype. However, the evolution of most traits, including fitness components, was constrained by history. Phenotypic convergence was partly associated with the selection of mutations in genes involved in the same pathway. By further investigating the convergence in one gene, BMH1, mutated in 20% of the evolved populations, we show that both the history and the environment influenced the types of mutations (missense/nonsense), their location within the gene itself, as well as their effects on multiple traits. However, these effects could not be easily predicted from ancestors' phylogeny or past-selection. Combined, our data highlight the role of pleiotropy and epistasis in shaping a rugged fitness landscape. This article is protected by copyright. All rights reserved.

    View details for DOI 10.1111/evo.12302

    View details for Web of Science ID 000332046700014

    View details for PubMedID 24164389

  • PortEco: a resource for exploring bacterial biology through high-throughput data and analysis tools. Nucleic acids research Hu, J. C., Sherlock, G., Siegele, D. A., Aleksander, S. A., Ball, C. A., Demeter, J., Gouni, S., Holland, T. A., Karp, P. D., Lewis, J. E., Liles, N. M., McIntosh, B. K., Mi, H., Muruganujan, A., Wymore, F., Thomas, P. D. 2014; 42 (1): D677-84


    PortEco ( aims to collect, curate and provide data and analysis tools to support basic biological research in Escherichia coli (and eventually other bacterial systems). PortEco is implemented as a 'virtual' model organism database that provides a single unified interface to the user, while integrating information from a variety of sources. The main focus of PortEco is to enable broad use of the growing number of high-throughput experiments available for E. coli, and to leverage community annotation through the EcoliWiki and GONUTS systems. Currently, PortEco includes curated data from hundreds of genome-wide RNA expression studies, from high-throughput phenotyping of single-gene knockouts under hundreds of annotated conditions, from chromatin immunoprecipitation experiments for tens of different DNA-binding factors and from ribosome profiling experiments that yield insights into protein expression. Conditions have been annotated with a consistent vocabulary, and data have been consistently normalized to enable users to find, compare and interpret relevant experiments. PortEco includes tools for data analysis, including clustering, enrichment analysis and exploration via genome browsers. PortEco search and data analysis tools are extensively linked to the curated gene, metabolic pathway and regulation content at its sister site, EcoCyc.

    View details for DOI 10.1093/nar/gkt1203

    View details for PubMedID 24285306

  • Curation accuracy of model organism databases. Database : the journal of biological databases and curation Keseler, I. M., Skrzypek, M., Weerasinghe, D., Chen, A. Y., Fulcher, C., Li, G., Lemmer, K. C., Mladinich, K. M., Chow, E. D., Sherlock, G., Karp, P. D. 2014; 2014


    Manual extraction of information from the biomedical literature-or biocuration-is the central methodology used to construct many biological databases. For example, the UniProt protein database, the EcoCyc Escherichia coli database and the Candida Genome Database (CGD) are all based on biocuration. Biological databases are used extensively by life science researchers, as online encyclopedias, as aids in the interpretation of new experimental data and as golden standards for the development of new bioinformatics algorithms. Although manual curation has been assumed to be highly accurate, we are aware of only one previous study of biocuration accuracy. We assessed the accuracy of EcoCyc and CGD by manually selecting curated assertions within randomly chosen EcoCyc and CGD gene pages and by then validating that the data found in the referenced publications supported those assertions. A database assertion is considered to be in error if that assertion could not be found in the publication cited for that assertion. We identified 10 errors in the 633 facts that we validated across the two databases, for an overall error rate of 1.58%, and individual error rates of 1.82% for CGD and 1.40% for EcoCyc. These data suggest that manual curation of the experimental literature by Ph.D-level scientists is highly accurate. Database URL:,

    View details for DOI 10.1093/database/bau058

    View details for PubMedID 24923819

    View details for PubMedCentralID PMC4207230

  • The Candida Genome Database: The new homology information page highlights protein similarity and phylogeny. Nucleic acids research Binkley, J., Arnaud, M. B., Inglis, D. O., Skrzypek, M. S., Shah, P., Wymore, F., Binkley, G., Miyasato, S. R., Simison, M., Sherlock, G. 2014; 42 (1): D711-6


    The Candida Genome Database (CGD, is a freely available online resource that provides gene, protein and sequence information for multiple Candida species, along with web-based tools for accessing, analyzing and exploring these data. The goal of CGD is to facilitate and accelerate research into Candida pathogenesis and biology. The CGD Web site is organized around Locus pages, which display information collected about individual genes. Locus pages have multiple tabs for accessing different types of information; the default Summary tab provides an overview of the gene name, aliases, phenotype and Gene Ontology curation, whereas other tabs display more in-depth information, including protein product details for coding genes, notes on changes to the sequence or structure of the gene and a comprehensive reference list. Here, in this update to previous NAR Database articles featuring CGD, we describe a new tab that we have added to the Locus page, entitled the Homology Information tab, which displays phylogeny and gene similarity information for each locus.

    View details for DOI 10.1093/nar/gkt1046

    View details for PubMedID 24185697

  • The Aspergillus Genome Database: multispecies curation and incorporation of RNA-Seq data to improve structural gene annotations. Nucleic acids research Cerqueira, G. C., Arnaud, M. B., Inglis, D. O., Skrzypek, M. S., Binkley, G., Simison, M., Miyasato, S. R., Binkley, J., Orvis, J., Shah, P., Wymore, F., Sherlock, G., Wortman, J. R. 2014; 42 (1): D705-10


    The Aspergillus Genome Database (AspGD; is a freely available web-based resource that was designed for Aspergillus researchers and is also a valuable source of information for the entire fungal research community. In addition to being a repository and central point of access to genome, transcriptome and polymorphism data, AspGD hosts a comprehensive comparative genomics toolbox that facilitates the exploration of precomputed orthologs among the 20 currently available Aspergillus genomes. AspGD curators perform gene product annotation based on review of the literature for four key Aspergillus species: Aspergillus nidulans, Aspergillus oryzae, Aspergillus fumigatus and Aspergillus niger. We have iteratively improved the structural annotation of Aspergillus genomes through the analysis of publicly available transcription data, mostly expressed sequenced tags, as described in a previous NAR Database article (Arnaud et al. 2012). In this update, we report substantive structural annotation improvements for A. nidulans, A. oryzae and A. fumigatus genomes based on recently available RNA-Seq data. Over 26 000 loci were updated across these species; although those primarily comprise the addition and extension of untranslated regions (UTRs), the new analysis also enabled over 1000 modifications affecting the coding sequence of genes in each target genome.

    View details for DOI 10.1093/nar/gkt1029

    View details for PubMedID 24194595

  • Identification of cell cycle-regulated genes periodically expressed in U2OS cells and their regulation by FOXM1 and E2F transcription factors MOLECULAR BIOLOGY OF THE CELL Grant, G. D., Brooks, L., Zhang, X., Mahoney, J. M., Martyanov, V., Wood, T. A., Sherlock, G., Cheng, C., Whitfield, M. L. 2013; 24 (23): 3634-3650


    We identify the cell cycle-regulated mRNA transcripts genome-wide in the osteosarcoma-derived U2OS cell line. This results in 2140 transcripts mapping to 1871 unique cell cycle-regulated genes that show periodic oscillations across multiple synchronous cell cycles. We identify genomic loci bound by the G2/M transcription factor FOXM1 by chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) and associate these with cell cycle-regulated genes. FOXM1 is bound to cell cycle-regulated genes with peak expression in both S phase and G2/M phases. We show that ChIP-seq genomic loci are responsive to FOXM1 using a real-time luciferase assay in live cells, showing that FOXM1 strongly activates promoters of G2/M phase genes and weakly activates those induced in S phase. Analysis of ChIP-seq data from a panel of cell cycle transcription factors (E2F1, E2F4, E2F6, and GABPA) from the Encyclopedia of DNA Elements and ChIP-seq data for the DREAM complex finds that a set of core cell cycle genes regulated in both U2OS and HeLa cells are bound by multiple cell cycle transcription factors. These data identify the cell cycle-regulated genes in a second cancer-derived cell line and provide a comprehensive picture of the transcriptional regulatory systems controlling periodic gene expression in the human cell division cycle.

    View details for DOI 10.1091/mbc.E13-05-0264

    View details for Web of Science ID 000328125100005

    View details for PubMedID 24109597

    View details for PubMedCentralID PMC3842991

  • Whole genome, whole population sequencing reveals that loss of signaling networks is the major adaptive strategy in a constant environment. PLoS genetics Kvitek, D. J., Sherlock, G. 2013; 9 (11)


    Molecular signaling networks are ubiquitous across life and likely evolved to allow organisms to sense and respond to environmental change in dynamic environments. Few examples exist regarding the dispensability of signaling networks, and it remains unclear whether they are an essential feature of a highly adapted biological system. Here, we show that signaling network function carries a fitness cost in yeast evolving in a constant environment. We performed whole-genome, whole-population Illumina sequencing on replicate evolution experiments and find the major theme of adaptive evolution in a constant environment is the disruption of signaling networks responsible for regulating the response to environmental perturbations. Over half of all identified mutations occurred in three major signaling networks that regulate growth control: glucose signaling, Ras/cAMP/PKA and HOG. This results in a loss of environmental sensitivity that is reproducible across experiments. However, adaptive clones show reduced viability under starvation conditions, demonstrating an evolutionary tradeoff. These mutations are beneficial in an environment with a constant and predictable nutrient supply, likely because they result in constitutive growth, but reduce fitness in an environment where nutrient supply is not constant. Our results are a clear example of the myopic nature of evolution: a loss of environmental sensitivity in a constant environment is adaptive in the short term, but maladaptive should the environment change.

    View details for DOI 10.1371/journal.pgen.1003972

    View details for PubMedID 24278038

    View details for PubMedCentralID PMC3836717

  • Ras Signaling Gets Fine-Tuned: Regulation of Multiple Pathogenic Traits of Candida albicans EUKARYOTIC CELL Inglis, D. O., Sherlock, G. 2013; 12 (10): 1316-1325


    Candida albicans is an opportunistic fungal pathogen that can cause disseminated infection in patients with indwelling catheters or other implanted medical devices. A common resident of the human microbiome, C. albicans responds to environmental signals, such as cell contact with catheter materials and exposure to serum or CO2, by triggering the expression of a variety of traits, some of which are known to contribute to its pathogenic lifestyle. Such traits include adhesion, biofilm formation, filamentation, white-to-opaque (W-O) switching, and two recently described phenotypes, finger and tentacle formation. Under distinct sets of environmental conditions and in specific cell types (mating type-like a [MTLa]/alpha cells, MTL homozygotes, or daughter cells), C. albicans utilizes (or reutilizes) a single signal transduction pathway-the Ras pathway-to affect these phenotypes. Ras1, Cyr1, Tpk2, and Pde2, the proteins of the Ras signaling pathway, are the only nontranscriptional regulatory proteins that are known to be essential for regulating all of these processes. How does C. albicans utilize this one pathway to regulate all of these phenotypes? The regulation of distinct and yet related processes by a single, evolutionarily conserved pathway is accomplished through the use of downstream transcription factors that are active under specific environmental conditions and in different cell types. In this minireview, we discuss the role of Ras signaling pathway components and Ras pathway-regulated transcription factors as well as the transcriptional regulatory networks that fine-tune gene expression in diverse biological contexts to generate specific phenotypes that impact the virulence of C. albicans.

    View details for DOI 10.1128/EC.00094-13

    View details for Web of Science ID 000324861400001

    View details for PubMedID 23913542

    View details for PubMedCentralID PMC3811338

  • Comparative metabolic footprinting of a large number of commercial wine yeast strains in Chardonnay fermentations. FEMS yeast research Richter, C. L., Dunn, B., Sherlock, G., Pugh, T. 2013; 13 (4): 394-410


    Wine has been made for thousands of years. In modern times, as the importance of yeast as an ingredient in winemaking became better appreciated, companies worldwide have collected and marketed specific yeast strains for enhancing positive and minimizing negative attributes in wine. It is generally believed that each yeast strain contributes uniquely to fermentation performance and wine style because of its genetic background; however, the impact of metabolic diversity among wine yeasts on aroma compound production has not been extensively studied. We characterized the metabolic footprints of 69 different commercial wine yeast strains in triplicate fermentations of identical Chardonnay juice, by measuring 29 primary and secondary metabolites; we additionally measured seven attributes of fermentation performance of these strains. We identified up to 1000-fold differences between strains for some of the metabolites and observed large differences in fermentation performance, suggesting significant metabolic diversity. These differences represent potential selective markers for the strains that may be important to the wine industry. Analysis of these metabolic traits further builds on the known genomic diversity of these strains and provides new insights into their genetic and metabolic relatedness.

    View details for DOI 10.1111/1567-1364.12046

    View details for PubMedID 23528123

  • Recurrent Rearrangement during Adaptive Evolution in an Interspecific Yeast Hybrid Suggests a Model for Rapid Introgression PLOS GENETICS Dunn, B., Paulish, T., Stanbery, A., Piotrowski, J., Koniges, G., Kroll, E., Louis, E. J., Liti, G., Sherlock, G., Rosenzweig, F. 2013; 9 (3)


    Genome rearrangements are associated with eukaryotic evolutionary processes ranging from tumorigenesis to speciation. Rearrangements are especially common following interspecific hybridization, and some of these could be expected to have strong selective value. To test this expectation we created de novo interspecific yeast hybrids between two diverged but largely syntenic Saccharomyces species, S. cerevisiae and S. uvarum, then experimentally evolved them under continuous ammonium limitation. We discovered that a characteristic interspecific genome rearrangement arose multiple times in independently evolved populations. We uncovered nine different breakpoints, all occurring in a narrow ~1-kb region of chromosome 14, and all producing an "interspecific fusion junction" within the MEP2 gene coding sequence, such that the 5' portion derives from S. cerevisiae and the 3' portion derives from S. uvarum. In most cases the rearrangements altered both chromosomes, resulting in what can be considered to be an introgression of a several-kb region of S. uvarum into an otherwise intact S. cerevisiae chromosome 14, while the homeologous S. uvarum chromosome 14 experienced an interspecific reciprocal translocation at the same breakpoint within MEP2, yielding a chimaeric chromosome; these events result in the presence in the cell of two MEP2 fusion genes having identical breakpoints. Given that MEP2 encodes for a high-affinity ammonium permease, that MEP2 fusion genes arise repeatedly under ammonium-limitation, and that three independent evolved isolates carrying MEP2 fusion genes are each more fit than their common ancestor, the novel MEP2 fusion genes are very likely adaptive under ammonium limitation. Our results suggest that, when homoploid hybrids form, the admixture of two genomes enables swift and otherwise unavailable evolutionary innovations. Furthermore, the architecture of the MEP2 rearrangement suggests a model for rapid introgression, a phenomenon seen in numerous eukaryotic phyla, that does not require repeated backcrossing to one of the parental species.

    View details for DOI 10.1371/journal.pgen.1003366

    View details for Web of Science ID 000316866700042

    View details for PubMedID 23555283

    View details for PubMedCentralID PMC3605161

  • Improved Gene Ontology Annotation for Biofilm Formation, Filamentous Growth, and Phenotypic Switching in Candida albicans EUKARYOTIC CELL Inglis, D. O., Skrzypek, M. S., Arnaud, M. B., Binkley, J., Shah, P., Wymore, F., Sherlock, G. 2013; 12 (1): 101-108


    The opportunistic fungal pathogen Candida albicans is a significant medical threat, especially for immunocompromised patients. Experimental research has focused on specific areas of C. albicans biology, with the goal of understanding the multiple factors that contribute to its pathogenic potential. Some of these factors include cell adhesion, invasive or filamentous growth, and the formation of drug-resistant biofilms. The Gene Ontology (GO) ( is a standardized vocabulary that the Candida Genome Database (CGD) ( and other groups use to describe the functions of gene products. To improve the breadth and accuracy of pathogenicity-related gene product descriptions and to facilitate the description of as yet uncharacterized but potentially pathogenicity-related genes in Candida species, CGD undertook a three-part project: first, the addition of terms to the biological process branch of the GO to improve the description of fungus-related processes; second, manual recuration of gene product annotations in CGD to use the improved GO vocabulary; and third, computational ortholog-based transfer of GO annotations from experimentally characterized gene products, using these new terms, to uncharacterized orthologs in other Candida species. Through genome annotation and analysis, we identified candidate pathogenicity genes in seven non-C. albicans Candida species and in one additional C. albicans strain, WO-1. We also defined a set of C. albicans genes at the intersection of biofilm formation, filamentous growth, pathogenesis, and phenotypic switching of this opportunistic fungal pathogen, which provides a compelling list of candidates for further experimentation.

    View details for DOI 10.1128/EC.00238-12

    View details for Web of Science ID 000313061100010

    View details for PubMedID 23143685

    View details for PubMedCentralID PMC3535841

  • Assembly of a phased diploid Candida albicans genome facilitates allele-specific measurements and provides a simple model for repeat and indel structure GENOME BIOLOGY Muzzey, D., Schwartz, K., Weissman, J. S., Sherlock, G. 2013; 14 (9)


    Candida albicans is a ubiquitous opportunistic fungal pathogen that afflicts immunocompromised human hosts. With rare and transient exceptions the yeast is diploid, yet despite its clinical relevance the respective sequences of its two homologous chromosomes have not been completely resolved.We construct a phased diploid genome assembly by deep sequencing a standard laboratory wild-type strain and a panel of strains homozygous for particular chromosomes. The assembly has 700-fold coverage on average, allowing extensive revision and expansion of the number of known SNPs and indels. This phased genome significantly enhances the sensitivity and specificity of allele-specific expression measurements by enabling pooling and cross-validation of signal across multiple polymorphic sites. Additionally, the diploid assembly reveals pervasive and unexpected patterns in allelic differences between homologous chromosomes. Firstly, we see striking clustering of indels, concentrated primarily in the repeat sequences in promoters. Secondly, both indels and their repeat-sequence substrate are enriched near replication origins. Finally, we reveal an intimate link between repeat sequences and indels, which argues that repeat length is under selective pressure for most eukaryotes. This connection is described by a concise one-parameter model that explains repeat-sequence abundance in C. albicans as a function of the indel rate, and provides a general framework to interpret repeat abundance in species ranging from bacteria to humans.The phased genome assembly and insights into repeat plasticity will be valuable for better understanding allele-specific phenomena and genome evolution.

    View details for DOI 10.1186/gb-2013-14-9-r97

    View details for Web of Science ID 000328195700003

    View details for PubMedID 24025428

  • Comprehensive annotation of secondary metabolite biosynthetic genes and gene clusters of Aspergillus nidulans, A. fumigatus, A. niger and A. oryzae. BMC microbiology Inglis, D. O., Binkley, J., Skrzypek, M. S., Arnaud, M. B., Cerqueira, G. C., Shah, P., Wymore, F., Wortman, J. R., Sherlock, G. 2013; 13: 91-?


    BACKGROUND: Secondary metabolite production, a hallmark of filamentous fungi, is an expanding area of research for the Aspergilli. These compounds are potent chemicals, ranging from deadly toxins to therapeutic antibiotics to potential anti-cancer drugs. The genome sequences for multiple Aspergilli have been determined, and provide a wealth of predictive information about secondary metabolite production. Sequence analysis and gene overexpression strategies have enabled the discovery of novel secondary metabolites and the genes involved in their biosynthesis. The Aspergillus Genome Database (AspGD) provides a central repository for gene annotation and protein information for Aspergillus species. These annotations include Gene Ontology (GO) terms, phenotype data, gene names and descriptions and they are crucial for interpreting both small- and large-scale data and for aiding in the design of new experiments that further Aspergillus research. RESULTS: We have manually curated Biological Process GO annotations for all genes in AspGD with recorded functions in secondary metabolite production, adding new GO terms that specifically describe each secondary metabolite. We then leveraged these new annotations to predict roles in secondary metabolism for genes lacking experimental characterization. As a starting point for manually annotating Aspergillus secondary metabolite gene clusters, we used antiSMASH (antibiotics and Secondary Metabolite Analysis SHell) and SMURF (Secondary Metabolite Unknown Regions Finder) algorithms to identify potential clusters in A. nidulans, A. fumigatus, A. niger and A. oryzae, which we subsequently refined through manual curation. CONCLUSIONS: This set of 266 manually curated secondary metabolite gene clusters will facilitate the investigation of novel Aspergillus secondary metabolites.

    View details for DOI 10.1186/1471-2180-13-91

    View details for PubMedID 23617571

  • Turbidostat culture of Saccharomyces cerevisiae W303-1A under selective pressure elicited by ethanol selects for mutations in SSD1 and UTH1 FEMS YEAST RESEARCH Avrahami-Moyal, L., Engelberg, D., Wenger, J. W., Sherlock, G., Braun, S. 2012; 12 (5): 521-533


    We investigated the genetic causes of ethanol tolerance by characterizing mutations selected in Saccharomyces cerevisiae W303-1A under the selective pressure of ethanol. W303-1A was subjected to three rounds of turbidostat, in a medium supplemented with increasing amounts of ethanol. By the end of selection, the growth rate of the culture has increased from 0.029 to 0.32 h(-1) . Unlike the progenitor strain, all yeast cells isolated from this population were able to form colonies on medium supplemented with 7% ethanol within 6 days, our definition of ethanol tolerance. Several clones selected from all three stages of selection were able to form dense colonies within 2 days on solid medium supplemented with 9% ethanol. We sequenced the whole genomes of six clones and identified mutations responsible for ethanol tolerance. Thirteen additional clones were tested for the presence of similar mutations. In 15 of 19 tolerant clones, the stop codon in ssd1-d was replaced with an amino acid-encoding codon. Three other clones contained one of two mutations in UTH1, and one clone did not contain mutations in either SSD1 or UTH1. We showed that the mutations in SSD1 and UTH1 increased tolerance of the cell wall to zymolyase and conclude that stability of the cell wall is a major factor in increased tolerance to ethanol.

    View details for DOI 10.1111/j.1567-1364.2012.00803.x

    View details for Web of Science ID 000306189600003

    View details for PubMedID 22443114

  • APJ1 and GRE3 Homologs Work in Concert to Allow Growth in Xylose in a Natural Saccharomyces sensu stricto Hybrid Yeast GENETICS Schwartz, K., Wenger, J. W., Dunn, B., Sherlock, G. 2012; 191 (2): 621-U504


    Creating Saccharomyces yeasts capable of efficient fermentation of pentoses such as xylose remains a key challenge in the production of ethanol from lignocellulosic biomass. Metabolic engineering of industrial Saccharomyces cerevisiae strains has yielded xylose-fermenting strains, but these strains have not yet achieved industrial viability due largely to xylose fermentation being prohibitively slower than that of glucose. Recently, it has been shown that naturally occurring xylose-utilizing Saccharomyces species exist. Uncovering the genetic architecture of such strains will shed further light on xylose metabolism, suggesting additional engineering approaches or possibly even enabling the development of xylose-fermenting yeasts that are not genetically modified. We previously identified a hybrid yeast strain, the genome of which is largely Saccharomyces uvarum, which has the ability to grow on xylose as the sole carbon source. To circumvent the sterility of this hybrid strain, we developed a novel method to genetically characterize its xylose-utilization phenotype, using a tetraploid intermediate, followed by bulk segregant analysis in conjunction with high-throughput sequencing. We found that this strain's growth in xylose is governed by at least two genetic loci, within which we identified the responsible genes: one locus contains a known xylose-pathway gene, a novel homolog of the aldo-keto reductase gene GRE3, while a second locus contains a homolog of APJ1, which encodes a putative chaperone not previously connected to xylose metabolism. Our work demonstrates that the power of sequencing combined with bulk segregant analysis can also be applied to a nongenetically tractable hybrid strain that contains a complex, polygenic trait, and identifies new avenues for metabolic engineering as well as for construction of nongenetically modified xylose-fermenting strains.

    View details for DOI 10.1534/genetics.112.140053

    View details for Web of Science ID 000308999300020

    View details for PubMedID 22426884

    View details for PubMedCentralID PMC3374322

  • Analysis of the Saccharomyces cerevisiae pan-genome reveals a pool of copy number variants distributed in diverse yeast strains from differing industrial environments GENOME RESEARCH Dunn, B., Richter, C., Kvitek, D. J., Pugh, T., Sherlock, G. 2012; 22 (5): 908-924


    Although the budding yeast Saccharomyces cerevisiae is arguably one of the most well-studied organisms on earth, the genome-wide variation within this species--i.e., its "pan-genome"--has been less explored. We created a multispecies microarray platform containing probes covering the genomes of several Saccharomyces species: S. cerevisiae, including regions not found in the standard laboratory S288c strain, as well as the mitochondrial and 2-μm circle genomes-plus S. paradoxus, S. mikatae, S. kudriavzevii, S. uvarum, S. kluyveri, and S. castellii. We performed array-Comparative Genomic Hybridization (aCGH) on 83 different S. cerevisiae strains collected across a wide range of habitats; of these, 69 were commercial wine strains, while the remaining 14 were from a diverse set of other industrial and natural environments. We observed interspecific hybridization events, introgression events, and pervasive copy number variation (CNV) in all but a few of the strains. These CNVs were distributed throughout the strains such that they did not produce any clear phylogeny, suggesting extensive mating in both industrial and wild strains. To validate our results and to determine whether apparently similar introgressions and CNVs were identical by descent or recurrent, we also performed whole-genome sequencing on nine of these strains. These data may help pinpoint genomic regions involved in adaptation to different industrial milieus, as well as shed light on the course of domestication of S. cerevisiae.

    View details for DOI 10.1101/gr.130310.111

    View details for Web of Science ID 000303369600010

    View details for PubMedID 22369888

    View details for PubMedCentralID PMC3337436

  • Different selective pressures lead to different genomic outcomes as newly-formed hybrid yeasts evolve BMC EVOLUTIONARY BIOLOGY Piotrowski, J. S., Nagarajan, S., Kroll, E., Stanbery, A., Chiotti, K. E., Kruckeberg, A. L., Dunn, B., Sherlock, G., Rosenzweig, F. 2012; 12


    Interspecific hybridization occurs in every eukaryotic kingdom. While hybrid progeny are frequently at a selective disadvantage, in some instances their increased genome size and complexity may result in greater stress resistance than their ancestors, which can be adaptively advantageous at the edges of their ancestors' ranges. While this phenomenon has been repeatedly documented in the field, the response of hybrid populations to long-term selection has not often been explored in the lab. To fill this knowledge gap we crossed the two most distantly related members of the Saccharomyces sensu stricto group, S. cerevisiae and S. uvarum, and established a mixed population of homoploid and aneuploid hybrids to study how different types of selection impact hybrid genome structure.As temperature was raised incrementally from 31°C to 46.5°C over 500 generations of continuous culture, selection favored loss of the S. uvarum genome, although the kinetics of genome loss differed among independent replicates. Temperature-selected isolates exhibited greater inherent and induced thermal tolerance than parental species and founding hybrids, and also exhibited ethanol resistance. In contrast, as exogenous ethanol was increased from 0% to 14% over 500 generations of continuous culture, selection favored euploid S. cerevisiae x S. uvarum hybrids. Ethanol-selected isolates were more ethanol tolerant than S. uvarum and one of the founding hybrids, but did not exhibit resistance to temperature stress. Relative to parental and founding hybrids, temperature-selected strains showed heritable differences in cell wall structure in the forms of increased resistance to zymolyase digestion and Micafungin, which targets cell wall biosynthesis.This is the first study to show experimentally that the genomic fate of newly-formed interspecific hybrids depends on the type of selection they encounter during the course of evolution, underscoring the importance of the ecological theatre in determining the outcome of the evolutionary play.

    View details for DOI 10.1186/1471-2148-12-46

    View details for Web of Science ID 000305180500001

    View details for PubMedID 22471618

  • The Candida genome database incorporates multiple Candida species: multispecies search and analysis tools with curated gene and protein information for Candida albicans and Candida glabrata NUCLEIC ACIDS RESEARCH Inglis, D. O., Arnaud, M. B., Binkley, J., Shah, P., Skrzypek, M. S., Wymore, F., Binkley, G., Miyasato, S. R., Simison, M., Sherlock, G. 2012; 40 (D1): D667-D674

    View details for DOI 10.1093/nar/gkr945

    View details for Web of Science ID 000298601300101

  • The Candida genome database incorporates multiple Candida species: multispecies search and analysis tools with curated gene and protein information for Candida albicans and Candida glabrata. Nucleic acids research Inglis, D. O., Arnaud, M. B., Binkley, J., Shah, P., Skrzypek, M. S., Wymore, F., Binkley, G., Miyasato, S. R., Simison, M., Sherlock, G. 2012; 40 (Database issue): D667-74


    The Candida Genome Database (CGD, is an internet-based resource that provides centralized access to genomic sequence data and manually curated functional information about genes and proteins of the fungal pathogen Candida albicans and other Candida species. As the scope of Candida research, and the number of sequenced strains and related species, has grown in recent years, the need for expanded genomic resources has also grown. To answer this need, CGD has expanded beyond storing data solely for C. albicans, now integrating data from multiple species. Herein we describe the incorporation of this multispecies information, which includes curated gene information and the reference sequence for C. glabrata, as well as orthology relationships that interconnect Locus Summary pages, allowing easy navigation between genes of C. albicans and C. glabrata. These orthology relationships are also used to predict GO annotations of their products. We have also added protein information pages that display domains, structural information and physicochemical properties; bibliographic pages highlighting important topic areas in Candida biology; and a laboratory strain lineage page that describes the lineage of commonly used laboratory strains. All of these data are freely available at We welcome feedback from the research community at

    View details for DOI 10.1093/nar/gkr945

    View details for PubMedID 22064862

    View details for PubMedCentralID PMC3245171

  • The Aspergillus Genome Database (AspGD): recent developments in comprehensive multispecies curation, comparative genomics and community resources. Nucleic acids research Arnaud, M. B., Cerqueira, G. C., Inglis, D. O., Skrzypek, M. S., Binkley, J., Chibucos, M. C., Crabtree, J., Howarth, C., Orvis, J., Shah, P., Wymore, F., Binkley, G., Miyasato, S. R., Simison, M., Sherlock, G., Wortman, J. R. 2012; 40 (Database issue): D653-9


    The Aspergillus Genome Database (AspGD; is a freely available, web-based resource for researchers studying fungi of the genus Aspergillus, which includes organisms of clinical, agricultural and industrial importance. AspGD curators have now completed comprehensive review of the entire published literature about Aspergillus nidulans and Aspergillus fumigatus, and this annotation is provided with streamlined, ortholog-based navigation of the multispecies information. AspGD facilitates comparative genomics by providing a full-featured genomics viewer, as well as matched and standardized sets of genomic information for the sequenced aspergilli. AspGD also provides resources to foster interaction and dissemination of community information and resources. We welcome and encourage feedback at

    View details for DOI 10.1093/nar/gkr875

    View details for PubMedID 22080559

    View details for PubMedCentralID PMC3245136

  • GC-Content Normalization for RNA-Seq Data BMC BIOINFORMATICS Risso, D., Schwartz, K., Sherlock, G., Dudoit, S. 2011; 12


    Transcriptome sequencing (RNA-Seq) has become the assay of choice for high-throughput studies of gene expression. However, as is the case with microarrays, major technology-related artifacts and biases affect the resulting expression measures. Normalization is therefore essential to ensure accurate inference of expression levels and subsequent analyses thereof.We focus on biases related to GC-content and demonstrate the existence of strong sample-specific GC-content effects on RNA-Seq read counts, which can substantially bias differential expression analysis. We propose three simple within-lane gene-level GC-content normalization approaches and assess their performance on two different RNA-Seq datasets, involving different species and experimental designs. Our methods are compared to state-of-the-art normalization procedures in terms of bias and mean squared error for expression fold-change estimation and in terms of Type I error and p-value distributions for tests of differential expression. The exploratory data analysis and normalization methods proposed in this article are implemented in the open-source Bioconductor R package EDASeq.Our within-lane normalization procedures, followed by between-lane normalization, reduce GC-content bias and lead to more accurate estimates of expression fold-changes and tests of differential expression. Such results are crucial for the biological interpretation of RNA-Seq experiments, where downstream analyses can be sensitive to the supplied lists of genes.

    View details for DOI 10.1186/1471-2105-12-480

    View details for Web of Science ID 000302434000001

    View details for PubMedID 22177264

  • Hunger Artists: Yeast Adapted to Carbon Limitation Show Trade-Offs under Carbon Sufficiency PLOS GENETICS Wenger, J. W., Piotrowski, J., Nagarajan, S., Chiotti, K., Sherlock, G., Rosenzweig, F. 2011; 7 (8)


    As organisms adaptively evolve to a new environment, selection results in the improvement of certain traits, bringing about an increase in fitness. Trade-offs may result from this process if function in other traits is reduced in alternative environments either by the adaptive mutations themselves or by the accumulation of neutral mutations elsewhere in the genome. Though the cost of adaptation has long been a fundamental premise in evolutionary biology, the existence of and molecular basis for trade-offs in alternative environments are not well-established. Here, we show that yeast evolved under aerobic glucose limitation show surprisingly few trade-offs when cultured in other carbon-limited environments, under either aerobic or anaerobic conditions. However, while adaptive clones consistently outperform their common ancestor under carbon limiting conditions, in some cases they perform less well than their ancestor in aerobic, carbon-rich environments, indicating that trade-offs can appear when resources are non-limiting. To more deeply understand how adaptation to one condition affects performance in others, we determined steady-state transcript abundance of adaptive clones grown under diverse conditions and performed whole-genome sequencing to identify mutations that distinguish them from one another and from their common ancestor. We identified mutations in genes involved in glucose sensing, signaling, and transport, which, when considered in the context of the expression data, help explain their adaptation to carbon poor environments. However, different sets of mutations in each independently evolved clone indicate that multiple mutational paths lead to the adaptive phenotype. We conclude that yeasts that evolve high fitness under one resource-limiting condition also become more fit under other resource-limiting conditions, but may pay a fitness cost when those same resources are abundant.

    View details for DOI 10.1371/journal.pgen.1002202

    View details for Web of Science ID 000294297000006

    View details for PubMedID 21829391

    View details for PubMedCentralID PMC3150441

  • DNA methylation profiling reveals novel biomarkers and important roles for DNA methyltransferases in prostate cancer GENOME RESEARCH Kobayashi, Y., Absher, D. M., Gulzar, Z. G., Young, S. R., McKenney, J. K., Peehl, D. M., Brooks, J. D., Myers, R. M., Sherlock, G. 2011; 21 (7): 1017-1027


    Candidate gene-based studies have identified a handful of aberrant CpG DNA methylation events in prostate cancer. However, DNA methylation profiles have not been compared on a large scale between prostate tumor and normal prostate, and the mechanisms behind these alterations are unknown. In this study, we quantitatively profiled 95 primary prostate tumors and 86 benign adjacent prostate tissue samples for their DNA methylation levels at 26,333 CpGs representing 14,104 gene promoters by using the Illumina HumanMethylation27 platform. A 2-class Significance Analysis of this data set revealed 5912 CpG sites with increased DNA methylation and 2151 CpG sites with decreased DNA methylation in tumors (FDR < 0.8%). Prediction Analysis of this data set identified 87 CpGs that are the most predictive diagnostic methylation biomarkers of prostate cancer. By integrating available clinical follow-up data, we also identified 69 prognostic DNA methylation alterations that correlate with biochemical recurrence of the tumor. To identify the mechanisms responsible for these genome-wide DNA methylation alterations, we measured the gene expression levels of several DNA methyltransferases (DNMTs) and their interacting proteins by TaqMan qPCR and observed increased expression of DNMT3A2, DNMT3B, and EZH2 in tumors. Subsequent transient transfection assays in cultured primary prostate cells revealed that DNMT3B1 and DNMT3B2 overexpression resulted in increased methylation of a substantial subset of CpG sites that showed tumor-specific increased methylation.

    View details for DOI 10.1101/gr.119487.110

    View details for Web of Science ID 000292298000003

    View details for PubMedID 21521786

    View details for PubMedCentralID PMC3129245

  • Integrated genomic analyses of ovarian carcinoma NATURE Bell, D., Berchuck, A., Birrer, M., Chien, J., Cramer, D. W., Dao, F., Dhir, R., Disaia, P., Gabra, H., Glenn, P., Godwin, A. K., GROSS, J., Hartmann, L., Huang, M., Huntsman, D. G., Iacocca, M., Imielinski, M., Kalloger, S., Karlan, B. Y., Levine, D. A., Mills, G. B., Morrison, C., Mutch, D., Olvera, N., Orsulic, S., Park, K., Petrelli, N., Rabeno, B., Rader, J. S., Sikic, B. I., Smith-McCune, K., Sood, A. K., Bowtell, D., PENNY, R., Testa, J. R., Chang, K., Dinh, H. H., Drummond, J. A., Fowler, G., Gunaratne, P., Hawes, A. C., Kovar, C. L., Lewis, L. R., Morgan, M. B., Newsham, I. F., Santibanez, J., Reid, J. G., Trevino, L. R., Wu, Y., Wang, M., Muzny, D. M., Wheeler, D. A., Gibbs, R. A., Getz, G., Lawrence, M. S., Cibulskis, K., Sivachenko, A. Y., Sougnez, C., VOET, D., Wilkinson, J., Bloom, T., Ardlie, K., Fennell, T., Baldwin, J., Gabriel, S., Lander, E. S., Ding, L., Fulton, R. S., Koboldt, D. C., McLellan, M. D., Wylie, T., Walker, J., O'Laughlin, M., Dooling, D. J., Fulton, L., Abbott, R., Dees, N. D., Zhang, Q., Kandoth, C., Wendl, M., Schierding, W., Shen, D., Harris, C. C., Schmidt, H., Kalicki, J., Delehaunty, K. D., Fronick, C. C., Demeter, R., Cook, L., Wallis, J. W., Lin, L., Magrini, V. J., Hodges, J. S., ELDRED, J. M., Smith, S. M., Pohl, C. S., Vandin, F., Raphael, B. J., Weinstock, G. M., Mardis, R., Wilson, R. K., Meyerson, M., Winckler, W., Getz, G., Verhaak, R. G., Carter, S. L., Mermel, C. H., Saksena, G., Nguyen, H., Onofrio, R. C., Lawrence, M. S., Hubbard, D., Gupta, S., Crenshaw, A., RAMOS, A. H., Ardlie, K., Chin, L., Protopopov, A., Zhang, J., Kim, T. M., Perna, I., Xiao, Y., Zhang, H., Ren, G., Sathiamoorthy, N., Park, R. W., Lee, E., Park, P. J., Kucherlapati, R., Absher, D. M., Waite, L., Sherlock, G., Brooks, J. D., Li, J. Z., Xu, J., Myers, R. M., Laird, P. W., Cope, L., Herman, J. G., Shen, H., Weisenberger, D. J., Noushmehr, H., Pan, F., Triche, T., Berman, B. P., Van den Berg, D. J., Buckley, J., BAYLIN, S. B., Spellman, P. T., Purdom, E., Neuvial, P., Bengtsson, H., Jakkula, L. R., Durinck, S., Han, J., Dorton, S., Marr, H., Choi, Y. G., Wang, V., Wang, N. J., Ngai, J., Conboy, J. G., Parvin, B., Feiler, H. S., Speed, T. P., Gray, J. W., Levine, D. A., Socci, N. D., Liang, Y., Taylor, B. S., Schultz, N., Borsu, L., Lash, A. E., Brennan, C., Viale, A., Sander, C., Ladanyi, M., Hoadley, K. A., Meng, S., Du, Y., Shi, Y., Li, L., Turman, Y. J., Zang, D., Helms, E. B., Balu, S., Zhou, X., Wu, J., Topal, M. D., Hayes, D. N., Perou, C. M., Getz, G., VOET, D., Saksena, G., Zhang, J., Zhang, H., Wu, C. J., Shukla, S., Cibulskis, K., Lawrence, M. S., Sivachenko, A., Jing, R., Park, R. W., Liu, Y., Park, P. J., Noble, M., Chin, L., Carter, H., Kim, D., Karchin, R., Spellman, P. T., Purdom, E., Neuvial, P., Bengtsson, H., Durinck, S., Han, J., Korkola, J. E., Heiser, L. M., Cho, R. J., Hu, Z., Parvin, B., Speed, T. P., Gray, J. W., Schultz, N., Cerami, E., Taylor, B. S., Olshen, A., Reva, B., Antipin, Y., Shen, R., Mankoo, P., Sheridan, R., Ciriello, G., Chang, W. K., Bernanke, J. A., Borsu, L., Levine, D. A., Ladanyi, M., Sander, C., Haussler, D., Benz, C. C., Stuart, J. M., Benz, S. C., Sanborn, J. Z., Vaske, C. J., Zhu, J., Szeto, C., Scott, G. K., Yau, C., Hoadley, K. A., Du, Y., Balu, S., Hayes, D. N., Perou, C. M., Wilkerson, M. D., Zhang, N., Akbani, R., Baggerly, K. A., YUNG, W. K., Mills, G. B., Weinstein, J. N., PENNY, R., Shelton, T., Grimm, D., Hatfield, M., Morris, S., Yena, P., Rhodes, P., Sherman, M., Paulauskis, J., Millis, S., Kahn, A., Greene, J. M., Sfeir, R., Jensen, M. A., Chen, J., Whitmore, J., Alonso, S., Jordan, J., Chu, A., Zhang, J., Barker, A., Compton, C., Eley, G., Ferguson, M., Fielding, P., Gerhard, D. S., Myles, R., Schaefer, C., Shaw, K. R., Vaught, J., Vockley, J. B., Good, P. J., Guyer, M. S., Ozenberger, B., Peterson, J., Thomson, E. 2011; 474 (7353): 609-615


    A catalogue of molecular aberrations that cause ovarian cancer is critical for developing and deploying therapies that will improve patients' lives. The Cancer Genome Atlas project has analysed messenger RNA expression, microRNA expression, promoter methylation and DNA copy number in 489 high-grade serous ovarian adenocarcinomas and the DNA sequences of exons from coding genes in 316 of these tumours. Here we report that high-grade serous ovarian cancer is characterized by TP53 mutations in almost all tumours (96%); low prevalence but statistically recurrent somatic mutations in nine further genes including NF1, BRCA1, BRCA2, RB1 and CDK12; 113 significant focal DNA copy number aberrations; and promoter methylation events involving 168 genes. Analyses delineated four ovarian cancer transcriptional subtypes, three microRNA subtypes, four promoter methylation subtypes and a transcriptional signature associated with survival duration, and shed new light on the impact that tumours with BRCA1/2 (BRCA1 or BRCA2) and CCNE1 aberrations have on survival. Pathway analyses suggested that homologous recombination is defective in about half of the tumours analysed, and that NOTCH and FOXM1 signalling are involved in serous ovarian cancer pathophysiology.

    View details for DOI 10.1038/nature10166

    View details for Web of Science ID 000292204300032

    View details for PubMedID 21720365

    View details for PubMedCentralID PMC3163504

  • A User's Guide to the Encyclopedia of DNA Elements (ENCODE) PLOS BIOLOGY Myers, R. M., Stamatoyannopoulos, J., Snyder, M., Dunham, I., Hardison, R. C., Bernstein, B. E., Gingeras, T. R., Kent, W. J., Birney, E., Wold, B., Crawford, G. E., Bernstein, B. E., Epstein, C. B., Shoresh, N., Ernst, J., Mikkelsen, T. S., Kheradpour, P., Zhang, X., Wang, L., Issner, R., Coyne, M. J., Durham, T., Ku, M., Thanh Truong, T., Ward, L. D., Altshuler, R. C., Lin, M. F., Kellis, M., Gingeras, T. R., Davis, C. A., Kapranov, P., Dobin, A., Zaleski, C., Schlesinger, F., Batut, P., Chakrabortty, S., Jha, S., Lin, W., Drenkow, J., Wang, H., Bell, K., Gao, H., Bell, I., Dumais, E., Dumais, J., Antonarakis, S. E., Ucla, C., Borel, C., Guigo, R., Djebali, S., Lagarde, J., Kingswood, C., Ribeca, P., Sammeth, M., Alioto, T., Merkel, A., Tilgner, H., Carninci, P., Hayashizaki, Y., Lassmann, T., Takahashi, H., Abdelhamid, R. F., Hannon, G., Fejes-Toth, K., Preall, J., Gordon, A., Sotirova, V., Reymond, A., Howald, C., Graison, E. A., Chrast, J., Ruan, Y., Ruan, X., Shahab, A., Poh, W. T., Wei, C., Crawford, G. E., Furey, T. S., Boyle, A. P., Sheffield, N. C., Song, L., Shibata, Y., Vales, T., Winter, D., Zhang, Z., London, D., Wang, T., Birney, E., Keefe, D., Iyer, V. R., Lee, B., McDaniell, R. M., Liu, Z., Battenhouse, A., Bhinge, A. A., Lieb, J. D., Grasfeder, L. L., Showers, K. A., Giresi, P. G., Kim, S. K., Shestak, C., Myers, R. M., Pauli, F., Reddy, T. E., Gertz, J., Partridge, E. C., Jain, P., Sprouse, R. O., Bansal, A., Pusey, B., Muratet, M. A., Varley, K. E., Bowling, K. M., Newberry, K. M., Nesmith, A. S., Dilocker, J. A., Parker, S. L., Waite, L. L., Thibeault, K., Roberts, K., Absher, D. M., Wold, B., Mortazavi, A., Williams, B., Marinov, G., Trout, D., Pepke, S., King, B., McCue, K., Kirilusha, A., DeSalvo, G., Fisher-Aylor, K., Amrhein, H., Vielmetter, J., Sherlock, G., Sidow, A., Batzoglou, S., Rauch, R., Kundaje, A., Libbrecht, M., Margulies, E. H., Parker, S. C., Elnitski, L., Green, E. D., Hubbard, T., Harrow, J., Searle, S., Kokocinski, F., Aken, B., Frankish, A., Hunt, T., Despacio-Reyes, G., Kay, M., Mukherjee, G., Bignell, A., Saunders, G., Boychenko, V., Brent, M., van Baren, M. J., Brown, R. H., Gerstein, M., Khurana, E., Balasubramanian, S., Zhang, Z., Lam, H., Cayting, P., Robilotto, R., Lu, Z., Guigo, R., Derrien, T., Tanzer, A., Knowles, D. G., Mariotti, M., Kent, W. J., Haussler, D., Harte, R., Diekhans, M., Kellis, M., Lin, M., Kheradpour, P., Ernst, J., Reymond, A., Howald, C., Graison, E. A., Chrast, J., Valencia, A., Tress, M., Manuel Rodriguez, J., Snyder, M., Landt, S. G., Raha, D., Shi, M., Euskirchen, G., Grubert, F., Kasowski, M., Lian, J., Cayting, P., Lacroute, P., Xu, Y., Monahan, H., Patacsil, D., Slifer, T., Yang, X., Charos, A., Reed, B., Wu, L., Auerbach, R. K., Habegger, L., Hariharan, M., Rozowsky, J., Abyzov, A., Weissman, S. M., Gerstein, M., Struhl, K., Lamarre-Vincent, N., Lindahl-Allen, M., Miotto, B., Moqtaderi, Z., Fleming, J. D., Newburger, P., Farnham, P. J., Frietze, S., O'Geen, H., Xu, X., Blahnik, K. R., Cao, A. R., Iyengar, S., Stamatoyannopoulos, J. A., Kaul, R., Thurman, R. E., Wang, H., Navas, P. A., Sandstrom, R., Sabo, P. J., Weaver, M., Canfield, T., Lee, K., Neph, S., Roach, V., Reynolds, A., Johnson, A., Rynes, E., Giste, E., Vong, S., Neri, J., Frum, T., Johnson, E. M., Nguyen, E. D., Ebersol, A. K., Sanchez, M. E., Sheffer, H. H., Lotakis, D., Haugen, E., Humbert, R., Kutyavin, T., Shafer, T., Dekker, J., Lajoie, B. R., Sanyal, A., Kent, W. J., Rosenbloom, K. R., Dreszer, T. R., Raney, B. J., Barber, G. P., Meyer, L. R., Sloan, C. A., Malladi, V. S., Cline, M. S., Learned, K., Swing, V. K., Zweig, A. S., Rhead, B., Fujita, P. A., Roskin, K., Karolchik, D., Kuhn, R. M., Haussler, D., Birney, E., Dunham, I., Wilder, S. P., Keefe, D., Sobral, D., Herrero, J., Beal, K., Lukk, M., Brazma, A., Vaquerizas, J. M., Luscombe, N. M., Bickel, P. J., Boley, N., Brown, J. B., Li, Q., Huang, H., Gerstein, M., Habegger, L., Sboner, A., Rozowsky, J., Auerbach, R. K., Yip, K. Y., Cheng, C., Yan, K., Bhardwaj, N., Wang, J., Lochovsky, L., Jee, J., Gibson, T., Leng, J., Du, J., Hardison, R. C., Harris, R. S., Song, G., Miller, W., Haussler, D., Roskin, K., Suh, B., Wang, T., Paten, B., Noble, W. S., Hoffman, M. M., Buske, O. J., Weng, Z., Dong, X., Wang, J., Xi, H., Tenenbaum, S. A., Doyle, F., Penalva, L. O., Chittur, S., Tullius, T. D., Parker, S. C., White, K. P., Karmakar, S., Victorsen, A., Jameel, N., Bild, N., Grossman, R. L., Snyder, M., Landt, S. G., Yang, X., Patacsil, D., Slifer, T., Dekker, J., Lajoie, B. R., Sanyal, A., Weng, Z., Whitfield, T. W., Wang, J., Collins, P. J., Trinklein, N. D., Partridge, E. C., Myers, R. M., Giddings, M. C., Chen, X., Khatun, J., Maier, C., Yu, Y., Gunawardena, H., Risk, B., Feingold, E. A., Lowdon, R. F., Dillon, L. A., Good, P. J. 2011; 9 (4)


    The mission of the Encyclopedia of DNA Elements (ENCODE) Project is to enable the scientific and medical communities to interpret the human genome sequence and apply it to understand human biology and improve health. The ENCODE Consortium is integrating multiple technologies and approaches in a collective effort to discover and define the functional elements encoded in the human genome, including genes, transcripts, and transcriptional regulatory regions, together with their attendant chromatin states and DNA methylation patterns. In the process, standards to ensure high-quality data have been implemented, and novel algorithms have been developed to facilitate analysis. Data and derived results are made available through a freely accessible database. Here we provide an overview of the project and the resources it is generating and illustrate the application of ENCODE data to interpret the human genome.

    View details for DOI 10.1371/journal.pbio.1001046

    View details for Web of Science ID 000289938900014

  • Reciprocal Sign Epistasis between Frequently Experimentally Evolved Adaptive Mutations Causes a Rugged Fitness Landscape PLOS GENETICS Kvitek, D. J., Sherlock, G. 2011; 7 (4)


    The fitness landscape captures the relationship between genotype and evolutionary fitness and is a pervasive metaphor used to describe the possible evolutionary trajectories of adaptation. However, little is known about the actual shape of fitness landscapes, including whether valleys of low fitness create local fitness optima, acting as barriers to adaptive change. Here we provide evidence of a rugged molecular fitness landscape arising during an evolution experiment in an asexual population of Saccharomyces cerevisiae. We identify the mutations that arose during the evolution using whole-genome sequencing and use competitive fitness assays to describe the mutations individually responsible for adaptation. In addition, we find that a fitness valley between two adaptive mutations in the genes MTH1 and HXT6/HXT7 is caused by reciprocal sign epistasis, where the fitness cost of the double mutant prohibits the two mutations from being selected in the same genetic background. The constraint enforced by reciprocal sign epistasis causes the mutations to remain mutually exclusive during the experiment, even though adaptive mutations in these two genes occur several times in independent lineages during the experiment. Our results show that epistasis plays a key role during adaptation and that inter-genic interactions can act as barriers between adaptive solutions. These results also provide a new interpretation on the classic Dobzhansky-Muller model of reproductive isolation and display some surprising parallels with mutations in genes often associated with tumors.

    View details for DOI 10.1371/journal.pgen.1002056

    View details for Web of Science ID 000289977000039

    View details for PubMedID 21552329

    View details for PubMedCentralID PMC3084205

  • Rapid Evolution of Simple Microbial Communities in the Laboratory 14th Evolutionary Biology Meeting Kinnersley, M., Wenger, J. W., Sherlock, G., Rosenzweig, F. R. SPRINGER-VERLAG BERLIN. 2011: 107–120
  • Rnnotator: an automated de novo transcriptome assembly pipeline from stranded RNA-Seq reads BMC GENOMICS Martin, J., Bruno, V. M., Fang, Z., Meng, X., Blow, M., Zhang, T., Sherlock, G., Snyder, M., Wang, Z. 2010; 11


    Comprehensive annotation and quantification of transcriptomes are outstanding problems in functional genomics. While high throughput mRNA sequencing (RNA-Seq) has emerged as a powerful tool for addressing these problems, its success is dependent upon the availability and quality of reference genome sequences, thus limiting the organisms to which it can be applied.Here, we describe Rnnotator, an automated software pipeline that generates transcript models by de novo assembly of RNA-Seq data without the need for a reference genome. We have applied the Rnnotator assembly pipeline to two yeast transcriptomes and compared the results to the reference gene catalogs of these organisms. The contigs produced by Rnnotator are highly accurate (95%) and reconstruct full-length genes for the majority of the existing gene models (54.3%). Furthermore, our analyses revealed many novel transcribed regions that are absent from well annotated genomes, suggesting Rnnotator serves as a complementary approach to analysis based on a reference genome for comprehensive transcriptomics.These results demonstrate that the Rnnotator pipeline is able to reconstruct full-length transcripts in the absence of a complete reference genome.

    View details for DOI 10.1186/1471-2164-11-663

    View details for Web of Science ID 000285303000001

    View details for PubMedID 21106091

    View details for PubMedCentralID PMC3152782

  • Comprehensive annotation of the transcriptome of the human fungal pathogen Candida albicans using RNA-seq GENOME RESEARCH Bruno, V. M., Wang, Z., Marjani, S. L., Euskirchen, G. M., Martin, J., Sherlock, G., Snyder, M. 2010; 20 (10): 1451-1458


    Candida albicans is the major invasive fungal pathogen of humans, causing diseases ranging from superficial mucosal infections to disseminated, systemic infections that are often lifethreatening. We have used massively parallel high-throughput sequencing of cDNA (RNA-seq) to generate a high-resolution map of the C. albicans transcriptome under several different environmental conditions. We have quantitatively determined all of the regions that are transcribed under these different conditions, and have identified 602 novel transcriptionally active regions (TARs) and numerous novel introns that are not represented in the current genome annotation. Interestingly, the expression of many of these TARs is regulated in a condition-specific manner. This comprehensive transcriptome analysis significantly enhances the current genome annotation of C. albicans, a necessary framework for a complete understanding of the molecular mechanisms of pathogenesis for this important eukaryotic pathogen.

    View details for DOI 10.1101/gr.109553.110

    View details for Web of Science ID 000282375000015

    View details for PubMedID 20810668

    View details for PubMedCentralID PMC2945194

  • Annotare-a tool for annotating high-throughput biomedical investigations and resulting data BIOINFORMATICS Shankar, R., Parkinson, H., Burdett, T., Hastings, E., Liu, J., Miller, M., Srinivasa, R., White, J., Brazma, A., Sherlock, G., Stoeckert, C. J., Ball, C. A. 2010; 26 (19): 2470-2471


    Computational methods in molecular biology will increasingly depend on standards-based annotations that describe biological experiments in an unambiguous manner. Annotare is a software tool that enables biologists to easily annotate their high-throughput experiments, biomaterials and data in a standards-compliant way that facilitates meaningful search and analysis.Annotare is available from under the terms of the open-source MIT License ( It has been tested on both Mac and Windows.

    View details for DOI 10.1093/bioinformatics/btq462

    View details for Web of Science ID 000282170000021

    View details for PubMedID 20733062

    View details for PubMedCentralID PMC2944206

  • Microarray karyotyping of maltose-fermenting Saccharomyces yeasts with differing maltotriose utilization profiles reveals copy number variation in genes involved in maltose and maltotriose utilization JOURNAL OF APPLIED MICROBIOLOGY Duval, E. H., Alves, S. L., Dunn, B., Sherlock, G., Stambuk, B. U. 2010; 109 (1): 248-259


    We performed an analysis of maltotriose utilization by 52 Saccharomyces yeast strains able to ferment maltose efficiently and correlated the observed phenotypes with differences in the copy number of genes possibly involved in maltotriose utilization by yeast cells.The analysis of maltose and maltotriose utilization by laboratory and industrial strains of the species Saccharomyces cerevisiae and Saccharomyces pastorianus (a natural S. cerevisiae/Saccharomyces bayanus hybrid) was carried out using microscale liquid cultivation, as well as in aerobic batch cultures. All strains utilize maltose efficiently as a carbon source, but three different phenotypes were observed for maltotriose utilization: efficient growth, slow/delayed growth and no growth. Through microarray karyotyping and pulsed-field gel electrophoresis blots, we analysed the copy number and localization of several maltose-related genes in selected S. cerevisiae strains. While most strains lacked the MPH2 and MPH3 transporter genes, almost all strains analysed had the AGT1 gene and increased copy number of MALx1 permeases.Our results showed that S. pastorianus yeast strains utilized maltotriose more efficiently than S. cerevisiae strains and highlighted the importance of the AGT1 gene for efficient maltotriose utilization by S. cerevisiae yeasts.Our results revealed new maltotriose utilization phenotypes, contributing to a better understanding of the metabolism of this carbon source for improved fermentation by Saccharomyces yeasts.

    View details for DOI 10.1111/j.1365-2672.2009.04656.x

    View details for Web of Science ID 000278674300024

    View details for PubMedID 20070441

  • A Genome-Wide Analysis Reveals No Nuclear Dobzhansky-Muller Pairs of Determinants of Speciation between S. cerevisiae and S. paradoxus, but Suggests More Complex Incompatibilities PLOS GENETICS Kao, K. C., Schwartz, K., Sherlock, G. 2010; 6 (7)


    The Dobzhansky-Muller (D-M) model of speciation by genic incompatibility is widely accepted as the primary cause of interspecific postzygotic isolation. Since the introduction of this model, there have been theoretical and experimental data supporting the existence of such incompatibilities. However, speciation genes have been largely elusive, with only a handful of candidate genes identified in a few organisms. The Saccharomyces sensu stricto yeasts, which have small genomes and can mate interspecifically to produce sterile hybrids, are thus an ideal model for studying postzygotic isolation. Among them, only a single D-M pair, comprising a mitochondrially targeted product of a nuclear gene and a mitochondrially encoded locus, has been found. Thus far, no D-M pair of nuclear genes has been identified between any sensu stricto yeasts. We report here the first detailed genome-wide analysis of rare meiotic products from an otherwise sterile hybrid and show that no classic D-M pairs of speciation genes exist between the nuclear genomes of the closely related yeasts S. cerevisiae and S. paradoxus. Instead, our analyses suggest that more complex interactions, likely involving multiple loci having weak effects, may be responsible for their post-zygotic separation. The lack of a nuclear encoded classic D-M pair between these two yeasts, yet the existence of multiple loci that may each exert a small effect through complex interactions suggests that initial speciation events might not always be mediated by D-M pairs. An alternative explanation may be that the accumulation of polymorphisms leads to gamete inviability due to the activities of anti-recombination mechanisms and/or incompatibilities between the species' transcriptional and metabolic networks, with no single pair at least initially being responsible for the incompatibility. After such a speciation event, it is possible that one or more D-M pairs might subsequently arise following isolation.

    View details for DOI 10.1371/journal.pgen.1001038

    View details for Web of Science ID 000280512700034

    View details for PubMedID 20686707

    View details for PubMedCentralID PMC2912382

  • TB database 2010: Overview and update TUBERCULOSIS Galagan, J. E., Sisk, P., Stolte, C., Weiner, B., Koehrsen, M., Wymore, F., Reddy, T. B., Zucker, J. D., Engels, R., Gellesch, M., Hubble, J., Jin, H., Larson, L., Mao, M., Nitzberg, M., White, J., Zachariah, Z. K., Sherlock, G., Ball, C. A., Schoolnik, G. K. 2010; 90 (4): 225-235


    The Tuberculosis Database (TBDB) is an online database providing integrated access to genome sequence, expression data and literature curation for TB. TBDB currently houses genome assemblies for numerous strains of Mycobacterium tuberculosis (MTB) as well assemblies for over 20 strains related to MTB and useful for comparative analysis. TBDB stores pre- and post-publication gene-expression data from M. tuberculosis and its close relatives, including over 3000 MTB microarrays, 95 RT-PCR datasets, 2700 microarrays for human and mouse TB related experiments, and 260 arrays for Streptomyces coelicolor. To enable wide use of these data, TBDB provides a suite of tools for searching, browsing, analyzing, and downloading the data. We provide here an overview of TBDB focusing on recent data releases and enhancements. In particular, we describe the recent release of a Global Genetic Diversity dataset for TB, support for short-read re-sequencing data, new tools for exploring gene expression data in the context of gene regulation, and the integration of a metabolic network reconstruction and BioCyc with TBDB. By integrating a wide range of genomic data with tools for their use, TBDB is a unique platform for both basic science research in TB, as well as research into the discovery and development of TB drugs, vaccines and biomarkers.

    View details for DOI 10.1016/

    View details for Web of Science ID 000280233900002

    View details for PubMedID 20488753

  • Bulk Segregant Analysis by High-Throughput Sequencing Reveals a Novel Xylose Utilization Gene from Saccharomyces cerevisiae PLOS GENETICS Wenger, J. W., Schwartz, K., Sherlock, G. 2010; 6 (5)


    Fermentation of xylose is a fundamental requirement for the efficient production of ethanol from lignocellulosic biomass sources. Although they aggressively ferment hexoses, it has long been thought that native Saccharomyces cerevisiae strains cannot grow fermentatively or non-fermentatively on xylose. Population surveys have uncovered a few naturally occurring strains that are weakly xylose-positive, and some S. cerevisiae have been genetically engineered to ferment xylose, but no strain, either natural or engineered, has yet been reported to ferment xylose as efficiently as glucose. Here, we used a medium-throughput screen to identify Saccharomyces strains that can increase in optical density when xylose is presented as the sole carbon source. We identified 38 strains that have this xylose utilization phenotype, including strains of S. cerevisiae, other sensu stricto members, and hybrids between them. All the S. cerevisiae xylose-utilizing strains we identified are wine yeasts, and for those that could produce meiotic progeny, the xylose phenotype segregates as a single gene trait. We mapped this gene by Bulk Segregant Analysis (BSA) using tiling microarrays and high-throughput sequencing. The gene is a putative xylitol dehydrogenase, which we name XDH1, and is located in the subtelomeric region of the right end of chromosome XV in a region not present in the S288c reference genome. We further characterized the xylose phenotype by performing gene expression microarrays and by genetically dissecting the endogenous Saccharomyces xylose pathway. We have demonstrated that natural S. cerevisiae yeasts are capable of utilizing xylose as the sole carbon source, characterized the genetic basis for this trait as well as the endogenous xylose utilization pathway, and demonstrated the feasibility of BSA using high-throughput sequencing.

    View details for DOI 10.1371/journal.pgen.1000942

    View details for Web of Science ID 000278557300012

    View details for PubMedID 20485559

    View details for PubMedCentralID PMC2869308

  • The Aspergillus Genome Database, a curated comparative genomics resource for gene, protein and sequence information for the Aspergillus research community NUCLEIC ACIDS RESEARCH Arnaud, M. B., Chibucos, M. C., Costanzo, M. C., Crabtree, J., Inglis, D. O., Lotia, A., Orvis, J., Shah, P., Skrzypek, M. S., Binkley, G., Miyasato, S. R., Wortman, J. R., Sherlock, G. 2010; 38: D420-D427


    The Aspergillus Genome Database (AspGD) is an online genomics resource for researchers studying the genetics and molecular biology of the Aspergilli. AspGD combines high-quality manual curation of the experimental scientific literature examining the genetics and molecular biology of Aspergilli, cutting-edge comparative genomics approaches to iteratively refine and improve structural gene annotations across multiple Aspergillus species, and web-based research tools for accessing and exploring the data. All of these data are freely available at We welcome feedback from users and the research community at

    View details for DOI 10.1093/nar/gkp751

    View details for Web of Science ID 000276399100066

    View details for PubMedID 19773420

    View details for PubMedCentralID PMC2808984

  • New tools at the Candida Genome Database: biochemical pathways and full-text literature search NUCLEIC ACIDS RESEARCH Skrzypek, M. S., Arnaud, M. B., Costanzo, M. C., Inglis, D. O., Shah, P., Binkley, G., Miyasato, S. R., Sherlock, G. 2010; 38: D428-D432


    The Candida Genome Database (CGD, provides online access to genomic sequence data and manually curated functional information about genes and proteins of the human pathogen Candida albicans. Herein, we describe two recently added features, Candida Biochemical Pathways and the Textpresso full-text literature search tool. The Biochemical Pathways tool provides visualization of metabolic pathways and analysis tools that facilitate interpretation of experimental data, including results of large-scale experiments, in the context of Candida metabolism. Textpresso for Candida allows searching through the full-text of Candida-specific literature, including clinical and epidemiological studies.

    View details for DOI 10.1093/nar/gkp836

    View details for Web of Science ID 000276399100067

    View details for PubMedID 19808938

    View details for PubMedCentralID PMC2808937

  • Industrial fuel ethanol yeasts contain adaptive copy number changes in genes involved in vitamin B1 and B6 biosynthesis GENOME RESEARCH Stambuk, B. U., Dunn, B., Alves, S. L., Duval, E. H., Sherlock, G. 2009; 19 (12): 2271-2278


    Fuel ethanol is now a global energy commodity that is competitive with gasoline. Using microarray-based comparative genome hybridization (aCGH), we have determined gene copy number variations (CNVs) common to five industrially important fuel ethanol Saccharomyces cerevisiae strains responsible for the production of billions of gallons of fuel ethanol per year from sugarcane. These strains have significant amplifications of the telomeric SNO and SNZ genes, which are involved in the biosynthesis of vitamins B6 (pyridoxine) and B1 (thiamin). We show that increased copy number of these genes confers the ability to grow more efficiently under the repressing effects of thiamin, especially in medium lacking pyridoxine and with high sugar concentrations. These genetic changes have likely been adaptive and selected for in the industrial environment, and may be required for the efficient utilization of biomass-derived sugars from other renewable feedstocks.

    View details for DOI 10.1101/gr.094276.109

    View details for Web of Science ID 000272273400011

    View details for PubMedID 19897511

    View details for PubMedCentralID PMC2792166

  • Gene Ontology and the annotation of pathogen genomes: the case of Candida albicans TRENDS IN MICROBIOLOGY Arnaud, M. B., Costanzo, M. C., Shah, P., Skrzypek, M. S., Sherlock, G. 2009; 17 (7): 295-303


    The Gene Ontology (GO) is a structured controlled vocabulary developed to describe the roles and locations of gene products in a consistent manner and in a way that can be shared across organisms. The unicellular fungus Candida albicans is similar in many ways to the model organism Saccharomyces cerevisiae but, as both a commensal and a pathogen of humans, differs greatly in its lifestyle. With an expanding at-risk population of immunosuppressed patients, increased use of invasive medical procedures, the increasing prevalence of drug resistance and the emergence of additional Candida species as serious pathogens, it has never been more crucial to improve our understanding of Candida biology to guide the development of better treatments. In this brief review, we examine the importance of GO in the annotation of C. albicans gene products, with a focus on those involved in pathogenesis. We also discuss how sequence information combined with GO facilitates the transfer of knowledge across related species and the challenges and opportunities that such an approach presents.

    View details for DOI 10.1016/j.tim.2009.04.007

    View details for Web of Science ID 000268616600006

    View details for PubMedID 19577928

  • Evolution of pathogenicity and sexual reproduction in eight Candida genomes NATURE Butler, G., Rasmussen, M. D., Lin, M. F., Santos, M. A., Sakthikumar, S., Munro, C. A., Rheinbay, E., Grabherr, M., Forche, A., Reedy, J. L., Agrafioti, I., Arnaud, M. B., Bates, S., Brown, A. J., Brunke, S., Costanzo, M. C., Fitzpatrick, D. A., De Groot, P. W., Harris, D., Hoyer, L. L., Hube, B., Klis, F. M., Kodira, C., Lennard, N., Logue, M. E., Martin, R., Neiman, A. M., Nikolaou, E., Quail, M. A., Quinn, J., Santos, M. C., Schmitzberger, F. F., Sherlock, G., Shah, P., Silverstein, K. A., Skrzypek, M. S., Soll, D., Staggs, R., Stansfield, I., Stumpf, M. P., Sudbery, P. E., Srikantha, T., Zeng, Q., Berman, J., Berriman, M., Heitman, J., Gow, N. A., Lorenz, M. C., Birren, B. W., Kellis, M., Cuomo, C. A. 2009; 459 (7247): 657-662


    Candida species are the most common cause of opportunistic fungal infection worldwide. Here we report the genome sequences of six Candida species and compare these and related pathogens and non-pathogens. There are significant expansions of cell wall, secreted and transporter gene families in pathogenic species, suggesting adaptations associated with virulence. Large genomic tracts are homozygous in three diploid species, possibly resulting from recent recombination events. Surprisingly, key components of the mating and meiosis pathways are missing from several species. These include major differences at the mating-type loci (MTL); Lodderomyces elongisporus lacks MTL, and components of the a1/2 cell identity determinant were lost in other species, raising questions about how mating and cell types are controlled. Analysis of the CUG leucine-to-serine genetic-code change reveals that 99% of ancestral CUG codons were erased and new ones arose elsewhere. Lastly, we revise the Candida albicans gene catalogue, identifying many new genes.

    View details for DOI 10.1038/nature08064

    View details for Web of Science ID 000266608600034

    View details for PubMedID 19465905

    View details for PubMedCentralID PMC2834264

  • TB database: an integrated platform for tuberculosis research NUCLEIC ACIDS RESEARCH Reddy, T. B., Riley, R., Wymore, F., Montgomery, P., DeCaprio, D., Engels, R., Gellesch, M., Hubble, J., Jen, D., Jin, H., Koehrsen, M., Larson, L., Mao, M., Nitzberg, M., Sisk, P., Stolte, C., Weiner, B., White, J., Zachariah, Z. K., Sherlock, G., Galagan, J. E., Ball, C. A., Schoolnik, G. K. 2009; 37: D499-D508


    The effective control of tuberculosis (TB) has been thwarted by the need for prolonged, complex and potentially toxic drug regimens, by reliance on an inefficient vaccine and by the absence of biomarkers of clinical status. The promise of the genomics era for TB control is substantial, but has been hindered by the lack of a central repository that collects and integrates genomic and experimental data about this organism in a way that can be readily accessed and analyzed. The Tuberculosis Database (TBDB) is an integrated database providing access to TB genomic data and resources, relevant to the discovery and development of TB drugs, vaccines and biomarkers. The current release of TBDB houses genome sequence data and annotations for 28 different Mycobacterium tuberculosis strains and related bacteria. TBDB stores pre- and post-publication gene-expression data from M. tuberculosis and its close relatives. TBDB currently hosts data for nearly 1500 public tuberculosis microarrays and 260 arrays for Streptomyces. In addition, TBDB provides access to a suite of comparative genomics and microarray analysis software. By bringing together M. tuberculosis genome annotation and gene-expression data with a suite of analysis tools, TBDB ( provides a unique discovery platform for TB research.

    View details for DOI 10.1093/nar/gkn652

    View details for PubMedID 18835847

  • Implementation of GenePattern within the Stanford Microarray Database NUCLEIC ACIDS RESEARCH Hubble, J., Demeter, J., Jin, H., Mao, M., Nitzberg, M., Reddy, T. B., Wymore, F., Zachariah, K., Sherlock, G., Ball, C. A. 2009; 37: D898-D901


    Hundreds of researchers across the world use the Stanford Microarray Database (SMD; to store, annotate, view, analyze and share microarray data. In addition to providing registered users at Stanford access to their own data, SMD also provides access to public data, and tools with which to analyze those data, to any public user anywhere in the world. Previously, the addition of new microarray data analysis tools to SMD has been limited by available engineering resources, and in addition, the existing suite of tools did not provide a simple way to design, execute and share analysis pipelines, or to document such pipelines for the purposes of publication. To address this, we have incorporated the GenePattern software package directly into SMD, providing access to many new analysis tools, as well as a plug-in architecture that allows users to directly integrate and share additional tools through SMD. In this article, we describe our implementation of the GenePattern microarray analysis software package into the SMD code base. This extension is available with the SMD source code that is fully and freely available to others under an Open Source license, enabling other groups to create a local installation of SMD with an enriched data analysis capability.

    View details for DOI 10.1093/nar/gkn786

    View details for Web of Science ID 000261906200157

    View details for PubMedID 18953035

    View details for PubMedCentralID PMC2686537

  • Novel Low Abundance and Transient RNAs in Yeast Revealed by Tiling Microarrays and Ultra High-Throughput Sequencing Are Not Conserved Across Closely Related Yeast Species PLOS GENETICS Lee, A., Hansen, K. D., Bullard, J., Dudoit, S., Sherlock, G. 2008; 4 (12)


    A complete description of the transcriptome of an organism is crucial for a comprehensive understanding of how it functions and how its transcriptional networks are controlled, and may provide insights into the organism's evolution. Despite the status of Saccharomyces cerevisiae as arguably the most well-studied model eukaryote, we still do not have a full catalog or understanding of all its genes. In order to interrogate the transcriptome of S. cerevisiae for low abundance or rapidly turned over transcripts, we deleted elements of the RNA degradation machinery with the goal of preferentially increasing the relative abundance of such transcripts. We then used high-resolution tiling microarrays and ultra high-throughput sequencing (UHTS) to identify, map, and validate unannotated transcripts that are more abundant in the RNA degradation mutants relative to wild-type cells. We identified 365 currently unannotated transcripts, the majority presumably representing low abundance or short-lived RNAs, of which 185 are previously unknown and unique to this study. It is likely that many of these are cryptic unstable transcripts (CUTs), which are rapidly degraded and whose function(s) within the cell are still unclear, while others may be novel functional transcripts. Of the 185 transcripts we identified as novel to our study, greater than 80 percent come from regions of the genome that have lower conservation scores amongst closely related yeast species than 85 percent of the verified ORFs in S. cerevisiae. Such regions of the genome have typically been less well-studied, and by definition transcripts from these regions will distinguish S. cerevisiae from these closely related species.

    View details for DOI 10.1371/journal.pgen.1000299

    View details for Web of Science ID 000263667900014

    View details for PubMedID 19096707

    View details for PubMedCentralID PMC2601015

  • Molecular characterization of clonal interference during adaptive evolution in asexual populations of Saccharomyces cerevisiae NATURE GENETICS Kao, K. C., Sherlock, G. 2008; 40 (12): 1499-1504


    The classical model of adaptive evolution in an asexual population postulates that each adaptive clone is derived from the one preceding it. However, experimental evidence has suggested more complex dynamics, with theory predicting the fixation probability of a beneficial mutation as dependent on the mutation rate, population size and the mutation's selection coefficient. Clonal interference has been demonstrated in viruses and bacteria but not in a eukaryote, and a detailed molecular characterization is lacking. Here we use three different fluorescent markers to visualize the dynamics of asexually evolving yeast populations. For each adaptive clone within one of our evolving populations, we identified the underlying mutations, monitored their population frequencies and used microarrays to characterize changes in the transcriptome. These results represent the most detailed molecular characterization of experimental evolution to date and provide direct experimental evidence supporting both the clonal interference and the multiple mutation models.

    View details for DOI 10.1038/ng.280

    View details for Web of Science ID 000261215900030

    View details for PubMedID 19029899

  • Changes to NIH Grant System May Backfire SCIENCE Karp, P. D., Sherlock, G., Gerlt, J. A., Sim, I., Paulsen, I., Babbitt, P. C., Laderoute, K., Hunter, L., Sternberg, P., Wooley, J., Bourne, P. E. 2008; 322 (5905): 1187-1188

    View details for Web of Science ID 000261033400017

    View details for PubMedID 19023064

  • Comprehensive genomic characterization defines human glioblastoma genes and core pathways NATURE Chin, L., Meyerson, M., Aldape, K., Bigner, D., Mikkelsen, T., VandenBerg, S., Kahn, A., PENNY, R., Ferguson, M. L., Gerhard, D. S., Getz, G., Brennan, C., Taylor, B. S., Winckler, W., Park, P., Ladanyi, M., Hoadley, K. A., Verhaak, R. G., Hayes, D. N., Spellman, P. T., Absher, D., Weir, B. A., Ding, L., Wheeler, D., Lawrence, M. S., Cibulskis, K., Mardis, E., Zhang, J., Wilson, R. K., Donehower, L., Wheeler, D. A., Purdom, E., Wallis, J., Laird, P. W., Herman, J. G., Schuebel, K. E., Weisenberger, D. J., BAYLIN, S. B., Schultz, N., Yao, J., Wiedemeyer, R., WEINSTEIN, J., Sander, C., Gibbs, R. A., Gray, J., Kucherlapati, R., Lander, E. S., Myers, R. M., Perou, C. M., McLendon, R., Friedman, A., Van Meir, E. G., Brat, D. J., Mastrogianakis, G. M., Olson, J. J., Lehman, N., Yung, W. K., Bogler, O., Berger, M., Prados, M., Muzny, D., Morgan, M., Scherer, S., Sabo, A., Nazareth, L., Lewis, L., Hall, O., Zhu, Y., Ren, Y., Alvi, O., Yao, J., Hawes, A., Jhangiani, S., Fowler, G., San Lucas, A., Kovar, C., Cree, A., Dinh, H., Santibanez, J., Joshi, V., Gonzalez-Garay, M. L., Miller, C. A., Milosavljevic, A., Sougnez, C., Fennell, T., Mahan, S., Wilkinson, J., Ziaugra, L., Onofrio, R., Bloom, T., Nicol, R., Ardlie, K., Baldwin, J., Gabriel, S., Fulton, R. S., McLellan, M. D., Larson, D. E., Shi, X., Abbott, R., Fulton, L., Chen, K., Koboldt, D. C., Wendl, M. C., Meyer, R., Tang, Y., Lin, L., Osborne, J. R., Dunford-Shore, B. H., Miner, T. L., Delehaunty, K., Markovic, C., Swift, G., Courtney, W., Pohl, C., Abbott, S., Hawkins, A., Leong, S., Haipek, C., Schmidt, H., Wiechert, M., Vickery, T., Scott, S., Dooling, D. J., Chinwalla, A., Weinstock, G. M., O'Kelly, M., Robinson, J., Alexe, G., Beroukhim, R., Carter, S., Chiang, D., Gould, J., Gupta, S., Korn, J., Mermel, C., Mesirov, J., Monti, S., Nguyen, H., Parkin, M., Reich, M., Stransky, N., Garraway, L., Golub, T., Protopopov, A., Perna, I., Aronson, S., Sathiamoorthy, N., Ren, G., Kim, H., Kong, S. W., Xiao, Y., Kohane, I. S., Seidman, J., Cope, L., Pan, F., Van Den Berg, D., van Neste, L., Yi, J. M., Li, J. Z., Southwick, A., Brady, S., Aggarwal, A., Chung, T., Sherlock, G., Brooks, J. D., Jakkula, L. R., Lapuk, A. V., Marr, H., Dorton, S., Choi, Y. G., Han, J., Ray, A., Wang, V., Durinck, S., Robinson, M., Wang, N. J., Vranizan, K., Peng, V., Van Name, E., Fontenay, G. V., Ngai, J., Conboy, J. G., Parvin, B., Feiler, H. S., Speed, T. P., Socci, N. D., Olshen, A., Lash, A., Reva, B., Antipin, Y., Stukalov, A., Gross, B., Cerami, E., Wang, W. Q., Qin, L., Seshan, V. E., Villafania, L., Cavatore, M., Borsu, L., Viale, A., Gerald, W., Topal, M. D., Qi, Y., Balu, S., Shi, Y., Wu, G., Bittner, M., Shelton, T., Lenkiewicz, E., Morris, S., Beasley, D., Sanders, S., Sfeir, R., Chen, J., Nassau, D., Feng, L., Hickey, E., Schaefer, C., Madhavan, S., Buetow, K., Barker, A., Vockley, J., Compton, C., Vaught, J., Fielding, P., Collins, F., Good, P., Guyer, M., Ozenberger, B., Peterson, J., Thomson, E. 2008; 455 (7216): 1061-1068


    Human cancer cells typically harbour multiple chromosomal aberrations, nucleotide substitutions and epigenetic modifications that drive malignant transformation. The Cancer Genome Atlas (TCGA) pilot project aims to assess the value of large-scale multi-dimensional analysis of these molecular characteristics in human cancer and to provide the data rapidly to the research community. Here we report the interim integrative analysis of DNA copy number, gene expression and DNA methylation aberrations in 206 glioblastomas--the most common type of adult brain cancer--and nucleotide sequence aberrations in 91 of the 206 glioblastomas. This analysis provides new insights into the roles of ERBB2, NF1 and TP53, uncovers frequent mutations of the phosphatidylinositol-3-OH kinase regulatory subunit gene PIK3R1, and provides a network view of the pathways altered in the development of glioblastoma. Furthermore, integration of mutation, DNA methylation and clinical treatment data reveals a link between MGMT promoter methylation and a hypermutator phenotype consequent to mismatch repair deficiency in treated glioblastomas, an observation with potential clinical implications. Together, these findings establish the feasibility and power of TCGA, demonstrating that it can rapidly expand knowledge of the molecular basis of cancer.

    View details for DOI 10.1038/nature07385

    View details for Web of Science ID 000260252600035

    View details for PubMedID 18772890

  • Reconstruction of the genome origins and evolution of the hybrid lager yeast Saccharomyces pastorianus GENOME RESEARCH Dunn, B., Sherlock, G. 2008; 18 (10): 1610-1623


    Inter-specific hybridization leading to abrupt speciation is a well-known, common mechanism in angiosperm evolution; only recently, however, have similar hybridization and speciation mechanisms been documented to occur frequently among the closely related group of sensu stricto Saccharomyces yeasts. The economically important lager beer yeast Saccharomyces pastorianus is such a hybrid, formed by the union of Saccharomyces cerevisiae and Saccharomyces bayanus-related yeasts; efforts to understand its complex genome, searching for both biological and brewing-related insights, have been underway since its hybrid nature was first discovered. It had been generally thought that a single hybridization event resulted in a unique S. pastorianus species, but it has been recently postulated that there have been two or more hybridization events. Here, we show that there may have been two independent origins of S. pastorianus strains, and that each independent group--defined by characteristic genome rearrangements, copy number variations, ploidy differences, and DNA sequence polymorphisms--is correlated with specific breweries and/or geographic locations. Finally, by reconstructing common ancestral genomes via array-CGH data analysis and by comparing representative DNA sequences of the S. pastorianus strains with those of many different S. cerevisiae isolates, we have determined that the most likely S. cerevisiae ancestral parent for each of the independent S. pastorianus groups was an ale yeast, with different, but closely related ale strains contributing to each group's parentage.

    View details for DOI 10.1101/gr.076075.108

    View details for Web of Science ID 000259700800008

    View details for PubMedID 18787083

    View details for PubMedCentralID PMC2556262

  • Minimum information specification for in situ hybridization and immunohistochemistry experiments (MISFISHIE) NATURE BIOTECHNOLOGY Deutsch, E. W., Ball, C. A., Berman, J. J., Bova, G. S., Brazma, A., Bumgarner, R. E., Campbell, D., Causton, H. C., Christiansen, J. H., Daian, F., Dauga, D., Davidson, D. R., Gimenez, G., Goo, Y. A., Grimmond, S., Henrich, T., Herrmann, B. G., Johnson, M. H., Korb, M., Mills, J. C., Oudes, A. J., Parkinson, H. E., Pascal, L. E., Pollet, N. I., Quackenbush, J., Ramialison, M., Ringwald, M., Salgado, D., Sansone, S., Sherlock, G., Stoeckert, C. J., Swedlow, J., Taylor, R. C., Walashek, L., Warford, A., Wilkinson, D. G., Zhou, Y., Zon, L. I., Liu, A. Y., True, L. D. 2008; 26 (3): 305-312


    One purpose of the biomedical literature is to report results in sufficient detail that the methods of data collection and analysis can be independently replicated and verified. Here we present reporting guidelines for gene expression localization experiments: the minimum information specification for in situ hybridization and immunohistochemistry experiments (MISFISHIE). MISFISHIE is modeled after the Minimum Information About a Microarray Experiment (MIAME) specification for microarray experiments. Both guidelines define what information should be reported without dictating a format for encoding that information. MISFISHIE describes six types of information to be provided for each experiment: experimental design, biomaterials and treatments, reporters, staining, imaging data and image characterizations. This specification has benefited the consortium within which it was developed and is expected to benefit the wider research community. We welcome feedback from the scientific community to help improve our proposal.

    View details for DOI 10.1038/nbt1391

    View details for Web of Science ID 000254123400023

    View details for PubMedID 18327244

  • Isolation and molecular characterization of cancer stem cells in MMTV-Wnt-1 murine breast tumors STEM CELLS Cho, R. W., Wang, X., Diehn, M., Shedden, K., Chen, G. Y., Sherlock, G., Gurney, A., Lewicki, J., Clarke, M. F. 2008; 26 (2): 364-371


    In human breast cancers, a phenotypically distinct minority population of tumorigenic (TG) cancer cells (sometimes referred to as cancer stem cells) drives tumor growth when transplanted into immunodeficient mice. Our objective was to identify a mouse model of breast cancer stem cells that could have relevance to the study of human breast cancer. To do so, we used breast tumors of the mouse mammary tumor virus (MMTV)-Wnt-1 mice. MMTV-Wnt-1 breast tumors were harvested, dissociated into single-cell suspensions, and sorted by flow cytometry on Thy1, CD24, and CD45. Sorted cells were then injected into recipient background FVB/NJ female syngeneic mice. In six of seven tumors examined, Thy1+CD24+ cancer cells, which constituted approximately 1%-4% of tumor cells, were highly enriched for cells capable of regenerating new tumors compared with cells of the tumor that did not fit this profile ("not-Thy1+CD24+"). Resultant tumors had a phenotypic diversity similar to that of the original tumor and behaved in a similar manner when passaged. Microarray analysis comparing Thy1+CD24+ tumor cells to not-Thy1+CD24+ cells identified a list of differentially expressed genes. Orthologs of these differentially expressed genes predicted survival of human breast cancer patients from two different study groups. These studies suggest that there is a cancer stem cell compartment in the MMTV-Wnt-1 murine breast tumor and that there is a clinical utility of this model for the study of cancer stem cells.

    View details for DOI 10.1634/stemcells.2007-0440

    View details for Web of Science ID 000253372600008

    View details for PubMedID 17975224

  • The XBabelPhish MAGE-ML and XML translator BMC BIOINFORMATICS Maier, D., Wymore, F., Sherlock, G., Ball, C. A. 2008; 9


    MAGE-ML has been promoted as a standard format for describing microarray experiments and the data they produce. Two characteristics of the MAGE-ML format compromise its use as a universal standard: First, MAGE-ML files are exceptionally large - too large to be easily read by most people, and often too large to be read by most software programs. Second, the MAGE-ML standard permits many ways of representing the same information. As a result, different producers of MAGE-ML create different documents describing the same experiment and its data. Recognizing all the variants is an unwieldy software engineering task, resulting in software packages that can read and process MAGE-ML from some, but not all producers. This Tower of MAGE-ML Babel bars the unencumbered exchange of microarray experiment descriptions couched in MAGE-ML.We have developed XBabelPhish - an XQuery-based technology for translating one MAGE-ML variant into another. XBabelPhish's use is not restricted to translating MAGE-ML documents. It can transform XML files independent of their DTD, XML schema, or semantic content. Moreover, it is designed to work on very large (> 200 Mb.) files, which are common in the world of MAGE-ML.XBabelPhish provides a way to inter-translate MAGE-ML variants for improved interchange of microarray experiment information. More generally, it can be used to transform most XML files, including very large ones that exceed the capacity of most XML tools.

    View details for DOI 10.1186/1471-2105-9-28

    View details for Web of Science ID 000253159700001

    View details for PubMedID 18205924

    View details for PubMedCentralID PMC2233607

  • The Stanford Tissue Microarray Database NUCLEIC ACIDS RESEARCH Marinelli, R. J., Montgomery, K., Liu, C. L., Shah, N. H., Prapong, W., Nitzberg, M., Zachariah, Z. K., Sherlock, G. J., Natkunam, Y., West, R. B., van de Rijn, M., Brown, P. O., Ball, C. A. 2008; 36: D871-D877


    The Stanford Tissue Microarray Database (TMAD; is a public resource for disseminating annotated tissue images and associated expression data. Stanford University pathologists, researchers and their collaborators worldwide use TMAD for designing, viewing, scoring and analyzing their tissue microarrays. The use of tissue microarrays allows hundreds of human tissue cores to be simultaneously probed by antibodies to detect protein abundance (Immunohistochemistry; IHC), or by labeled nucleic acids (in situ hybridization; ISH) to detect transcript abundance. TMAD archives multi-wavelength fluorescence and bright-field images of tissue microarrays for scoring and analysis. As of July 2007, TMAD contained 205 161 images archiving 349 distinct probes on 1488 tissue microarray slides. Of these, 31 306 images for 68 probes on 125 slides have been released to the public. To date, 12 publications have been based on these raw public data. TMAD incorporates the NCI Thesaurus ontology for searching tissues in the cancer domain. Image processing researchers can extract images and scores for training and testing classification algorithms. The production server uses the Apache HTTP Server, Oracle Database and Perl application code. Source code is available to interested researchers under a no-cost license.

    View details for DOI 10.1093/nar/gkm861

    View details for PubMedID 17989087

  • OntologyWidget - a reusable, embeddable widget for easily locating ontology terms BMC BIOINFORMATICS Beauheim, C. C., Wymore, F., Nitzberg, M., Zachariah, Z. K., Jin, H., Skene, J. H., Ball, C. A., Sherlock, G. 2007; 8


    Biomedical ontologies are being widely used to annotate biological data in a computer-accessible, consistent and well-defined manner. However, due to their size and complexity, annotating data with appropriate terms from an ontology is often challenging for experts and non-experts alike, because there exist few tools that allow one to quickly find relevant ontology terms to easily populate a web form.We have produced a tool, OntologyWidget, which allows users to rapidly search for and browse ontology terms. OntologyWidget can easily be embedded in other web-based applications. OntologyWidget is written using AJAX (Asynchronous JavaScript and XML) and has two related elements. The first is a dynamic auto-complete ontology search feature. As a user enters characters into the search box, the appropriate ontology is queried remotely for terms that match the typed-in text, and the query results populate a drop-down list with all potential matches. Upon selection of a term from the list, the user can locate this term within a generic and dynamic ontology browser, which comprises the second element of the tool. The ontology browser shows the paths from a selected term to the root as well as parent/child tree hierarchies. We have implemented web services at the Stanford Microarray Database (SMD), which provide the OntologyWidget with access to over 40 ontologies from the Open Biological Ontology (OBO) website 1. Each ontology is updated weekly. Adopters of the OntologyWidget can either use SMD's web services, or elect to rely on their own. Deploying the OntologyWidget can be accomplished in three simple steps: (1) install Apache Tomcat 2 on one's web server, (2) download and install the OntologyWidget servlet stub that provides access to the SMD ontology web services, and (3) create an html (HyperText Markup Language) file that refers to the OntologyWidget using a simple, well-defined format.We have developed OntologyWidget, an easy-to-use ontology search and display tool that can be used on any web page by creating a simple html description. OntologyWidget provides a rapid auto-complete search function paired with an interactive tree display. We have developed a web service layer that communicates between the web page interface and a database of ontology terms. We currently store 40 of the ontologies from the OBO website 1, as well as a several others. These ontologies are automatically updated on a weekly basis. OntologyWidget can be used in any web-based application to take advantage of the ontologies we provide via web services or any other ontology that is provided elsewhere in the correct format. The full source code for the JavaScript and description of the OntologyWidget is available from

    View details for DOI 10.1186/1471-2105-8-338

    View details for Web of Science ID 000250989100001

    View details for PubMedID 17854506

    View details for PubMedCentralID PMC2080642

  • The prognostic role of a gene signature from tumorigenic breast-cancer cells. NEW ENGLAND JOURNAL OF MEDICINE Liu, R., Wang, X., Chen, G. Y., Dalerba, P., Gurney, A., Hoey, T., Sherlock, G., Lewicki, J., Shedden, K., Clarke, M. F. 2007; 356 (3): 217-226


    Breast cancers contain a minority population of cancer cells characterized by CD44 expression but low or undetectable levels of CD24 (CD44+CD24-/low) that have higher tumorigenic capacity than other subtypes of cancer cells.We compared the gene-expression profile of CD44+CD24-/low tumorigenic breast-cancer cells with that of normal breast epithelium. Differentially expressed genes were used to generate a 186-gene "invasiveness" gene signature (IGS), which was evaluated for its association with overall survival and metastasis-free survival in patients with breast cancer or other types of cancer.There was a significant association between the IGS and both overall and metastasis-free survival (P<0.001, for both) in patients with breast cancer, which was independent of established clinical and pathological variables. When combined with the prognostic criteria of the National Institutes of Health, the IGS was used to stratify patients with high-risk early breast cancer into prognostic categories (good or poor); among patients with a good prognosis, the 10-year rate of metastasis-free survival was 81%, and among those with a poor prognosis, it was 57%. The IGS was also associated with the prognosis in medulloblastoma (P=0.004), lung cancer (P=0.03), and prostate cancer (P=0.01). The prognostic power of the IGS was increased when combined with the wound-response (WR) signature.The IGS is strongly associated with metastasis-free survival and overall survival for four different types of tumors. This genetic signature of tumorigenic breast-cancer cells was even more strongly associated with clinical outcomes when combined with the WR signature in breast cancer.

    View details for Web of Science ID 000243488100004

    View details for PubMedID 17229949

  • The Stanford Microarray Database: implementation of new analysis tools and open source release of software NUCLEIC ACIDS RESEARCH Demeter, J., Beauheim, C., Gollub, J., Hernandez-Boussard, T., Jin, H., Maier, D., Matese, J. C., Nitzberg, M., Wymore, F., Zachariah, Z. K., Brown, P. O., Sherlock, G., Ball, C. A. 2007; 35: D766-D770


    The Stanford Microarray Database (SMD; is a research tool and archive that allows hundreds of researchers worldwide to store, annotate, analyze and share data generated by microarray technology. SMD supports most major microarray platforms, and is MIAME-supportive and can export or import MAGE-ML. The primary mission of SMD is to be a research tool that supports researchers from the point of data generation to data publication and dissemination, but it also provides unrestricted access to analysis tools and public data from 300 publications. In addition to supporting ongoing research, SMD makes its source code fully and freely available to others under an Open Source license, enabling other groups to create a local installation of SMD. In this article, we describe several data analysis tools implemented in SMD and we discuss features of our software release.

    View details for DOI 10.1093/nar/gkl1019

    View details for Web of Science ID 000243494600151

    View details for PubMedID 17182626

    View details for PubMedCentralID PMC1781111

  • Sequence resources at the Candida genome database NUCLEIC ACIDS RESEARCH Arnaud, M. B., Costanzo, M. C., Skrzypek, M. S., Shah, P., Binkley, G., Lane, C., Miyasato, S. R., Sherlock, G. 2007; 35: D452-D456


    The Candida Genome Database (CGD, contains a curated collection of genomic information and community resources for researchers who are interested in the molecular biology of the opportunistic pathogen Candida albicans. With the recent release of a new assembly of the C.albicans genome, Assembly 20, C.albicans genomics has entered a new era. Although the C.albicans genome assembly continues to undergo refinement, multiple assemblies and gene nomenclatures will remain in widespread use by the research community. CGD has now taken on the responsibility of maintaining the most up-to-date version of the genome sequence by providing the data from this new assembly alongside the data from the previous assemblies, as well as any future corrections and refinements. In this database update, we describe the sequence information available for C.albicans, the sequence information contained in CGD, and the tools for sequence retrieval, analysis and comparison that CGD provides. CGD is freely accessible at and CGD curators may be contacted by email at

    View details for DOI 10.1093/nar/gkl899

    View details for Web of Science ID 000243494600092

    View details for PubMedID 17090582

    View details for PubMedCentralID PMC1669745

  • A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB BMC BIOINFORMATICS Rayner, T. F., Rocca-Serra, P., Spellman, P. T., Causton, H. C., Farne, A., Holloway, E., Irizarry, R. A., Liu, J., Maier, D. S., Miller, M., Petersen, K., Quackenbush, J., Sherlock, G., Stoeckert, C. J., White, J., Whetzel, P. L., Wymore, F., Parkinson, H., Sarkans, U., Ball, C. A., Brazma, A. 2006; 7


    Sharing of microarray data within the research community has been greatly facilitated by the development of the disclosure and communication standards MIAME and MAGE-ML by the MGED Society. However, the complexity of the MAGE-ML format has made its use impractical for laboratories lacking dedicated bioinformatics support.We propose a simple tab-delimited, spreadsheet-based format, MAGE-TAB, which will become a part of the MAGE microarray data standard and can be used for annotating and communicating microarray data in a MIAME compliant fashion.MAGE-TAB will enable laboratories without bioinformatics experience or support to manage, exchange and submit well-annotated microarray data in a standard format using a spreadsheet. The MAGE-TAB format is self-contained, and does not require an understanding of MAGE-ML or XML.

    View details for DOI 10.1186/1471-2105-7-489

    View details for Web of Science ID 000242642800001

    View details for PubMedID 17087822

    View details for PubMedCentralID PMC1687205

  • Cell cycle - Complex evolution NATURE Sherlock, G. 2006; 443 (7111): 513-?

    View details for DOI 10.1038/443513a

    View details for Web of Science ID 000240988200026

    View details for PubMedID 17024077

  • The Candida Genome Database: Facilitating research on Candida albicans molecular biology FEMS YEAST RESEARCH Costanzo, M. C., Arnaud, M. B., Skrzypek, M. S., Binkley, G., Lane, C., Miyasato, S. R., Sherlock, G. 2006; 6 (5): 671-684


    The Candida Genome Database (CGD; is a resource for information about the Candida albicans genomic sequence and the molecular biology of its encoded gene products. CGD collects and organizes data from the biological literature concerning C. albicans, and provides tools for viewing, searching, analysing, and downloading these data. CGD also serves as an organizing centre for the C. albicans research community, providing a gene-name registry, contact information, and research community news. This article describes the information contained in CGD and how to access it, either from the perspective of a bench scientist interested in the function of one or a few genes, or from the perspective of a biologist or bioinformatician interpreting large-scale functional genomic datasets.

    View details for DOI 10.1111/j.1567-1364.2006.000074.x

    View details for Web of Science ID 000239004600001

    View details for PubMedID 16879419

  • Radiation-induced effects on gene expression: An in vivo study on breast cancer 3rd International Conference on Translational Research and Pre-Clinical Strategies in Radiation Oncology Helland, A., Johnsen, H., Froyland, C., Landmark, H. B., Saetersdal, A. B., Holmen, M. M., Gjertsen, T., Nesland, J. M., Ottestad, W., Jeffrey, S. S., Ottestad, L. O., Rodningen, O. K., Sherlock, G., Borresen-Dale, A. ELSEVIER IRELAND LTD. 2006: 230–35


    Breast cancer is diagnosed worldwide in approximately one million women annually and radiation therapy is an integral part of treatment. The purpose of this study was to investigate the molecular basis underlying response to radiotherapy in breast cancer tissue.Tumour biopsies were sampled before radiation and after 10 treatments (of 2 Gray (Gy) each) from 19 patients with breast cancer receiving radiation therapy. Gene expression microarray analyses were performed to identify in vivo radiation-responsive genes in tumours from patients diagnosed with breast cancer. The mutation status of the TP53 gene was determined by using direct sequencing.Several genes involved in cell cycle regulation and DNA repair were found to be significantly induced by radiation treatment. Mutations were found in the TP53 gene in 39% of the tumours and the gene expression profiles observed seemed to be influenced by the TP53 mutation status.

    View details for DOI 10.1016/j.radonc.2006.07.007

    View details for Web of Science ID 000240882300018

    View details for PubMedID 16890317

  • Development of the Minimum Information Specification for in situ Hybridization and Immunohistochemistry Experiments (MISFISHIE) OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY Deutsch, E. W., Ball, C. A., Bova, G. S., Brazma, A., Bumgarner, R. E., Campbell, D., Causton, H. C., Christiansen, J., Davidson, D., Eichner, L. J., Goo, Y. A., Grimmond, S., Henrich, T., Johnson, M. H., Korb, M., Mills, J. C., Oudes, A., Parkinson, H. E., Pascal, L. E., Quackenbush, J., Ramialison, M., Ringwald, M., Sansone, S., Sherlock, G., Stoeckert, C. J., Swedlow, J., Taylor, R. C., Walashek, L., Zhou, Y., Liu, A. Y., True, L. D. 2006; 10 (2): 205-208


    We describe the creation process of the Minimum Information Specification for In Situ Hybridization and Immunohistochemistry Experiments (MISFISHIE). Modeled after the existing minimum information specification for microarray data, we created a new specification for gene expression localization experiments, initially to facilitate data sharing within a consortium. After successful use within the consortium, the specification was circulated to members of the wider biomedical research community for comment and refinement. After a period of acquiring many new suggested requirements, it was necessary to enter a final phase of excluding those requirements that were deemed inappropriate as a minimum requirement for all experiments. The full specification will soon be published as a version 1.0 proposal to the community, upon which a more full discussion must take place so that the final specification may be achieved with the involvement of the whole community.

    View details for Web of Science ID 000240210900017

    View details for PubMedID 16901227

  • Top-down standards will not serve systems biology NATURE Quackenbush, J. 2006; 440 (7080): 24-24

    View details for DOI 10.1038/440024a

    View details for Web of Science ID 000235685700017

    View details for PubMedID 16511469

  • The Stanford Microarray Database: a user's guide. Methods in molecular biology (Clifton, N.J.) Gollub, J., Ball, C. A., Sherlock, G. 2006; 338: 191-208


    The Stanford Microarray Database (SMD) is a DNA microarray research database that provides a large amount of data for public use. This chapter describes the use of the primary tools for searching, browsing, retrieving, and analyzing data available for SMD. With this introduction, researchers and students will be able to examine and analyze a large body of gene expression and other experiments. Additional tools for depositing, annotating, sharing, and analyzing data, available only to registered users, are also described. SMD is available for installation as a local database.

    View details for PubMedID 16888360

  • Global analysis of gene function in yeast by quantitative phenotypic profiling MOLECULAR SYSTEMS BIOLOGY Brown, J. A., Sherlock, G., Myers, C. L., Burrows, N. M., Deng, C., Wu, H. I., McCann, K. E., Troyanskaya, O. G., Brown, J. M. 2006; 2


    We present a method for the global analysis of the function of genes in budding yeast based on hierarchical clustering of the quantitative sensitivity profiles of the 4756 strains with individual homozygous deletion of nonessential genes to a broad range of cytotoxic or cytostatic agents. This method is superior to other global methods of identifying the function of genes involved in the various DNA repair and damage checkpoint pathways as well as other interrogated functions. Analysis of the phenotypic profiles of the 51 diverse treatments places a total of 860 genes of unknown function in clusters with genes of known function. We demonstrate that this can not only identify the function of unknown genes but can also suggest the mechanism of action of the agents used. This method will be useful when used alone and in conjunction with other global approaches to identify gene function in yeast.

    View details for DOI 10.1038/msb4100043

    View details for Web of Science ID 000243245400005

    View details for PubMedID 16738548

    View details for PubMedCentralID PMC1681475

  • Wrestling with SUMO and bio-ontologies. Nature biotechnology Stoeckert, C., Ball, C., Brazma, A., Brinkman, R., Causton, H., Fan, L., Fostel, J., Fragoso, G., Heiskanen, M., Holstege, F., Morrison, N., Parkinson, H., Quackenbush, J., Rocca-Serra, P., Sansone, S. A., Sarkans, U., Sherlock, G., Stevens, R., Taylor, C., Taylor, R., Whetzel, P., White, J. 2006; 24 (1): 21-2; author reply 23

    View details for DOI 10.1038/nbt0106-21a

    View details for PubMedID 16404382

  • Clustering microarray data DNA MICROARRAYS, PART B: DATABASES AND STATISTICS Gollub, J., Sherlock, G. 2006; 411: 194-?


    Even a simple, small-scale, microarray experiment generates thousands to millions of data points. Clearly, spreadsheets or plotting programs do not suffice for analysis of such large volumes of data, and comprehensive analysis requires systematic methods for selection and organization of data. This chapter focuses on the concepts and algorithms of hierarchical clustering and the most commonly employed methods of partitioning or organizing microarray data, and freely available software that implements these algorithms.

    View details for DOI 10.1016/S0076-6879(06)11010-1

    View details for Web of Science ID 000244506300010

    View details for PubMedID 16939791

  • Storage and retrieval of microarray data and open source microarray database software MOLECULAR BIOTECHNOLOGY Sherlock, G., Ball, C. A. 2005; 30 (3): 239-251


    Microarray technology has been widely adopted by researchers who use both home-made microarrays and microarrays purchased from commercial vendors. Associated with the adoption of this technology has been a deluge of complex data, both from the microarrays themselves, and also in the form of associated meta data, such as gene annotation information, the properties and treatment of biological samples, and the data transformation and analysis steps taken downstream. In addition, standards for annotation and data exchange have been proposed, and are now being adopted by journals and funding agencies alike. The coupling of large quantities of complex data with extensive and complex standards require all but the most small-scale of microarray users to have access to a robust and scaleable database with various tools. In this review, we discuss some of the desirable properties of such a database, and look at the features of several freely available alternatives.

    View details for Web of Science ID 000230547300006

    View details for PubMedID 15988049

  • A human-curated annotation of the Candida albicans genome PLOS GENETICS Braun, B. R., Hoog, M. V., d'Enfert, C., Martchenko, M., Dungan, J., Kuo, A., Inglis, D. O., Uhl, M. A., Hogues, H., Berriman, M., Lorenz, M., Levitin, A., Oberholzer, U., Bachewich, C., Harcus, D., Marcil, A., Dignard, D., Iouk, T., Zito, R., Frangeul, L., Tekaia, F., Rutherford, K., Wang, E., Munro, C. A., BATES, S., Gow, N. A., Hoyer, L. L., Kohler, G., Morschhauser, J., Newport, G., Znaidi, S., Raymond, M., Turcotte, B., Sherlock, G., Costanzo, M., Ihmels, J., Berman, J., Sanglard, D., Agabian, N., Mitchell, A. P., Johnson, A. D., Whiteway, M., Nantel, A. 2005; 1 (1): 36-57


    Recent sequencing and assembly of the genome for the fungal pathogen Candida albicans used simple automated procedures for the identification of putative genes. We have reviewed the entire assembly, both by hand and with additional bioinformatic resources, to accurately map and describe 6,354 genes and to identify 246 genes whose original database entries contained sequencing errors (or possibly mutations) that affect their reading frame. Comparison with other fungal genomes permitted the identification of numerous fungus-specific genes that might be targeted for antifungal therapy. We also observed that, compared to other fungi, the protein-coding sequences in the C. albicans genome are especially rich in short sequence repeats. Finally, our improved annotation permitted a detailed analysis of several multigene families, and comparative genomic studies showed that C. albicans has a far greater catabolic range, encoding respiratory Complex 1, several novel oxidoreductases and ketone body degrading enzymes, malonyl-CoA and enoyl-CoA carriers, several novel amino acid degrading enzymes, a variety of secreted catabolic lipases and proteases, and numerous transporters to assimilate the resulting nutrients. The results of these efforts will ensure that the Candida research community has uniform and comprehensive genomic information for medical research as well as for future diagnostic and therapeutic applications.

    View details for DOI 10.1371/journal.pgen.0010001

    View details for Web of Science ID 000234295900006

    View details for PubMedID 16103911

    View details for PubMedCentralID PMC1183520

  • Of fish and chips NATURE METHODS Sherlock, G. 2005; 2 (5): 329-330

    View details for Web of Science ID 000228790200008

    View details for PubMedID 15846357

  • Microarray karyotyping of commercial wine yeast strains reveals shared, as well as unique, genomic signatures BMC GENOMICS Dunn, B., Levine, R. P., Sherlock, G. 2005; 6


    Genetic differences between yeast strains used in wine-making may account for some of the variation seen in their fermentation properties and may also produce differing sensory characteristics in the final wine product itself. To investigate this, we have determined genomic differences among several Saccharomyces cerevisiae wine strains by using a "microarray karyotyping" (also known as "array-CGH" or "aCGH") technique.We have studied four commonly used commercial wine yeast strains, assaying three independent isolates from each strain. All four wine strains showed common differences with respect to the laboratory S. cerevisiae strain S288C, some of which may be specific to commercial wine yeasts. We observed very little intra-strain variation; i.e., the genomic karyotypes of different commercial isolates of the same strain looked very similar, although an exception to this was seen among the Montrachet isolates. A moderate amount of inter-strain genomic variation between the four wine strains was observed, mostly in the form of depletions or amplifications of single genes; these differences allowed unique identification of each strain. Many of the inter-strain differences appear to be in transporter genes, especially hexose transporters (HXT genes), metal ion sensors/transporters (CUP1, ZRT1, ENA genes), members of the major facilitator superfamily, and in genes involved in drug response (PDR3, SNQ1, QDR1, RDS1, AYT1, YAR068W). We therefore used halo assays to investigate the response of these strains to three different fungicidal drugs (cycloheximide, clotrimazole, sulfomethuron methyl). Strains with fewer copies of the CUP1 loci showed hypersensitivity to sulfomethuron methyl.Microarray karyotyping is a useful tool for analyzing the genome structures of wine yeasts. Despite only small to moderate variations in gene copy numbers between different wine yeast strains and within different isolates of a given strain, there was enough variation to allow unique identification of strains; additionally, some of the variation correlated with drug sensitivity. The relatively small number of differences seen by microarray karyotyping between the strains suggests that the differences in fermentative and organoleptic properties ascribed to these different strains may arise from a small number of genetic changes, making it possible to test whether the observed differences do indeed confer different sensory properties in the finished wine.

    View details for DOI 10.1186/1471-2164-6-53

    View details for Web of Science ID 000228998600001

    View details for PubMedID 15833139

    View details for PubMedCentralID PMC1097725

  • The Stanford Microarray Database accommodates additional microarray platforms and data formats NUCLEIC ACIDS RESEARCH Ball, C. A., Awad, I. A., Demeter, J., Gollub, J., Hebert, J. M., Hernandez-Boussard, T., Jin, H., Matese, J. C., Nitzberg, M., Wymore, F., Zachariah, Z. K., Brown, P. O., Sherlock, G. 2005; 33: D580-D582


    The Stanford Microarray Database (SMD) ( is a research tool for hundreds of Stanford researchers and their collaborators. In addition, SMD functions as a resource for the entire biological research community by providing unrestricted access to microarray data published by SMD users and by disseminating its source code. In addition to storing GenePix (Axon Instruments) and ScanAlyze output from spotted microarrays, SMD has recently added the ability to store, retrieve, display and analyze the complete raw data produced by several additional microarray platforms and image analysis software packages, so that we can also now accept data from Affymetrix GeneChips (MAS5/GCOS or dChip), Agilent Catalog or Custom arrays (using Agilent's Feature Extraction software) or data created by SpotReader (Niles Scientific). We have implemented software that allows us to accept MAGE-ML documents from array manufacturers and to submit MIAME-compliant data in MAGE-ML format directly to ArrayExpress and GEO, greatly increasing the ease with which data from SMD can be published adhering to accepted standards and also increasing the accessibility of published microarray data to the general public. We have introduced a new tool to facilitate data sharing among our users, so that datasets can be shared during, before or after the completion of data analysis. The latest version of the source code for the complete database package was released in November 2004 (, allowing researchers around the world to deploy their own installations of SMD.

    View details for Web of Science ID 000226524300119

    View details for PubMedID 15608265

  • The Candida Genome Database (CGD), a community resource for Candida albicans gene and protein information NUCLEIC ACIDS RESEARCH Arnaud, M. B., Costanzo, M. C., Skrzypek, M. S., Binkley, G., Lane, C., Miyasato, S. R., Sherlock, G. 2005; 33: D358-D363


    The Candida Genome Database (CGD) is a new database that contains genomic information about the opportunistic fungal pathogen Candida albicans. CGD is a public resource for the research community that is interested in the molecular biology of this fungus. CGD curators are in the process of combing the scientific literature to collect all C.albicans gene names and aliases; to assign gene ontology terms that describe the molecular function, biological process, and subcellular localization of each gene product; to annotate mutant phenotypes; and to summarize the function and biological context of each gene product in free-text description lines. CGD also provides community resources, including a reservation system for gene names and a colleague registry through which Candida researchers can share contact information and research interests. CGD is publicly funded (by NIH grant R01 DE15873-01 from the NIDCR) and is freely available at

    View details for DOI 10.1093/nar/gki003

    View details for Web of Science ID 000226524300074

    View details for PubMedID 15608216

    View details for PubMedCentralID PMC539957

  • GO::TermFinder - open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes BIOINFORMATICS Boyle, E. I., Weng, S. A., Gollub, J., Jin, H., Botstein, D., Cherry, J. M., Sherlock, G. 2004; 20 (18): 3710-3715


    GO::TermFinder comprises a set of object-oriented Perl modules for accessing Gene Ontology (GO) information and evaluating and visualizing the collective annotation of a list of genes to GO terms. It can be used to draw conclusions from microarray and other biological data, calculating the statistical significance of each annotation. GO::TermFinder can be used on any system on which Perl can be run, either as a command line application, in single or batch mode, or as a web-based CGI script.The full source code and documentation for GO::TermFinder are freely available from

    View details for DOI 10.1093/bioinformatics/bth456

    View details for Web of Science ID 000225786600064

    View details for PubMedID 15297299

    View details for PubMedCentralID PMC3037731

  • An open letter on microarray data from the MGED Society MICROBIOLOGY-SGM Ball, C., Brazma, A., Causton, H., Chervitz, S., Edgar, R., Hingamp, P., Matese, J. C., Parkinson, H., Quackenbush, J., RINGWALD, M., Sansone, S. A., Sherlock, G., Spellman, P., Stoeckert, C., Tateno, Y., Taylor, R., WHITE, J., Winegarden, N. 2004; 150: 3522-3524

    View details for DOI 10.1099/mic.0.27637-0

    View details for Web of Science ID 000225372700003

    View details for PubMedID 15528642

  • Caryoscope: An Open Source Java application for viewing microarray data in a genomic context BMC BIOINFORMATICS Awad, I. A., Rees, C. A., Hernandez-Boussard, T., Ball, C. A., Sherlock, G. 2004; 5


    Microarray-based comparative genome hybridization experiments generate data that can be mapped onto the genome. These data are interpreted more easily when represented graphically in a genomic context.We have developed Caryoscope, which is an open source Java application for visualizing microarray data from array comparative genome hybridization experiments in a genomic context. Caryoscope can read General Feature Format files (GFF files), as well as comma- and tab-delimited files, that define the genomic positions of the microarray reporters for which data are obtained. The microarray data can be browsed using an interactive, zoomable interface, which helps users identify regions of chromosomal deletion or amplification. The graphical representation of the data can be exported in a number of graphic formats, including publication-quality formats such as PostScript.Caryoscope is a useful tool that can aid in the visualization, exploration and interpretation of microarray data in a genomic context.

    View details for DOI 10.1186/1471-2105-5-151

    View details for Web of Science ID 000225769900002

    View details for PubMedID 15488149

    View details for PubMedCentralID PMC528725

  • GeneXplorer: an interactive web application for microarray data visualization and analysis BMC BIOINFORMATICS Rees, C. A., Demeter, J., Matese, J. C., Botstein, D., Sherlock, G. 2004; 5


    When publishing large-scale microarray datasets, it is of great value to create supplemental websites where either the full data, or selected subsets corresponding to figures within the paper, can be browsed. We set out to create a CGI application containing many of the features of some of the existing standalone software for the visualization of clustered microarray data.We present GeneXplorer, a web application for interactive microarray data visualization and analysis in a web environment. GeneXplorer allows users to browse a microarray dataset in an intuitive fashion. It provides simple access to microarray data over the Internet and uses only HTML and JavaScript to display graphic and annotation information. It provides radar and zoom views of the data, allows display of the nearest neighbors to a gene expression vector based on their Pearson correlations and provides the ability to search gene annotation fields.The software is released under the permissive MIT Open Source license, and the complete documentation and the entire source code are freely available for download from CPAN

    View details for DOI 10.1186/1471-2105-5-141

    View details for Web of Science ID 000224940600001

    View details for PubMedID 15458579

    View details for PubMedCentralID PMC523853

  • Submission of microarray data to public repositories. PLoS biology Ball, C. A., Brazma, A., Causton, H., Chervitz, S., Edgar, R., Hingamp, P., Matese, J. C., Parkinson, H., Quackenbush, J., Ringwald, M., Sansone, S., Sherlock, G., Spellman, P., Stoeckert, C., Tateno, Y., Taylor, R., White, J., Winegarden, N. 2004; 2 (9): E317-?

    View details for PubMedID 15340489

  • Funding high-throughput data sharing NATURE BIOTECHNOLOGY Ball, C. A., Sherlock, G., Brazma, A. 2004; 22 (9): 1179-1183

    View details for DOI 10.1038/nbt0904-1179

    View details for Web of Science ID 000223653400040

    View details for PubMedID 15340487

  • Standards for microarray data: an open letter. Environmental health perspectives Ball, C., Brazma, A., Causton, H., Chervitz, S., Edgar, R., Hingamp, P., Matese, J. C., Parkinson, H., Quackenbush, J., Ringwald, M., Sansone, S., Sherlock, G., Spellman, P., Stoeckert, C., Tateno, Y., Taylor, R., White, J., Winegarden, N. 2004; 112 (12): A666-7

    View details for PubMedID 15345376

  • STARTing to recycle NATURE GENETICS Sherlock, G. 2004; 36 (8): 795-796

    View details for DOI 10.1038/ng0804-795

    View details for Web of Science ID 000222974000010

    View details for PubMedID 15284848

  • Final words: cell age and cell cycle are. unlinked TRENDS IN BIOTECHNOLOGY Spellman, P. T., Sherlock, G. 2004; 22 (6): 277-278


    Cooper has a simple belief: that the cell cycle is connected to age and size. Furthermore, as a result of this connection in his mind he believes that there are no possible manipulations that can operate on a batch culture to synchronize cells within the cell cycle, such that those cells can undergo a semblance of a normal cell cycle. His formulation of this argument is as a 'fundamental law', the law of conservation of cell-age order (LCCAO). The first part of this law - 'there is no batch treatment of the culture that can lead to an alteration of the cell-age order' - can probably be proved true, in the mathematical sense, and certainly makes intuitive sense. Unfortunately the corollaries of this law are rather suspect, drawing inferences from cell age to cell size to the cell cycle.

    View details for Web of Science ID 000222301000006

    View details for PubMedID 15158055

  • Reply: whole-culture synchronization effective tools for cell cycle studies TRENDS IN BIOTECHNOLOGY Spellman, P. T., Sherlock, G. 2004; 22 (6): 270-273


    Studies of gene expression during the eukaryotic cell cycle in whole-culture synchronized cultures have been published using many methodologies. These procedures alter the state of the cell cycle for a population of cells, rather than purifying a population of cells that are in the same state. Criticism of these methods (e.g. see Cooper, this issue, pp. 266-269, ) suggests that these studies are flawed, and posits that such methodologies cannot be used to study the cell cycle because they alter the size and age distributions of the cultures. We believe that whole-culture cell cycle studies work even though they alter the size and age distributions: these cells still progress through the cell cycle and although we do not suggest that the methods are perfect, we will explain how these microarray studies have successfully identified cell cycle regulated genes and why these results are biologically meaningful.

    View details for Web of Science ID 000222301000004

    View details for PubMedID 15158053

  • The Longhorn Array Database (LAD): An open-source, MIAME compliant implementation of the Stanford Microarray database (SMD) BMC BIOINFORMATICS Killion, P. J., Sherlock, G., Iyer, V. R. 2003; 4


    The power of microarray analysis can be realized only if data is systematically archived and linked to biological annotations as well as analysis algorithms.The Longhorn Array Database (LAD) is a MIAME compliant microarray database that operates on PostgreSQL and Linux. It is a fully open source version of the Stanford Microarray Database (SMD), one of the largest microarray databases. LAD is available at http://www.longhornarraydatabase.orgOur development of LAD provides a simple, free, open, reliable and proven solution for storage and analysis of two-color microarray data.

    View details for Web of Science ID 000185003900001

    View details for PubMedID 12930545

  • SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data NUCLEIC ACIDS RESEARCH Diehn, M., Sherlock, G., Binkley, G., Jin, H., Matese, J. C., Hernandez-Boussard, T., Rees, C. A., Cherry, J. M., Botstein, D., Brown, P. O., Alizadeh, A. A. 2003; 31 (1): 219-223


    The explosion in the number of functional genomic datasets generated with tools such as DNA microarrays has created a critical need for resources that facilitate the interpretation of large-scale biological data. SOURCE is a web-based database that brings together information from a broad range of resources, and provides it in manner particularly useful for genome-scale analyses. SOURCE's GeneReports include aliases, chromosomal location, functional descriptions, GeneOntology annotations, gene expression data, and links to external databases. We curate published microarray gene expression datasets and allow users to rapidly identify sets of co-regulated genes across a variety of tissues and a large number of conditions using a simple and intuitive interface. SOURCE provides content both in gene and cDNA clone-centric pages, and thus simplifies analysis of datasets generated using cDNA microarrays. SOURCE is continuously updated and contains the most recent and accurate information available for human, mouse, and rat genes. By allowing dynamic linking to individual gene or clone reports, SOURCE facilitates browsing of large genomic datasets. Finally, SOURCEs batch interface allows rapid extraction of data for thousands of genes or clones at once and thus facilitates statistical analyses such as assessing the enrichment of functional attributes within clusters of genes. SOURCE is available at

    View details for DOI 10.1093/nar/gkg014

    View details for Web of Science ID 000181079700050

    View details for PubMedID 12519986

    View details for PubMedCentralID PMC165461

  • The Stanford Microarray Database: data access and quality assessment tools NUCLEIC ACIDS RESEARCH Gollub, J., Ball, C. A., Binkley, G., Demeter, J., Finkelstein, D. B., Hebert, J. M., Hernandez-Boussard, T., Jin, H., Kaloper, M., Matese, J. C., Schroeder, M., Brown, P. O., Botstein, D., Sherlock, G. 2003; 31 (1): 94-96


    The Stanford Microarray Database (SMD; serves as a microarray research database for Stanford investigators and their collaborators. In addition, SMD functions as a resource for the entire scientific community, by making freely available all of its source code and providing full public access to data published by SMD users, along with many tools to explore and analyze those data. SMD currently provides public access to data from 3500 microarrays, including data from 85 publications, and this total is increasing rapidly. In this article, we describe some of SMD's newer tools for accessing public data, assessing data quality and for data analysis.

    View details for DOI 10.1093/nar/gkg078

    View details for Web of Science ID 000181079700020

    View details for PubMedID 12519956

    View details for PubMedCentralID PMC165525

  • Microarray databases: storage and retrieval of microarray data. Methods in molecular biology (Clifton, N.J.) Sherlock, G., Ball, C. A. 2003; 224: 235-248

    View details for PubMedID 12710676

  • The underlying principles of scientific publication. Bioinformatics Ball, C. A., Sherlock, G., Parkinson, H., Rocca-Sera, P., Brooksbank, C., Causton, H. C., Cavalieri, D., Gaasterland, T., Hingamp, P., Holstege, F., Ringwald, M., Spellman, P., Stoeckert, C. J., Stewart, J. E., Taylor, R., Brazma, A., Quackenbush, J. 2002; 18 (11): 1409-?

    View details for PubMedID 12424109

  • Standards for Microarray data SCIENCE Ball, C. A., Sherlock, G., Parkinson, H., Rocca-Serra, P., Brooksbank, C., Causton, H. C., Cavalieri, D., Gaasterland, T., Hingamp, P., Holstege, F., RINGWALD, M., Spellman, P., Stoeckert, C. J., Stewart, J. E., Taylor, R., Brazma, A., Quackenbuch, J. 2002; 298 (5593): 539-539

    View details for Web of Science ID 000178634800016

    View details for PubMedID 12387284

  • Identification of genes periodically expressed in the human cell cycle and their expression in tumors MOLECULAR BIOLOGY OF THE CELL Whitfield, M. L., Sherlock, G., Saldanha, A. J., Murray, J. I., Ball, C. A., Alexander, K. E., Matese, J. C., Perou, C. M., Hurt, M. M., Brown, P. O., Botstein, D. 2002; 13 (6): 1977-2000


    The genome-wide program of gene expression during the cell division cycle in a human cancer cell line (HeLa) was characterized using cDNA microarrays. Transcripts of >850 genes showed periodic variation during the cell cycle. Hierarchical clustering of the expression patterns revealed coexpressed groups of previously well-characterized genes involved in essential cell cycle processes such as DNA replication, chromosome segregation, and cell adhesion along with genes of uncharacterized function. Most of the genes whose expression had previously been reported to correlate with the proliferative state of tumors were found herein also to be periodically expressed during the HeLa cell cycle. However, some of the genes periodically expressed in the HeLa cell cycle do not have a consistent correlation with tumor proliferation. Cell cycle-regulated transcripts of genes involved in fundamental processes such as DNA replication and chromosome segregation seem to be more highly expressed in proliferative tumors simply because they contain more cycling cells. The data in this report provide a comprehensive catalog of cell cycle regulated genes that can serve as a starting point for functional discovery. The full dataset is available at

    View details for DOI 10.1091/mbc.02-02-0030

    View details for Web of Science ID 000176418800016

    View details for PubMedID 12058064

    View details for PubMedCentralID PMC117619

  • Molecular characterisation of soft tissue tumours: a gene expression study LANCET Nielsen, T. O., West, R. B., Linn, S. C., Alter, O., Knowling, M. A., O'Connell, J. X., Zhu, S., Fero, M., Sherlock, G., Pollack, J. R., Brown, P. O., Botstein, D., van de Rijn, M. 2002; 359 (9314): 1301-1307


    Soft-tissue tumours are derived from mesenchymal cells such as fibroblasts, muscle cells, or adipocytes, but for many such tumours the histogenesis is controversial. We aimed to start molecular characterisation of these rare neoplasms and to do a genome-wide search for new diagnostic markers.We analysed gene-expression patterns of 41 soft-tissue tumours with spotted cDNA microarrays. After removal of errors introduced by use of different microarray batches, the expression patterns of 5520 genes that were well defined were used to separate tumours into discrete groups by hierarchical clustering and singular value decomposition.Synovial sarcomas, gastrointestinal stromal tumours, neural tumours, and a subset of the leiomyosarcomas, showed strikingly distinct gene-expression patterns. Other tumour categories--malignant fibrous histiocytoma, liposarcoma, and the remaining leiomyosarcomas--shared molecular profiles that were not predicted by histological features or immunohistochemistry. Strong expression of known genes, such as KIT in gastrointestinal stromal tumours, was noted within gene sets that distinguished the different sarcomas. However, many uncharacterised genes also contributed to the distinction between tumour types.These results suggest a new method for classification of soft-tissue tumours, which could improve on the method based on histological findings. Large numbers of uncharacterised genes contributed to distinctions between the tumours, and some of these could be useful markers for diagnosis, have prognostic significance, or prove possible targets for treatment.

    View details for Web of Science ID 000174989700013

    View details for PubMedID 11965276

  • Exploratory screening of genes and clusters from microarray experiments STATISTICA SINICA Tibshirani, R., Hastie, T., Narasimhan, B., Eisen, M., Sherlock, G., Brown, P., Botstein, D. 2002; 12 (1): 47-59
  • Design and implementation of microarray gene expression markup language (MAGE-ML) GENOME BIOLOGY Spellman, P. T., Miller, M., Stewart, J., Troup, C., Sarkans, U., Chervitz, S., Bernhart, D., Sherlock, G., Ball, C., Lepage, M., Swiatek, M., Marks, W. L., Goncalves, J., Markel, S., Iordan, D., Shojatalab, M., Pizarro, A., White, J., Hubley, R., Deutsch, E., Senger, M., Aronow, B. J., Robinson, A., Bassett, D., Stoeckert, C. J., Brazma, A. 2002; 3 (9)


    Meaningful exchange of microarray data is currently difficult because it is rare that published data provide sufficient information depth or are even in the same format from one publication to another. Only when data can be easily exchanged will the entire biological community be able to derive the full benefit from such microarray studies.To this end we have developed three key ingredients towards standardizing the storage and exchange of microarray data. First, we have created a minimal information for the annotation of a microarray experiment (MIAME)-compliant conceptualization of microarray experiments modeled using the unified modeling language (UML) named MAGE-OM (microarray gene expression object model). Second, we have translated MAGE-OM into an XML-based data format, MAGE-ML, to facilitate the exchange of data. Third, some of us are now using MAGE (or its progenitors) in data production settings. Finally, we have developed a freely available software tool kit (MAGE-STK) that eases the integration of MAGE-ML into end users' systems.MAGE will help microarray data producers and users to exchange information by providing a common platform for data exchange, and MAGE-STK will make the adoption of MAGE easier.

    View details for Web of Science ID 000207581400013

    View details for PubMedID 12225585

    View details for PubMedCentralID PMC126871

  • Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO) NUCLEIC ACIDS RESEARCH Dwight, S. S., Harris, M. A., Dolinski, K., Ball, C. A., Binkley, G., Christie, K. R., Fisk, D. G., Issel-Tarver, L., Schroeder, M., Sherlock, G., Sethuraman, A., Weng, S., Botstein, D., Cherry, J. M. 2002; 30 (1): 69-72


    The Saccharomyces Genome Database (SGD) resources, ranging from genetic and physical maps to genome-wide analysis tools, reflect the scientific progress in identifying genes and their functions over the last decade. As emphasis shifts from identification of the genes to identification of the role of their gene products in the cell, SGD seeks to provide its users with annotations that will allow relationships to be made between gene products, both within Saccharomyces cerevisiae and across species. To this end, SGD is annotating genes to the Gene Ontology (GO), a structured representation of biological knowledge that can be shared across species. The GO consists of three separate ontologies describing molecular function, biological process and cellular component. The goal is to use published information to associate each characterized S.cerevisiae gene product with one or more GO terms from each of the three ontologies. To be useful, this must be done in a manner that allows accurate associations based on experimental evidence, modifications to GO when necessary, and careful documentation of the annotations through evidence codes for given citations. Reaching this goal is an ongoing process at SGD. For information on the current progress of GO annotations at SGD and other participating databases, as well as a description of each of the three ontologies, please visit the GO Consortium page at SGD gene associations to GO can be found by visiting our site at

    View details for Web of Science ID 000173077100017

    View details for PubMedID 11752257

  • Minimum information about a microarray experiment (MIAME) - toward standards for microarray data NATURE GENETICS Brazma, A., Hingamp, P., Quackenbush, J., Sherlock, G., Spellman, P., Stoeckert, C., Aach, J., Ansorge, W., Ball, C. A., Causton, H. C., Gaasterland, T., Glenisson, P., HOLSTEGE, F. C., Kim, I. F., Markowitz, V., Matese, J. C., Parkinson, H., Robinson, A., Sarkans, U., Schulze-Kremer, S., STEWART, J., Taylor, R., Vilo, J., Vingron, M. 2001; 29 (4): 365-371


    Microarray analysis has become a widely used tool for the generation of gene expression data on a genomic scale. Although many significant results have been derived from microarray studies, one limitation has been the lack of standards for presenting and exchanging such data. Here we present a proposal, the Minimum Information About a Microarray Experiment (MIAME), that describes the minimum information required to ensure that microarray data can be easily interpreted and that results derived from its analysis can be independently verified. The ultimate goal of this work is to establish a standard for recording and reporting microarray-based gene expression data, which will in turn facilitate the establishment of databases and public repositories and enable the development of data analysis tools. With respect to MIAME, we concentrate on defining the content and structure of the necessary information rather than the technical format for capturing it.

    View details for Web of Science ID 000172507500006

    View details for PubMedID 11726920

  • Analysis of large-scale gene expression data. Briefings in bioinformatics Sherlock, G. 2001; 2 (4): 350-362


    DNA microarray technology has resulted in the generation of large complex data sets, such that the bottleneck in biological investigation has shifted from data generation, to data analysis. This review discusses some of the algorithms and tools for the analysis and organisation of microarray expression data, including clustering methods, partitioning methods, and methods for correlating expression data to other biological data.

    View details for PubMedID 11808747

  • Creating the gene ontology resource: Design and implementation GENOME RESEARCH Ashburner, M., Ball, C. A., Blake, J. A., Butler, H., Cherry, J. M., Corradi, J., Dolinski, K., Eppig, J. T., Harris, M., Hill, D. P., Lewis, S., Marshall, B., Mungall, C., Reiser, L., Rhee, S., Richardson, J. E., Richter, J., RINGWALD, M., Rubin, G. M., Sherlock, G., Yoon, J. 2001; 11 (8): 1425-1433


    The exponential growth in the volume of accessible biological information has generated a confusion of voices surrounding the annotation of molecular information about genes and their products. The Gene Ontology (GO) project seeks to provide a set of structured vocabularies for specific biological domains that can be used to describe gene products in any organism. This work includes building three extensive ontologies to describe molecular function, biological process, and cellular component, and providing a community database resource that supports the use of these ontologies. The GO Consortium was initiated by scientists associated with three model organism databases: SGD, the Saccharomyces Genome database; FlyBase, the Drosophila genome database; and MGD/GXD, the Mouse Genome Informatics databases. Additional model organism database groups are joining the project. Each of these model organism information systems is annotating genes and gene products using GO vocabulary terms and incorporating these annotations into their respective model organism databases. Each database contributes its annotation files to a shared GO data resource accessible to the public at The GO site can be used by the community both to recover the GO vocabularies and to access the annotated gene product data sets from the model organism databases. The GO Consortium supports the development of the GO database resource and provides tools enabling curators and researchers to query and manipulate the vocabularies. We believe that the shared development of this molecular annotation resource will contribute to the unification of biological information.

    View details for Web of Science ID 000170263900015

    View details for PubMedID 11483584

  • Missing value estimation methods for DNA microarrays BIOINFORMATICS Troyanskaya, O., Cantor, M., Sherlock, G., BROWN, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R. B. 2001; 17 (6): 520-525


    Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data.We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 1--20% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions.

    View details for Web of Science ID 000169404700005

    View details for PubMedID 11395428

  • The Stanford Microarray Database NUCLEIC ACIDS RESEARCH Sherlock, G., Hernandez-Boussard, T., Kasarskis, A., Binkley, G., Matese, J. C., Dwight, S. S., Kaloper, M., Weng, S., Jin, H., Ball, C. A., Eisen, M. B., Spellman, P. T., Brown, P. O., Botstein, D., Cherry, J. M. 2001; 29 (1): 152-155


    The Stanford Microarray Database (SMD) stores raw and normalized data from microarray experiments, and provides web interfaces for researchers to retrieve, analyze and visualize their data. The two immediate goals for SMD are to serve as a storage site for microarray data from ongoing research at Stanford University, and to facilitate the public dissemination of that data once published, or released by the researcher. Of paramount importance is the connection of microarray data with the biological data that pertains to the DNA deposited on the microarray (genes, clones etc.). SMD makes use of many public resources to connect expression information to the relevant biology, including SGD [Ball,C.A., Dolinski,K., Dwight,S.S., Harris,M.A., Issel-Tarver,L., Kasarskis,A., Scafe,C.R., Sherlock,G., Binkley,G., Jin,H. et al. (2000) Nucleic Acids Res., 28, 77-80], YPD and WormPD [Costanzo,M.C., Hogan,J.D., Cusick,M.E., Davis,B.P., Fancher,A.M., Hodges,P.E., Kondu,P., Lengieza,C., Lew-Smith,J.E., Lingner,C. et al. (2000) Nucleic Acids Res., 28, 73-76], Unigene [Wheeler,D.L., Chappey,C., Lash,A.E., Leipe,D.D., Madden,T.L., Schuler,G.D., Tatusova,T.A. and Rapp,B.A. (2000) Nucleic Acids Res., 28, 10-14], dbEST [Boguski,M.S., Lowe,T.M. and Tolstoshev,C.M. (1993) Nature Genet., 4, 332-333] and SWISS-PROT [Bairoch,A. and Apweiler,R. (2000) Nucleic Acids Res., 28, 45-48] and can be accessed at

    View details for Web of Science ID 000166360300039

    View details for PubMedID 11125075

  • Saccharomyces Genome Database provides tools to survey gene expression and functional analysis data NUCLEIC ACIDS RESEARCH Ball, C. A., Jin, H., Sherlock, G., Weng, S., Matese, J. C., Andrada, R., Binkley, G., Dolinski, K., Dwight, S. S., Harris, M. A., Issel-Tarver, L., SCHROEDER, R., Botstein, D., Cherry, J. M. 2001; 29 (1): 80-81


    Upon the completion of the SACCHAROMYCES: cerevisiae genomic sequence in 1996 [Goffeau,A. et al. (1997) NATURE:, 387, 5], several creative and ambitious projects have been initiated to explore the functions of gene products or gene expression on a genome-wide scale. To help researchers take advantage of these projects, the SACCHAROMYCES: Genome Database (SGD) has created two new tools, Function Junction and Expression Connection. Together, the tools form a central resource for querying multiple large-scale analysis projects for data about individual genes. Function Junction provides information from diverse projects that shed light on the role a gene product plays in the cell, while Expression Connection delivers information produced by the ever-increasing number of microarray projects. WWW access to SGD is available at genome-www.stanford. edu/Saccharomyces/.

    View details for Web of Science ID 000166360300019

    View details for PubMedID 11125055

  • A whole-genome microarray reveals genetic diversity among Helicobacter pylori strains PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Salama, N., Guillemin, K., McDaniel, T. K., Sherlock, G., Tompkins, L., FALKOW, S. 2000; 97 (26): 14668-14673


    Helicobacter pylori colonizes the stomach of half of the world's population, causing a wide spectrum of disease ranging from asymptomatic gastritis to ulcers to gastric cancer. Although the basis for these diverse clinical outcomes is not understood, more severe disease is associated with strains harboring a pathogenicity island. To characterize the genetic diversity of more and less virulent strains, we examined the genomic content of 15 H. pylori clinical isolates by using a whole genome H. pylori DNA microarray. We found that a full 22% of H. pylori genes are dispensable in one or more strains, thus defining a minimal functional core of 1281 H. pylori genes. While the core genes encode most metabolic and cellular processes, the strain-specific genes include genes unique to H. pylori, restriction modification genes, transposases, and genes encoding cell surface proteins, which may aid the bacteria under specific circumstances during their long-term infection of genetically diverse hosts. We observed distinct patterns of the strain-specific gene distribution along the chromosome, which may result from different mechanisms of gene acquisition and loss. Among the strain-specific genes, we have found a class of candidate virulence genes identified by their coinheritance with the pathogenicity island.

    View details for Web of Science ID 000165993700121

    View details for PubMedID 11121067

  • Gene Ontology: tool for the unification of biology NATURE GENETICS Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., RINGWALD, M., Rubin, G. M., Sherlock, G. 2000; 25 (1): 25-29

    View details for PubMedID 10802651

  • Analysis of large-scale gene expression data CURRENT OPINION IN IMMUNOLOGY Sherlock, G. 2000; 12 (2): 201-205


    The advent of cDNA and oligonucleotide microarray technologies has led to a paradigm shift in biological investigation, such that the bottleneck in research is shifting from data generation to data analysis. Hierarchical clustering, divisive clustering, self-organizing maps and k-means clustering have all been recently used to make sense of this mass of data.

    View details for Web of Science ID 000085786300012

    View details for PubMedID 10712947

  • Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling NATURE Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. G., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L. M., Marti, G. E., Moore, T., Hudson, J., Lu, L. S., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger, D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C., Botstein, D., Brown, P. O., Staudt, L. M. 2000; 403 (6769): 503-511


    Diffuse large B-cell lymphoma (DLBCL), the most common subtype of non-Hodgkin's lymphoma, is clinically heterogeneous: 40% of patients respond well to current therapy and have prolonged survival, whereas the remainder succumb to the disease. We proposed that this variability in natural history reflects unrecognized molecular heterogeneity in the tumours. Using DNA microarrays, we have conducted a systematic characterization of gene expression in B-cell malignancies. Here we show that there is diversity in gene expression among the tumours of DLBCL patients, apparently reflecting the variation in tumour proliferation rate, host response and differentiation state of the tumour. We identified two molecularly distinct forms of DLBCL which had gene expression patterns indicative of different stages of B-cell differentiation. One type expressed genes characteristic of germinal centre B cells ('germinal centre B-like DLBCL'); the second type expressed genes normally induced during in vitro activation of peripheral blood B cells ('activated B-like DLBCL'). Patients with germinal centre B-like DLBCL had a significantly better overall survival than those with activated B-like DLBCL. The molecular classification of tumours on the basis of gene expression can thus identify previously undetected and clinically significant subtypes of cancer.

    View details for Web of Science ID 000085227300039

    View details for PubMedID 10676951

  • Integrating functional genomic information into the Saccharomyces genome database NUCLEIC ACIDS RESEARCH Ball, C. A., Dolinski, K., Dwight, S. S., Harris, M. A., Issel-Tarver, L., Kasarskis, A., Scafe, C. R., Sherlock, G., Binkley, G., Jin, H., Kaloper, M., Orr, S. D., Schroeder, M., Weng, S., Zhu, Y., Botstein, D., Cherry, J. M. 2000; 28 (1): 77-80


    The Saccharomyces Genome Database (SGD) stores and organizes information about the nearly 6200 genes in the yeast genome. The information is organized around the 'locus page' and directs users to the detailed information they seek. SGD is endeavoring to integrate the existing information about yeast genes with the large volume of data generated by functional analyses that are beginning to appear in the literature and on web sites. New features will include searches of systematic analyses and Gene Summary Paragraphs that succinctly review the literature for each gene. In addition to current information, such as gene product and phenotype descriptions, the new locus page will also describe a gene product's cellular process, function and localization using a controlled vocabulary developed in collaboration with two other model organism databases. We describe these developments in SGD through the newly reorganized locus page. The SGD is accessible via the WWW at

    View details for Web of Science ID 000084896300020

    View details for PubMedID 10592186

  • Using the Saccharomyces Genome Database (SGD) for analysis of protein similarities and structure NUCLEIC ACIDS RESEARCH Chervitz, S. A., Hester, E. T., Ball, C. A., Dolinski, K., Dwight, S. S., Harris, M. A., Juvik, G., Malekian, A., Roberts, S., Roe, T., Scafe, C., Schroeder, M., Sherlock, G., Weng, S., Zhu, Y., Cherry, J. M., Botstein, D. 1999; 27 (1): 74-78


    The Saccharomyces Genome Database (SGD) collects and organizes information about the molecular biology and genetics of the yeast Saccharomyces cerevisiae. The latest protein structure and comparison tools available at SGD are presented here. With the completion of the yeast sequence and the Caenorhabditis elegans sequence soon to follow, comparison of proteins from complete eukaryotic proteomes will be an extremely powerful way to learn more about a particular protein's structure, its function, and its relationships with other proteins. SGD can be accessed through the World Wide Web at

    View details for Web of Science ID 000077983000017

    View details for PubMedID 9847146

  • Comparison of the complete protein sets of worm and yeast: Orthology and divergence SCIENCE Chervitz, S. A., Aravind, L., Sherlock, G., Ball, C. A., Koonin, E. V., Dwight, S. S., Harris, M. A., Dolinski, K., Mohr, S., Smith, T., Weng, S., Cherry, J. M., Botstein, D. 1998; 282 (5396): 2022-2028


    Comparative analysis of predicted protein sequences encoded by the genomes of Caenorhabditis elegans and Saccharomyces cerevisiae suggests that most of the core biological functions are carried out by orthologous proteins (proteins of different species that can be traced back to a common ancestor) that occur in comparable numbers. The specialized processes of signal transduction and regulatory control that are unique to the multicellular worm appear to use novel proteins, many of which re-use conserved domains. Major expansion of the number of some of these domains seen in the worm may have contributed to the advent of multicellularity. The proteins conserved in yeast and worm are likely to have orthologs throughout eukaryotes; in contrast, the proteins unique to the worm may well define metazoans.

    View details for Web of Science ID 000077467100036

    View details for PubMedID 9851918

  • Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization MOLECULAR BIOLOGY OF THE CELL Spellman, P. T., Sherlock, G., Zhang, M. Q., Iyer, V. R., Anders, K., Eisen, M. B., Brown, P. O., Botstein, D., Futcher, B. 1998; 9 (12): 3273-3297


    We sought to create a comprehensive catalog of yeast genes whose transcript levels vary periodically within the cell cycle. To this end, we used DNA microarrays and samples from yeast cultures synchronized by three independent methods: alpha factor arrest, elutriation, and arrest of a cdc15 temperature-sensitive mutant. Using periodicity and correlation algorithms, we identified 800 genes that meet an objective minimum criterion for cell cycle regulation. In separate experiments, designed to examine the effects of inducing either the G1 cyclin Cln3p or the B-type cyclin Clb2p, we found that the mRNA levels of more than half of these 800 genes respond to one or both of these cyclins. Furthermore, we analyzed our set of cell cycle-regulated genes for known and new promoter elements and show that several known elements (or variations thereof) contain information predictive of cell cycle regulation. A full description and complete data sets are available at

    View details for Web of Science ID 000077388600003

    View details for PubMedID 9843569



    In the budding yeast Saccharomyces cerevisiae, progress of the cell cycle beyond the major control point in G1 phase, termed START, requires activation of the evolutionarily conserved Cdc28 protein kinase by direct association with G1 cyclins. We have used a conditional lethal mutation in CDC28 of S. cerevisiae to clone a functional homologue from the human fungal pathogen Candida albicans. The protein sequence, deduced from the nucleotide sequence, is 79% identical to that of S. cerevisiae Cdc28 and as such is the most closely related protein yet identified. We have also isolated from C. albicans two genes encoding putative G1 cyclins, by their ability to rescue a conditional G1 cyclin defect in S. cerevisiae; one of these genes encodes a protein of 697 amino acids and is identical to the product of the previously described CCN1 gene. The second gene codes for a protein of 465 residues, which has significant homology to S. cerevisiae Cln3. These data suggest that the events and regulatory mechanisms operating at START are highly conserved between these two organisms.

    View details for Web of Science ID A1994QA10400006

    View details for PubMedID 7830719



    In Saccharomyces cerevisiae, START has been shown to comprise a series of tightly regulated reactions by which the cellular environment is assessed and under appropriate conditions, cells are commited to a further round of mitotic division. The key effector of START is the product of the CDC28 gene and the mechanisms by which the protein kinase activity of this gene product is regulated at START are well characterized. This is in contrast to the events which follow p34CDC28 activation and the way in which progress to S phase is achieved, which are less clear. We suggest two possible models to describe the regulation of these events. Firstly, it is conceivable that the only post-START targets of the p34CDC28/G1 cyclin kinase complex are components of the SBF and DSC1 transcription factors. This would require that either SBF or DSC1 regulates CDC4 function either directly by activating the transcription of CDC4 itself or else indirectly by activating the transcription of a mediator of CDC4 function in a manner analogous to the way in which the control of CDC7 function may be mediated by transcriptional regulation of DBF4 (Jackson et al., 1993). Potential regulatory effectors of CDC4 function include SCM4, which suppresses cdc4 mutations in an allele-specific manner (Smith et al., 1992) or its homologue HFS1 (J. Hartley & J. Rosamond, unpublished). This possibility is supported by the finding that CDC4 has no upstream SCB or MCB elements, whereas SCM4 and HFS1 have either an exact or close match to the SCB. This model would further require that genes needed for bud emergence and spindle pole body duplication are also subject to transcriptional regulation by DSC1 or SBF. An alternative model is that the p34CDC28/G1 cyclin complexes have several targets post-START, one being DSC1 and the others being as yet unidentified components of the pathways leading to CDC4 function, spindle pole body duplication and bud emergence. This model could account for the functional redundancy observed amongst the G1 cyclins with the various cyclins providing substrate specificity for the kinase complex. We suggest that a complex containing Cln3 protein is primarily responsible for, and acts most efficiently on, the targets containing Swi6 protein (SBF and DSC1), with complexes containing other G1 cyclins (Cln1 and/or Cln2 proteins) principally involved in activating the other pathways. However, there must be overlap in the function of these complexes with each cyclin able to substitute for some or all of the functions when necessary, albeit with differing efficiencies. This hypothesis is supported by several observations.(ABSTRACT TRUNCATED AT 400 WORDS)

    View details for Web of Science ID A1993MH54500001

    View details for PubMedID 8277239