Administrative Appointments
-
Advisory Board, WormBase, Caenorhabditis community database (2002 - 2018)
-
Advisory Board, The Blueprint Initiative, Interaction Database (2003 - 2005)
-
Advisory Board, dictyBase, Dictystelium discoideum community database (2003 - 2008)
-
Advisory Board, EcoCyc, E. coli genetics and pathway resource (2004 - 2008)
-
Advisory Board, TIGR Rice Genome Annotation Project (2004 - 2010)
-
External Consultants Panel, NHGRI ENCODE and modENCODE projects (2005 - 2011)
-
Member, Academic Council Committee on Libraries (C-LIB) (2008 - 2010)
-
Advisory Board, FlyBase, Drosophila Knowledgebase (2008 - 2018)
-
Executive Board, International Society of Biocuration (2010 - 2015)
-
Member, Committee on Academic Computing and Information Systems (C-ACIS) (2012 - 2015)
-
President, International Society of Biocuration (2015 - 2016)
-
Chair, Committee on Academic Computing and Information Systems (C-ACIS) (2015 - 2018)
-
External Scientific Advisers, 4D-Nucleome Common Fund Project (2015 - 2019)
-
Advisory Board, FaceBase, Comprehensive data and resources for craniofacial researchers. USC (2015 - 2024)
-
Advisory Board, Laboratory of Neuro Imaging Resource (LONIR), USC (2015 - 2024)
-
Advisory Board, XenBase, Xenopus Knowledgebase, Univ of Calgary (2016 - 2024)
-
Advisory Board, GlyGen, Glycoscience Informatics Resource, George Washington University (2020 - 2024)
-
Advisory Board, ZFIN, Zebrafish Genome Database, Univ of Oregon (2021 - 2024)
-
Member, Committee on Academic Computing and Information Systems (C-ACIS) (2024 - Present)
Honors & Awards
-
Ira Herskowitz Award, Genetics Society of America (August 2018)
Professional Education
-
Ph.D., University of California, Molecular Biology (1985)
-
B.S., Purdue University, Biochemistry (1979)
-
B.S., Purdue University, Biological Sciences (1979)
Community and International Work
-
Stanford at The Tech, San Jose
Topic
Public Understanding of Genetics
Partnering Organization(s)
The Tech Museum of San Jose
Location
Bay Area
Ongoing Project
Yes
Opportunities for Student Involvement
Yes
Current Research and Scholarly Interests
The Cherry lab is involved in identifying, validating and integrating scientific information into encyclopedic databases essential for investigation as well as scientific education. Published results of scientific experimentation are a foundation of our understanding of the natural world and provide motivation for new experiments. The combination of in-depth understanding reported in the literature with computational analyses is an essential ingredient of modern biological research. Mastery of the volumes of published literature requires comprehensive databases that provide the facts and underlying experimental data in publically accessible ways. Curation, extraction and sorting of factual experimental data from peer-reviewed journal articles is necessary to acquire these data from its source. Large quantitative datasets using global studies extend our knowledge of genes, their products and their interactions. By integrating quantitative datasets with curated focused experimental results creates unique comprehensive databases. My group creates such essential databases and makes them available to scientists and educators seeking to understand experimental results and to teach scientific knowledge.
The exploration of the genes and other important elements of a genome involve the use of previous results to aid the design of experiments that explore, for example, gene regulation, protein function, and interaction of these processes. New technologies are being applied to the determination of many molecular interactions of the components of chromosomes and the specific controls for the generation of the many cell types that create an organism from a single set of chromosomes. These methods create very large datasets that cannot be appreciated without computational methods and access to databases of scientific results.
The Cherry lab specializes in designing and managing a public database of information for the budding yeast Saccharomyces cerevisiae and have recently begun applying my expertise to human genomic information. Our current projects address three areas of research: engineering for the design of databases and software for the effective integration of complex experimental results; defining standards for eukaryotic genomic data that measure reliability and quality; and developing vocabularies that enhance communication between researchers, and between computational resources. This research involves the collection and standardization of experimental results and the detailed descriptions of these data into complex biological models, application of flexible search and retrieval tools, distribution of the integrated information for the acceleration of discovery.
Three major bioinformatics resources funded by the National Institutes of Health are provided by the lab. The Saccharomyces Genome Database project is the foremost database on a single organism. It is the archetype of all such databases because of its high quality, rich design, completeness, easy of use, and facilitation of scientific discovery. The Gene Ontology Consortium invented a structured vocabulary for the specification and description of gene function, their involvement in biological processes and their location within subcellular complexes and components. This innovative knowledgebase has unified biological nomenclature and is crucial for the analysis of biological results. The ENCODE Data Coordination Center provides an essential component for the analysis and use of large-scale studies of the human genome. Our work specifies the accurate and complete submission of human genomic experimental results, verifies the data quality, specifies and compiles the dataset experimental details, integrates data with existing human genome databases, distributed these results with its analyses via a portal that serves the diverse biomedical research community of skilled bioinformaticists, biologists, and educators.
2024-25 Courses
-
Independent Studies (7)
- Biomedical Informatics Teaching Methods
BIOMEDIN 290 (Aut, Win, Spr, Sum) - Directed Reading and Research
BIOMEDIN 299 (Aut, Win, Spr, Sum) - Directed Reading in Genetics
GENE 299 (Aut, Win, Spr, Sum) - Graduate Research
GENE 399 (Aut, Win, Spr, Sum) - Medical Scholars Research
BIOMEDIN 370 (Aut, Win, Spr, Sum) - Medical Scholars Research
GENE 370 (Aut, Win, Spr, Sum) - Supervised Study
GENE 260 (Aut, Win, Spr, Sum)
- Biomedical Informatics Teaching Methods
-
Prior Year Courses
2021-22 Courses
- Computational Analysis of Biological Information: Introduction to Python for Biologists
GENE 218, MI 218, PATH 218 (Spr) - Introductory Python Programming for Genomics
BIOS 274 (Win)
- Computational Analysis of Biological Information: Introduction to Python for Biologists
All Publications
-
CZ CELLxGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data.
Nucleic acids research
2024
Abstract
Hundreds of millions of single cells have been analyzed using high-throughput transcriptomic methods. The cumulative knowledge within these datasets provides an exciting opportunity for unlocking insights into health and disease at the level of single cells. Meta-analyses that span diverse datasets building on recent advances in large language models and other machine-learning approaches pose exciting new directions to model and extract insight from single-cell data. Despite the promise of these and emerging analytical tools for analyzing large amounts of data, the sheer number of datasets, data models and accessibility remains a challenge. Here, we present CZ CELLxGENE Discover (cellxgene.cziscience.com), a data platform that provides curated and interoperable single-cell data. Available via a free-to-use online data portal, CZ CELLxGENE hosts a growing corpus of community-contributed data of over 93 million unique cells. Curated, standardized and associated with consistent cell-level metadata, this collection of single-cell transcriptomic data is the largest of its kind and growing rapidly via community contributions. A suite of tools and features enables accessibility and reusability of the data via both computational and visual interfaces to allow researchers to explore individual datasets, perform cross-corpus analysis, and run meta-analyses of tens of millions of cells across studies and tissues at the resolution of single cells.
View details for DOI 10.1093/nar/gkae1142
View details for PubMedID 39607691
-
Saccharomyces Genome Database: Advances in Genome Annotation, Expanded Biochemical Pathways, and Other Key Enhancements.
Genetics
2024
Abstract
Budding yeast (Saccharomyces cerevisiae) is the most extensively characterized eukaryotic model organism and has long been used to gain insight into the fundamentals of genetics, cellular biology, and the functions of specific genes and proteins. The Saccharomyces Genome Database (SGD) is a scientific resource that provides information about the genome and biology of S. cerevisiae. For more than 30 years, SGD has maintained the genetic nomenclature, chromosome maps, and functional annotation for budding yeast along with search and analysis tools to explore these data. Here we describe recent updates at SGD, including the two most recent reference genome annotation updates, expanded biochemical pathways representation, changes to SGD search and data files, and other enhancements to the SGD website and user interface. These activities are part of our continuing effort to promote insights gained from yeast to enable the discovery of functional relationships between sequence and gene products in fungi and higher eukaryotes.
View details for DOI 10.1093/genetics/iyae185
View details for PubMedID 39530598
-
The ENCODE Uniform Analysis Pipelines.
Research square
2023
Abstract
The Encyclopedia of DNA elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human genome. The current database comprises more than 19000 functional genomics experiments across more than 1000 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the Homo sapiens and Mus musculus genomes. All experimental data, metadata, and associated computational analyses created by the ENCODE consortium are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. The ENCODE project has engineered and distributed uniform processing pipelines in order to promote data provenance and reproducibility as well as allow interoperability between genomic resources and other consortia. All data files, reference genome versions, software versions, and parameters used by the pipelines are captured and available via the ENCODE Portal. The pipeline code, developed using Docker and Workflow Description Language (WDL; https://openwdl.org/) is publicly available in GitHub, with images available on Dockerhub (https://hub.docker.com), enabling access to a diverse range of biomedical researchers. ENCODE pipelines maintained and used by the DCC can be installed to run on personal computers, local HPC clusters, or in cloud computing environments via Cromwell. Access to the pipelines and data via the cloud allows small labs the ability to use the data or software without access to institutional compute clusters. Standardization of the computational methodologies for analysis and quality control leads to comparable results from different ENCODE collections - a prerequisite for successful integrative analyses.
View details for DOI 10.21203/rs.3.rs-3111932/v1
View details for PubMedID 37503119
View details for PubMedCentralID PMC10371165
-
Annotating and prioritizing human non-coding variants with RegulomeDB v.2.
Nature genetics
2023; 55 (5): 724-726
View details for DOI 10.1038/s41588-023-01365-3
View details for PubMedID 37173523
View details for PubMedCentralID 3431494
-
The Gene Ontology Knowledgebase in 2023.
Genetics
2023
Abstract
The Gene Ontology (GO) knowledgebase (http://geneontology.org) is a comprehensive resource concerning the functions of genes and gene products (proteins and non-coding RNAs). GO annotations cover genes from organisms across the tree of life as well as viruses, though most gene function knowledge currently derives from experiments carried out in a relatively small number of model organisms. Here, we provide an updated overview of the GO knowledgebase, as well as the efforts of the broad, international consortium of scientists that develops, maintains and updates the GO knowledgebase. The GO knowledgebase consists of three components: 1) the Gene Ontology - a computational knowledge structure describing functional characteristics of genes; 2) GO annotations - evidence-supported statements asserting that a specific gene product has a particular functional characteristic; and 3) GO Causal Activity Models (GO-CAMs) - mechanistic models of molecular "pathways" (GO biological processes) created by linking multiple GO annotations using defined relations. Each of these components is continually expanded, revised and updated in response to newly published discoveries, and receives extensive QA checks, reviews and user feedback. For each of these components, we provide a description of the current contents, recent developments to keep the knowledgebase up to date with new discoveries, as well as guidance on how users can best make use of the data we provide. We conclude with future directions for the project.
View details for DOI 10.1093/genetics/iyad031
View details for PubMedID 36866529
-
Saccharomyces Genome Database Update: Server Architecture, Pan-Genome Nomenclature, and External Resources.
Genetics
2023
Abstract
As one of the first model organism knowledgebases, Saccharomyces Genome Database (SGD) has been supporting the scientific research community since 1993. As technologies and research evolve, so does SGD: from updates in software architecture, to curation of novel data types, to incorporation of data from, and collaboration with, other knowledgebases. We are continuing to make steps toward providing the community with an S. cerevisiae pan-genome. Here we describe software upgrades, a new nomenclature system for genes not found in the reference strain, and additions to gene pages. With these improvements, we aim to remain a leading resource for students, researchers, and the broader scientific community.
View details for DOI 10.1093/genetics/iyac191
View details for PubMedID 36607068
-
New Data and Collaborations at the Saccharomyces Genome Database: Updated reference genome, alleles, and the Alliance of Genome Resources.
Genetics
1800
Abstract
Saccharomyces cerevisiae is used to provide fundamental understanding of eukaryotic genetics, gene product function, and cellular biological processes. Saccharomyces Genome Database (SGD) has been supporting the yeast research community since 1993, serving as its de facto hub. Over the years, SGD has maintained the genetic nomenclature, chromosome maps, and functional annotation, and developed various tools and methods for analysis and curation of a variety of emerging data types. More recently, SGD and six other model organism focused knowledgebases have come together to create the Alliance of Genome Resources to develop sustainable genome information resources that promote and support the use of various model organisms to understand the genetic and genomic bases of human biology and disease. Here we describe recent activities at SGD, including the latest reference genome annotation update, the development of a curation system for mutant alleles, and new pages addressing homology across model organisms as well as the use of yeast to study human disease.
View details for DOI 10.1093/genetics/iyab224
View details for PubMedID 34897464
-
Dive into Epigenetics and Gene Regulation - Navigation using the ENCODE Portal
SPRINGERNATURE. 2020: 744
View details for Web of Science ID 000598482602403
-
Incorporation of a unified protein abundance dataset into the Saccharomyces genome database.
Database : the journal of biological databases and curation
2020; 2020
Abstract
The identification and accurate quantitation of protein abundance has been a major objective of proteomics research. Abundance studies have the potential to provide users with data that can be used to gain a deeper understanding of protein function and regulation and can also help identify cellular pathways and modules that operate under various environmental stress conditions. One of the central missions of the Saccharomyces Genome Database (SGD; https://www.yeastgenome.org) is to work with researchers to identify and incorporate datasets of interest to the wider scientific community, thereby enabling hypothesis-driven research. A large number of studies have detailed efforts to generate proteome-wide abundance data, but deeper analyses of these data have been hampered by the inability to compare results between studies. Recently, a unified protein abundance dataset was generated through the evaluation of more than 20 abundance datasets, which were normalized and converted to common measurement units, in this case molecules per cell. We have incorporated these normalized protein abundance data and associated metadata into the SGD database, as well as the SGD YeastMine data warehouse, resulting in the addition of 56 487 values for untreated cells grown in either rich or defined media and 28 335 values for cells treated with environmental stressors. Abundance data for protein-coding genes are displayed in a sortable, filterable table on Protein pages, available through Locus Summary pages. A median abundance value was incorporated, and a median absolute deviation was calculated for each protein-coding gene and incorporated into SGD. These values are displayed in the Protein section of the Locus Summary page. The inclusion of these data has enhanced the quality and quantity of protein experimental information presented at SGD and provides opportunities for researchers to access and utilize the data to further their research.
View details for DOI 10.1093/database/baaa008
View details for PubMedID 32128557
-
The Alliance of Genome Resources: Building a Modern Data Ecosystem for Model Organism Databases
GENETICS
2019; 213 (4): 1189–96
Abstract
Model organisms are essential experimental platforms for discovering gene functions, defining protein and genetic networks, uncovering functional consequences of human genome variation, and for modeling human disease. For decades, researchers who use model organisms have relied on Model Organism Databases (MODs) and the Gene Ontology Consortium (GOC) for expertly curated annotations, and for access to integrated genomic and biological information obtained from the scientific literature and public data archives. Through the development and enforcement of data and semantic standards, these genome resources provide rapid access to the collected knowledge of model organisms in human readable and computation-ready formats that would otherwise require countless hours for individual researchers to assemble on their own. Since their inception, the MODs for the predominant biomedical model organisms [Mus sp (laboratory mouse), Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, Danio rerio, and Rattus norvegicus] along with the GOC have operated as a network of independent, highly collaborative genome resources. In 2016, these six MODs and the GOC joined forces as the Alliance of Genome Resources (the Alliance). By implementing shared programmatic access methods and data-specific web pages with a unified "look and feel," the Alliance is tackling barriers that have limited the ability of researchers to easily compare common data types and annotations across model organisms. To adapt to the rapidly changing landscape for evaluating and funding core data resources, the Alliance is building a modern, extensible, and operationally efficient "knowledge commons" for model organisms using shared, modular infrastructure.
View details for DOI 10.1534/genetics.119.302523
View details for Web of Science ID 000501177400004
View details for PubMedID 31796553
View details for PubMedCentralID PMC6893393
-
The ENCODE Portal as an Epigenomics Resource.
Current protocols in bioinformatics
2019; 68 (1): e89
Abstract
The Encyclopedia of DNA Elements (ENCODE) web portal hosts genomic data generated by the ENCODE Consortium, Genomics of Gene Regulation, The NIH Roadmap Epigenomics Consortium, and the modENCODE and modERN projects. The goal of the ENCODE project is to build a comprehensive map of the functional elements of the human and mouse genomes. Currently, the portal database stores over 500 TB of raw and processed data from over 15,000 experiments spanning assays that measure gene expression, DNA accessibility, DNA and RNA binding, DNA methylation, and 3D chromatin structure across numerous cell lines, tissue types, and differentiation states with selected genetic and molecular perturbations. The ENCODE portal provides unrestricted access to the aforementioned data and relevant metadata as a service to the scientific community. The metadata model captures the details of the experiments, raw and processed data files, and processing pipelines in human and machine-readable form and enables the user to search for specific data either using a web browser or programmatically via REST API. Furthermore, ENCODE data can be freely visualized or downloaded for additional analyses. © 2019 The Authors. Basic Protocol: Query the portal Support Protocol 1: Batch downloading Support Protocol 2: Using the cart to download files Support Protocol 3: Visualize data Alternate Protocol: Query building and programmatic access.
View details for DOI 10.1002/cpbi.89
View details for PubMedID 31751002
-
Integration of macromolecular complex data into the Saccharomyces Genome Database.
Database : the journal of biological databases and curation
2019; 2019
Abstract
Proteins seldom function individually. Instead, they interact with other proteins or nucleic acids to form stable macromolecular complexes that play key roles in important cellular processes and pathways. One of the goals of Saccharomyces Genome Database (SGD; www.yeastgenome.org) is to provide a complete picture of budding yeast biological processes. To this end, we have collaborated with the Molecular Interactions team that provides the Complex Portal database at EMBL-EBI to manually curate the complete yeast complexome. These data, from a total of 589 complexes, were previously available only in SGD's YeastMine data warehouse (yeastmine.yeastgenome.org) and the Complex Portal (www.ebi.ac.uk/complexportal). We have now incorporated these macromolecular complex data into the SGD core database and designed complex-specific reports to make these data easily available to researchers. These web pages contain referenced summaries focused on the composition and function of individual complexes. In addition, detailed information about how subunits interact within the complex, their stoichiometry and the physical structure are displayed when such information is available. Finally, we generate network diagrams displaying subunits and Gene Ontology annotations that are shared between complexes. Information on macromolecular complexes will continue to be updated in collaboration with the Complex Portal team and curated as more data become available.
View details for PubMedID 30715277
-
New developments on the Encyclopedia of DNA Elements (ENCODE) data portal.
Nucleic acids research
2019
Abstract
The Encyclopedia of DNA Elements (ENCODE) is an ongoing collaborative research project aimed at identifying all the functional elements in the human and mouse genomes. Data generated by the ENCODE consortium are freely accessible at the ENCODE portal (https://www.encodeproject.org/), which is developed and maintained by the ENCODE Data Coordinating Center (DCC). Since the initial portal release in 2013, the ENCODE DCC has updated the portal to make ENCODE data more findable, accessible, interoperable and reusable. Here, we report on recent updates, including new ENCODE data and assays, ENCODE uniform data processing pipelines, new visualization tools, a dataset cart feature, unrestricted public access to ENCODE data on the cloud (Amazon Web Services open data registry, https://registry.opendata.aws/encode-project/) and more comprehensive tutorials and documentation.
View details for DOI 10.1093/nar/gkz1062
View details for PubMedID 31713622
- Prevention of data duplication for high throughput sequencing repositories DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018
-
SnoVault and encodeD: A novel object-based storage system and applications to ENCODE metadata
PLOS ONE
2017; 12 (4)
Abstract
The Encyclopedia of DNA elements (ENCODE) project is an ongoing collaborative effort to create a comprehensive catalog of functional elements initiated shortly after the completion of the Human Genome Project. The current database exceeds 6500 experiments across more than 450 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the H. sapiens and M. musculus genomes. All ENCODE experimental data, metadata, and associated computational analyses are submitted to the ENCODE Data Coordination Center (DCC) for validation, tracking, storage, unified processing, and distribution to community resources and the scientific community. As the volume of data increases, the identification and organization of experimental details becomes increasingly intricate and demands careful curation. The ENCODE DCC has created a general purpose software system, known as SnoVault, that supports metadata and file submission, a database used for metadata storage, web pages for displaying the metadata and a robust API for querying the metadata. The software is fully open-source, code and installation instructions can be found at: http://github.com/ENCODE-DCC/snovault/ (for the generic database) and http://github.com/ENCODE-DCC/encoded/ to store genomic data in the manner of ENCODE. The core database engine, SnoVault (which is completely independent of ENCODE, genomic data, or bioinformatic data) has been released as a separate Python package.
View details for DOI 10.1371/journal.pone.0175310
View details for Web of Science ID 000399955200049
View details for PubMedID 28403240
-
Curated protein information in the Saccharomyces genome database
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION
2017
Abstract
Due to recent advancements in the production of experimental proteomic data, the Saccharomyces genome database (SGD; www.yeastgenome.org ) has been expanding our protein curation activities to make new data types available to our users. Because of broad interest in post-translational modifications (PTM) and their importance to protein function and regulation, we have recently started incorporating expertly curated PTM information on individual protein pages. Here we also present the inclusion of new abundance and protein half-life data obtained from high-throughput proteome studies. These new data types have been included with the aim to facilitate cellular biology research.: www.yeastgenome.org.
View details for DOI 10.1093/database/bax011
View details for Web of Science ID 000397530600002
View details for PubMedID 28365727
-
Saccharomyces genome database informs human biology.
Nucleic acids research
2017
Abstract
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is an expertly curated database of literature-derived functional information for the model organism budding yeast, Saccharomyces cerevisiae. SGD constantly strives to synergize new types of experimental data and bioinformatics predictions with existing data, and to organize them into a comprehensive and up-to-date information resource. The primary mission of SGD is to facilitate research into the biology of yeast and to provide this wealth of information to advance, in many ways, research on other organisms, even those as evolutionarily distant as humans. To build such a bridge between biological kingdoms, SGD is curating data regarding yeast-human complementation, in which a human gene can successfully replace the function of a yeast gene, and/or vice versa. These data are manually curated from published literature, made available for download, and incorporated into a variety of analysis tools provided by SGD.
View details for PubMedID 29140510
-
ENCODE data at the ENCODE portal.
Nucleic acids research
2016; 44 (D1): D726-32
Abstract
The Encyclopedia of DNA Elements (ENCODE) Project is in its third phase of creating a comprehensive catalog of functional elements in the human genome. This phase of the project includes an expansion of assays that measure diverse RNA populations, identify proteins that interact with RNA and DNA, probe regions of DNA hypersensitivity, and measure levels of DNA methylation in a wide range of cell and tissue types to identify putative regulatory elements. To date, results for almost 5000 experiments have been released for use by the scientific community. These data are available for searching, visualization and download at the new ENCODE Portal (www.encodeproject.org). The revamped ENCODE Portal provides new ways to browse and search the ENCODE data based on the metadata that describe the assays as well as summaries of the assays that focus on data provenance. In addition, it is a flexible platform that allows integration of genomic data from multiple projects. The portal experience was designed to improve access to ENCODE data by relying on metadata that allow reusability and reproducibility of the experiments.
View details for DOI 10.1093/nar/gkv1160
View details for PubMedID 26527727
View details for PubMedCentralID PMC4702836
-
From one to many: expanding the Saccharomyces cerevisiae reference genome panel.
Database : the journal of biological databases and curation
2016; 2016
Abstract
In recent years, thousands of Saccharomyces cerevisiae genomes have been sequenced to varying degrees of completion. The Saccharomyces Genome Database (SGD) has long been the keeper of the original eukaryotic reference genome sequence, which was derived primarily from S. cerevisiae strain S288C. Because new technologies are pushing S. cerevisiae annotation past the limits of any system based exclusively on a single reference sequence, SGD is actively working to expand the original S. cerevisiae systematic reference sequence from a single genome to a multi-genome reference panel. We first commissioned the sequencing of additional genomes and their automated analysis using the AGAPE pipeline. Here we describe our curation strategy to produce manually reviewed high-quality genome annotations in order to elevate 11 of these additional genomes to Reference status. Database URL: http://www.yeastgenome.org/.
View details for DOI 10.1093/database/baw020
View details for PubMedID 26989152
View details for PubMedCentralID PMC4795930
-
Principles of metadata organization at the ENCODE data coordination center.
Database : the journal of biological databases and curation
2016; 2016
Abstract
The Encyclopedia of DNA Elements (ENCODE) Data Coordinating Center (DCC) is responsible for organizing, describing and providing access to the diverse data generated by the ENCODE project. The description of these data, known as metadata, includes the biological sample used as input, the protocols and assays performed on these samples, the data files generated from the results and the computational methods used to analyze the data. Here, we outline the principles and philosophy used to define the ENCODE metadata in order to create a metadata standard that can be applied to diverse assays and multiple genomic projects. In addition, we present how the data are validated and used by the ENCODE DCC in creating the ENCODE Portal (https://www.encodeproject.org/). Database URL: www.encodeproject.org.
View details for DOI 10.1093/database/baw001
View details for PubMedID 26980513
View details for PubMedCentralID PMC4792520
-
The Saccharomyces Genome Database: Advanced Searching Methods and Data Mining.
Cold Spring Harbor protocols
2015; 2015 (12): pdb.prot088906
Abstract
At the core of the Saccharomyces Genome Database (SGD) are chromosomal features that encode a product. These include protein-coding genes and major noncoding RNA genes, such as tRNA and rRNA genes. The basic entry point into SGD is a gene or open-reading frame name that leads directly to the locus summary information page. A keyword describing function, phenotype, selective condition, or text from abstracts will also provide a door into the SGD. A DNA or protein sequence can be used to identify a gene or a chromosomal region using BLAST. Protein and DNA sequence identifiers, PubMed and NCBI IDs, author names, and function terms are also valid entry points. The information in SGD has been gathered and is maintained by a group of scientific biocurators and software developers who are devoted to providing researchers with up-to-date information from the published literature, connections to all the major research resources, and tools that allow the data to be explored. All the collected information cannot be represented or summarized for every possible question; therefore, it is necessary to be able to search the structured data in the database. This protocol describes the YeastMine tool, which provides an advanced search capability via an interactive tool. The SGD also archives results from microarray expression experiments, and a strategy designed to explore these data using the SPELL (Serial Pattern of Expression Levels Locator) tool is provided.
View details for DOI 10.1101/pdb.prot088906
View details for PubMedID 26631124
View details for PubMedCentralID PMC5673598
-
Ontology application and use at the ENCODE DCC.
Database : the journal of biological databases and curation
2015; 2015
Abstract
The Encyclopedia of DNA elements (ENCODE) project is an ongoing collaborative effort to create a catalog of genomic annotations. To date, the project has generated over 4000 experiments across more than 350 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory network and transcriptional landscape of the Homo sapiens and Mus musculus genomes. All ENCODE experimental data, metadata and associated computational analyses are submitted to the ENCODE Data Coordination Center (DCC) for validation, tracking, storage and distribution to community resources and the scientific community. As the volume of data increases, the organization of experimental details becomes increasingly complicated and demands careful curation to identify related experiments. Here, we describe the ENCODE DCC's use of ontologies to standardize experimental metadata. We discuss how ontologies, when used to annotate metadata, provide improved searching capabilities and facilitate the ability to find connections within a set of experiments. Additionally, we provide examples of how ontologies are used to annotate ENCODE metadata and how the annotations can be identified via ontology-driven searches at the ENCODE portal. As genomic datasets grow larger and more interconnected, standardization of metadata becomes increasingly vital to allow for exploration and comparison of data between different scientific projects.
View details for DOI 10.1093/database/bav010
View details for PubMedID 25776021
View details for PubMedCentralID PMC4360730
-
Annotation of functional variation in personal genomes using RegulomeDB
GENOME RESEARCH
2012; 22 (9): 1790-1797
Abstract
As the sequencing of healthy and disease genomes becomes more commonplace, detailed annotation provides interpretation for individual variation responsible for normal and disease phenotypes. Current approaches focus on direct changes in protein coding genes, particularly nonsynonymous mutations that directly affect the gene product. However, most individual variation occurs outside of genes and, indeed, most markers generated from genome-wide association studies (GWAS) identify variants outside of coding segments. Identification of potential regulatory changes that perturb these sites will lead to a better localization of truly functional variants and interpretation of their effects. We have developed a novel approach and database, RegulomeDB, which guides interpretation of regulatory variants in the human genome. RegulomeDB includes high-throughput, experimental data sets from ENCODE and other sources, as well as computational predictions and manual annotations to identify putative regulatory potential and identify functional variants. These data sources are combined into a powerful tool that scores variants to help separate functional variants from a large pool and provides a small set of putative sites with testable hypotheses as to their function. We demonstrate the applicability of this tool to the annotation of noncoding variants from 69 full sequenced genomes as well as that of a personal genome, where thousands of functionally associated variants were identified. Moreover, we demonstrate a GWAS where the database is able to quickly identify the known associated functional variant and provide a hypothesis as to its function. Overall, we expect this approach and resource to be valuable for the annotation of human genome sequences.
View details for DOI 10.1101/gr.137323.112
View details for PubMedID 22955989
-
Updates to the Alliance of Genome Resources central infrastructure
GENETICS
2024; 227 (1)
Abstract
The Alliance of Genome Resources (Alliance) is an extensible coalition of knowledgebases focused on the genetics and genomics of intensively studied model organisms. The Alliance is organized as individual knowledge centers with strong connections to their research communities and a centralized software infrastructure, discussed here. Model organisms currently represented in the Alliance are budding yeast, Caenorhabditis elegans, Drosophila, zebrafish, frog, laboratory mouse, laboratory rat, and the Gene Ontology Consortium. The project is in a rapid development phase to harmonize knowledge, store it, analyze it, and present it to the community through a web portal, direct downloads, and application programming interfaces (APIs). Here, we focus on developments over the last 2 years. Specifically, we added and enhanced tools for browsing the genome (JBrowse), downloading sequences, mining complex data (AllianceMine), visualizing pathways, full-text searching of the literature (Textpresso), and sequence similarity searching (SequenceServer). We enhanced existing interactive data tables and added an interactive table of paralogs to complement our representation of orthology. To support individual model organism communities, we implemented species-specific "landing pages" and will add disease-specific portals soon; in addition, we support a common community forum implemented in Discourse software. We describe our progress toward a central persistent database to support curation, the data modeling that underpins harmonization, and progress toward a state-of-the-art literature curation system with integrated artificial intelligence and machine learning (AI/ML).
View details for DOI 10.1093/genetics/iyae049
View details for Web of Science ID 001287647500001
View details for PubMedID 38552170
View details for PubMedCentralID PMC11075569
-
The EN-TEx resource of multi-tissue personal epigenomes & variant-impact models.
Cell
2023; 186 (7): 1493-1511.e40
Abstract
Understanding how genetic variants impact molecular phenotypes is a key goal of functional genomics, currently hindered by reliance on a single haploid reference genome. Here, we present the EN-TEx resource of 1,635 open-access datasets from four donors (∼30 tissues × ∼15 assays). The datasets are mapped to matched, diploid genomes with long-read phasing and structural variants, instantiating a catalog of >1 million allele-specific loci. These loci exhibit coordinated activity along haplotypes and are less conserved than corresponding, non-allele-specific ones. Surprisingly, a deep-learning transformer model can predict the allele-specific activity based only on local nucleotide-sequence context, highlighting the importance of transcription-factor-binding motifs particularly sensitive to variants. Furthermore, combining EN-TEx with existing genome annotations reveals strong associations between allele-specific and GWAS loci. It also enables models for transferring known eQTLs to difficult-to-profile tissues (e.g., from skin to heart). Overall, EN-TEx provides rich data and generalizable models for more accurate personal functional genomics.
View details for DOI 10.1016/j.cell.2023.02.018
View details for PubMedID 37001506
-
Describing the Impact of Genomic Variation on Function (IGVF) Consortium submitted on behalf of the IGVF Consortium members
ELSEVIER SCIENCE INC. 2022: S219
View details for DOI 10.1016/j.gim.2022.01.384
View details for Web of Science ID 000796586200125
-
ClinGen Variant Curation Interface: a variant classification platform for the application of evidence criteria from ACMG/AMP guidelines.
Genome medicine
1800; 14 (1): 6
Abstract
BACKGROUND: Identification of clinically significant genetic alterations involved in human disease has been dramatically accelerated by developments in next-generation sequencing technologies. However, the infrastructure and accessible comprehensive curation tools necessary for analyzing an individual patient genome and interpreting genetic variants to inform healthcare management have been lacking.RESULTS: Here we present the ClinGen Variant Curation Interface (VCI), a global open-source variant classification platform for supporting the application of evidence criteria and classification of variants based on the ACMG/AMP variant classification guidelines. The VCI is among a suite of tools developed by the NIH-funded Clinical Genome Resource (ClinGen) Consortium and supports an FDA-recognized human variant curation process. Essential to this is the ability to enable collaboration and peer review across ClinGen Expert Panels supporting users in comprehensively identifying, annotating, and sharing relevant evidence while making variant pathogenicity assertions. To facilitate evidence-based improvements in human variant classification, the VCI is publicly available to the genomics community. Navigation workflows support users providing guidance to comprehensively apply the ACMG/AMP evidence criteria and document provenance for asserting variant classifications.CONCLUSIONS: The VCI offers a central platform for clinical variant classification that fills a gap in the learning healthcare system, facilitates widespread adoption of standards for clinical curation, and is available at https://curation.clinicalgenome.org.
View details for DOI 10.1186/s13073-021-01004-8
View details for PubMedID 35039090
-
The Gene Ontology resource: enriching a GOld mine
NUCLEIC ACIDS RESEARCH
2021; 49 (D1): D325–D334
Abstract
The Gene Ontology Consortium (GOC) provides the most comprehensive resource currently available for computable knowledge regarding the functions of genes and gene products. Here, we report the advances of the consortium over the past two years. The new GO-CAM annotation framework was notably improved, and we formalized the model with a computational schema to check and validate the rapidly increasing repository of 2838 GO-CAMs. In addition, we describe the impacts of several collaborations to refine GO and report a 10% increase in the number of GO annotations, a 25% increase in annotated gene products, and over 9,400 new scientific articles annotated. As the project matures, we continue our efforts to review older annotations in light of newer findings, and, to maintain consistency with other ontologies. As a result, 20 000 annotations derived from experimental data were reviewed, corresponding to 2.5% of experimental GO annotations. The website (http://geneontology.org) was redesigned for quick access to documentation, downloads and tools. To maintain an accurate resource and support traceability and reproducibility, we have made available a historical archive covering the past 15 years of GO data with a consistent format and file structure for both the ontology and annotations.
View details for DOI 10.1093/nar/gkaa1113
View details for Web of Science ID 000608437800042
View details for PubMedID 33290552
View details for PubMedCentralID PMC7779012
-
Data Sanitization to Reduce Private Information Leakage from Functional Genomics.
Cell
2020; 183 (4): 905
Abstract
The generation of functional genomics datasets is surging, because they provide insight into gene regulation and organismal phenotypes (e.g., genes upregulated in cancer). The intent behind functional genomics experiments is not necessarily to study genetic variants, yet they pose privacy concerns due to their use of next-generation sequencing. Moreover, there is a great incentive to broadly share raw reads for better statistical power and general research reproducibility. Thus, we need new modes of sharing beyond traditional controlled-access models. Here, we develop a data-sanitization procedure allowing raw functional genomics reads to be shared while minimizing privacy leakage, enabling principled privacy-utility trade-offs. Our protocol works with traditional Illumina-based assays and newer technologies such as 10x single-cell RNA sequencing. It involves quantifying the privacy leakage in reads by statistically linking study participants to known individuals. We carried out these linkages using data from highly accurate reference genomes and more realistic environmental samples.
View details for DOI 10.1016/j.cell.2020.09.036
View details for PubMedID 33186529
-
An atlas of dynamic chromatin landscapes in mouse fetal development.
Nature
2020; 583 (7818): 744–51
Abstract
The Encyclopedia of DNA Elements (ENCODE) project has established a genomic resource for mammalian development, profiling a diverse panel of mouse tissues at 8 developmental stages from 10.5 days after conception until birth, including transcriptomes, methylomes and chromatin states. Here we systematically examined the state and accessibility of chromatin in the developing mouse fetus. In total we performed 1,128 chromatin immunoprecipitation with sequencing (ChIP-seq) assays for histone modifications and 132 assay for transposase-accessible chromatin using sequencing (ATAC-seq) assays for chromatin accessibility across 72 distinct tissue-stages. We used integrative analysis to develop a unified set of chromatin state annotations, infer the identities of dynamic enhancers and key transcriptional regulators, and characterize the relationship between chromatin state and accessibility during developmental gene regulation. We also leveraged these data to link enhancers to putative target genes and demonstrate tissue-specific enrichments of sequence variants associated with disease in humans. The mouse ENCODE data sets provide a compendium of resources for biomedical researchers and achieve, to our knowledge, the most comprehensive view of chromatin dynamics during mammalian fetal development to date.
View details for DOI 10.1038/s41586-020-2093-3
View details for PubMedID 32728240
-
Perspectives on ENCODE.
Nature
2020; 583 (7818): 693–98
Abstract
The Encylopedia of DNA Elements (ENCODE) Project launched in 2003 with the long-term goal of developing a comprehensive map of functional elements in the human genome. These included genes, biochemical regions associated with gene regulation (for example, transcription factor binding sites, open chromatin, and histone marks) and transcript isoforms. The marks serve as sites for candidate cis-regulatory elements (cCREs) that may serve functional roles in regulating gene expression1. The project has been extended to model organisms, particularly the mouse. In the third phase of ENCODE, nearly a million and more than 300,000 cCRE annotations have been generated for human and mouse, respectively, and these have provided a valuable resource for the scientific community.
View details for DOI 10.1038/s41586-020-2449-8
View details for PubMedID 32728248
-
CNN-Peaks: ChIP-Seq peak detection pipeline using convolutional neural networks that imitate human visual inspection.
Scientific reports
2020; 10 (1): 7933
Abstract
ChIP-seq is one of the core experimental resources available to understand genome-wide epigenetic interactions and identify the functional elements associated with diseases. The analysis of ChIP-seq data is important but poses a difficult computational challenge, due to the presence of irregular noise and bias on various levels. Although many peak-calling methods have been developed, the current computational tools still require, in some cases, human manual inspection using data visualization. However, the huge volumes of ChIP-seq data make it almost impossible for human researchers to manually uncover all the peaks. Recently developed convolutional neural networks (CNN), which are capable of achieving human-like classification accuracy, can be applied to this challenging problem. In this study, we design a novel supervised learning approach for identifying ChIP-seq peaks using CNNs, and integrate it into a software pipeline called CNN-Peaks. We use data labeled by human researchers who annotate the presence or absence of peaks in some genomic segments, as training data for our model. The trained model is then applied to predict peaks in previously unseen genomic segments from multiple ChIP-seq datasets including benchmark datasets commonly used for validation of peak calling methods. We observe a performance superior to that of previous methods.
View details for DOI 10.1038/s41598-020-64655-4
View details for PubMedID 32404971
-
Expanded encyclopaedias of DNA elements in the human and mouse genomes.
Nature
2020; 583 (7818): 699–710
Abstract
The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen.encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.
View details for DOI 10.1038/s41586-020-2493-4
View details for PubMedID 32728249
-
Transcriptome visualization and data availability at the Saccharomyces Genome Database.
Nucleic acids research
2019
Abstract
The Saccharomyces Genome Database (SGD; www.yeastgenome.org) maintains the official annotation of all genes in the Saccharomyces cerevisiae reference genome and aims to elucidate the function of these genes and their products by integrating manually curated experimental data. Technological advances have allowed researchers to profile RNA expression and identify transcripts at high resolution. These data can be configured in web-based genome browser applications for display to the general public. Accordingly, SGD has incorporated published transcript isoform data in our instance of JBrowse, a genome visualization platform. This resource will help clarify S. cerevisiae biological processes by furthering studies of transcriptional regulation, untranslated regions, genome engineering, and expression quantification in S. cerevisiae.
View details for DOI 10.1093/nar/gkz892
View details for PubMedID 31612944
-
RNAcentral: a hub of information for non-coding RNA sequences (vol 47, pg D221, 2019)
NUCLEIC ACIDS RESEARCH
2019; 47 (D1): D1250–D1251
View details for DOI 10.1093/nar/gky1206
View details for Web of Science ID 000462587400170
-
The Gene Ontology Resource: 20 years and still GOing strong
NUCLEIC ACIDS RESEARCH
2019; 47 (D1): D330–D338
Abstract
The Gene Ontology resource (GO; http://geneontology.org) provides structured, computable knowledge regarding the functions of genes and gene products. Founded in 1998, GO has become widely adopted in the life sciences, and its contents are under continual improvement, both in quantity and in quality. Here, we report the major developments of the GO resource during the past two years. Each monthly release of the GO resource is now packaged and given a unique identifier (DOI), enabling GO-based analyses on a specific release to be reproduced in the future. The molecular function ontology has been refactored to better represent the overall activities of gene products, with a focus on transcription regulator activities. Quality assurance efforts have been ramped up to address potentially out-of-date or inaccurate annotations. New evidence codes for high-throughput experiments now enable users to filter out annotations obtained from these sources. GO-CAM, a new framework for representing gene function that is more expressive than standard GO annotations, has been released, and users can now explore the growing repository of these models. We also provide the 'GO ribbon' widget for visualizing GO annotations to a gene; the widget can be easily embedded in any web page.
View details for DOI 10.1093/nar/gky1055
View details for Web of Science ID 000462587400049
View details for PubMedID 30395331
View details for PubMedCentralID PMC6323945
-
Integrative Meta-Assembly Pipeline (IMAP): Chromosome-level genome assembler combining multiple de novo assemblies.
PloS one
2019; 14 (8): e0221858
Abstract
Genomic data have become major resources to understand complex mechanisms at fine-scale temporal and spatial resolution in functional and evolutionary genetic studies, including human diseases, such as cancers. Recently, a large number of whole genomes of evolving populations of yeast (Saccharomyces cerevisiae W303 strain) were sequenced in a time-dependent manner to identify temporal evolutionary patterns. For this type of study, a chromosome-level sequence assembly of the strain or population at time zero is required to compare with the genomes derived later. However, there is no fully automated computational approach in experimental evolution studies to establish the chromosome-level genome assembly using unique features of sequencing data.In this study, we developed a new software pipeline, the integrative meta-assembly pipeline (IMAP), to build chromosome-level genome sequence assemblies by generating and combining multiple initial assemblies using three de novo assemblers from short-read sequencing data. We significantly improved the continuity and accuracy of the genome assembly using a large collection of sequencing data and hybrid assembly approaches. We validated our pipeline by generating chromosome-level assemblies of yeast strains W303 and SK1, and compared our results with assemblies built using long-read sequencing and various assembly evaluation metrics. We also constructed chromosome-level sequence assemblies of S. cerevisiae strain Sigma1278b, and three commonly used fungal strains: Aspergillus nidulans A713, Neurospora crassa 73, and Thielavia terrestris CBS 492.74, for which long-read sequencing data are not yet available. Finally, we examined the effect of IMAP parameters, such as reference and resolution, on the quality of the final assembly of the yeast strains W303 and SK1.We developed a cost-effective pipeline to generate chromosome-level sequence assemblies using only short-read sequencing data. Our pipeline combines the strengths of reference-guided and meta-assembly approaches. Our pipeline is available online at http://github.com/jkimlab/IMAP including a Docker image, as well as a Perl script, to help users install the IMAP package, including several prerequisite programs. Users can use IMAP to easily build the chromosome-level assembly for the genome of their interest.
View details for DOI 10.1371/journal.pone.0221858
View details for PubMedID 31454399
- Updated regulation curation model at the Saccharomyces Genome Database DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018
-
Evaluating the Clinical Validity of Gene-Disease Associations: An Evidence-Based Framework Developed by the Clinical Genome Resource.
American journal of human genetics
2017; 100 (6): 895-906
Abstract
With advances in genomic sequencing technology, the number of reported gene-disease relationships has rapidly expanded. However, the evidence supporting these claims varies widely, confounding accurate evaluation of genomic variation in a clinical setting. Despite the critical need to differentiate clinically valid relationships from less well-substantiated relationships, standard guidelines for such evaluation do not currently exist. The NIH-funded Clinical Genome Resource (ClinGen) has developed a framework to define and evaluate the clinical validity of gene-disease pairs across a variety of Mendelian disorders. In this manuscript we describe a proposed framework to evaluate relevant genetic and experimental evidence supporting or contradicting a gene-disease relationship and the subsequent validation of this framework using a set of representative gene-disease pairs. The framework provides a semiquantitative measurement for the strength of evidence of a gene-disease relationship that correlates to a qualitative classification: "Definitive," "Strong," "Moderate," "Limited," "No Reported Evidence," or "Conflicting Evidence." Within the ClinGen structure, classifications derived with this framework are reviewed and confirmed or adjusted based on clinical expertise of appropriate disease experts. Detailed guidance for utilizing this framework and access to the curation interface is available on our website. This evidence-based, systematic method to assess the strength of gene-disease relationships will facilitate more knowledgeable utilization of genomic variants in clinical and research settings.
View details for DOI 10.1016/j.ajhg.2017.04.015
View details for PubMedID 28552198
-
Outreach and online training services at the Saccharomyces Genome Database
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION
2017
Abstract
The Saccharomyces Genome Database (SGD; www.yeastgenome.org ), the primary genetics and genomics resource for the budding yeast S. cerevisiae , provides free public access to expertly curated information about the yeast genome and its gene products. As the central hub for the yeast research community, SGD engages in a variety of social outreach efforts to inform our users about new developments, promote collaboration, increase public awareness of the importance of yeast to biomedical research, and facilitate scientific discovery. Here we describe these various outreach methods, from networking at scientific conferences to the use of online media such as blog posts and webinars, and include our perspectives on the benefits provided by outreach activities for model organism databases.http://www.yeastgenome.org.
View details for DOI 10.1093/database/bax002
View details for Web of Science ID 000397529600002
View details for PubMedID 28365719
-
Active Interaction Mapping Reveals the Hierarchical Organization of Autophagy.
Molecular cell
2017; 65 (4): 761-774 e5
Abstract
We have developed a general progressive procedure, Active Interaction Mapping, to guide assembly of the hierarchy of functions encoding any biological system. Using this process, we assemble an ontology of functions comprising autophagy, a central recycling process implicated in numerous diseases. A first-generation model, built from existing gene networks in Saccharomyces, captures most known autophagy components in broad relation to vesicle transport, cell cycle, and stress response. Systematic analysis identifies synthetic-lethal interactions as most informative for further experiments; consequently, we saturate the model with 156,364 such measurements across autophagy-activating conditions. These targeted interactions provide more information about autophagy than all previous datasets, producing a second-generation ontology of 220 functions. Approximately half are previously unknown; we confirm roles for Gyp1 at the phagophore-assembly site, Atg24 in cargo engulfment, Atg26 in cytoplasm-to-vacuole targeting, and Ssd1, Did4, and others in selective and non-selective autophagy. The procedure and autophagy hierarchy are at http://atgo.ucsd.edu/.
View details for DOI 10.1016/j.molcel.2016.12.024
View details for PubMedID 28132844
-
Expansion of the Gene Ontology knowledgebase and resources
NUCLEIC ACIDS RESEARCH
2017; 45 (D1): D331-D338
Abstract
The Gene Ontology (GO) is a comprehensive resource of computable knowledge regarding the functions of genes and gene products. As such, it is extensively used by the biomedical research community for the analysis of -omics and related data. Our continued focus is on improving the quality and utility of the GO resources, and we welcome and encourage input from researchers in all areas of biology. In this update, we summarize the current contents of the GO knowledgebase, and present several new features and improvements that have been made to the ontology, the annotations and the tools. Among the highlights are 1) developments that facilitate access to, and application of, the GO knowledgebase, and 2) extensions to the resource as well as increasing support for descriptions of causal models of biological systems and network biology. To learn more, visit http://geneontology.org/.
View details for DOI 10.1093/nar/gkw1108
View details for Web of Science ID 000396575500049
View details for PubMedID 27899567
-
RNAcentral: a comprehensive database of non-coding RNA sequences
NUCLEIC ACIDS RESEARCH
2017; 45 (D1): D128-D134
Abstract
RNAcentral is a database of non-coding RNA (ncRNA) sequences that aggregates data from specialised ncRNA resources and provides a single entry point for accessing ncRNA sequences of all ncRNA types from all organisms. Since its launch in 2014, RNAcentral has integrated twelve new resources, taking the total number of collaborating database to 22, and began importing new types of data, such as modified nucleotides from MODOMICS and PDB. We created new species-specific identifiers that refer to unique RNA sequences within a context of single species. The website has been subject to continuous improvements focusing on text and sequence similarity searches as well as genome browsing functionality. All RNAcentral data is provided for free and is available for browsing, bulk downloads, and programmatic access at http://rnacentral.org/.
View details for DOI 10.1093/nar/gkw1008
View details for Web of Science ID 000396575500020
View details for PubMedCentralID PMC5210518
-
The Encyclopedia of DNA elements (ENCODE): data portal update.
Nucleic acids research
2017
Abstract
The Encyclopedia of DNA Elements (ENCODE) Data Coordinating Center has developed the ENCODE Portal database and website as the source for the data and metadata generated by the ENCODE Consortium. Two principles have motivated the design. First, experimental protocols, analytical procedures and the data themselves should be made publicly accessible through a coherent, web-based search and download interface. Second, the same interface should serve carefully curated metadata that record the provenance of the data and justify its interpretation in biological terms. Since its initial release in 2013 and in response to recommendations from consortium members and the wider community of scientists who use the Portal to access ENCODE data, the Portal has been regularly updated to better reflect these design principles. Here we report on these updates, including results from new experiments, uniformly-processed data from other projects, new visualization tools and more comprehensive metadata to describe experiments and analyses. Additionally, the Portal is now home to meta(data) from related projects including Genomics of Gene Regulation, Roadmap Epigenome Project, Model organism ENCODE (modENCODE) and modERN. The Portal now makes available over 13000 datasets and their accompanying metadata and can be accessed at: https://www.encodeproject.org/.
View details for PubMedID 29126249
-
XenMine: A genomic interaction tool for the Xenopus community.
Developmental biology
2016
Abstract
The Xenopus community has embraced recent advances in sequencing technology, resulting in the accumulation of numerous RNA-Seq and ChIP-Seq datasets. However, easily accessing and comparing datasets generated by multiple laboratories is challenging. Thus, we have created a central space to view, search and analyze data, providing essential information on gene expression changes and regulatory elements present in the genome. XenMine (www.xenmine.org) is a user-friendly website containing published genomic datasets from both Xenopus tropicalis and Xenopus laevis. We have established an analysis pipeline where all published datasets are uniformly processed with the latest genome releases. Information from these datasets can be extracted and compared using an array of pre-built or custom templates. With these search tools, users can easily extract sequences for all putative regulatory domains surrounding a gene of interest, identify the expression values of a gene of interest over developmental time, and analyze lists of genes for gene ontology terms and publications. Additionally, XenMine hosts an in-house genome browser that allows users to visualize all available ChIP-Seq data, extract specifically marked sequences, and aid in identifying important regulatory elements within the genome. Altogether, XenMine is an excellent tool for visualizing, accessing and querying analyzed datasets rapidly and efficiently.
View details for DOI 10.1016/j.ydbio.2016.02.034
View details for PubMedID 27157655
-
The Saccharomyces Genome Database Variant Viewer.
Nucleic acids research
2016; 44 (D1): D698-702
Abstract
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is the authoritative community resource for the Saccharomyces cerevisiae reference genome sequence and its annotation. In recent years, we have moved toward increased representation of sequence variation and allelic differences within S. cerevisiae. The publication of numerous additional genomes has motivated the creation of new tools for their annotation and analysis. Here we present the Variant Viewer: a dynamic open-source web application for the visualization of genomic and proteomic differences. Multiple sequence alignments have been constructed across high quality genome sequences from 11 different S. cerevisiae strains and stored in the SGD. The alignments and summaries are encoded in JSON and used to create a two-tiered dynamic view of the budding yeast pan-genome, available at http://www.yeastgenome.org/variant-viewer.
View details for DOI 10.1093/nar/gkv1250
View details for PubMedID 26578556
View details for PubMedCentralID PMC4702884
-
Integration of new alternative reference strain genome sequences into the Saccharomyces genome database.
Database : the journal of biological databases and curation
2016; 2016
Abstract
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) is the authoritative community resource for the Saccharomyces cerevisiae reference genome sequence and its annotation. To provide a wider scope of genetic and phenotypic variation in yeast, the genome sequences and their corresponding annotations from 11 alternative S. cerevisiae reference strains have been integrated into SGD. Genomic and protein sequence information for genes from these strains are now available on the Sequence and Protein tab of the corresponding Locus Summary pages. We illustrate how these genome sequences can be utilized to aid our understanding of strain-specific functional and phenotypic differences.Database URL: www.yeastgenome.org.
View details for DOI 10.1093/database/baw074
View details for PubMedID 27252399
View details for PubMedCentralID PMC4888754
-
Providing Access to Genomic Variant Knowledge in a Healthcare Setting: A Vision for the ClinGen Electronic Health Records Workgroup.
Clinical pharmacology and therapeutics
2016; 99 (2): 157–60
Abstract
The Clinical Genome Resource (ClinGen) is a National Institutes of Health (NIH)-funded collaborative program that brings together a variety of projects designed to provide high-quality, curated information on clinically relevant genes and variants. ClinGen's EHR (Electronic Health Record) Workgroup aims to ensure that ClinGen is accessible to providers and patients through EHR and related systems. This article describes the current scope of these efforts and progress to date. The ClinGen public portal can be accessed at www.clinicalgenome.org.
View details for PubMedID 26418054
-
The Saccharomyces Genome Database: A Tool for Discovery.
Cold Spring Harbor protocols
2015; 2015 (12): pdb.top083840
Abstract
The Saccharomyces Genome Database (SGD) is the main community repository of information for the budding yeast, Saccharomyces cerevisiae. The SGD has collected published results on chromosomal features, including genes and their products, and has become an encyclopedia of information on the biology of the yeast cell. This information includes gene and gene product function, phenotype, interactions, regulation, complexes, and pathways. All information has been integrated into a unique web resource, accessible via http://yeastgenome.org. The website also provides custom tools to allow useful searches and visualization of data. The experimentally defined functions of genes, mutant phenotypes, and sequence homologies archived in the SGD provide a platform for understanding many fields of biological research. The mission of SGD is to provide public access to all published experimental results on yeast to aid life science students, educators, and researchers. As such, the SGD has become an essential tool for the design of experiments and for the analysis of experimental results.
View details for DOI 10.1101/pdb.top083840
View details for PubMedID 26631132
View details for PubMedCentralID PMC5673599
-
The Saccharomyces Genome Database: Exploring Biochemical Pathways and Mutant Phenotypes.
Cold Spring Harbor protocols
2015; 2015 (12): pdb.prot088898
Abstract
Many biochemical processes, and the proteins and cofactors involved, have been defined for the eukaryote Saccharomyces cerevisiae. This understanding has been largely derived through the awesome power of yeast genetics. The proteins responsible for the reactions that build complex molecules and generate energy for the cell have been integrated into web-based tools that provide classical views of pathways. The Yeast Pathways in the Saccharomyces Genome Database (SGD) is, however, the only database created from manually curated literature annotations. In this protocol, gene function is explored using phenotype annotations to enable hypotheses to be formulated about a gene's action. A common use of the SGD is to understand more about a gene that was identified via a phenotypic screen or found to interact with a gene/protein of interest. There are still many genes that do not yet have an experimentally defined function and so the information currently available can be used to speculate about their potential function. Typically, computational annotations based on sequence similarity are used to predict gene function. In addition, annotations are sometimes available for phenotypes of mutations in the gene of interest. Integrated results for a few example genes will be explored in this protocol. This will be instructive for the exploration of details that aid the analysis of experimental results and the establishment of connections within the yeast literature.
View details for DOI 10.1101/pdb.prot088898
View details for PubMedID 26631123
View details for PubMedCentralID PMC5673601
-
The Saccharomyces Genome Database: Gene Product Annotation of Function, Process, and Component.
Cold Spring Harbor protocols
2015; 2015 (12): pdb.prot088914
Abstract
An ontology is a highly structured form of controlled vocabulary. Each entry in the ontology is commonly called a term. These terms are used when talking about an annotation. However, each term has a definition that, like the definition of a word found within a dictionary, provides the complete usage and detailed explanation of the term. It is critical to consult a term's definition because the distinction between terms can be subtle. The use of ontologies in biology started as a way of unifying communication between scientific communities and to provide a standard dictionary for different topics, including molecular functions, biological processes, mutant phenotypes, chemical properties and structures. The creation of ontology terms and their definitions often requires debate to reach agreement but the result has been a unified descriptive language used to communicate knowledge. In addition to terms and definitions, ontologies require a relationship used to define the type of connection between terms. In an ontology, a term can have more than one parent term, the term above it in an ontology, as well as more than one child, the term below it in the ontology. Many ontologies are used to construct annotations in the Saccharomyces Genome Database (SGD), as in all modern biological databases; however, Gene Ontology (GO), a descriptive system used to categorize gene function, is the most extensively used ontology in SGD annotations. Examples included in this protocol illustrate the structure and features of this ontology.
View details for DOI 10.1101/pdb.prot088914
View details for PubMedID 26631125
View details for PubMedCentralID PMC5673600
-
The Saccharomyces Genome Database: Exploring Genome Features and Their Annotations.
Cold Spring Harbor protocols
2015; 2015 (12): pdb.prot088922
Abstract
Genomic-scale assays result in data that provide information over the entire genome. Such base pair resolution data cannot be summarized easily except via a graphical viewer. A genome browser is a tool that displays genomic data and experimental results as horizontal tracks. Genome browsers allow searches for a chromosomal coordinate or a feature, such as a gene name, but they do not allow searches by function or upstream binding site. Entry into a genome browser requires that you identify the gene name or chromosomal coordinates for a region of interest. A track provides a representation for genomic results and is displayed as a row of data shown as line segments to indicate regions of the chromosome with a feature. Another type of track presents a graph or wiggle plot that indicates the processed signal intensity computed for a particular experiment or set of experiments. Wiggle plots are typical for genomic assays such as the various next-generation sequencing methods (e.g., chromatin immunoprecipitation [ChIP]-seq or RNA-seq), where it represents a peak of DNA binding, histone modification, or the mapping of an RNA sequence. Here we explore the browser that has been built into the Saccharomyces Genome Database (SGD).
View details for DOI 10.1101/pdb.prot088922
View details for PubMedID 26631126
View details for PubMedCentralID PMC5673602
-
Gene Ontology Consortium: going forward
NUCLEIC ACIDS RESEARCH
2015; 43 (D1): D1049-D1056
Abstract
The Gene Ontology (GO; http://www.geneontology.org) is a community-based bioinformatics resource that supplies information about gene product function using ontologies to represent biological knowledge. Here we describe improvements and expansions to several branches of the ontology, as well as updates that have allowed us to more efficiently disseminate the GO and capture feedback from the research community. The Gene Ontology Consortium (GOC) has expanded areas of the ontology such as cilia-related terms, cell-cycle terms and multicellular organism processes. We have also implemented new tools for generating ontology terms based on a set of logical rules making use of templates, and we have made efforts to increase our use of logical definitions. The GOC has a new and improved web site summarizing new developments and documentation, serving as a portal to GO data. Users can perform GO enrichment analysis, and search the GO for terms, annotations to gene products, and associated metadata across multiple species using the all-new AmiGO 2 browser. We encourage and welcome the input of the research community in all biological areas in our continued effort to improve the Gene Ontology.
View details for DOI 10.1093/nar/gku1179
View details for Web of Science ID 000350210400154
View details for PubMedCentralID PMC4383973
-
RNAcentral: an international database of ncRNA sequences
NUCLEIC ACIDS RESEARCH
2015; 43 (D1): D123-D129
Abstract
The field of non-coding RNA biology has been hampered by the lack of availability of a comprehensive, up-to-date collection of accessioned RNA sequences. Here we present the first release of RNAcentral, a database that collates and integrates information from an international consortium of established RNA sequence databases. The initial release contains over 8.1 million sequences, including representatives of all major functional classes. A web portal (http://rnacentral.org) provides free access to data, search functionality, cross-references, source code and an integrated genome browser for selected species.
View details for DOI 10.1093/nar/gku991
View details for Web of Science ID 000350210400020
View details for PubMedCentralID PMC4384043
-
AGAPE (Automated Genome Analysis PipelinE) for pan-genome analysis of Saccharomyces cerevisiae.
PloS one
2015; 10 (3)
Abstract
The characterization and public release of genome sequences from thousands of organisms is expanding the scope for genetic variation studies. However, understanding the phenotypic consequences of genetic variation remains a challenge in eukaryotes due to the complexity of the genotype-phenotype map. One approach to this is the intensive study of model systems for which diverse sources of information can be accumulated and integrated. Saccharomyces cerevisiae is an extensively studied model organism, with well-known protein functions and thoroughly curated phenotype data. To develop and expand the available resources linking genomic variation with function in yeast, we aim to model the pan-genome of S. cerevisiae. To initiate the yeast pan-genome, we newly sequenced or re-sequenced the genomes of 25 strains that are commonly used in the yeast research community using advanced sequencing technology at high quality. We also developed a pipeline for automated pan-genome analysis, which integrates the steps of assembly, annotation, and variation calling. To assign strain-specific functional annotations, we identified genes that were not present in the reference genome. We classified these according to their presence or absence across strains and characterized each group of genes with known functional and phenotypic features. The functional roles of novel genes not found in the reference genome and associated with strains or groups of strains appear to be consistent with anticipated adaptations in specific lineages. As more S. cerevisiae strain genomes are released, our analysis can be used to collate genome data and relate it to lineage-specific patterns of genome evolution. Our new tool set will enhance our understanding of genomic and functional evolution in S. cerevisiae, and will be available to the yeast genetics and molecular biology community.
View details for DOI 10.1371/journal.pone.0120671
View details for PubMedID 25781462
-
Saccharomyces genome database provides new regulation data.
Nucleic acids research
2014; 42 (Database issue): D717-25
Abstract
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is the community resource for genomic, gene and protein information about the budding yeast Saccharomyces cerevisiae, containing a variety of functional information about each yeast gene and gene product. We have recently added regulatory information to SGD and present it on a new tabbed section of the Locus Summary entitled 'Regulation'. We are compiling transcriptional regulator-target gene relationships, which are curated from the literature at SGD or imported, with permission, from the YEASTRACT database. For nearly every S. cerevisiae gene, the Regulation page displays a table of annotations showing the regulators of that gene, and a graphical visualization of its regulatory network. For genes whose products act as transcription factors, the Regulation page also shows a table of their target genes, accompanied by a Gene Ontology enrichment analysis of the biological processes in which those genes participate. We additionally synthesize information from the literature for each transcription factor in a free-text Regulation Summary, and provide other information relevant to its regulatory function, such as DNA binding site motifs and protein domains. All of the regulation data are available for querying, analysis and download via YeastMine, the InterMine-based data warehouse system in use at SGD.
View details for DOI 10.1093/nar/gkt1158
View details for PubMedID 24265222
View details for PubMedCentralID PMC3965049
-
The Reference Genome Sequence of Saccharomyces cerevisiae: Then and Now.
G3 (Bethesda, Md.)
2014; 4 (3): 389-398
Abstract
The genome of the budding yeast Saccharomyces cerevisiae was the first completely sequenced from a eukaryote. It was released in 1996 as the work of a worldwide effort of hundreds of researchers. In the time since, the yeast genome has been intensively studied by geneticists, molecular biologists, and computational scientists all over the world. Maintenance and annotation of the genome sequence have long been provided by the Saccharomyces Genome Database, one of the original model organism databases. To deepen our understanding of the eukaryotic genome, the S. cerevisiae strain S288C reference genome sequence was updated recently in its first major update since 1996. The new version, called "S288C 2010," was determined from a single yeast colony using modern sequencing technologies and serves as the anchor for further innovations in yeast genomic science.
View details for DOI 10.1534/g3.113.008995
View details for PubMedID 24374639
View details for PubMedCentralID PMC3962479
-
DATABASE, The Journal of Biological Databases and Curation, is now the official journal of the International Society for Biocuration.
Database : the journal of biological databases and curation
2013; 2013: bat077
View details for DOI 10.1093/database/bat077
View details for PubMedID 24319113
View details for PubMedCentralID PMC3855479
-
A guide to best practices for Gene Ontology (GO) manual annotation
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION
2013
Abstract
The Gene Ontology Consortium (GOC) is a community-based bioinformatics project that classifies gene product function through the use of structured controlled vocabularies. A fundamental application of the Gene Ontology (GO) is in the creation of gene product annotations, evidence-based associations between GO definitions and experimental or sequence-based analysis. Currently, the GOC disseminates 126 million annotations covering >374 000 species including all the kingdoms of life. This number includes two classes of GO annotations: those created manually by experienced biocurators reviewing the literature or by examination of biological data (1.1 million annotations covering 2226 species) and those generated computationally via automated methods. As manual annotations are often used to propagate functional predictions between related proteins within and between genomes, it is critical to provide accurate consistent manual annotations. Toward this goal, we present here the conventions defined by the GOC for the creation of manual annotation. This guide represents the best practices for manual annotation as established by the GOC project over the past 12 years. We hope this guide will encourage research communities to annotate gene products of their interest to enhance the corpus of GO annotations available to all. Database URL: http://www.geneontology.org.
View details for DOI 10.1093/database/bat054
View details for Web of Science ID 000322067500001
View details for PubMedID 23842463
View details for PubMedCentralID PMC3706743
-
InterMOD: integrated data and tools for the unification of model organism research.
Scientific reports
2013; 3: 1802-?
Abstract
Model organisms are widely used for understanding basic biology, and have significantly contributed to the study of human disease. In recent years, genomic analysis has provided extensive evidence of widespread conservation of gene sequence and function amongst eukaryotes, allowing insights from model organisms to help decipher gene function in a wider range of species. The InterMOD consortium is developing an infrastructure based around the InterMine data warehouse system to integrate genomic and functional data from a number of key model organisms, leading the way to improved cross-species research. So far including budding yeast, nematode worm, fruit fly, zebrafish, rat and mouse, the project has set up data warehouses, synchronized data models, and created analysis tools and links between data from different species. The project unites a number of major model organism databases, improving both the consistency and accessibility of comparative research, to the benefit of the wider scientific community.
View details for DOI 10.1038/srep01802
View details for PubMedID 23652793
View details for PubMedCentralID PMC3647165
-
The new modern era of yeast genomics: community sequencing and the resulting annotation of multiple Saccharomyces cerevisiae strains at the Saccharomyces Genome Database
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION
2013
Abstract
The first completed eukaryotic genome sequence was that of the yeast Saccharomyces cerevisiae, and the Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) is the original model organism database. SGD remains the authoritative community resource for the S. cerevisiae reference genome sequence and its annotation, and continues to provide comprehensive biological information correlated with S. cerevisiae genes and their products. A diverse set of yeast strains have been sequenced to explore commercial and laboratory applications, and a brief history of those strains is provided. The publication of these new genomes has motivated the creation of new tools, and SGD will annotate and provide comparative analyses of these sequences, correlating changes with variations in strain phenotypes and protein function. We are entering a new era at SGD, as we incorporate these new sequences and make them accessible to the scientific community, all in an effort to continue in our mission of educating researchers and facilitating discovery.
View details for DOI 10.1093/database/bat012
View details for Web of Science ID 000316172400001
View details for PubMedID 23487186
View details for PubMedCentralID PMC3595989
-
The YeastGenome app: the Saccharomyces Genome Database at your fingertips
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION
2013
Abstract
The Saccharomyces Genome Database (SGD) is a scientific database that provides researchers with high-quality curated data about the genes and gene products of Saccharomyces cerevisiae. To provide instant and easy access to this information on mobile devices, we have developed YeastGenome, a native application for the Apple iPhone and iPad. YeastGenome can be used to quickly find basic information about S. cerevisiae genes and chromosomal features regardless of internet connectivity. With or without network access, you can view basic information and Gene Ontology annotations about a gene of interest by searching gene names and gene descriptions or by browsing the database within the app to find the gene of interest. With internet access, the app provides more detailed information about the gene, including mutant phenotypes, references and protein and genetic interactions, as well as provides hyperlinks to retrieve detailed information by showing SGD pages and views of the genome browser. SGD provides online help describing basic ways to navigate the mobile version of SGD, highlights key features and answers frequently asked questions related to the app. The app is available from iTunes (http://itunes.com/apps/yeastgenome). The YeastGenome app is provided freely as a service to our community, as part of SGD's mission to provide free and open access to all its data and annotations.
View details for DOI 10.1093/database/bat004
View details for Web of Science ID 000316179800001
View details for PubMedID 23396302
View details for PubMedCentralID PMC3567487
-
A gene ontology inferred from molecular networks
NATURE BIOTECHNOLOGY
2013; 31 (1): 38-?
Abstract
Ontologies have proven very useful for capturing knowledge as a hierarchy of terms and their interrelationships. In biology a major challenge has been to construct ontologies of gene function given incomplete biological knowledge and inconsistencies in how this knowledge is manually curated. Here we show that large networks of gene and protein interactions in Saccharomyces cerevisiae can be used to infer an ontology whose coverage and power are equivalent to those of the manually curated Gene Ontology (GO). The network-extracted ontology (NeXO) contains 4,123 biological terms and 5,766 term-term relations, capturing 58% of known cellular components. We also explore robust NeXO terms and term relations that were initially not cataloged in GO, a number of which have now been added based on our analysis. Using quantitative genetic interaction profiling and chemogenomics, we find further support for many of the uncharacterized terms identified by NeXO, including multisubunit structures related to protein trafficking or mitochondrial function. This work enables a shift from using ontologies to evaluate data to using data to construct and evaluate ontologies.
View details for DOI 10.1038/nbt.2463
View details for Web of Science ID 000313563600020
View details for PubMedID 23242164
View details for PubMedCentralID PMC3654867
-
Gene Ontology Annotations and Resources
NUCLEIC ACIDS RESEARCH
2013; 41 (D1): D530-D535
Abstract
The Gene Ontology (GO) Consortium (GOC, http://www.geneontology.org) is a community-based bioinformatics resource that classifies gene product function through the use of structured, controlled vocabularies. Over the past year, the GOC has implemented several processes to increase the quantity, quality and specificity of GO annotations. First, the number of manual, literature-based annotations has grown at an increasing rate. Second, as a result of a new 'phylogenetic annotation' process, manually reviewed, homology-based annotations are becoming available for a broad range of species. Third, the quality of GO annotations has been improved through a streamlined process for, and automated quality checks of, GO annotations deposited by different annotation groups. Fourth, the consistency and correctness of the ontology itself has increased by using automated reasoning tools. Finally, the GO has been expanded not only to cover new areas of biology through focused interaction with experts, but also to capture greater specificity in all areas of the ontology using tools for adding new combinatorial terms. The GOC works closely with other ontology developers to support integrated use of terminologies. The GOC supports its user community through the use of e-mail lists, social media and web-based resources.
View details for DOI 10.1093/nar/gks1050
View details for Web of Science ID 000312893300075
View details for PubMedCentralID PMC3531070
-
In the beginning there was babble ...
AUTOPHAGY
2012; 8 (8): 1165-1167
View details for DOI 10.4161/auto.20665
View details for Web of Science ID 000308505200001
View details for PubMedID 22836666
View details for PubMedCentralID PMC3625114
-
YeastMine-an integrated data warehouse for Saccharomyces cerevisiae data as a multipurpose tool-kit
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION
2012
Abstract
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) provides high-quality curated genomic, genetic, and molecular information on the genes and their products of the budding yeast Saccharomyces cerevisiae. To accommodate the increasingly complex, diverse needs of researchers for searching and comparing data, SGD has implemented InterMine (http://www.InterMine.org), an open source data warehouse system with a sophisticated querying interface, to create YeastMine (http://yeastmine.yeastgenome.org). YeastMine is a multifaceted search and retrieval environment that provides access to diverse data types. Searches can be initiated with a list of genes, a list of Gene Ontology terms, or lists of many other data types. The results from queries can be combined for further analysis and saved or downloaded in customizable file formats. Queries themselves can be customized by modifying predefined templates or by creating a new template to access a combination of specific data types. YeastMine offers multiple scenarios in which it can be used such as a powerful search interface, a discovery tool, a curation aid and also a complex database presentation format. DATABASE URL: http://yeastmine.yeastgenome.org.
View details for DOI 10.1093/database/bar062
View details for Web of Science ID 000304923700001
View details for PubMedID 22434830
View details for PubMedCentralID PMC3308152
-
Considerations for creating and annotating the budding yeast Genome Map at SGD: a progress report
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION
2012
Abstract
The Saccharomyces Genome Database (SGD) is compiling and annotating a comprehensive catalogue of functional sequence elements identified in the budding yeast genome. Recent advances in deep sequencing technologies have enabled for example, global analyses of transcription profiling and assembly of maps of transcription factor occupancy and higher order chromatin organization, at nucleotide level resolution. With this growing influx of published genome-scale data, come new challenges for their storage, display, analysis and integration. Here, we describe SGD's progress in the creation of a consolidated resource for genome sequence elements in the budding yeast, the considerations taken in its design and the lessons learned thus far. The data within this collection can be accessed at http://browse.yeastgenome.org and downloaded from http://downloads.yeastgenome.org. DATABASE URL: http://www.yeastgenome.org.
View details for DOI 10.1093/database/bar057
View details for Web of Science ID 000304922200001
View details for PubMedID 22434826
View details for PubMedCentralID PMC3308148
-
CvManGO, a method for leveraging computational predictions to improve literature-based Gene Ontology annotations
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION
2012
Abstract
The set of annotations at the Saccharomyces Genome Database (SGD) that classifies the cellular function of S. cerevisiae gene products using Gene Ontology (GO) terms has become an important resource for facilitating experimental analysis. In addition to capturing and summarizing experimental results, the structured nature of GO annotations allows for functional comparison across organisms as well as propagation of functional predictions between related gene products. Due to their relevance to many areas of research, ensuring the accuracy and quality of these annotations is a priority at SGD. GO annotations are assigned either manually, by biocurators extracting experimental evidence from the scientific literature, or through automated methods that leverage computational algorithms to predict functional information. Here, we discuss the relationship between literature-based and computationally predicted GO annotations in SGD and extend a strategy whereby comparison of these two types of annotation identifies genes whose annotations need review. Our method, CvManGO (Computational versus Manual GO annotations), pairs literature-based GO annotations with computational GO predictions and evaluates the relationship of the two terms within GO, looking for instances of discrepancy. We found that this method will identify genes that require annotation updates, taking an important step towards finding ways to prioritize literature review. Additionally, we explored factors that may influence the effectiveness of CvManGO in identifying relevant gene targets to find in particular those genes that are missing literature-supported annotations, but our survey found that there are no immediately identifiable criteria by which one could enrich for these under-annotated genes. Finally, we discuss possible ways to improve this strategy, and the applicability of this method to other projects that use the GO for curation. DATABASE URL: http://www.yeastgenome.org.
View details for DOI 10.1093/database/bas001
View details for Web of Science ID 000304919800001
View details for PubMedID 22434836
View details for PubMedCentralID PMC3308158
-
Saccharomyces Genome Database: the genomics resource of budding yeast.
Nucleic acids research
2012; 40 (Database issue): D700-5
Abstract
The Saccharomyces Genome Database (SGD, http://www.yeastgenome.org) is the community resource for the budding yeast Saccharomyces cerevisiae. The SGD project provides the highest-quality manually curated information from peer-reviewed literature. The experimental results reported in the literature are extracted and integrated within a well-developed database. These data are combined with quality high-throughput results and provided through Locus Summary pages, a powerful query engine and rich genome browser. The acquisition, integration and retrieval of these data allow SGD to facilitate experimental design and analysis by providing an encyclopedia of the yeast genome, its chromosomal features, their functions and interactions. Public access to these data is provided to researchers and educators via web pages designed for optimal ease of use.
View details for DOI 10.1093/nar/gkr1029
View details for PubMedID 22110037
View details for PubMedCentralID PMC3245034
-
The Gene Ontology: enhancements for 2011
NUCLEIC ACIDS RESEARCH
2012; 40 (D1): D559-D564
Abstract
The Gene Ontology (GO) (http://www.geneontology.org) is a community bioinformatics resource that represents gene product function through the use of structured, controlled vocabularies. The number of GO annotations of gene products has increased due to curation efforts among GO Consortium (GOC) groups, including focused literature-based annotation and ortholog-based functional inference. The GO ontologies continue to expand and improve as a result of targeted ontology development, including the introduction of computable logical definitions and development of new tools for the streamlined addition of terms to the ontology. The GOC continues to support its user community through the use of e-mail lists, social media and web-based resources.
View details for DOI 10.1093/nar/gkr1028
View details for Web of Science ID 000298601300084
View details for PubMedCentralID PMC3245151
-
Toward an interactive article: integrating journals and biological databases
BMC BIOINFORMATICS
2011; 12
Abstract
Journal articles and databases are two major modes of communication in the biological sciences, and thus integrating these critical resources is of urgent importance to increase the pace of discovery. Projects focused on bridging the gap between journals and databases have been on the rise over the last five years and have resulted in the development of automated tools that can recognize entities within a document and link those entities to a relevant database. Unfortunately, automated tools cannot resolve ambiguities that arise from one term being used to signify entities that are quite distinct from one another. Instead, resolving these ambiguities requires some manual oversight. Finding the right balance between the speed and portability of automation and the accuracy and flexibility of manual effort is a crucial goal to making text markup a successful venture.We have established a journal article mark-up pipeline that links GENETICS journal articles and the model organism database (MOD) WormBase. This pipeline uses a lexicon built with entities from the database as a first step. The entity markup pipeline results in links from over nine classes of objects including genes, proteins, alleles, phenotypes and anatomical terms. New entities and ambiguities are discovered and resolved by a database curator through a manual quality control (QC) step, along with help from authors via a web form that is provided to them by the journal. New entities discovered through this pipeline are immediately sent to an appropriate curator at the database. Ambiguous entities that do not automatically resolve to one link are resolved by hand ensuring an accurate link. This pipeline has been extended to other databases, namely Saccharomyces Genome Database (SGD) and FlyBase, and has been implemented in marking up a paper with links to multiple databases.Our semi-automated pipeline hyperlinks articles published in GENETICS to model organism databases such as WormBase. Our pipeline results in interactive articles that are data rich with high accuracy. The use of a manual quality control step sets this pipeline apart from other hyperlinking tools and results in benefits to authors, journals, readers and databases.
View details for DOI 10.1186/1471-2105-12-175
View details for Web of Science ID 000293000700001
View details for PubMedID 21595960
-
Using computational predictions to improve literature-based Gene Ontology annotations: a feasibility study
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION
2011
Abstract
Annotation using Gene Ontology (GO) terms is one of the most important ways in which biological information about specific gene products can be expressed in a searchable, computable form that may be compared across genomes and organisms. Because literature-based GO annotations are often used to propagate functional predictions between related proteins, their accuracy is critically important. We present a strategy that employs a comparison of literature-based annotations with computational predictions to identify and prioritize genes whose annotations need review. Using this method, we show that comparison of manually assigned 'unknown' annotations in the Saccharomyces Genome Database (SGD) with InterPro-based predictions can identify annotations that need to be updated. A survey of literature-based annotations and computational predictions made by the Gene Ontology Annotation (GOA) project at the European Bioinformatics Institute (EBI) across several other databases shows that this comparison strategy could be used to maintain and improve the quality of GO annotations for other organisms besides yeast. The survey also shows that although GOA-assigned predictions are the most comprehensive source of functional information for many genomes, a large proportion of genes in a variety of different organisms entirely lack these predictions but do have manual annotations. This underscores the critical need for manually performed, literature-based curation to provide functional information about genes that are outside the scope of widely used computational methods. Thus, the combination of manual and computational methods is essential to provide the most accurate and complete functional annotation of a genome. Database URL: http://www.yeastgenome.org.
View details for DOI 10.1093/database/bar004
View details for Web of Science ID 000299630600010
View details for PubMedID 21411447
View details for PubMedCentralID PMC3067894
-
Towards BioDBcore: a community-defined information specification for biological databases
NUCLEIC ACIDS RESEARCH
2011; 39: D7-D10
Abstract
The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases.
View details for DOI 10.1093/nar/gkq1173
View details for Web of Science ID 000285831700002
View details for PubMedID 21097465
-
Saccharomyces Genome Database provides mutant phenotype data
NUCLEIC ACIDS RESEARCH
2010; 38: D433-D436
Abstract
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is a scientific database for the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast. The information in SGD includes functional annotations, mapping and sequence information, protein domains and structure, expression data, mutant phenotypes, physical and genetic interactions and the primary literature from which these data are derived. Here we describe how published phenotypes and genetic interaction data are annotated and displayed in SGD.
View details for DOI 10.1093/nar/gkp917
View details for Web of Science ID 000276399100068
View details for PubMedID 19906697
View details for PubMedCentralID PMC2808950
-
The Gene Ontology in 2010: extensions and refinements The Gene Ontology Consortium
NUCLEIC ACIDS RESEARCH
2010; 38: D331-D335
View details for DOI 10.1093/nar/gkp1018
View details for Web of Science ID 000276399100051
-
The Gene Ontology's Reference Genome Project: A Unified Framework for Functional Annotation across Species
PLOS COMPUTATIONAL BIOLOGY
2009; 5 (7)
Abstract
The Gene Ontology (GO) is a collaborative effort that provides structured vocabularies for annotating the molecular function, biological role, and cellular location of gene products in a highly systematic way and in a species-neutral manner with the aim of unifying the representation of gene function across different organisms. Each contributing member of the GO Consortium independently associates GO terms to gene products from the organism(s) they are annotating. Here we introduce the Reference Genome project, which brings together those independent efforts into a unified framework based on the evolutionary relationships between genes in these different organisms. The Reference Genome project has two primary goals: to increase the depth and breadth of annotations for genes in each of the organisms in the project, and to create data sets and tools that enable other genome annotation efforts to infer GO annotations for homologous genes in their organisms. In addition, the project has several important incidental benefits, such as increasing annotation consistency across genome databases, and providing important improvements to the GO's logical structure and biological content.
View details for DOI 10.1371/journal.pcbi.1000431
View details for Web of Science ID 000269220100031
View details for PubMedID 19578431
View details for PubMedCentralID PMC2699109
-
Functional annotations for the Saccharomyces cerevisiae genome: the knowns and the known unknowns
TRENDS IN MICROBIOLOGY
2009; 17 (7): 286-294
Abstract
The quest to characterize each of the genes of the yeast Saccharomyces cerevisiae has propelled the development and application of novel high-throughput (HTP) experimental techniques. To handle the enormous amount of information generated by these techniques, new bioinformatics tools and resources are needed. Gene Ontology (GO) annotations curated by the Saccharomyces Genome Database (SGD) have facilitated the development of algorithms that analyze HTP data and help predict functions for poorly characterized genes in S. cerevisiae and other organisms. Here, we describe how published results are incorporated into GO annotations at SGD and why researchers can benefit from using these resources wisely to analyze their HTP data and predict gene functions.
View details for DOI 10.1016/j.tim.2009.04.005
View details for Web of Science ID 000268616600005
View details for PubMedID 19577472
View details for PubMedCentralID PMC3057094
-
New mutant phenotype data curation system in the Saccharomyces Genome Database.
Database : the journal of biological databases and curation
2009; 2009: bap001
Abstract
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) organizes and displays molecular and genetic information about the genes and proteins of baker's yeast, Saccharomyces cerevisiae. Mutant phenotype screens have been the starting point for a large proportion of yeast molecular biological studies, and are still used today to elucidate the functions of uncharacterized genes and discover new roles for previously studied genes. To greatly facilitate searching and comparison of mutant phenotypes across genes, we have devised a new controlled-vocabulary system for capturing phenotype information. Each phenotype annotation is represented as an 'observable', which is the entity, or process that is observed, and a 'qualifier' that describes the change in that entity or process in the mutant (e.g. decreased, increased, or abnormal). Additional information about the mutant, such as strain background, allele name, conditions under which the phenotype is observed, or the identity of relevant chemicals, is captured in separate fields. For each gene, a summary of the mutant phenotype information is displayed on the Locus Summary page, and the complete information is displayed in tabular format on the Phenotype Details Page. All of the information is searchable and may also be downloaded in bulk using SGD's Batch Download Tool or Download Data Files Page. In the future, phenotypes will be integrated with other curated data to allow searching across different types of functional information, such as genetic and physical interaction data and Gene Ontology annotations.Database URL:http://www.yeastgenome.org/
View details for DOI 10.1093/database/bap001
View details for PubMedID 20157474
View details for PubMedCentralID PMC2790299
-
New mutant phenotype data curation system in the Saccharomyces Genome Database
DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION
2009
View details for DOI 10.1093/database/bap001
View details for Web of Science ID 000208191300001
-
Gene Ontology annotations at SGD: new data sources and annotation methods
NUCLEIC ACIDS RESEARCH
2008; 36: D577-D581
Abstract
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) collects and organizes biological information about the chromosomal features and gene products of the budding yeast Saccharomyces cerevisiae. Although published data from traditional experimental methods are the primary sources of evidence supporting Gene Ontology (GO) annotations for a gene product, high-throughput experiments and computational predictions can also provide valuable insights in the absence of an extensive body of literature. Therefore, GO annotations available at SGD now include high-throughput data as well as computational predictions provided by the GO Annotation Project (GOA UniProt; http://www.ebi.ac.uk/GOA/). Because the annotation method used to assign GO annotations varies by data source, GO resources at SGD have been modified to distinguish data sources and annotation methods. In addition to providing information for genes that have not been experimentally characterized, GO annotations from independent sources can be compared to those made by SGD to help keep the literature-based GO annotations current.
View details for DOI 10.1093/nar/gkm909
View details for Web of Science ID 000252545400104
View details for PubMedID 17982175
View details for PubMedCentralID PMC2238894
-
Combining guilt-by-association and guilt-by-profiling to predict Saccharomyces cerevisiae gene function
GENOME BIOLOGY
2008; 9
Abstract
Learning the function of genes is a major goal of computational genomics. Methods for inferring gene function have typically fallen into two categories: 'guilt-by-profiling', which exploits correlation between function and other gene characteristics; and 'guilt-by-association', which transfers function from one gene to another via biological relationships.We have developed a strategy ('Funckenstein') that performs guilt-by-profiling and guilt-by-association and combines the results. Using a benchmark set of functional categories and input data for protein-coding genes in Saccharomyces cerevisiae, Funckenstein was compared with a previous combined strategy. Subsequently, we applied Funckenstein to 2,455 Gene Ontology terms. In the process, we developed 2,455 guilt-by-profiling classifiers based on 8,848 gene characteristics and 12 functional linkage graphs based on 23 biological relationships.Funckenstein outperforms a previous combined strategy using a common benchmark dataset. The combination of 'guilt-by-profiling' and 'guilt-by-association' gave significant improvement over the component classifiers, showing the greatest synergy for the most specific functions. Performance was evaluated by cross-validation and by literature examination of the top-scoring novel predictions. These quantitative predictions should help prioritize experimental study of yeast gene functions.
View details for Web of Science ID 000278173500007
View details for PubMedID 18613951
-
The Gene Ontology project in 2008
NUCLEIC ACIDS RESEARCH
2008; 36: D440-D444
Abstract
The Gene Ontology (GO) project (http://www.geneontology.org/) provides a set of structured, controlled vocabularies for community use in annotating genes, gene products and sequences (also see http://www.sequenceontology.org/). The ontologies have been extended and refined for several biological areas, and improvements to the structure of the ontologies have been implemented. To improve the quantity and quality of gene product annotations available from its public repository, the GO Consortium has launched a focused effort to provide comprehensive and detailed annotation of orthologous genes across a number of 'reference' genomes, including human and several key model organisms. Software developments include two releases of the ontology-editing tool OBO-Edit, and improvements to the AmiGO browser interface.
View details for DOI 10.1093/nar/gkm883
View details for Web of Science ID 000252545400079
View details for PubMedID 17984083
View details for PubMedCentralID PMC2238979
-
Mining experimental evidence of molecular function claims from the literature
BIOINFORMATICS
2007; 23 (23): 3232-3240
Abstract
The rate at which gene-related findings appear in the scientific literature makes it difficult if not impossible for biomedical scientists to keep fully informed and up to date. The importance of these findings argues for the development of automated methods that can find, extract and summarize this information. This article reports on methods for determining the molecular function claims that are being made in a scientific article, specifically those that are backed by experimental evidence.The most significant result is that for molecular function claims based on direct assays, our methods achieved recall of 70.7% and precision of 65.7%. Furthermore, our methods correctly identified in the text 44.6% of the specific molecular function claims backed up by direct assays, but with a precision of only 0.92%, a disappointing outcome that led to an examination of the different kinds of errors. These results were based on an analysis of 1823 articles from the literature of Saccharomyces cerevisiae (budding yeast).The annotation files for S.cerevisiae are available from ftp://genome-ftp.stanford.edu/pub/yeast/data_download/literature_curation/gene_association.sgd.gz. The draft protocol vocabulary is available by request from the first author.
View details for DOI 10.1093/bioinformatics/btm495
View details for Web of Science ID 000251334800017
View details for PubMedID 17942445
-
The Saccharomyces Genome Database provides comprehensive information about the biology of S-cerevisiae and tools for studies in comparative genomics
Experimental Biology 2007 Annual Meeting
FEDERATION AMER SOC EXP BIOL. 2007: A264–A264
View details for Web of Science ID 000245708502115
-
Tetrahymena genome database (TGD): a resource for comparative studies with a model protist.
WILEY-BLACKWELL PUBLISHING, INC. 2007: 54S–54S
View details for Web of Science ID 000245312600169
-
Expanded protein information at SGD: new pages and proteome browser
NUCLEIC ACIDS RESEARCH
2007; 35: D468-D471
Abstract
The recent explosion in protein data generated from both directed small-scale studies and large-scale proteomics efforts has greatly expanded the quantity of available protein information and has prompted the Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) to enhance the depth and accessibility of protein annotations. In particular, we have expanded ongoing efforts to improve the integration of experimental information and sequence-based predictions and have redesigned the protein information web pages. A key feature of this redesign is the development of a GBrowse-derived interactive Proteome Browser customized to improve the visualization of sequence-based protein information. This Proteome Browser has enabled SGD to unify the display of hidden Markov model (HMM) domains, protein family HMMs, motifs, transmembrane regions, signal peptides, hydropathy plots and profile hits using several popular prediction algorithms. In addition, a physico-chemical properties page has been introduced to provide easy access to basic protein information. Improvements to the layout of the Protein Information page and integration of the Proteome Browser will facilitate the ongoing expansion of sequence-specific experimental information captured in SGD, including post-translational modifications and other user-defined annotations. Finally, SGD continues to improve upon the availability of genetic and physical interaction data in an ongoing collaboration with BioGRID by providing direct access to more than 82,000 manually-curated interactions.
View details for DOI 10.1093/nar/gkl931
View details for Web of Science ID 000243494600095
View details for PubMedID 17142221
View details for PubMedCentralID PMC1669759
-
Saccharomyces cerevisiae S288C genome annotation: a working hypothesis
YEAST
2006; 23 (12): 857-865
Abstract
The S. cerevisiae genome is the most well-characterized eukaryotic genome and one of the simplest in terms of identifying open reading frames (ORFs), yet its primary annotation has been updated continually in the decade since its initial release in 1996 (Goffeau et al., 1996). The Saccharomyces Genome Database (SGD; www.yeastgenome.org) (Hirschman et al., 2006), the community-designated repository for this reference genome, strives to ensure that the S. cerevisiae annotation is as accurate and useful as possible. At SGD, the S. cerevisiae genome sequence and annotation are treated as a working hypothesis, which must be repeatedly tested and refined. In this paper, in celebration of the tenth anniversary of the completion of the S. cerevisiae genome sequence, we discuss the ways in which the S. cerevisiae sequence and annotation have changed, consider the multiple sources of experimental and comparative data on which these changes are based, and describe our methods for evaluating, incorporating and documenting these new data.
View details for DOI 10.1002/yea.1400
View details for Web of Science ID 000242009800002
View details for PubMedID 17001629
View details for PubMedCentralID PMC3040122
-
Macronuclear genome sequence of the ciliate Tetrahymena thermophila, a model eukaryote
PLOS BIOLOGY
2006; 4 (9): 1620-1642
Abstract
The ciliate Tetrahymena thermophila is a model organism for molecular and cellular biology. Like other ciliates, this species has separate germline and soma functions that are embodied by distinct nuclei within a single cell. The germline-like micronucleus (MIC) has its genome held in reserve for sexual reproduction. The soma-like macronucleus (MAC), which possesses a genome processed from that of the MIC, is the center of gene expression and does not directly contribute DNA to sexual progeny. We report here the shotgun sequencing, assembly, and analysis of the MAC genome of T. thermophila, which is approximately 104 Mb in length and composed of approximately 225 chromosomes. Overall, the gene set is robust, with more than 27,000 predicted protein-coding genes, 15,000 of which have strong matches to genes in other organisms. The functional diversity encoded by these genes is substantial and reflects the complexity of processes required for a free-living, predatory, single-celled organism. This is highlighted by the abundance of lineage-specific duplications of genes with predicted roles in sensing and responding to environmental conditions (e.g., kinases), using diverse resources (e.g., proteases and transporters), and generating structural complexity (e.g., kinesins and dyneins). In contrast to the other lineages of alveolates (apicomplexans and dinoflagellates), no compelling evidence could be found for plastid-derived genes in the genome. UGA, the only T. thermophila stop codon, is used in some genes to encode selenocysteine, thus making this organism the first known with the potential to translate all 64 codons in nuclear genes into amino acids. We present genomic evidence supporting the hypothesis that the excision of DNA from the MIC to generate the MAC specifically targets foreign DNA as a form of genome self-defense. The combination of the genome sequence, the functional diversity encoded therein, and the presence of some pathways missing from other model organisms makes T. thermophila an ideal model for functional genomic studies to address biological, biomedical, and biotechnological questions of fundamental importance.
View details for DOI 10.1371/journal.pbio.0040286
View details for Web of Science ID 000240740900012
View details for PubMedID 16933976
View details for PubMedCentralID PMC1557398
-
Genome Snapshot: a new resource at the Saccharomyces Genome Database (SGD) presenting an overview of the Saccharomyces cerevisiae genome
NUCLEIC ACIDS RESEARCH
2006; 34: D442-D445
Abstract
Sequencing and annotation of the entire Saccharomyces cerevisiae genome has made it possible to gain a genome-wide perspective on yeast genes and gene products. To make this information available on an ongoing basis, the Saccharomyces Genome Database (SGD) (http://www.yeastgenome.org/) has created the Genome Snapshot (http://db.yeastgenome.org/cgi-bin/genomeSnapShot.pl). The Genome Snapshot summarizes the current state of knowledge about the genes and chromosomal features of S.cerevisiae. The information is organized into two categories: (i) number of each type of chromosomal feature annotated in the genome and (ii) number and distribution of genes annotated to Gene Ontology terms. Detailed lists are accessible through SGD's Advanced Search tool (http://db.yeastgenome.org/cgi-bin/search/featureSearch), and all the data presented on this page are available from the SGD ftp site (ftp://ftp.yeastgenome.org/yeast/).
View details for DOI 10.1093/nar/gkj117
View details for Web of Science ID 000239307700097
View details for PubMedID 16381907
View details for PubMedCentralID PMC1347479
-
Tetrahymena Genome Database (TGD): a new genomic resource for Tetrahymena thermophila research
NUCLEIC ACIDS RESEARCH
2006; 34: D500-D503
Abstract
We have developed a web-based resource (available at www.ciliate.org) for researchers studying the model ciliate organism Tetrahymena thermophila. Employing the underlying database structure and programming of the Saccharomyces Genome Database, the Tetrahymena Genome Database (TGD) integrates the wealth of knowledge generated by the Tetrahymena research community about genome structure, genes and gene products with the newly sequenced macronuclear genome determined by The Institute for Genomic Research (TIGR). TGD provides information curated from the literature about each published gene, including a standardized gene name, a link to the genomic locus in our graphical genome browser, gene product annotations utilizing the Gene Ontology, links to published literature about the gene and more. TGD also displays automatic annotations generated for the gene models predicted by TIGR. A variety of tools are available at TGD for searching the Tetrahymena genome, its literature and information about members of the research community.
View details for DOI 10.1093/nar/gkj054
View details for Web of Science ID 000239307700109
View details for PubMedID 16381920
View details for PubMedCentralID PMC1347417
-
The Gene Ontology (GO) project in 2006
NUCLEIC ACIDS RESEARCH
2006; 34: D322-D326
Abstract
The Gene Ontology (GO) project (http://www.geneontology.org) develops and uses a set of structured, controlled vocabularies for community use in annotating genes, gene products and sequences (also see http://song.sourceforge.net/). The GO Consortium continues to improve to the vocabulary content, reflecting the impact of several novel mechanisms of incorporating community input. A growing number of model organism databases and genome annotation groups contribute annotation sets using GO terms to GO's public repository. Updates to the AmiGO browser have improved access to contributed genome annotations. As the GO project continues to grow, the use of the GO vocabularies is becoming more varied as well as more widespread. The GO project provides an ontological annotation system that enables biologists to infer knowledge from large amounts of data.
View details for DOI 10.1093/nar/gkj021
View details for Web of Science ID 000239307700070
View details for PubMedID 16381878
-
PatMatch: a program for finding patterns in peptide and nucleotide sequences
NUCLEIC ACIDS RESEARCH
2005; 33: W262-W266
Abstract
Here, we present PatMatch, an efficient, web-based pattern-matching program that enables searches for short nucleotide or peptide sequences such as cis-elements in nucleotide sequences or small domains and motifs in protein sequences. The program can be used to find matches to a user-specified sequence pattern that can be described using ambiguous sequence codes and a powerful and flexible pattern syntax based on regular expressions. A recent upgrade has improved performance and now supports both mismatches and wildcards in a single pattern. This enhancement has been achieved by replacing the previous searching algorithm, scan_for_matches [D'Souza et al. (1997), Trends in Genetics, 13, 497-498], with nondeterministic-reverse grep (NR-grep), a general pattern matching tool that allows for approximate string matching [Navarro (2001), Software Practice and Experience, 31, 1265-1312]. We have tailored NR-grep to be used for DNA and protein searches with PatMatch. The stand-alone version of the software can be adapted for use with any sequence dataset and is available for download at The Arabidopsis Information Resource (TAIR) at ftp://ftp.arabidopsis.org/home/tair/Software/Patmatch/. The PatMatch server is available on the web at http://www.arabidopsis.org/cgi-bin/patmatch/nph-patmatch.pl for searching Arabidopsis thaliana sequences.
View details for DOI 10.1093/nar/gki368
View details for Web of Science ID 000230271400050
View details for PubMedID 15980466
View details for PubMedCentralID PMC1160129
-
Inference of combinatorial regulation in yeast transcriptional networks: A case study of sporulation
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2005; 102 (6): 1998-2003
Abstract
Decomposing transcriptional regulatory networks into functional modules and determining logical relations between them is the first step toward understanding transcriptional regulation at the system level. Modules based on analysis of genome-scale data can serve as the basis for inferring combinatorial regulation and for building mathematical models to quantitatively describe the behavior of the networks. We present here an algorithm called modem to identify target genes of a transcription factor (TF) from a single expression experiment, based on a joint probabilistic model for promoter sequence and gene expression data. We show how this method can facilitate the discovery of specific instances of combinatorial regulation and illustrate this for a specific case of transcriptional networks that regulate sporulation in the yeast Saccharomyces cerevisiae. Applying this method to analyze two crucial TFs in sporulation, Ndt80p and Sum1p, we were able to delineate their overlapping binding sites. We proposed a mechanistic model for the competitive regulation by the two TFs on a defined subset of sporulation genes. We show that this model accounts for the temporal control of the "middle" sporulation genes and suggest a similar regulatory arrangement can be found in developmental programs in higher organisms.
View details for Web of Science ID 000227072900037
View details for PubMedID 15684073
-
Fungal BLAST and Model Organism BLASTP Best Hits: new comparison resources at the Saccharomyces Genome Database (SGD)
NUCLEIC ACIDS RESEARCH
2005; 33: D374-D377
Abstract
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/) is a scientific database of gene, protein and genomic information for the yeast Saccharomyces cerevisiae. SGD has recently developed two new resources that facilitate nucleotide and protein sequence comparisons between S.cerevisiae and other organisms. The Fungal BLAST tool provides directed searches against all fungal nucleotide and protein sequences available from GenBank, divided into categories according to organism, status of completeness and annotation, and source. The Model Organism BLASTP Best Hits resource displays, for each S.cerevisiae protein, the single most similar protein from several model organisms and presents links to the database pages of those proteins, facilitating access to curated information about potential orthologs of yeast proteins.
View details for DOI 10.1093/nar/gki023
View details for Web of Science ID 000226524300077
View details for PubMedID 15608219
View details for PubMedCentralID PMC539977
-
GO::TermFinder - open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes
BIOINFORMATICS
2004; 20 (18): 3710-3715
Abstract
GO::TermFinder comprises a set of object-oriented Perl modules for accessing Gene Ontology (GO) information and evaluating and visualizing the collective annotation of a list of genes to GO terms. It can be used to draw conclusions from microarray and other biological data, calculating the statistical significance of each annotation. GO::TermFinder can be used on any system on which Perl can be run, either as a command line application, in single or batch mode, or as a web-based CGI script.The full source code and documentation for GO::TermFinder are freely available from http://search.cpan.org/dist/GO-TermFinder/.
View details for DOI 10.1093/bioinformatics/bth456
View details for Web of Science ID 000225786600064
View details for PubMedID 15297299
View details for PubMedCentralID PMC3037731
-
Saccharomyces genome database: Underlying principles and organisation
BRIEFINGS IN BIOINFORMATICS
2004; 5 (1): 9-22
Abstract
A scientific database can be a powerful tool for biologists in an era where large-scale genomic analysis, combined with smaller-scale scientific results, provides new insights into the roles of genes and their products in the cell. However, the collection and assimilation of data is, in itself, not enough to make a database useful. The data must be incorporated into the database and presented to the user in an intuitive and biologically significant manner. Most importantly, this presentation must be driven by the user's point of view; that is, from a biological perspective. The success of a scientific database can therefore be measured by the response of its users - statistically, by usage numbers and, in a less quantifiable way, by its relationship with the community it serves and its ability to serve as a model for similar projects. Since its inception ten years ago, the Saccharomyces Genome Database (SGD) has seen a dramatic increase in its usage, has developed and maintained a positive working relationship with the yeast research community, and has served as a template for at least one other database. The success of SGD, as measured by these criteria, is due in large part to philosophies that have guided its mission and organisation since it was established in 1993. This paper aims to detail these philosophies and how they shape the organisation and presentation of the database.
View details for Web of Science ID 000222244300002
View details for PubMedID 15153302
-
The Gene Ontology (GO) database and informatics resource
NUCLEIC ACIDS RESEARCH
2004; 32: D258-D261
Abstract
The Gene Ontology (GO) project (http://www. geneontology.org/) provides structured, controlled vocabularies and classifications that cover several domains of molecular and cellular biology and are freely available for community use in the annotation of genes, gene products and sequences. Many model organism databases and genome annotation groups use the GO and contribute their annotation sets to the GO resource. The GO database integrates the vocabularies and contributed annotations and provides full access to this information in several formats. Members of the GO Consortium continually work collectively, involving outside experts as needed, to expand and update the GO vocabularies. The GO Web resource also provides access to extensive documentation about the GO project and links to applications that use GO data for functional analyses.
View details for DOI 10.1093/nar/gkh036
View details for Web of Science ID 000188079000059
View details for PubMedID 14681407
View details for PubMedCentralID PMC308770
-
Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms
NUCLEIC ACIDS RESEARCH
2004; 32: D311-D314
Abstract
The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/), a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, has recently developed several new resources that allow the comparison and integration of information on a genome-wide scale, enabling the user not only to find detailed information about individual genes, but also to make connections across groups of genes with common features and across different species. The Fungal Alignment Viewer displays alignments of sequences from multiple fungal genomes, while the Sequence Similarity Query tool displays PSI-BLAST alignments of each S.cerevisiae protein with similar proteins from any species whose sequences are contained in the non-redundant (nr) protein data set at NCBI. The Yeast Biochemical Pathways tool integrates groups of genes by their common roles in metabolism and displays the metabolic pathways in a graphical form. Finally, the Find Chromosomal Features search interface provides a versatile tool for querying multiple types of information in SGD.
View details for DOI 10.1093/nar/gkh033
View details for Web of Science ID 000188079000073
View details for PubMedID 14681421
View details for PubMedCentralID PMC308767
-
Defining Saccharomyces genes.
21st International Conference on Yeast Genetics and Molecular Biology
WILEY-BLACKWELL. 2003: S280–S280
View details for Web of Science ID 000184161800667
-
The Community Annotation system at the Saccharomyces genome database (SGD).
21st International Conference on Yeast Genetics and Molecular Biology
WILEY-BLACKWELL. 2003: S345–S345
View details for Web of Science ID 000184161800824
-
Saccharomyces Genome Database (SGD) provides biochemical and structural information for budding yeast proteins
NUCLEIC ACIDS RESEARCH
2003; 31 (1): 216-218
Abstract
The Saccharomyces Genome Database (SGD: http://genome-www.stanford.edu/Saccharomyces/) has recently developed new resources to provide more complete information about proteins from the budding yeast Saccharomyces cerevisiae. The PDB Homologs page provides structural information from the Protein Data Bank (PDB) about yeast proteins and/or their homologs. SGD has also created a resource that utilizes the eMOTIF database for motif information about a given protein. A third new resource is the Protein Information page, which contains protein physical and chemical properties, such as molecular weight and hydropathicity scores, predicted from the translated ORF sequence.
View details for DOI 10.1093/nar/gkg054
View details for Web of Science ID 000181079700049
View details for PubMedID 12519985
View details for PubMedCentralID PMC165501
-
Gene function, metabolic pathways and comparative genomics in yeast
2nd International Computational Systems Bioinformatics Conference
IEEE COMPUTER SOC. 2003: 437–438
View details for Web of Science ID 000188997700075
-
SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data
NUCLEIC ACIDS RESEARCH
2003; 31 (1): 219-223
Abstract
The explosion in the number of functional genomic datasets generated with tools such as DNA microarrays has created a critical need for resources that facilitate the interpretation of large-scale biological data. SOURCE is a web-based database that brings together information from a broad range of resources, and provides it in manner particularly useful for genome-scale analyses. SOURCE's GeneReports include aliases, chromosomal location, functional descriptions, GeneOntology annotations, gene expression data, and links to external databases. We curate published microarray gene expression datasets and allow users to rapidly identify sets of co-regulated genes across a variety of tissues and a large number of conditions using a simple and intuitive interface. SOURCE provides content both in gene and cDNA clone-centric pages, and thus simplifies analysis of datasets generated using cDNA microarrays. SOURCE is continuously updated and contains the most recent and accurate information available for human, mouse, and rat genes. By allowing dynamic linking to individual gene or clone reports, SOURCE facilitates browsing of large genomic datasets. Finally, SOURCEs batch interface allows rapid extraction of data for thousands of genes or clones at once and thus facilitates statistical analyses such as assessing the enrichment of functional attributes within clusters of genes. SOURCE is available at http://source.stanford.edu.
View details for DOI 10.1093/nar/gkg014
View details for Web of Science ID 000181079700050
View details for PubMedID 12519986
View details for PubMedCentralID PMC165461
-
A systematic approach to reconstructing transcription networks in Saccharomyces cerevisiae
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2002; 99 (26): 16893-16898
Abstract
Decomposing regulatory networks into functional modules is a first step toward deciphering the logical structure of complex networks. We propose a systematic approach to reconstructing transcription modules (defined by a transcription factor and its target genes) and identifying conditionsperturbations under which a particular transcription module is activateddeactivated. Our approach integrates information from regulatory sequences, genome-wide mRNA expression data, and functional annotation. We systematically analyzed gene expression profiling experiments in which the yeast cell was subjected to various environmental or genetic perturbations. We were able to construct transcription modules with high specificity and sensitivity for many transcription factors, and predict the activation of these modules under anticipated as well as unexpected conditions. These findings generate testable hypotheses when combined with existing knowledge on signaling pathways and protein-protein interactions. Correlating the activation of a module to a specific perturbation predicts links in the cell's regulatory networks, and examining coactivated modules suggests specific instances of crosstalk between regulatory pathways.
View details for DOI 10.1073/pnas.252638199
View details for Web of Science ID 000180101600070
View details for PubMedID 12482955
View details for PubMedCentralID PMC139240
-
Identification of unstable transcripts in Arabidopsis by cDNA microarray analysis: Rapid decay is associated with a group of touch- and specific clock-controlled genes
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2002; 99 (17): 11513-11518
Abstract
mRNA degradation provides a powerful means for controlling gene expression during growth, development, and many physiological transitions in plants and other systems. Rates of decay help define the steady state levels to which transcripts accumulate in the cytoplasm and determine the speed with which these levels change in response to the appropriate signals. When fast responses are to be achieved, rapid decay of mRNAs is necessary. Accordingly, genes with unstable transcripts often encode proteins that play important regulatory roles. Although detailed studies have been carried out on individual genes with unstable transcripts, there is limited knowledge regarding their nature and associations from a genomic perspective, or the physiological significance of rapid mRNA turnover in intact organisms. To address these problems, we have applied cDNA microarray analysis to identify and characterize genes with unstable transcripts in Arabidopsis thaliana (AtGUTs). Our studies showed that at least 1% of the 11,521 clones represented on Arabidopsis Functional Genomics Consortium microarrays correspond to transcripts that are rapidly degraded, with estimated half-lives of less than 60 min. AtGUTs encode proteins that are predicted to participate in a broad range of cellular processes, with transcriptional functions being over-represented relative to the whole Arabidopsis genome annotation. Analysis of public microarray expression data for these genes argues that mRNA instability is of high significance during plant responses to mechanical stimulation and is associated with specific genes controlled by the circadian clock.
View details for DOI 10.1073/pnas.152204099
View details for Web of Science ID 000177606900100
View details for PubMedID 12167669
-
Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO)
NUCLEIC ACIDS RESEARCH
2002; 30 (1): 69-72
Abstract
The Saccharomyces Genome Database (SGD) resources, ranging from genetic and physical maps to genome-wide analysis tools, reflect the scientific progress in identifying genes and their functions over the last decade. As emphasis shifts from identification of the genes to identification of the role of their gene products in the cell, SGD seeks to provide its users with annotations that will allow relationships to be made between gene products, both within Saccharomyces cerevisiae and across species. To this end, SGD is annotating genes to the Gene Ontology (GO), a structured representation of biological knowledge that can be shared across species. The GO consists of three separate ontologies describing molecular function, biological process and cellular component. The goal is to use published information to associate each characterized S.cerevisiae gene product with one or more GO terms from each of the three ontologies. To be useful, this must be done in a manner that allows accurate associations based on experimental evidence, modifications to GO when necessary, and careful documentation of the annotations through evidence codes for given citations. Reaching this goal is an ongoing process at SGD. For information on the current progress of GO annotations at SGD and other participating databases, as well as a description of each of the three ontologies, please visit the GO Consortium page at http://www.geneontology.org. SGD gene associations to GO can be found by visiting our site at http://genome-www.stanford.edu/Saccharomyces/.
View details for Web of Science ID 000173077100017
View details for PubMedID 11752257
-
Saccharomyces genome database
GUIDE TO YEAST GENETICS AND MOLECULAR AND CELL BIOLOGY, PT B
2002; 350: 329-346
View details for Web of Science ID 000176466300019
View details for PubMedID 12073322
-
Microarray data quality analysis: lessons from the AFGC project. Arabidopsis Functional Genomics Consortium.
Plant molecular biology
2002; 48 (1-2): 119-131
Abstract
Genome-wide expression profiling with DNA microarrays has and will provide a great deal of data to the plant scientific community. However, reliability concerns have required the development data quality tests for common systematic biases. Fortunately, most large-scale systematic biases are detectable and some are correctable by normalization. Technical replication experiments and statistical surveys indicate that these biases vary widely in severity and appearance. As a result, no single normalization or correction method currently available is able to address all the issues. However, careful sequence selection, array design, experimental design and experimental annotation can substantially improve the quality and biological of microarray data. In this review, we discuss these issues with reference to examples from the Arabidopsis Functional Genomics Consortium (AFGC) microarray project.
View details for PubMedID 11860205
-
Information resources at SGD: Gene Ontology, Gene Summary Paragraphs, and the Literature Guide.
WILEY-BLACKWELL. 2001: S331–S331
View details for Web of Science ID 000170442100575
-
Creating the gene ontology resource: Design and implementation
GENOME RESEARCH
2001; 11 (8): 1425-1433
Abstract
The exponential growth in the volume of accessible biological information has generated a confusion of voices surrounding the annotation of molecular information about genes and their products. The Gene Ontology (GO) project seeks to provide a set of structured vocabularies for specific biological domains that can be used to describe gene products in any organism. This work includes building three extensive ontologies to describe molecular function, biological process, and cellular component, and providing a community database resource that supports the use of these ontologies. The GO Consortium was initiated by scientists associated with three model organism databases: SGD, the Saccharomyces Genome database; FlyBase, the Drosophila genome database; and MGD/GXD, the Mouse Genome Informatics databases. Additional model organism database groups are joining the project. Each of these model organism information systems is annotating genes and gene products using GO vocabulary terms and incorporating these annotations into their respective model organism databases. Each database contributes its annotation files to a shared GO data resource accessible to the public at http://www.geneontology.org/. The GO site can be used by the community both to recover the GO vocabularies and to access the annotated gene product data sets from the model organism databases. The GO Consortium supports the development of the GO database resource and provides tools enabling curators and researchers to query and manipulate the vocabularies. We believe that the shared development of this molecular annotation resource will contribute to the unification of biological information.
View details for Web of Science ID 000170263900015
View details for PubMedID 11483584
-
Visualization of expression clusters using Sammon's non-linear mapping
BIOINFORMATICS
2001; 17 (7): 658-659
Abstract
A method of exploratory analysis and visualization of multi-dimensional gene expression data using Sammon's Non-Linear Mapping (NLM) is presented.
View details for Web of Science ID 000170249100012
View details for PubMedID 11448886
-
Computer manipulation of DNA and protein sequences.
Current protocols in molecular biology / edited by Frederick M. Ausubel ... [et al.]
2001; Chapter 7: Unit7 7-?
Abstract
This unit outlines a variety of methods by which DNA sequences can be manipulated by computers. Procedures for entering sequence data into the computer and assembling raw sequence data into a contiguous sequence are described first, followed by a description of methods of analyzing and manipulating sequences--e.g., verifying sequences, constructing restriction maps, designing oligonucleotides, identifying protein-coding regions, and predicting secondary structures. This unit also provides information on the large amount of software available for sequence analysis. The appendix to this unit lists some of the commercial software, shareware, and free software related to DNA sequence manipulation. The goal of this unit is to serve as a starting point for researchers interested in utilizing the tremendous sequencing resources available to the computer-knowledgeable molecular biology laboratory.
View details for DOI 10.1002/0471142727.mb0707s30
View details for PubMedID 18265271
-
Characteristics of amino acids.
Current protocols in molecular biology / edited by Frederick M. Ausubel ... [et al.]
2001; Appendix 1: Appendix 1C-?
Abstract
This appendix presents useful basic information, including common abbreviations, useful measurements and data, characteristics of amino acids and nucleic acids, information on radioactivity and the safe use of radioisotopes and other hazardous chemicals, conversions for centrifuges and rotors, characteristics of common detergents, and common conversion factors.
View details for DOI 10.1002/0471142727.mba01cs33
View details for PubMedID 18265025
-
Genome comparisons highlight similarity and diversity within the eukaryotic kingdoms
CURRENT OPINION IN CHEMICAL BIOLOGY
2001; 5 (1): 86-89
Abstract
In 2000, the number of completely sequenced eukaryotic genomes increased to four. The addition of Drosophila and Arabidopsis into this cohort permits additional insights into the processes that have shaped evolution. Analysis and comparisons of both completed genomes and partially sequenced genomes have already shed light on mechanisms such as gene duplication and gene loss that have long been hypothesized to be major forces in speciation. Indeed, duplicate gene pairs in Saccharomyces, Arabidopsis, Caenorhabditis and Drosophila are high: 30%, 60%, 48% and 40%, respectively. Evidence of horizontal gene-transfer, thought to be a major evolutionary force in bacteria, has been found in Arabidopsis. The release of the 'first draft' of the human genome sequence in 2000 heralds a new stage of biological study. Understanding the as-yet-unannotated human genome will be largely based on conclusions, techniques and tools developed during the analysis and comparison of the genome of these four model organisms.
View details for Web of Science ID 000167051500014
View details for PubMedID 11166654
-
Saccharomyces Genome Database provides tools to survey gene expression and functional analysis data
NUCLEIC ACIDS RESEARCH
2001; 29 (1): 80-81
Abstract
Upon the completion of the SACCHAROMYCES: cerevisiae genomic sequence in 1996 [Goffeau,A. et al. (1997) NATURE:, 387, 5], several creative and ambitious projects have been initiated to explore the functions of gene products or gene expression on a genome-wide scale. To help researchers take advantage of these projects, the SACCHAROMYCES: Genome Database (SGD) has created two new tools, Function Junction and Expression Connection. Together, the tools form a central resource for querying multiple large-scale analysis projects for data about individual genes. Function Junction provides information from diverse projects that shed light on the role a gene product plays in the cell, while Expression Connection delivers information produced by the ever-increasing number of microarray projects. WWW access to SGD is available at genome-www.stanford. edu/Saccharomyces/.
View details for Web of Science ID 000166360300019
View details for PubMedID 11125055
-
The Stanford Microarray Database
NUCLEIC ACIDS RESEARCH
2001; 29 (1): 152-155
Abstract
The Stanford Microarray Database (SMD) stores raw and normalized data from microarray experiments, and provides web interfaces for researchers to retrieve, analyze and visualize their data. The two immediate goals for SMD are to serve as a storage site for microarray data from ongoing research at Stanford University, and to facilitate the public dissemination of that data once published, or released by the researcher. Of paramount importance is the connection of microarray data with the biological data that pertains to the DNA deposited on the microarray (genes, clones etc.). SMD makes use of many public resources to connect expression information to the relevant biology, including SGD [Ball,C.A., Dolinski,K., Dwight,S.S., Harris,M.A., Issel-Tarver,L., Kasarskis,A., Scafe,C.R., Sherlock,G., Binkley,G., Jin,H. et al. (2000) Nucleic Acids Res., 28, 77-80], YPD and WormPD [Costanzo,M.C., Hogan,J.D., Cusick,M.E., Davis,B.P., Fancher,A.M., Hodges,P.E., Kondu,P., Lengieza,C., Lew-Smith,J.E., Lingner,C. et al. (2000) Nucleic Acids Res., 28, 73-76], Unigene [Wheeler,D.L., Chappey,C., Lash,A.E., Leipe,D.D., Madden,T.L., Schuler,G.D., Tatusova,T.A. and Rapp,B.A. (2000) Nucleic Acids Res., 28, 10-14], dbEST [Boguski,M.S., Lowe,T.M. and Tolstoshev,C.M. (1993) Nature Genet., 4, 332-333] and SWISS-PROT [Bairoch,A. and Apweiler,R. (2000) Nucleic Acids Res., 28, 45-48] and can be accessed at http://genome-www.stanford.edu/microarray.
View details for Web of Science ID 000166360300039
View details for PubMedID 11125075
-
Gene Ontology: tool for the unification of biology
NATURE GENETICS
2000; 25 (1): 25-29
View details for PubMedID 10802651
-
Comparative genomics of the eukaryotes
SCIENCE
2000; 287 (5461): 2204-2215
Abstract
A comparative analysis of the genomes of Drosophila melanogaster, Caenorhabditis elegans, and Saccharomyces cerevisiae-and the proteins they are predicted to encode-was undertaken in the context of cellular, developmental, and evolutionary processes. The nonredundant protein sets of flies and worms are similar in size and are only twice that of yeast, but different gene families are expanded in each genome, and the multidomain proteins and signaling pathways of the fly and worm are far more complex than those of yeast. The fly has orthologs to 177 of the 289 human disease genes examined and provides the foundation for rapid analysis of some of the basic processes involved in human disease.
View details for Web of Science ID 000086049100035
View details for PubMedID 10731134
View details for PubMedCentralID PMC2754258
-
The genome sequence of Drosophila melanogaster
SCIENCE
2000; 287 (5461): 2185-2195
Abstract
The fly Drosophila melanogaster is one of the most intensively studied organisms in biology and serves as a model system for the investigation of many developmental and cellular processes common to higher eukaryotes, including humans. We have determined the nucleotide sequence of nearly all of the approximately 120-megabase euchromatic portion of the Drosophila genome using a whole-genome shotgun sequencing strategy supported by extensive clone-based sequence and a high-quality bacterial artificial chromosome physical map. Efforts are under way to close the remaining gaps; however, the sequence is of sufficient accuracy and contiguity to be declared substantially complete and to support an initial analysis of genome structure and preliminary gene annotation and interpretation. The genome encodes approximately 13,600 genes, somewhat fewer than the smaller Caenorhabditis elegans genome, but with comparable functional diversity.
View details for Web of Science ID 000086049100033
View details for PubMedID 10731132
-
Integrating functional genomic information into the Saccharomyces genome database
NUCLEIC ACIDS RESEARCH
2000; 28 (1): 77-80
Abstract
The Saccharomyces Genome Database (SGD) stores and organizes information about the nearly 6200 genes in the yeast genome. The information is organized around the 'locus page' and directs users to the detailed information they seek. SGD is endeavoring to integrate the existing information about yeast genes with the large volume of data generated by functional analyses that are beginning to appear in the literature and on web sites. New features will include searches of systematic analyses and Gene Summary Paragraphs that succinctly review the literature for each gene. In addition to current information, such as gene product and phenotype descriptions, the new locus page will also describe a gene product's cellular process, function and localization using a controlled vocabulary developed in collaboration with two other model organism databases. We describe these developments in SGD through the newly reorganized locus page. The SGD is accessible via the WWW at http://genome-www.stanford.edu/Saccharomyces/
View details for Web of Science ID 000084896300020
View details for PubMedID 10592186
-
Gene Ontology: a controlled vocabulary to describe the function, biological process and cellular location of gene products in genome databases.
CELL PRESS. 1999: A419–A419
View details for Web of Science ID 000082879802373
-
Unified display of Arabidopsis thaliana physical maps from AtDB, the A.thaliana database
NUCLEIC ACIDS RESEARCH
1999; 27 (1): 79-84
Abstract
In the past several years, there has been a tremendous effort to construct physical maps and to sequence the genome of Arabidopsis thaliana. As a result, four of the five chromosomes are completely covered by overlapping clones except at the centromeric and nucleolus organizer regions (NOR). In addition, over 30% of the genome has been sequenced and completion is anticipated by the end of the year 2000. Despite these accomplishments, the physical maps are provided in many formats on laboratories' Web sites. These data are thus difficult to obtain in a coherent manner for researchers. To alleviate this problem, AtDB (Arabidopsis thaliana DataBase, URL: http://genome-www.stanford.edu/Arabidopsis/) has constructed a unified display of the physical maps where all publicly available physical-map data for all chromosomes are presented through the Web in a clickable, 'on-the-fly' graphic, created by CGI programs that directly consult our relational database.
View details for Web of Science ID 000077983000018
View details for PubMedID 9847147
-
Using the Saccharomyces Genome Database (SGD) for analysis of protein similarities and structure
NUCLEIC ACIDS RESEARCH
1999; 27 (1): 74-78
Abstract
The Saccharomyces Genome Database (SGD) collects and organizes information about the molecular biology and genetics of the yeast Saccharomyces cerevisiae. The latest protein structure and comparison tools available at SGD are presented here. With the completion of the yeast sequence and the Caenorhabditis elegans sequence soon to follow, comparison of proteins from complete eukaryotic proteomes will be an extremely powerful way to learn more about a particular protein's structure, its function, and its relationships with other proteins. SGD can be accessed through the World Wide Web at http://genome-www.stanford.edu/Saccharomyces/
View details for Web of Science ID 000077983000017
View details for PubMedID 9847146
-
Comparison of the complete protein sets of worm and yeast: Orthology and divergence
SCIENCE
1998; 282 (5396): 2022-2028
Abstract
Comparative analysis of predicted protein sequences encoded by the genomes of Caenorhabditis elegans and Saccharomyces cerevisiae suggests that most of the core biological functions are carried out by orthologous proteins (proteins of different species that can be traced back to a common ancestor) that occur in comparable numbers. The specialized processes of signal transduction and regulatory control that are unique to the multicellular worm appear to use novel proteins, many of which re-use conserved domains. Major expansion of the number of some of these domains seen in the worm may have contributed to the advent of multicellularity. The proteins conserved in yeast and worm are likely to have orthologs throughout eukaryotes; in contrast, the proteins unique to the worm may well define metazoans.
View details for Web of Science ID 000077467100036
View details for PubMedID 9851918
-
Expanding yeast knowledge online
YEAST
1998; 14 (16): 1453-1469
Abstract
The completion of the Saccharomyces cerevisiae genome sequencing project and the continued development of improved technology for large-scale genome analysis have led to tremendous growth in the amount of new yeast genetics and molecular biology data. Efficient organization, presentation, and dissemination of this information are essential if researchers are to exploit this knowledge. In addition, the development of tools that provide efficient analysis of this information and link it with pertinent information from other systems is becoming increasingly important at a time when the complete genome sequences of other organisms are becoming available. The aim of this review is to familiarize biologists with the type of data resources currently available on the World Wide Web (WWW).
View details for Web of Science ID 000077792400003
View details for PubMedID 9885151
-
Arabidopsis thaliana: A model plant for genome analysis
SCIENCE
1998; 282 (5389): 662-?
Abstract
Arabidopsis thaliana is a small plant in the mustard family that has become the model system of choice for research in plant biology. Significant advances in understanding plant growth and development have been made by focusing on the molecular genetics of this simple angiosperm. The 120-megabase genome of Arabidopsis is organized into five chromosomes and contains an estimated 20,000 genes. More than 30 megabases of annotated genomic sequence has already been deposited in GenBank by a consortium of laboratories in Europe, Japan, and the United States. The entire genome is scheduled to be sequenced by the end of the year 2000. Reaching this milestone should enhance the value of Arabidopsis as a model for plant biology and the analysis of complex organisms in general.
View details for Web of Science ID 000076607500039
View details for PubMedID 9784120
-
Genome maps 9. Arabidopsis thaliana. Wall chart.
Science
1998; 282 (5389): 663-667
View details for PubMedID 9841422
-
AtDB, the Arabidopsis thaliana database, and graphical-web-display of progress by the Arabidopsis genome initiative
NUCLEIC ACIDS RESEARCH
1998; 26 (1): 80-84
Abstract
AtDB, the Arabidopsis thaliana Database, has a primary role to provide public access to the collected genomic information for A. thaliana via the World Wide Web (URL: http://genome-www.stanford. edu/ ). AtDB presents interactive physical and genetics maps that are hyperlinked with detailed information about the clones and markers placed on these maps. A large literature collection on Arabidopsis , contact information on researchers worldwide, laboratory method manuals and other information useful to plant molecular biologists are also provided. This paper discusses the database-driven clickable displays that provide easy navigation within a variety of genomic maps, including those summarizing progress of the international Arabidopsis genomic sequencing effort, AGI (the Arabidopsis Genome Initiative). The interface uses client-side hyperlinked GIF-images that direct the user to detailed database-information. A new BLAST service is also described. This gives users access to the thousands of Arabidopsis BAC clone end-sequences and includes hyperlinked images summarizing the search results. The linking of genetic and physically mapped regions and their sequence into information for loci within that region is an ongoing goal for this project.
View details for Web of Science ID 000071778900017
View details for PubMedID 9399805
-
SGD: Saccharomyces Genome Database
NUCLEIC ACIDS RESEARCH
1998; 26 (1): 73-79
Abstract
The Saccharomyces Genome Database (SGD) provides Internet access to the complete Saccharomyces cerevisiae genomic sequence, its genes and their products, the phenotypes of its mutants, and the literature supporting these data. The amount of information and the number of features provided by SGD have increased greatly following the release of the S.cerevisiae genomic sequence, which is currently the only complete sequence of a eukaryotic genome. SGD aids researchers by providing not only basic information, but also tools such as sequence similarity searching that lead to detailed information about features of the genome and relationships between genes. SGD presents information using a variety of user-friendly, dynamically created graphical displays illustrating physical, genetic and sequence feature maps. SGD can be accessed via the World Wide Web at http://genome-www.stanford.edu/Saccharomyces/
View details for Web of Science ID 000071778900016
View details for PubMedID 9399804
-
Genetics - Yeast as a model organism
SCIENCE
1997; 277 (5330): 1259-1260
View details for Web of Science ID A1997XT82700041
View details for PubMedID 9297238
-
Arabidopsis genomic information from AtDB.
AMER SOC PLANT BIOLOGISTS. 1997: 11003–
View details for Web of Science ID A1997XL11900023
-
The nucleotide sequence of Saccharomyces cerevisiae chromosome V
NATURE
1997; 387 (6632): 78-81
Abstract
Here we report the sequence of 569,202 base pairs of Saccharomyces cerevisiae chromosome V. Analysis of the sequence revealed a centromere, two telomeres and 271 open reading frames (ORFs) plus 13 tRNAs and four small nuclear RNAs. There are two Tyl transposable elements, each of which contains an ORF (included in the count of 271). Of the ORFs, 78 (29%) are new, 81 (30%) have potential homologues in the public databases, and 112 (41%) are previously characterized yeast genes.
View details for Web of Science ID A1997XB54600008
View details for PubMedID 9169868
-
The nucleotide sequence of Saccharomyces cerevisiae chromosome IV
NATURE
1997; 387 (6632): 75-78
Abstract
The complete DNA sequence of the yeast Saccharomyces cerevisiae chromosome IV has been determined. Apart from chromosome XII, which contains the 1-2 Mb rDNA cluster, chromosome IV is the longest S. cerevisiae chromosome. It was split into three parts, which were sequenced by a consortium from the European Community, the Sanger Centre, and groups from St Louis and Stanford in the United States. The sequence of 1,531,974 base pairs contains 796 predicted or known genes, 318 (39.9%) of which have been previously identified. Of the 478 new genes, 225 (28.3%) are homologous to previously identified genes and 253 (32%) have unknown functions or correspond to spurious open reading frames (ORFs). On average there is one gene approximately every two kilobases. Superimposed on alternating regional variations in G+C composition, there is a large central domain with a lower G+C content that contains all the yeast transposon (Ty) elements and most of the tRNA genes. Chromosome IV shares with chromosomes II, V, XII, XIII and XV some long clustered duplications which partly explain its origin.
View details for Web of Science ID A1997XB54600007
View details for PubMedID 9169867
-
Genetic and physical maps of Saccharomyces cerevisiae
NATURE
1997; 387 (6632): 67-73
Abstract
Genetic and physical maps for the 16 chromosomes of Saccharomyces cerevisiae are presented. The genetic map is the result of 40 years of genetic analysis. The physical map was produced from the results of an international systematic sequencing effort. The data for the maps are accessible electronically from the Saccharomyces Genome Database (SGD: http://genome-www.stanford. edu/Saccharomyces/).
View details for Web of Science ID A1997XB54600006
View details for PubMedID 9169866
-
The nucleotide sequence of Saccharomyces cerevisiae chromosome XVI
NATURE
1997; 387 (6632): 103-105
Abstract
The nucleotide sequence of the 948,061 base pairs of chromosome XVI has been determined, completing the sequence of the yeast genome. Chromosome XVI was the last yeast chromosome identified, and some of the genes mapped early to it, such as GAL4, PEP4 and RAD1 (ref. 2) have played important roles in the development of yeast biology. The architecture of this final chromosome seems to be typical of the large yeast chromosomes, and shows large duplications with other yeast chromosomes. Chromosome XVI contains 487 potential protein-encoding genes, 17 tRNA genes and two small nuclear RNA genes; 27% of the genes have significant similarities to human gene products, and 48% are new and of unknown biological function. Systematic efforts to explore gene function have begun.
View details for Web of Science ID A1997XB54600015
View details for PubMedID 9169875
-
Molecular linguistics: Extracting information from gene and protein sequences
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
1997; 94 (11): 5506-5507
View details for Web of Science ID A1997XB71100005
View details for PubMedID 9159100
View details for PubMedCentralID PMC34160
-
Genetic nomenclature guide. Saccharomyces cerevisiae.
Trends in genetics : TIG
1995: 11-12
View details for PubMedID 7660459
-
AN INTEGRATED GENETIC RFLP MAP OF THE ARABIDOPSIS-THALIANA GENOME
PLANT JOURNAL
1993; 3 (5): 745-754
View details for Web of Science ID A1993LC75800013
-
DETECTION OF HERPES-SIMPLEX VIRUS THYMIDINE KINASE AND LATENCY-ASSOCIATED TRANSCRIPT GENE-SEQUENCES IN HUMAN HERPETIC CORNEAS BY POLYMERASE CHAIN-REACTION AMPLIFICATION
INVESTIGATIVE OPHTHALMOLOGY & VISUAL SCIENCE
1991; 32 (6): 1808-1815
Abstract
Herpes simplex virus (HSV) latency in sensory ganglion neurons is well documented, but the existence of extraneuronal corneal latency is less well defined. To investigate the possibility of extraneuronal latency during ocular HSV infection, corneal specimens from 18 patients with quiescent herpes simplex keratitis (HSK) were obtained at the time of keratoplasty. Polymerase chain reaction (PCR) amplification followed by southern blot hybridization with a radiolabeled oligonucleotide probe was done to detect the presence of HSV-1 genome in these human corneal samples. Two pairs of oligonucleotides from the region of the HSV thymidine kinase (TK) gene and the latency-associated transcript (LAT) gene were used as primers in the PCR amplification. The DNA sequences from either the TK or the LAT gene were identified in 15 of 18 HSK corneas (83%). These results demonstrate that the HSV genome was retained, at least in part, in human corneas during quiescent HSV infection, giving further support to the concept of corneal extraneuronal latency.
View details for Web of Science ID A1991FM17900014
View details for PubMedID 1851732
-
CODON USAGE TABLE FOR XENOPUS-LAEVIS
METHODS IN CELL BIOLOGY
1991; 36: 675-677
View details for Web of Science ID A1991MC41400038
View details for PubMedID 1811159
-
SACCHAROMYCES-CEREVISIAE HOMOSERINE KINASE IS HOMOLOGOUS TO PROKARYOTIC HOMOSERINE KINASES
GENE
1990; 96 (2): 177-180
Abstract
The Saccharomyces cerevisiae gene (THR1) encoding homoserine kinase (HK; EC 2.7.1.39) was cloned by complementation in yeast. Disruption of the THR1 gene results in threonine auxotrophy in yeast. Comparison of the amino acid sequences of yeast and bacterial HKs reveals substantial similarity.
View details for Web of Science ID A1990EM78200004
View details for PubMedID 2176637
-
MUTATIONAL ANALYSIS OF CONSERVED NUCLEOTIDES IN A SELF-SPLICING GROUP-I INTRON
JOURNAL OF MOLECULAR BIOLOGY
1990; 215 (3): 345-358
Abstract
We have constructed all single base substitutions in almost all of the highly conserved residues of the Tetrahymena self-splicing intron. Mutation of highly conserved residues almost invariably leads to loss of enzymatic activity. In many cases, activity could be regained by making additional mutations that restored predicted base-pairings; these second site suppressors in general confirm the secondary structure derived from phylogenetic data. At several positions, our suppression data can be most readily explained by assuming non-Watson-Crick base-pairings. In addition to the requirements imposed by the secondary structure, the sequence of the intron is constrained by "negative interactions", the exclusion of particular nucleotide sequences that would form undesirable secondary structures. A comparison of genetic and phylogenetic data suggests sites that may be involved in tertiary structural interactions.
View details for Web of Science ID A1990ED16700004
View details for PubMedID 1700131
-
GENETIC DISSECTION OF AN RNA ENZYME
COLD SPRING HARBOR SYMPOSIA ON QUANTITATIVE BIOLOGY
1987; 52: 173-180
View details for Web of Science ID A1987P094200021
View details for PubMedID 2456876
-
THE INTERNALLY LOCATED TELOMERIC SEQUENCES IN THE GERM-LINE CHROMOSOMES OF TETRAHYMENA ARE AT THE ENDS OF TRANSPOSON-LIKE ELEMENTS
CELL
1985; 43 (3): 747-758
Abstract
The germ-line micronuclear genome of the ciliate Tetrahymena thermophila contains approximately 10(2) chromosome-internal blocks of tandemly repeated C4A2 sequences (mic C4A2). This repeated sequence is the telomeric sequence in the somatic macronucleus. Each of six cloned micC4A2 was found to be adjacent to a conserved 30 bp sequence, which we propose is the terminal inverted repeat of a family of DNA elements (the Tel-1 family). This 30 bp sequence contains a site for the infrequently cutting restriction enzyme Bst XI, which allows full-length Tel-1 elements to be cut out of the micronuclear genome. BAL 31 exonuclease digestion of Bst XI-cut micronuclear DNA showed the majority of micC4A2 blocks to be associated with the ends of the Tel-1 family. We propose that Tel-1 elements are transposable and suggest a novel mechanism to account for the origin of micC4A2, in which telomeric repeats are added to the ends of free linear forms of the transposable elements prior to reintegration.
View details for Web of Science ID A1985AWV6100022
View details for PubMedID 3000613
-
DNA termini in ciliate macronuclei.
Cold Spring Harbor symposia on quantitative biology
1983; 47: 1195-1207
View details for PubMedID 6407801
-
DNA TERMINI IN CILIATE MACRONUCLEI
COLD SPRING HARBOR SYMPOSIA ON QUANTITATIVE BIOLOGY
1982; 47: 1195-1207
View details for Web of Science ID A1982QR19100066
-
EVIDENCE FOR A PLASMA-MEMBRANE REDOX SYSTEM ON INTACT ASCITES TUMOR-CELLS WITH DIFFERENT METASTATIC CAPACITY
BIOCHIMICA ET BIOPHYSICA ACTA
1981; 634 (1): 11-18
Abstract
A NADH-ferricyanide reductase of the external surface of intact mouse ascites tumor cells grown in culture was shown. The oxidation/reduction reaction was due to enzymatic rather than inorganic iron catalysis as demonstrated by the kinetics and specificity of the reaction. Activities of three markers for cytoplasmic contents were lacking with the intact tumor cells. The dehydrogenase activity was inhibited by p-chloromercuribenzoate, bathophenanthroline sulfonate, and the anticancer drug adriamycin. Sodium azide and potassium cyanide inhibited partially. The response to inhibitors resembled that of isolated plasma membranes rather than that of mitochondria. Concurrent with these findings, neither superoxide dismutase nor rotenone affected the redox activity. The findings provide evidence for the operation of a plasma membrane redox system at the surface of intact, living cells.
View details for Web of Science ID A1981KZ18600002
View details for PubMedID 7470494
-
ABSENCE OF GANGLIOSIDES IN A HIGHER PLANT
EXPERIENTIA
1978; 34 (11): 1433-1434
View details for Web of Science ID A1978FX18000016