Seung Yon (Sue) Rhee has a B.A. degree in Biology from Swarthmore College and a Ph.D. degree in Biology from Stanford University. She was the founding director of TAIR (the Arabidopsis Information Resource) and has been a staff associate then full member at the Plant Biology department of Carnegie Institution for Science since 1999. She is currently the director of the Plant Biology department. Her group strives to understand how plants adapt and acclimate to changes in their environment. Her group develops computational tools and integrative frameworks to systematically identify the functions of novel proteins, pathways, and networks of proteins by combining genomic resources, computational, biocuration, statistical, genetics, molecular, and evolutionary biology methods. Her group is currently developing novel approaches to identify new classes of transcriptional regulators, patterns of metabolic network evolution, and the genetic networks that evolved to control salt tolerance in plants. She has been a leader in newly emerging fields of biology such as genome annotation, biocuration, bioinformatics, and systems biology.
Assistant Professor, Biology
Director of Plant Biology Department, Carnegie Institution for Science (2016 - Present)
Current Research and Scholarly Interests
Humans depend on plant metabolism for survival and well-being. For example, over 25% of drugs are natural products or derivatives of plant metabolism. Despite our dependence on plant metabolism, we know very little about it. Of the estimated 200,000-1,000,000 metabolites and >1 billion enzymes in plants, we know about how ~1000 metabolites are made by ~3000 enzymes. Therefore, plant metabolism is one of the most understudied areas in biology and medicine with a huge unrealized potential for discovering new chemistry and biology.
We wish to understand how plants control their metabolism in response to environmental signals. We also want to understand why different plants have adapted different ways of responding to their environments. With these understandings, we expect to be able to engineer plants to optimize their metabolism under different environmental scenarios.
We developed methods to generate high-quality genetic, metabolic, and protein interaction networks. First, we generated the first genome-wide gene co-function network of Arabidopsis called AraNet (http://www.functionalnet.org/aranet/) for a plant by integrating large-scale data. This network increased the number of genes with predicted function from 50% to 80% of the genome in Arabidopsis. Using inferences made from AraNet, we discovered a potential regulatory complex of the COP9 signalosome and a novel post-translational regulatory mechanism controlling auxin transport in root development. Second, we developed a high-quality prediction system for metabolic enzymes called E2P2 by integrating different prediction programs. Using E2P2, we generated and compared metabolic networks of 11 plant species and 1 alga (www.plantcyc.org), which led to the discovery of several novel genomic signatures of specialized metabolism, including a wide-spread collocation of specialized metabolic enzymes, potentially comprising metabolic pathways. In addition, we generated the most comprehensive metabolite profiles (~4000 metabolites) of a large collection of Arabidopsis mutants, as a part of a consortium, and identified patterns of metabotypes in the network, which we are using to build an algorithm to predict the most likely paths and control points in the metabolic network. Finally, we generated the first large-scale membrane protein interaction map (~12000 interactions among ~4000 proteins) called MIND (www.associomics.org) for a multicellular organism in collaboration with Wolf Frommer’s group and identified numerous regulatory links controlling signal transduction and transport systems. Using MIND and other large-scale protein interaction data, we are developing methods to systematically identify signaling architectures in plants.
Genome-wide prediction of metabolic enzymes, pathways and gene clusters in plants.
Plant metabolism underpins many traits of ecological and agronomic importance. Plants produce numerous compounds to cope with their environments but the biosynthetic pathways for most of these compounds have not yet been elucidated. To engineer and improve metabolic traits, we need comprehensive and accurate knowledge of the organization and regulation of plant metabolism at the genome scale. Here, we present a computational pipeline to identify metabolic enzymes, pathways, and gene clusters from a sequenced genome. Using this pipeline, we generated metabolic pathway databases for 22 species and identified metabolic gene clusters from 18 species. This unified resource can be used to conduct a wide array of comparative studies of plant metabolism. Using the resource, we discovered a widespread occurrence of metabolic gene clusters in plants: 11,969 clusters from 18 species. The prevalence of metabolic gene clusters offers an intriguing possibility of an untapped source for uncovering new metabolite biosynthesis pathways. For example, more than 1,700 clusters contain enzymes that could generate a specialized metabolite scaffold (signature enzymes) and enzymes that modify the scaffold (tailoring enzymes). In four species with sufficient gene expression data, we identified 43 highly coexpressed clusters that contain signature and tailoring enzymes, of which eight were characterized previously to be functional pathways. Finally, we identified patterns of genome organization that implicate local gene duplication and, to a lesser extent, single gene transposition as having played roles in the evolution of plant metabolic gene clusters.
View details for DOI 10.1104/pp.16.01942
View details for PubMedID 28228535
Enhancing gene regulatory network inference through data integration with markov random fields
A gene regulatory network links transcription factors to their target genes and represents a map of transcriptional regulation. Much progress has been made in deciphering gene regulatory networks computationally. However, gene regulatory network inference for most eukaryotic organisms remain challenging. To improve the accuracy of gene regulatory network inference and facilitate candidate selection for experimentation, we developed an algorithm called GRACE (Gene Regulatory network inference ACcuracy Enhancement). GRACE exploits biological a priori and heterogeneous data integration to generate high- confidence network predictions for eukaryotic organisms using Markov Random Fields in a semi-supervised fashion. GRACE uses a novel optimization scheme to integrate regulatory evidence and biological relevance. It is particularly suited for model learning with sparse regulatory gold standard data. We show GRACE's potential to produce high confidence regulatory networks compared to state of the art approaches using Drosophila melanogaster and Arabidopsis thaliana data. In an A. thaliana developmental gene regulatory network, GRACE recovers cell cycle related regulatory mechanisms and further hypothesizes several novel regulatory links, including a putative control mechanism of vascular structure formation due to modifications in cell proliferation.
View details for DOI 10.1038/srep41174
View details for Web of Science ID 000393297200001
View details for PubMedID 28145456
- Computational inference of gene regulatory networks: Approaches, limitations and opportunities BIOCHIMICA ET BIOPHYSICA ACTA-GENE REGULATORY MECHANISMS 2017; 1860 (1): 41-52
The quality of metabolic pathway resources depends on initial enzymatic function assignments: a case for maize
BMC SYSTEMS BIOLOGY
As metabolic pathway resources become more commonly available, researchers have unprecedented access to information about their organism of interest. Despite efforts to ensure consistency between various resources, information content and quality can vary widely. Two maize metabolic pathway resources for the B73 inbred line, CornCyc 4.0 and MaizeCyc 2.2, are based on the same gene model set and were developed using Pathway Tools software. These resources differ in their initial enzymatic function assignments and in the extent of manual curation. We present an in-depth comparison between CornCyc and MaizeCyc to demonstrate the effect of initial computational enzymatic function assignments on the quality and content of metabolic pathway resources.These two resources are different in their content. MaizeCyc contains GO annotations for over 21,000 genes that CornCyc is missing. CornCyc contains on average 1.6 transcripts per gene, while MaizeCyc contains almost no alternate splicing. MaizeCyc also does not match CornCyc's breadth in representing the metabolic domain; MaizeCyc has fewer compounds, reactions, and pathways than CornCyc. CornCyc's computational predictions are more accurate than those in MaizeCyc when compared to experimentally determined function assignments, demonstrating the relative strength of the enzymatic function assignment pipeline used to generate CornCyc.Our results show that the quality of initial enzymatic function assignments primarily determines the quality of the final metabolic pathway resource. Therefore, biologists should pay close attention to the methods and information sources used to develop a metabolic pathway resource to gauge the utility of using such functional assignments to construct hypotheses for experimental studies.
View details for DOI 10.1186/s12918-016-0369-x
View details for Web of Science ID 000389483700001
View details for PubMedID 27899149
Computational inference of gene regulatory networks: Approaches, limitations and opportunities.
Biochimica et biophysica acta
Gene regulatory networks lie at the core of cell function control. In E. coli and S. cerevisiae, the study of gene regulatory networks has led to the discovery of regulatory mechanisms responsible for the control of cell growth, differentiation and responses to environmental stimuli. In plants, computational rendering of gene regulatory networks is gaining momentum, thanks to the recent availability of high-quality genomes and transcriptomes and development of computational network inference approaches. Here, we review current techniques, challenges and trends in gene regulatory network inference and highlight challenges and opportunities for plant science. We provide plant-specific application examples to guide researchers in selecting methodologies that suit their particular research questions. Given the interdisciplinary nature of gene regulatory network inference, we tried to cater to both biologists and computer scientists to help them engage in a dialogue about concepts and caveats in network inference. Specifically, we discuss problems and opportunities in heterogeneous data integration for eukaryotic organisms and common caveats to be considered during network model evaluation. This article is part of a Special Issue entitled: Plant Gene Regulatory Mechanisms and Networks, edited by Dr. Erich Grotewold and Dr. Nathan Springer.
View details for DOI 10.1016/j.bbagrm.2016.09.003
View details for PubMedID 27641093
- Target Enrichment Improves Mapping of Complex Traits by Deep Sequencing G3-GENES GENOMES GENETICS 2016; 6 (1): 67-77
A Framework for Discovering, Designing, and Testing MicroProteins to Regulate Synthetic Transcriptional Modules.
Methods in molecular biology (Clifton, N.J.)
2016; 1482: 175-188
Transcription factors often form protein complexes and give rise to intricate transcriptional networks. The regulation of transcription factor multimerization plays a key role in the fine-tuning of the underlying transcriptional pathways and can be exploited to modulate synthetic transcriptional modules. A novel regulation of protein complex formation is emerging: microProteins-truncated transcription factors-engage in protein-protein interactions with transcriptional complexes and modulate their transcriptional activity. Here, we outline a strategy for the discovery, design, and test of putative miPs to fine-tune the activity of transcription factors regulating synthetic or natural transcriptional circuits.
View details for DOI 10.1007/978-1-4939-6396-6_12
View details for PubMedID 27557768
Target Enrichment Improves Mapping of Complex Traits by Deep Sequencing.
G3 (Bethesda, Md.)
2015; 6 (1): 67-77
Complex traits such as crop performance and human diseases are controlled by multiple genetic loci, many of which have small effects and often go undetected by traditional quantitative trait locus (QTL) mapping. Recently, bulked segregant analysis with large F2 pools and genome-level markers (named extreme-QTL or X-QTL mapping) has been used to identify many QTL. To estimate parameters impacting QTL detection for X-QTL mapping, we simulated the effects of population size, marker density, and sequencing depth of markers on QTL detectability for traits with differing heritabilities. These simulations indicate that a high (>90%) chance of detecting QTL with at least 5% effect requires 5000× sequencing depth for a trait with heritability of 0.4-0.7. For most eukaryotic organisms, whole-genome sequencing at this depth is not economically feasible. Therefore, we tested and confirmed the feasibility of applying deep sequencing of target-enriched markers for X-QTL mapping. We used two traits in Arabidopsis thaliana with different heritabilities: seed size (H(2) = 0.61) and seedling greening in response to salt (H(2) = 0.94). We used a modified G test to identify QTL regions and developed a model-based statistical framework to resolve individual peaks by incorporating recombination rates. Multiple QTL were identified for both traits, including previously undiscovered QTL. We call our method target-enriched X-QTL (TEX-QTL) mapping; this mapping approach is not limited by the genome size or the availability of recombinant inbred populations and should be applicable to many organisms and traits.
View details for DOI 10.1534/g3.115.023671
View details for PubMedID 26530422
Patterns of Metabolite Changes Identified from Large-Scale Gene Perturbations in Arabidopsis Using a Genome-Scale Metabolic Network
2015; 167 (4): 1685-U890
Metabolomics enables quantitative evaluation of metabolic changes caused by genetic or environmental perturbations. However, little is known about how perturbing a single gene changes the metabolic system as a whole and which network and functional properties are involved in this response. To answer this question, we investigated the metabolite profiles from 136 mutants with single gene perturbations of functionally diverse Arabidopsis (Arabidopsis thaliana) genes. Fewer than 10 metabolites were changed significantly relative to the wild type in most of the mutants, indicating that the metabolic network was robust to perturbations of single metabolic genes. These changed metabolites were closer to each other in a genome-scale metabolic network than expected by chance, supporting the notion that the genetic perturbations changed the network more locally than globally. Surprisingly, the changed metabolites were close to the perturbed reactions in only 30% of the mutants of the well-characterized genes. To determine the factors that contributed to the distance between the observed metabolic changes and the perturbation site in the network, we examined nine network and functional properties of the perturbed genes. Only the isozyme number affected the distance between the perturbed reactions and changed metabolites. This study revealed patterns of metabolic changes from large-scale gene perturbations and relationships between characteristics of the perturbed genes and metabolic changes.
View details for DOI 10.1104/pp.114.252361
View details for Web of Science ID 000354438500038
View details for PubMedID 25670818
Measuring semantic similarities by combining gene ontology annotations and gene co-function networks
Gene Ontology (GO) has been used widely to study functional relationships between genes. The current semantic similarity measures rely only on GO annotations and GO structure. This limits the power of GO-based similarity because of the limited proportion of genes that are annotated to GO in most organisms.We introduce a novel approach called NETSIM (network-based similarity measure) that incorporates information from gene co-function networks in addition to using the GO structure and annotations. Using metabolic reaction maps of yeast, Arabidopsis, and human, we demonstrate that NETSIM can improve the accuracy of GO term similarities. We also demonstrate that NETSIM works well even for genomes with sparser gene annotation data. We applied NETSIM on large Arabidopsis gene families such as cytochrome P450 monooxygenases to group the members functionally and show that this grouping could facilitate functional characterization of genes in these families.Using NETSIM as an example, we demonstrated that the performance of a semantic similarity measure could be significantly improved after incorporating genome-specific information. NETSIM incorporates both GO annotations and gene co-function network data as a priori knowledge in the model. Therefore, functional similarities of GO terms that are not explicitly encoded in GO but are relevant in a taxon-specific manner become measurable when GO annotations are limited. Supplementary information and software are available at http://www.msu.edu/~jinchen/NETSIM .
View details for DOI 10.1186/s12859-015-0474-7
View details for Web of Science ID 000349915400001
View details for PubMedID 25886899
microProtein Prediction Program (miP3): A Software for Predicting microProteins and Their Target Transcription Factors
INTERNATIONAL JOURNAL OF GENOMICS
An emerging concept in transcriptional regulation is that a class of truncated transcription factors (TFs), called microProteins (miPs), engages in protein-protein interactions with TF complexes and provides feedback controls. A handful of miP examples have been described in the literature but the extent of their prevalence is unclear. Here we present an algorithm that predicts miPs and their target TFs from a sequenced genome. The algorithm is called miP prediction program (miP3), which is implemented in Python. The software will help shed light on the prevalence, biological roles, and evolution of miPs. Moreover, miP3 can be used to predict other types of miP-like proteins that may have evolved from other functional classes such as kinases and receptors. The program is freely available and can be applied to any sequenced genome.
View details for DOI 10.1155/2015/734147
View details for Web of Science ID 000354307500001
View details for PubMedID 26060811
Becoming data-savvy in a big-data world
TRENDS IN PLANT SCIENCE
2014; 19 (10): 619-622
Plant biology is becoming a data-driven science. High-throughput technologies generate data quickly from molecular to ecosystem levels. Statistical and computational approaches enable describing and interpreting data quantitatively. We highlight the purpose, common problems, and general principles in data analysis. We use RNA sequencing (RNAseq) analysis to illustrate the rationale behind some of the choices made in statistical data analysis. Finally, we provide a list of free online resources that emphasize intuition behind quantitative data analysis.
View details for DOI 10.1016/j.tplants.2014.08.003
View details for Web of Science ID 000343359900004
View details for PubMedID 25213119
Border Control-A Membrane-Linked Interactome of Arabidopsis
2014; 344 (6185): 711-716
Cellular membranes act as signaling platforms and control solute transport. Membrane receptors, transporters, and enzymes communicate with intracellular processes through protein-protein interactions. Using a split-ubiquitin yeast two-hybrid screen that covers a test-space of 6.4 × 10(6) pairs, we identified 12,102 membrane/signaling protein interactions from Arabidopsis. Besides confirmation of expected interactions such as heterotrimeric G protein subunit interactions and aquaporin oligomerization, >99% of the interactions were previously unknown. Interactions were confirmed at a rate of 32% in orthogonal in planta split-green flourescent protein interaction assays, which was statistically indistinguishable from the confirmation rate for known interactions collected from literature (38%). Regulatory associations in membrane protein trafficking, turnover, and phosphorylation include regulation of potassium channel activity through abscisic acid signaling, transporter activity by a WNK kinase, and a brassinolide receptor kinase by trafficking-related proteins. These examples underscore the utility of the membrane/signaling protein interaction network for gene discovery and hypothesis generation in plants and other organisms.
View details for DOI 10.1126/science.1251358
View details for Web of Science ID 000335912900032
View details for PubMedID 24833385
Genomic Signatures of Specialized Metabolism in Plants
2014; 344 (6183): 510-513
All plants synthesize basic metabolites needed for survival (primary metabolism), but different taxa produce distinct metabolites that are specialized for specific environmental interactions (specialized metabolism). Because evolutionary pressures on primary and specialized metabolism differ, we investigated differences in the emergence and maintenance of these processes across 16 species encompassing major plant lineages from algae to angiosperms. We found that, relative to their primary metabolic counterparts, genes coding for specialized metabolic functions have proliferated to a much greater degree and by different mechanisms and display lineage-specific patterns of physical clustering within the genome and coexpression. These properties illustrate the differential evolution of specialized metabolism in plants, and collectively they provide unique signatures for the potential discovery of novel specialized metabolic processes.
View details for DOI 10.1126/science.1252076
View details for Web of Science ID 000335157700041
View details for PubMedID 24786077
A Comprehensive Analysis of MicroProteins Reveals Their Potentially Widespread Mechanism of Transcriptional Regulation.
2014; 165 (1): 149-159
Truncated transcription factor-like proteins called microProteins (miPs) can modulate transcription factor activities, thereby increasing transcriptional regulatory complexity. To understand their prevalence, evolution, and function, we predicted over 400 genes that encode putative miPs from Arabidopsis (Arabidopsis thaliana) using a bioinformatics pipeline and validated two novel miPs involved in flowering time and response to abiotic and biotic stress. We provide an evolutionary perspective for a class of miPs targeting homeodomain transcription factors in plants and metazoans. We identify domain loss as one mechanism of miP evolution and suggest the possible roles of miPs on the evolution of their target transcription factors. Overall, we reveal a prominent layer of transcriptional regulation by miPs, show pervasiveness of such proteins both within and across genomes, and provide a framework for studying their function and evolution.
View details for DOI 10.1104/pp.114.235903
View details for PubMedID 24616380
View details for PubMedCentralID PMC4012575
Towards revealing the functions of all genes in plants
TRENDS IN PLANT SCIENCE
2014; 19 (4): 212-221
The great recent progress made in identifying the molecular parts lists of organisms revealed the paucity of our understanding of what most of the parts do. In this review, we introduce computational and statistical approaches and omics data used for inferring gene function in plants, with an emphasis on network-based inference. We also discuss caveats associated with network-based function predictions such as performance assessment, annotation propagation, the guilt-by-association concept, and the meaning of hubs. Finally, we note the current limitations and possible future directions such as the need for gold standard data from several species, unified access to data and tools, quantitative comparison of data and tool quality, and high-throughput experimental validation platforms for systematic gene function elucidation in plants.
View details for DOI 10.1016/j.tplants.2013.10.006
View details for Web of Science ID 000334976000010
View details for PubMedID 24231067
- Interview with seung yon rhee. Trends in plant science 2014; 19 (4): 198-199
Systems Analysis of Plant Functional, Transcriptional, Physical Interaction, and Metabolic Networks
2012; 24 (10): 3859-3875
Physiological responses, developmental programs, and cellular functions rely on complex networks of interactions at different levels and scales. Systems biology brings together high-throughput biochemical, genetic, and molecular approaches to generate omics data that can be analyzed and used in mathematical and computational models toward uncovering these networks on a global scale. Various approaches, including transcriptomics, proteomics, interactomics, and metabolomics, have been employed to obtain these data on the cellular, tissue, organ, and whole-plant level. We summarize progress on gene regulatory, cofunction, protein interaction, and metabolic networks. We also illustrate the main approaches that have been used to obtain these networks, with specific examples from Arabidopsis thaliana, and describe the pros and cons of each approach.
View details for DOI 10.1105/tpc.112.100776
View details for Web of Science ID 000312378300004
View details for PubMedID 23110892
Towards understanding how molecular networks evolve in plants
CURRENT OPINION IN PLANT BIOLOGY
2012; 15 (2): 177-184
Residing beneath the phenotypic landscape of a plant are intricate and dynamic networks of genes and proteins. As evolution operates on phenotypes, we expect its forces to shape somehow these underlying molecular networks. In this review, we discuss progress being made to elucidate the nature of these forces and their impact on the composition and structure of molecular networks. We also outline current limitations and open questions facing the broader field of plant network analysis.
View details for DOI 10.1016/j.pbi.2012.01.006
View details for Web of Science ID 000303640500010
View details for PubMedID 22280840
Uncovering Arabidopsis membrane protein interactome enriched in transporters using mating-based split ubiquitin assays and classification models.
Frontiers in plant science
2012; 3: 124-?
High-throughput data are a double-edged sword; for the benefit of large amount of data, there is an associated cost of noise. To increase reliability and scalability of high-throughput protein interaction data generation, we tested the efficacy of classification to enrich potential protein-protein interactions. We applied this method to identify interactions among Arabidopsis membrane proteins enriched in transporters. We validated our method with multiple retests. Classification improved the quality of the ensuing interaction network and was effective in reducing the search space and increasing true positive rate. The final network of 541 interactions among 239 proteins (of which 179 are transporters) is the first protein interaction network enriched in membrane transporters reported for any organism. This network has similar topological attributes to other published protein interaction networks. It also extends and fills gaps in currently available biological networks in plants and allows building a number of hypotheses about processes and mechanisms involving signal-transduction and transport systems.
View details for DOI 10.3389/fpls.2012.00124
View details for PubMedID 22737156
- Metabolomics as a hypothesis-generating functional genomics tool for the annotation of Arabidopsis thaliana genes of "unknown function" FRONTIERS IN PLANT SCIENCE 2012; 3
- Uncovering Arabidopsis membrane protein interactome enriched in transporters using mating-based split ubiquitin assays and classification models FRONTIERS IN PLANT SCIENCE 2012; 3
Systematic prediction of gene function in Arabidopsis thaliana using a probabilistic functional gene network
2011; 6 (9): 1429-1442
AraNet is a functional gene network for the reference plant Arabidopsis and has been constructed in order to identify new genes associated with plant traits. It is highly predictive for diverse biological pathways and can be used to prioritize genes for functional screens. Moreover, AraNet provides a web-based tool with which plant biologists can efficiently discover novel functions of Arabidopsis genes (http://www.functionalnet.org/aranet/). This protocol explains how to conduct network-based prediction of gene functions using AraNet and how to interpret the prediction results. Functional discovery in plant biology is facilitated by combining candidate prioritization by AraNet with focused experimental tests.
View details for DOI 10.1038/nprot.2011.372
View details for Web of Science ID 000295362900012
View details for PubMedID 21886106
View details for PubMedCentralID PMC3654671
Integration of Brassinosteroid Signal Transduction with the Transcription Network for Plant Growth Regulation in Arabidopsis
2010; 19 (5): 765-777
Brassinosteroids (BRs) regulate a wide range of developmental and physiological processes in plants through a receptor-kinase signaling pathway that controls the BZR transcription factors. Here, we use transcript profiling and chromatin-immunoprecipitation microarray (ChIP-chip) experiments to identify 953 BR-regulated BZR1 target (BRBT) genes. Functional studies of selected BRBTs further demonstrate roles in BR promotion of cell elongation. The BRBT genes reveal numerous molecular links between the BR-signaling pathway and downstream components involved in developmental and physiological processes. Furthermore, the results reveal extensive crosstalk between BR and other hormonal and light-signaling pathways at multiple levels. For example, BZR1 not only controls the expression of many signaling components of other hormonal and light pathways but also coregulates common target genes with light-signaling transcription factors. Our results provide a genomic map of steroid hormone actions in plants that reveals a regulatory network that integrates hormonal and light-signaling pathways for plant growth regulation.
View details for DOI 10.1016/j.devcel.2010.10.010
View details for Web of Science ID 000284516300016
View details for PubMedID 21074725
View details for PubMedCentralID PMC3018842
Creation of a Genome-Wide Metabolic Pathway Database for Populus trichocarpa Using a New Approach for Reconstruction and Curation of Metabolic Pathways for Plants
2010; 153 (4): 1479-1491
Metabolic networks reconstructed from sequenced genomes or transcriptomes can help visualize and analyze large-scale experimental data, predict metabolic phenotypes, discover enzymes, engineer metabolic pathways, and study metabolic pathway evolution. We developed a general approach for reconstructing metabolic pathway complements of plant genomes. Two new reference databases were created and added to the core of the infrastructure: a comprehensive, all-plant reference pathway database, PlantCyc, and a reference enzyme sequence database, RESD, for annotating metabolic functions of protein sequences. PlantCyc (version 3.0) includes 714 metabolic pathways and 2,619 reactions from over 300 species. RESD (version 1.0) contains 14,187 literature-supported enzyme sequences from across all kingdoms. We used RESD, PlantCyc, and MetaCyc (an all-species reference metabolic pathway database), in conjunction with the pathway prediction software Pathway Tools, to reconstruct a metabolic pathway database, PoplarCyc, from the recently sequenced genome of Populus trichocarpa. PoplarCyc (version 1.0) contains 321 pathways with 1,807 assigned enzymes. Comparing PoplarCyc (version 1.0) with AraCyc (version 6.0, Arabidopsis [Arabidopsis thaliana]) showed comparable numbers of pathways distributed across all domains of metabolism in both databases, except for a higher number of AraCyc pathways in secondary metabolism and a 1.5-fold increase in carbohydrate metabolic enzymes in PoplarCyc. Here, we introduce these new resources and demonstrate the feasibility of using them to identify candidate enzymes for specific pathways and to analyze metabolite profiling data through concrete examples. These resources can be searched by text or BLAST, browsed, and downloaded from our project Web site (http://plantcyc.org).
View details for DOI 10.1104/pp.110.157396
View details for Web of Science ID 000280566000004
View details for PubMedID 20522724
View details for PubMedCentralID PMC2923894
PlantMetabolomics.org: A Web Portal for Plant Metabolomics Experiments
2010; 152 (4): 1807-1816
PlantMetabolomics.org (PM) is a web portal and database for exploring, visualizing, and downloading plant metabolomics data. Widespread public access to well-annotated metabolomics datasets is essential for establishing metabolomics as a functional genomics tool. PM integrates metabolomics data generated from different analytical platforms from multiple laboratories along with the key visualization tools such as ratio and error plots. Visualization tools can quickly show how one condition compares to another and which analytical platforms show the largest changes. The database tries to capture a complete annotation of the experiment metadata along with the metabolite abundance databased on the evolving Metabolomics Standards Initiative. PM can be used as a platform for deriving hypotheses by enabling metabolomic comparisons between genetically unique Arabidopsis (Arabidopsis thaliana) populations subjected to different environmental conditions. Each metabolite is linked to relevant experimental data and information from various annotation databases. The portal also provides detailed protocols and tutorials on conducting plant metabolomics experiments to promote metabolomics in the community. PM currently houses Arabidopsis metabolomics data generated by a consortium of laboratories utilizing metabolomics to help elucidate the functions of uncharacterized genes. PM is publicly available at http://www.plantmetabolomics.org.
View details for DOI 10.1104/pp.109.151027
View details for Web of Science ID 000276335900005
View details for PubMedID 20147492
View details for PubMedCentralID PMC2850039
Database for Mass Spectrometry-Based Plant Metabolomics
SPRINGER. 2010: S7–S7
View details for Web of Science ID 000285367500015
Rational association of genes with traits using a genome-scale gene network for Arabidopsis thaliana
2010; 28 (2): 149-U14
We introduce a rational approach for associating genes with plant traits by combined use of a genome-scale functional network and targeted reverse genetic screening. We present a probabilistic network (AraNet) of functional associations among 19,647 (73%) genes of the reference flowering plant Arabidopsis thaliana. AraNet associations are predictive for diverse biological pathways, and outperform predictions derived only from literature-based protein interactions, achieving 21% precision for 55% of genes. AraNet prioritizes genes for limited-scale functional screening, resulting in a hit-rate tenfold greater than screens of random insertional mutants, when applied to early seedling development as a test case. By interrogating network neighborhoods, we identify AT1G80710 (now DROUGHT SENSITIVE 1; DRS1) and AT3G05090 (now LATERAL ROOT STIMULATOR 1; LRS1) as regulators of drought sensitivity and lateral root development, respectively. AraNet (http://www.functionalnet.org/aranet/) provides a resource for plant gene function identification and genetic dissection of plant traits.
View details for DOI 10.1038/nbt.1603
View details for Web of Science ID 000274317200023
View details for PubMedID 20118918
View details for PubMedCentralID PMC2857375
A membrane protein/signaling protein interaction network for Arabidopsis version AMPv2.
Frontiers in physiology
2010; 1: 24-?
Interactions between membrane proteins and the soluble fraction are essential for signal transduction and for regulating nutrient transport. To gain insights into the membrane-based interactome, 3,852 open reading frames (ORFs) out of a target list of 8,383 representing membrane and signaling proteins from Arabidopsis thaliana were cloned into a Gateway-compatible vector. The mating-based split ubiquitin system was used to screen for potential protein-protein interactions (pPPIs) among 490 Arabidopsis ORFs. A binary robotic screen between 142 receptor-like kinases (RLKs), 72 transporters, 57 soluble protein kinases and phosphatases, 40 glycosyltransferases, 95 proteins of various functions, and 89 proteins with unknown function detected 387 out of 90,370 possible PPIs. A secondary screen confirmed 343 (of 386) pPPIs between 179 proteins, yielding a scale-free network (r(2) = 0.863). Eighty of 142 transmembrane RLKs tested positive, identifying 3 homomers, 63 heteromers, and 80 pPPIs with other proteins. Thirty-one out of 142 RLK interactors (including RLKs) had previously been found to be phosphorylated; thus interactors may be substrates for respective RLKs. None of the pPPIs described here had been reported in the major interactome databases, including potential interactors of G-protein-coupled receptors, phospholipase C, and AMT ammonium transporters. Two RLKs found as putative interactors of AMT1;1 were independently confirmed using a split luciferase assay in Arabidopsis protoplasts. These RLKs may be involved in ammonium-dependent phosphorylation of the C-terminus and regulation of ammonium uptake activity. The robotic screening method established here will enable a systematic analysis of membrane protein interactions in fungi, plants and metazoa.
View details for DOI 10.3389/fphys.2010.00024
View details for PubMedID 21423366
- A membrane protein/signaling protein interaction network for Arabidopsis version AMPv2 FRONTIERS IN PHYSIOLOGY 2010; 1
The Gene Ontology's Reference Genome Project: A Unified Framework for Functional Annotation across Species
PLOS COMPUTATIONAL BIOLOGY
2009; 5 (7)
The Gene Ontology (GO) is a collaborative effort that provides structured vocabularies for annotating the molecular function, biological role, and cellular location of gene products in a highly systematic way and in a species-neutral manner with the aim of unifying the representation of gene function across different organisms. Each contributing member of the GO Consortium independently associates GO terms to gene products from the organism(s) they are annotating. Here we introduce the Reference Genome project, which brings together those independent efforts into a unified framework based on the evolutionary relationships between genes in these different organisms. The Reference Genome project has two primary goals: to increase the depth and breadth of annotations for genes in each of the organisms in the project, and to create data sets and tools that enable other genome annotation efforts to infer GO annotations for homologous genes in their organisms. In addition, the project has several important incidental benefits, such as increasing annotation consistency across genome databases, and providing important improvements to the GO's logical structure and biological content.
View details for DOI 10.1371/journal.pcbi.1000431
View details for Web of Science ID 000269220100031
View details for PubMedID 19578431
Exploiting Domain Knowledge to Improve Biological Significance of Biclusters with Key Missing Genes
IEEE 25th International Conference on Data Engineering
IEEE. 2009: 1219–1222
View details for Web of Science ID 000269126700121
The rules of gene expression in plants: Organ identity and gene body methylation are key factors for regulation of gene expression in Arabidopsis thaliana
Microarray technology is a widely used approach for monitoring genome-wide gene expression. For Arabidopsis, there are over 1,800 microarray hybridizations representing many different experimental conditions on Affymetrix ATH1 gene chips alone. This huge amount of data offers a unique opportunity to infer the principles that govern the regulation of gene expression in plants.We used bioinformatics methods to analyze publicly available data obtained using the ATH1 chip from Affymetrix. A total of 1887 ATH1 hybridizations were normalized and filtered to eliminate low-quality hybridizations. We classified and compared control and treatment hybridizations and determined differential gene expression. The largest differences in gene expression were observed when comparing samples obtained from different organs. On average, ten-fold more genes were differentially expressed between organs as compared to any other experimental variable. We defined "gene responsiveness" as the number of comparisons in which a gene changed its expression significantly. We defined genes with the highest and lowest responsiveness levels as hypervariable and housekeeping genes, respectively. Remarkably, housekeeping genes were best distinguished from hypervariable genes by differences in methylation status in their transcribed regions. Moreover, methylation in the transcribed region was inversely correlated (R2 = 0.8) with gene responsiveness on a genome-wide scale. We provide an example of this negative relationship using genes encoding TCA cycle enzymes, by contrasting their regulatory responsiveness to nitrate and methylation status in their transcribed regions.Our results indicate that the Arabidopsis transcriptome is largely established during development and is comparatively stable when faced with external perturbations. We suggest a novel functional role for DNA methylation in the transcribed region as a key determinant capable of restraining the capacity of a gene to respond to internal/external cues. Our findings suggest a prominent role for epigenetic mechanisms in the regulation of gene expression in plants.
View details for DOI 10.1186/1471-2164-9-438
View details for Web of Science ID 000260172900003
View details for PubMedID 18811951
View details for PubMedCentralID PMC2566314
- Big data: The future of biocuration NATURE 2008; 455 (7209): 47-50
- Homeodomain proteins in mice and plants: What we know and what we don't 67th Annual Meeting of the Society-for-Developmental-Biology ACADEMIC PRESS INC ELSEVIER SCIENCE. 2008: 564–64
Use and misuse of the gene ontology annotations
NATURE REVIEWS GENETICS
2008; 9 (7): 509-515
The Gene Ontology (GO) project is a collaboration among model organism databases to describe gene products from all organisms using a consistent and computable language. GO produces sets of explicitly defined, structured vocabularies that describe biological processes, molecular functions and cellular components of gene products in both a computer- and human-readable manner. Here we describe key aspects of GO, which, when overlooked, can cause erroneous results, and address how these pitfalls can be avoided.
View details for DOI 10.1038/nrg2363
View details for Web of Science ID 000256832900011
View details for PubMedID 18475267
The low temperature-responsive, Solanum CBF1 genes maintain high identity in their upstream regions in a genomic environment undergoing gene duplications, deletions, and rearrangements
PLANT MOLECULAR BIOLOGY
2008; 67 (5): 483-497
Some plants like Arabidopsis thaliana increase in freezing tolerance when exposed to low nonfreezing temperatures, a process known as cold acclimation. Other plants including tomato, Solanum lycopersicum, are chilling sensitive and incur injury during prolonged low temperature exposure. A key initial event that occurs upon low temperature exposure is the induction of genes encoding the CBF transcription factors. In Arabidopsis three CBF genes, present in a tandemly-linked cluster, are induced by low temperatures. Tomato also harbors three tandemly-linked CBF genes, Sl-CBF3-CBF1-CBF2, but only one of these, Sl-CBF1, is low-temperature responsive. Here we report that Solanum species that are closely-allied to cultivated tomato essentially share this structural organization, but the locus is in a dynamic state of flux. Additional paralogs and in-frame deletions between adjacent genes occur, and the genomic regions flanking the CBF genes are dissimilar across Solanum species. Nevertheless, the CBF1 upstream region remains intact and highly conserved. This feature differed for CBF2 and CBF3, whose upstream regions were far less conserved. CBF1 was also the only low-temperature responsive gene in the cluster and its expression was greatly affected by a circadian clock. The tuber-bearing S. tuberosum and S. commersonii also harbored a fourth gene, CBF4, which was also low temperature responsive. CBF4 was physically linked to CBF5 in S. tuberosum, but CBF5 was absent from S. commersonii. Phylogenic analyses suggest that CBF5-CBF4 resulted from the duplication of the CBF3-CBF1-CBF2 cluster. DNA sequence motifs shared between the Solanum CBF1 and CBF4 upstream regions were identified, portions of which were also present in the Arabidopsis CBF1-3 upstream regions. These results suggest that much greater functional constraints are placed upon the Solanum CBF1 upstream regions over the other CBF upstream regions and that CBF4 has retained the capacity for low temperature responsiveness following the duplication event that gave rise to CBF4.
View details for DOI 10.1007/s11103-008-9333-5
View details for Web of Science ID 000256661500004
View details for PubMedID 18415686
Molecular and cellular approaches for the detection of protein-protein interactions: latest techniques and current limitations
2008; 53 (4): 610-635
Homotypic and heterotypic protein interactions are crucial for all levels of cellular function, including architecture, regulation, metabolism, and signaling. Therefore, protein interaction maps represent essential components of post-genomic toolkits needed for understanding biological processes at a systems level. Over the past decade, a wide variety of methods have been developed to detect, analyze, and quantify protein interactions, including surface plasmon resonance spectroscopy, NMR, yeast two-hybrid screens, peptide tagging combined with mass spectrometry and fluorescence-based technologies. Fluorescence techniques range from co-localization of tags, which may be limited by the optical resolution of the microscope, to fluorescence resonance energy transfer-based methods that have molecular resolution and can also report on the dynamics and localization of the interactions within a cell. Proteins interact via highly evolved complementary surfaces with affinities that can vary over many orders of magnitude. Some of the techniques described in this review, such as surface plasmon resonance, provide detailed information on physical properties of these interactions, while others, such as two-hybrid techniques and mass spectrometry, are amenable to high-throughput analysis using robotics. In addition to providing an overview of these methods, this review emphasizes techniques that can be applied to determine interactions involving membrane proteins, including the split ubiquitin system and fluorescence-based technologies for characterizing hits obtained with high-throughput approaches. Mass spectrometry-based methods are covered by a review by Miernyk and Thelen (2008; this issue, pp. 597-609). In addition, we discuss the use of interaction data to construct interaction networks and as the basis for the exciting possibility of using to predict interaction surfaces.
View details for DOI 10.1111/j.1365-313X.2007.03332.x
View details for Web of Science ID 000252931800002
View details for PubMedID 18269572
The Plant Ontology Database: a community resource for plant structure and developmental stages controlled vocabulary and annotations
NUCLEIC ACIDS RESEARCH
2008; 36: D449-D454
The Plant Ontology Consortium (POC, http://www.plantontology.org) is a collaborative effort among model plant genome databases and plant researchers that aims to create, maintain and facilitate the use of a controlled vocabulary (ontology) for plants. The ontology allows users to ascribe attributes of plant structure (anatomy and morphology) and developmental stages to data types, such as genes and phenotypes, to provide a semantic framework to make meaningful cross-species and database comparisons. The POC builds upon groundbreaking work by the Gene Ontology Consortium (GOC) by adopting and extending the GOC's principles, existing software and database structure. Over the past year, POC has added hundreds of ontology terms to associate with thousands of genes and gene products from Arabidopsis, rice and maize, which are available through a newly updated web-based browser (http://www.plantontology.org/amigo/go.cgi) for viewing, searching and querying. The Consortium has also implemented new functionalities to facilitate the application of PO in genomic research and updated the website to keep the contents current. In this report, we present a brief description of resources available from the website, changes to the interfaces, data updates, community activities and future enhancement.
View details for DOI 10.1093/nar/gkm908
View details for Web of Science ID 000252545400081
View details for PubMedID 18194960
View details for PubMedCentralID PMC2238838
The Gene Ontology project in 2008
NUCLEIC ACIDS RESEARCH
2008; 36: D440-D444
The Gene Ontology (GO) project (http://www.geneontology.org/) provides a set of structured, controlled vocabularies for community use in annotating genes, gene products and sequences (also see http://www.sequenceontology.org/). The ontologies have been extended and refined for several biological areas, and improvements to the structure of the ontologies have been implemented. To improve the quantity and quality of gene product annotations available from its public repository, the GO Consortium has launched a focused effort to provide comprehensive and detailed annotation of orthologous genes across a number of 'reference' genomes, including human and several key model organisms. Software developments include two releases of the ontology-editing tool OBO-Edit, and improvements to the AmiGO browser interface.
View details for DOI 10.1093/nar/gkm883
View details for Web of Science ID 000252545400079
View details for PubMedID 17984083
View details for PubMedCentralID PMC2238979
The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases
NUCLEIC ACIDS RESEARCH
2008; 36: D623-D631
MetaCyc (MetaCyc.org) is a universal database of metabolic pathways and enzymes from all domains of life. The pathways in MetaCyc are curated from the primary scientific literature, and are experimentally determined small-molecule metabolic pathways. Each reaction in a MetaCyc pathway is annotated with one or more well-characterized enzymes. Because MetaCyc contains only experimentally elucidated knowledge, it provides a uniquely high-quality resource for metabolic pathways and enzymes. BioCyc (BioCyc.org) is a collection of more than 350 organism-specific Pathway/Genome Databases (PGDBs). Each BioCyc PGDB contains the predicted metabolic network of one organism, including metabolic pathways, enzymes, metabolites and reactions predicted by the Pathway Tools software using MetaCyc as a reference database. BioCyc PGDBs also contain predicted operons and predicted pathway hole fillers-predictions of which enzymes may catalyze pathway reactions that have not been assigned to an enzyme. The BioCyc website offers many tools for computational analysis of PGDBs, including comparative analysis and analysis of omics data in a pathway context. The BioCyc PGDBs generated by SRI are offered for adoption by any interested party for the ongoing integration of metabolic and genome-related information about an organism.
View details for DOI 10.1093/nar/gkm900
View details for Web of Science ID 000252545400112
View details for PubMedID 17965431
View details for PubMedCentralID PMC2238876
- Minimum reporting standards for plant biology context information in metabolomic studies METABOLOMICS 2007; 3 (3): 195-201
The plant structure ontology, a unified vocabulary of anatomy and morphology of a flowering plant
2007; 143 (2): 587-599
Formal description of plant phenotypes and standardized annotation of gene expression and protein localization data require uniform terminology that accurately describes plant anatomy and morphology. This facilitates cross species comparative studies and quantitative comparison of phenotypes and expression patterns. A major drawback is variable terminology that is used to describe plant anatomy and morphology in publications and genomic databases for different species. The same terms are sometimes applied to different plant structures in different taxonomic groups. Conversely, similar structures are named by their species-specific terms. To address this problem, we created the Plant Structure Ontology (PSO), the first generic ontological representation of anatomy and morphology of a flowering plant. The PSO is intended for a broad plant research community, including bench scientists, curators in genomic databases, and bioinformaticians. The initial releases of the PSO integrated existing ontologies for Arabidopsis (Arabidopsis thaliana), maize (Zea mays), and rice (Oryza sativa); more recent versions of the ontology encompass terms relevant to Fabaceae, Solanaceae, additional cereal crops, and poplar (Populus spp.). Databases such as The Arabidopsis Information Resource, Nottingham Arabidopsis Stock Centre, Gramene, MaizeGDB, and SOL Genomics Network are using the PSO to describe expression patterns of genes and phenotypes of mutants and natural variants and are regularly contributing new annotations to the Plant Ontology database. The PSO is also used in specialized public databases, such as BRENDA, GENEVESTIGATOR, NASCArrays, and others. Over 10,000 gene annotations and phenotype descriptions from participating databases can be queried and retrieved using the Plant Ontology browser. The PSO, as well as contributed gene associations, can be obtained at www.plantontology.org.
View details for DOI 10.1104/pp.106.092825
View details for Web of Science ID 000244032400005
View details for PubMedID 17142475
View details for PubMedCentralID PMC1803752
Whole-plant growth stage ontology for angiosperms and its application in plant biology
2006; 142 (2): 414-428
Plant growth stages are identified as distinct morphological landmarks in a continuous developmental process. The terms describing these developmental stages record the morphological appearance of the plant at a specific point in its life cycle. The widely differing morphology of plant species consequently gave rise to heterogeneous vocabularies describing growth and development. Each species or family specific community developed distinct terminologies for describing whole-plant growth stages. This semantic heterogeneity made it impossible to use growth stage description contained within plant biology databases to make meaningful computational comparisons. The Plant Ontology Consortium (http://www.plantontology.org) was founded to develop standard ontologies describing plant anatomical as well as growth and developmental stages that can be used for annotation of gene expression patterns and phenotypes of all flowering plants. In this article, we describe the development of a generic whole-plant growth stage ontology that describes the spatiotemporal stages of plant growth as a set of landmark events that progress from germination to senescence. This ontology represents a synthesis and integration of terms and concepts from a variety of species-specific vocabularies previously used for describing phenotypes and genomic information. It provides a common platform for annotating gene function and gene expression in relation to the developmental trajectory of a plant described at the organismal level. As proof of concept the Plant Ontology Consortium used the plant ontology growth stage ontology to annotate genes and phenotypes in plants with initial emphasis on those represented in The Arabidopsis Information Resource, Gramene database, and MaizeGDB.
View details for DOI 10.1104/pp.106.085720
View details for Web of Science ID 000241161900004
View details for PubMedID 16905665
Systematic analysis of Arabidopsis organelles and a protein localization database for facilitating fluorescent tagging of full-length Arabidopsis proteins
2006; 141 (2): 527-539
Cells are organized into a complex network of subcellular compartments that are specialized for various biological functions. Subcellular location is an important attribute of protein function. To facilitate systematic elucidation of protein subcellular location, we analyzed experimentally verified protein localization data of 1,300 Arabidopsis (Arabidopsis thaliana) proteins. The 1,300 experimentally verified proteins are distributed among 40 different compartments, with most of the proteins localized to four compartments: mitochondria (36%), nucleus (28%), plastid (17%), and cytosol (13.3%). About 19% of the proteins are found in multiple compartments, in which a high proportion (36.4%) is localized to both cytosol and nucleus. Characterization of the overrepresented Gene Ontology molecular functions and biological processes suggests that the Golgi apparatus and peroxisome may play more diverse functions but are involved in more specialized processes than other compartments. To support systematic empirical determination of protein subcellular localization using a technology called fluorescent tagging of full-length proteins, we developed a database and Web application to provide preselected green fluorescent protein insertion position and primer sequences for all Arabidopsis proteins to study their subcellular localization and to store experimentally verified protein localization images, videos, and their annotations of proteins generated using the fluorescent tagging of full-length proteins technology. The database can be searched, browsed, and downloaded using a Web browser at http://aztec.stanford.edu/gfp/. The software can also be downloaded from the same Web site for local installation.
View details for DOI 10.1104/pp.106.078881
View details for Web of Science ID 000238168800028
View details for PubMedID 16617091
View details for PubMedCentralID PMC1475441
Taking the first steps towards a standard for reporting on phylogenies: Minimum information about a phylogenetic analysis (MIAPA)
OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY
2006; 10 (2): 231-237
In the eight years since phylogenomics was introduced as the intersection of genomics and phylogenetics, the field has provided fundamental insights into gene function, genome history and organismal relationships. The utility of phylogenomics is growing with the increase in the number and diversity of taxa for which whole genome and large transcriptome sequence sets are being generated. We assert that the synergy between genomic and phylogenetic perspectives in comparative biology would be enhanced by the development and refinement of minimal reporting standards for phylogenetic analyses. Encouraged by the development of the Minimum Information About a Microarray Experiment (MIAME) standard, we propose a similar roadmap for the development of a Minimal Information About a Phylogenetic Analysis (MIAPA) standard. Key in the successful development and implementation of such a standard will be broad participation by developers of phylogenetic analysis software, phylogenetic database developers, practitioners of phylogenomics, and journal editors.
View details for Web of Science ID 000240210900021
View details for PubMedID 16901231
PubSearch and PubFetch: a simple management system for semiautomated retrieval and annotation of biological information from the literature.
Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.]
2006; Chapter 9: Unit9 7-?
For most systems in biology, a large body of literature exists that describes the complexity of the system based on experimental results. Manual review of this literature to extract targeted information into biological databases is difficult and time consuming. To address this problem, we developed PubSearch and PubFetch, which store literature, keyword, and gene information in a relational database, index the literature with keywords and gene names, and provide a Web user interface for annotating the genes from experimental data found in the associated literature. A set of protocols is provided in this unit for installing, populating, running, and using PubSearch and PubFetch. In addition, we provide support protocols for performing controlled vocabulary annotations. Intended users of PubSearch and PubFetch are database curators and biology researchers interested in tracking the literature and capturing information about genes of interest in a more effective way than with conventional spreadsheets and lab notebooks.
View details for DOI 10.1002/0471250953.bi0907s13
View details for PubMedID 18428773
Bioinformatics and its applications in plant biology
ANNUAL REVIEW OF PLANT BIOLOGY
2006; 57: 335-360
Bioinformatics plays an essential role in today's plant science. As the amount of data grows exponentially, there is a parallel growth in the demand for tools and methods in data management, visualization, integration, analysis, modeling, and prediction. At the same time, many researchers in biology are unfamiliar with available bioinformatics methods, tools, and databases, which could lead to missed opportunities or misinterpretation of the information. In this review, we describe some of the key concepts, methods, software packages, and databases used in bioinformatics, with an emphasis on those relevant to plant science. We also cover some fundamental issues related to biological sequence analyses, transcriptome analyses, computational proteomics, computational metabolomics, bio-ontologies, and biological databases. Finally, we explore a few emerging research topics in bioinformatics.
View details for DOI 10.1146/annurev.arplant.56.032604.144103
View details for Web of Science ID 000239807700013
View details for PubMedID 16669765
The Gene Ontology (GO) project in 2006
NUCLEIC ACIDS RESEARCH
2006; 34: D322-D326
The Gene Ontology (GO) project (http://www.geneontology.org) develops and uses a set of structured, controlled vocabularies for community use in annotating genes, gene products and sequences (also see http://song.sourceforge.net/). The GO Consortium continues to improve to the vocabulary content, reflecting the impact of several novel mechanisms of incorporating community input. A growing number of model organism databases and genome annotation groups contribute annotation sets using GO terms to GO's public repository. Updates to the AmiGO browser have improved access to contributed genome annotations. As the GO project continues to grow, the use of the GO vocabularies is becoming more varied as well as more widespread. The GO project provides an ontological annotation system that enables biologists to infer knowledge from large amounts of data.
View details for DOI 10.1093/nar/gkj021
View details for Web of Science ID 000239307700070
View details for PubMedID 16381878
MetaCyc: a multiorganism database of metabolic pathways and enzymes
NUCLEIC ACIDS RESEARCH
2006; 34: D511-D516
MetaCyc is a database of metabolic pathways and enzymes located at http://MetaCyc.org/. Its goal is to serve as a metabolic encyclopedia, containing a collection of non-redundant pathways central to small molecule metabolism, which have been reported in the experimental literature. Most of the pathways in MetaCyc occur in microorganisms and plants, although animal pathways are also represented. MetaCyc contains metabolic pathways, enzymatic reactions, enzymes, chemical compounds, genes and review-level comments. Enzyme information includes substrate specificity, kinetic properties, activators, inhibitors, cofactor requirements and links to sequence and structure databases. Data are curated from the primary literature by curators with expertise in biochemistry and molecular biology. MetaCyc serves as a readily accessible comprehensive resource on microbial and plant pathways for genome analysis, basic research, education, metabolic engineering and systems biology. Querying, visualization and curation of the database is supported by SRI's Pathway Tools software. The PathoLogic component of Pathway Tools is used in conjunction with MetaCyc to predict the metabolic network of an organism from its annotated genome. SRI and the European Bioinformatics Institute employed this tool to create pathway/genome databases (PGDBs) for 165 organisms, available at the BioCyc.org website. These PGDBs also include predicted operons and pathway hole fillers.
View details for DOI 10.1093/nar/gkj128
View details for Web of Science ID 000239307700112
View details for PubMedID 16381923
- MIAME/Plant - adding value to plant microarrray experiments PLANT METHODS 2006; 2
Plant Ontology (PO): a controlled vocabulary of plant structures and growth stages
COMPARATIVE AND FUNCTIONAL GENOMICS
2005; 6 (7-8): 388-397
The Plant Ontology Consortium (POC) (www.plantontology.org) is a collaborative effort among several plant databases and experts in plant systematics, botany and genomics. A primary goal of the POC is to develop simple yet robust and extensible controlled vocabularies that accurately reflect the biology of plant structures and developmental stages. These provide a network of vocabularies linked by relationships (ontology) to facilitate queries that cut across datasets within a database or between multiple databases. The current version of the ontology integrates diverse vocabularies used to describe Arabidopsis, maize and rice (Oryza sp.) anatomy, morphology and growth stages. Using the ontology browser, over 3500 gene annotations from three species-specific databases, The Arabidopsis Information Resource (TAIR) for Arabidopsis, Gramene for rice and MaizeGDB for maize, can now be queried and retrieved.
View details for DOI 10.1002/cfg.496
View details for Web of Science ID 000235811600007
View details for PubMedID 18629207
PatMatch: a program for finding patterns in peptide and nucleotide sequences
NUCLEIC ACIDS RESEARCH
2005; 33: W262-W266
Here, we present PatMatch, an efficient, web-based pattern-matching program that enables searches for short nucleotide or peptide sequences such as cis-elements in nucleotide sequences or small domains and motifs in protein sequences. The program can be used to find matches to a user-specified sequence pattern that can be described using ambiguous sequence codes and a powerful and flexible pattern syntax based on regular expressions. A recent upgrade has improved performance and now supports both mismatches and wildcards in a single pattern. This enhancement has been achieved by replacing the previous searching algorithm, scan_for_matches [D'Souza et al. (1997), Trends in Genetics, 13, 497-498], with nondeterministic-reverse grep (NR-grep), a general pattern matching tool that allows for approximate string matching [Navarro (2001), Software Practice and Experience, 31, 1265-1312]. We have tailored NR-grep to be used for DNA and protein searches with PatMatch. The stand-alone version of the software can be adapted for use with any sequence dataset and is available for download at The Arabidopsis Information Resource (TAIR) at ftp://ftp.arabidopsis.org/home/tair/Software/Patmatch/. The PatMatch server is available on the web at http://www.arabidopsis.org/cgi-bin/patmatch/nph-patmatch.pl for searching Arabidopsis thaliana sequences.
View details for DOI 10.1093/nar/gki368
View details for Web of Science ID 000230271400050
View details for PubMedID 15980466
View details for PubMedCentralID PMC1160129
- Bioinformatics. Current limitations and insights for the future PLANT PHYSIOLOGY 2005; 138 (2): 569-570
MetaCyc and AraCyc. Metabolic pathway databases for plant research
2005; 138 (1): 27-37
MetaCyc (http://metacyc.org) contains experimentally determined biochemical pathways to be used as a reference database for metabolism. In conjunction with the Pathway Tools software, MetaCyc can be used to computationally predict the metabolic pathway complement of an annotated genome. To increase the breadth of pathways and enzymes, more than 60 plant-specific pathways have been added or updated in MetaCyc recently. In contrast to MetaCyc, which contains metabolic data for a wide range of organisms, AraCyc is a species-specific database containing only enzymes and pathways found in the model plant Arabidopsis (Arabidopsis thaliana). AraCyc (http://arabidopsis.org/tools/aracyc/) was the first computationally predicted plant metabolism database derived from MetaCyc. Since its initial computational build, AraCyc has been under continued curation to enhance data quality and to increase breadth of pathway coverage. Twenty-eight pathways have been manually curated from the literature recently. Pathway predictions in AraCyc have also been recently updated with the latest functional annotations of Arabidopsis genes that use controlled vocabulary and literature evidence. AraCyc currently features 1,418 unique genes mapped onto 204 pathways with 1,156 literature citations. The Omics Viewer, a user data visualization and analysis tool, allows a list of genes, enzymes, or metabolites with experimental values to be painted on a diagram of the full pathway map of AraCyc. Other recent enhancements to both MetaCyc and AraCyc include implementation of an evidence ontology, which has been used to provide information on data quality, expansion of the secondary metabolism node of the pathway ontology to accommodate curation of secondary metabolic pathways, and enhancement of the cellular component ontology for storing and displaying enzyme and pathway locations within subcellular compartments.
View details for DOI 10.1104/pp.105.060376
View details for Web of Science ID 000229023100004
View details for PubMedID 15888675
View details for PubMedCentralID PMC1104157
- Biological databases for plant research PLANT PHYSIOLOGY 2005; 138 (1): 1-3
Using the Arabidopsis Information Resource (TAIR) to find information about Arabidopsis genes.
Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.]
2005; Chapter 1: Unit 1 11-?
The Arabidopsis Information Resource (TAIR; http://www.arabidopsis.org) is a comprehensive Web resource of Arabidopsis biology for plant scientists. TAIR curates and integrates information about genes, proteins, gene expression, mutant phenotypes, biological materials such as DNA and seed stocks, genetic markers, genetic and physical maps, biochemical pathways, genome organization, images of mutant plants and protein sub-cellular localizations, publications, and the research community Data in TAIR are extensively interconnected and can be accessed through a variety of Web-based search and display tools. This unit primarily focuses on some basic methods for searching, browsing, visualizing, and analyzing information about Arabidopsis genes. Gene expression data from microarrays is a recent addition to the database and methods for accessing these data are also described. Two pattern identification programs are described for mining TAIR's unique Arabidopsis sequence data sets. We also describe how to use AraCyc for mining plant metabolic pathways.
View details for DOI 10.1002/0471250953.bi0111s9
View details for PubMedID 18428741
An ontology for cell types
2005; 6 (2)
We describe an ontology for cell types that covers the prokaryotic, fungal, animal and plant worlds. It includes over 680 cell types. These cell types are classified under several generic categories and are organized as a directed acyclic graph. The ontology is available in the formats adopted by the Open Biological Ontologies umbrella and is designed to be used in the context of model organism genome and other biological databases. The ontology is freely available at http://obo.sourceforge.net/ and can be viewed using standard ontology visualization tools such as OBO-Edit and COBrA.
View details for Web of Science ID 000227026500017
View details for PubMedID 15693950
Community-based gene structure annotation
TRENDS IN PLANT SCIENCE
2005; 10 (1): 9-14
Uncertainty and inconsistency of gene structure annotation remain limitations on research in the genome era, frustrating both biologists and bioinformaticians, who have to sort out annotation errors for their genes of interest or to generate trustworthy datasets for algorithmic development. It is unrealistic to hope for better software solutions in the near future that would solve all the problems. The issue is all the more urgent with more species being sequenced and analyzed by comparative genomics - erroneous annotations could easily propagate, whereas correct annotations in one species will greatly facilitate annotation of novel genomes. We propose a dynamic, economically feasible solution to the annotation predicament: broad-based, web-technology-enabled community annotation, a prototype of which is now in use for Arabidopsis.
View details for DOI 10.1016/j.tplants.2004.11.002
View details for Web of Science ID 000226764900003
View details for PubMedID 15642518
A proposed framework for the description of plant metabolomics experiments and their results
2004; 22 (12): 1601-1606
The study of the metabolite complement of biological samples, known as metabolomics, is creating large amounts of data, and support for handling these data sets is required to facilitate meaningful analyses that will answer biological questions. We present a data model for plant metabolomics known as ArMet (architecture for metabolomics). It encompasses the entire experimental time line from experiment definition and description of biological source material, through sample growth and preparation to the results of chemical analysis. Such formal data descriptions, which specify the full experimental context, enable principled comparison of data sets, allow proper interpretation of experimental results, permit the repetition of experiments and provide a basis for the design of systems for data storage and transmission. The current design and example implementations are freely available (http://www.armet.org/). We seek to advance discussion and community adoption of a standard for metabolomics, which would promote principled collection, storage and transmission of experiment data.
View details for DOI 10.1038/nbt1041
View details for Web of Science ID 000225638600040
View details for PubMedID 15583675
Freezing-sensitive tomato has a functional CBF cold response pathway, but a CBF regulon that differs from that of freezing-tolerant Arabidopsis
2004; 39 (6): 905-919
Many plants increase in freezing tolerance in response to low temperature, a process known as cold acclimation. In Arabidopsis, cold acclimation involves action of the CBF cold response pathway. Key components of the pathway include rapid cold-induced expression of three homologous genes encoding transcriptional activators, CBF1, 2 and 3 (also known as DREB1b, c and a, respectively), followed by expression of CBF-targeted genes, the CBF regulon, that increase freezing tolerance. Unlike Arabidopsis, tomato cannot cold acclimate raising the question of whether it has a functional CBF cold response pathway. Here we show that tomato, like Arabidopsis, encodes three CBF homologs, LeCBF1-3 (Lycopersicon esculentum CBF1-3), that are present in tandem array in the genome. Only the tomato LeCBF1 gene, however, was found to be cold-inducible. As is the case for Arabidopsis CBF1-3, transcripts for LeCBF1-3 did accumulate in response to mechanical agitation, but not in response to drought, ABA or high salinity. Constitutive overexpression of LeCBF1 in transgenic Arabidopsis plants induced expression of CBF-targeted genes and increased freezing tolerance indicating that LeCBF1 encodes a functional homolog of the Arabidopsis CBF1-3 proteins. However, constitutive overexpression of either LeCBF1 or AtCBF3 in transgenic tomato plants did not increase freezing tolerance. Gene expression studies, including the use of a cDNA microarray representing approximately 8000 tomato genes, identified only four genes that were induced 2.5-fold or more in the LeCBF1 or AtCBF3 overexpressing plants, three of which were putative members of the tomato CBF regulon as they were also upregulated in response to low temperature. Additional experiments indicated that of eight tomato genes that were likely orthologs of Arabidopsis CBF regulon genes, none were responsive to CBF overexpression in tomato. From these results, we conclude that tomato has a complete CBF cold response pathway, but that the tomato CBF regulon differs from that of Arabidopsis and appears to be considerably smaller and less diverse in function.
View details for DOI 10.1111/j.1365-313X.2004.02176.x
View details for Web of Science ID 000224178500009
View details for PubMedID 15341633
Design, implementation and maintenance of a model organism database for Arabidopsis thaliana
COMPARATIVE AND FUNCTIONAL GENOMICS
2004; 5 (4): 362-369
The Arabidopsis Information Resource (TAIR) is a web-based community database for the model plant Arabidopsis thaliana. It provides an integrated view of genes, sequences, proteins, germplasms, clones, metabolic pathways, gene expression, ecotypes, polymorphisms, publications, maps and community information. TAIR is developed and maintained by collaboration between software developers and biologists. Biologists provide specification and use cases for the system, acquire, analyse and curate data, interact with users and test the software. Software developers design, implement and test the database and software. In this review, we briefly describe how TAIR was built and is being maintained.
View details for DOI 10.1002/cfg.408
View details for Web of Science ID 000222231600005
View details for PubMedID 18629167
View details for PubMedCentralID PMC2447457
Functional annotation of the Arabidopsis genome using controlled vocabularies
2004; 135 (2): 745-755
Controlled vocabularies are increasingly used by databases to describe genes and gene products because they facilitate identification of similar genes within an organism or among different organisms. One of The Arabidopsis Information Resource's goals is to associate all Arabidopsis genes with terms developed by the Gene Ontology Consortium that describe the molecular function, biological process, and subcellular location of a gene product. We have also developed terms describing Arabidopsis anatomy and developmental stages and use these to annotate published gene expression data. As of March 2004, we used computational and manual annotation methods to make 85,666 annotations representing 26,624 unique loci. We focus on associating genes to controlled vocabulary terms based on experimental data from the literature and use The Arabidopsis Information Resource-developed PubSearch software to facilitate this process. Each annotation is tagged with a combination of evidence codes, evidence descriptions, and references that provide a robust means to assess data quality. Annotation of all Arabidopsis genes will allow quantitative comparisons between sets of genes derived from sources such as microarray experiments. The Arabidopsis annotation data will also facilitate annotation of newly sequenced plant genomes by using sequence similarity to transfer annotations to homologous genes. In addition, complete and up-to-date annotations will make unknown genes easy to identify and target for experimentation. Here, we describe the process of Arabidopsis functional annotation using a variety of data sources and illustrate several ways in which this information can be accessed and used to infer knowledge about Arabidopsis and other plant species.
View details for DOI 10.1104/pp.104.040071
View details for Web of Science ID 000222165300020
View details for PubMedID 15173566
View details for PubMedCentralID PMC514112
High-throughput fluorescent tagging of full-length arabidopsis gene products in planta
2004; 135 (1): 25-38
We developed a high-throughput methodology, termed fluorescent tagging of full-length proteins (FTFLP), to analyze expression patterns and subcellular localization of Arabidopsis gene products in planta. Determination of these parameters is a logical first step in functional characterization of the approximately one-third of all known Arabidopsis genes that encode novel proteins of unknown function. Our FTFLP-based approach offers two significant advantages: first, it produces internally-tagged full-length proteins that are likely to exhibit native intracellular localization, and second, it yields information about the tissue specificity of gene expression by the use of native promoters. To demonstrate how FTFLP may be used for characterization of the Arabidopsis proteome, we tagged a series of known proteins with diverse subcellular targeting patterns as well as several proteins with unknown function and unassigned subcellular localization.
View details for DOI 10.1104/pp.104.040139
View details for Web of Science ID 000221420800005
View details for PubMedID 15141064
View details for PubMedCentralID PMC429330
Strategies for avoiding reinventing the precollege education and outreach wheel
2004; 166 (4): 1601-1609
The National Science Foundation's recent mandate that all Principal Investigators address the broader impacts of their research has prompted an unprecedented number of scientists to seek opportunities to participate in precollege education and outreach. To help interested geneticists avoid duplicating efforts and make use of existing resources, we examined several precollege genetics, genomics, and biotechnology education efforts and noted the elements that contributed to their success, indicated by program expansion, participant satisfaction, or participant learning. Identifying a specific audience and their needs and resources, involving K-12 teachers in program development, and evaluating program efforts are integral to program success. We highlighted a few innovative programs to illustrate these findings. Challenges that may compromise further development and dissemination of these programs include absence of reward systems for participation in outreach as well as lack of training for scientists doing outreach. Several programs and institutions are tackling these issues in ways that will help sustain outreach efforts while allowing them to be modified to meet the changing needs of their participants, including scientists, teachers, and students. Most importantly, resources and personnel are available to facilitate greater and deeper involvement of scientists in precollege and public education.
View details for Web of Science ID 000221377700002
View details for PubMedID 15126383
View details for PubMedCentralID PMC1470816
- Ontologies in biology: Design, applications and future challenges NATURE REVIEWS GENETICS 2004; 5 (3): 213-222
- Carpe diem. Retooling the "publish or perish" model into the "share and survive" model PLANT PHYSIOLOGY 2004; 134 (2): 543-547
The Gene Ontology (GO) database and informatics resource
NUCLEIC ACIDS RESEARCH
2004; 32: D258-D261
The Gene Ontology (GO) project (http://www. geneontology.org/) provides structured, controlled vocabularies and classifications that cover several domains of molecular and cellular biology and are freely available for community use in the annotation of genes, gene products and sequences. Many model organism databases and genome annotation groups use the GO and contribute their annotation sets to the GO resource. The GO database integrates the vocabularies and contributed annotations and provides full access to this information in several formats. Members of the GO Consortium continually work collectively, involving outside experts as needed, to expand and update the GO vocabularies. The GO Web resource also provides access to extensive documentation about the GO project and links to applications that use GO data for functional analyses.
View details for DOI 10.1093/nar/gkh036
View details for Web of Science ID 000188079000059
View details for PubMedID 14681407
View details for PubMedCentralID PMC308770
MetaCyc: a multiorganism database of metabolic pathways and enzymes
NUCLEIC ACIDS RESEARCH
2004; 32: D438-D442
The MetaCyc database (see URL http://MetaCyc.org) is a collection of metabolic pathways and enzymes from a wide variety of organisms, primarily microorganisms and plants. The goal of MetaCyc is to contain a representative sample of each experimentally elucidated pathway, and thereby to catalog the universe of metabolism. MetaCyc also describes reactions, chemical compounds and genes. Many of the pathways and enzymes in MetaCyc contain extensive information, including comments and literature citations. SRI's Pathway Tools software supports querying, visualization and curation of MetaCyc. With its wide breadth and depth of metabolic information, MetaCyc is a valuable resource for a variety of applications. MetaCyc is the reference database of pathways and enzymes that is used in conjunction with SRI's metabolic pathway prediction program to create Pathway/Genome Databases that can be augmented with curation from the scientific literature and published on the world wide web. MetaCyc also serves as a readily accessible comprehensive resource on microbial and plant pathways for genome analysis, basic research, education, metabolic engineering and systems biology. In the past 2 years the data content and the Pathway Tools software used to query, visualize and edit MetaCyc have been expanded significantly. These enhancements are described in this paper.
View details for DOI 10.1093/nar/gkh100
View details for Web of Science ID 000188079000104
View details for PubMedID 14681452
Microspore separation in the quartet 3 mutants of Arabidopsis is impaired by a defect in a developmentally regulated polygalacturonase required for pollen mother cell wall degradation
2003; 133 (3): 1170-1180
Mutations in the QUARTET loci in Arabidopsis result in failure of microspore separation during pollen development due to a defect in degradation of the pollen mother cell wall during late stages of pollen development. Mutations in a new locus required for microspore separation, QRT3, were isolated, and the corresponding gene was cloned by T-DNA tagging. QRT3 encodes a protein that is approximately 30% similar to an endopolygalacturonase from peach (Prunus persica). The QRT3 protein was expressed in yeast (Saccharomyces cerevisiae) and found to exhibit polygalacturonase activity. In situ hybridization experiments showed that QRT3 is specifically and transiently expressed in the tapetum during the phase when microspores separate from their meiotic siblings. Immunohistochemical localization of QRT3 indicated that the protein is secreted from tapetal cells during the early microspore stage. Thus, QRT3 plays a direct role in degrading the pollen mother cell wall during microspore development.
View details for DOI 10.1104/pp.103.028266
View details for Web of Science ID 000186644600022
View details for PubMedID 14551328
View details for PubMedCentralID PMC281612
AraCyc: A biochemical pathway database for Arabidopsis
2003; 132 (2): 453-460
AraCyc is a database containing biochemical pathways of Arabidopsis, developed at The Arabidopsis Information Resource (http://www.arabidopsis.org). The aim of AraCyc is to represent Arabidopsis metabolism as completely as possible with a user-friendly Web-based interface. It presently features more than 170 pathways that include information on compounds, intermediates, cofactors, reactions, genes, proteins, and protein subcellular locations. The database uses Pathway Tools software, which allows the users to visualize a bird's eye view of all pathways in the database down to the individual chemical structures of the compounds. The database was built using Pathway Tools' Pathologic module with MetaCyc, a collection of pathways from more than 150 species, as a reference database. This initial build was manually refined and annotated. More than 20 plant-specific pathways, including carotenoid, brassinosteroid, and gibberellin biosyntheses have been added from the literature. A list of more than 40 plant pathways will be added in the coming months. The quality of the initial, automatic build of the database was compared with the manually improved version, and with EcoCyc, an Escherichia coli database using the same software system that has been manually annotated for many years. In addition, a Perl interface, PerlCyc, was developed that allows programmers to access Pathway Tools databases from the popular Perl language. AraCyc is available at the tools section of The Arabidopsis Information Resource Web site (http://www.arabidopsis.org/tools/aracyc).
View details for DOI 10.1104/pp.102.017236
View details for Web of Science ID 000185076600010
View details for PubMedID 12805578
View details for PubMedCentralID PMC166988
The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community
NUCLEIC ACIDS RESEARCH
2003; 31 (1): 224-228
Arabidopsis thaliana is the most widely-studied plant today. The concerted efforts of over 11 000 researchers and 4000 organizations around the world are generating a rich diversity and quantity of information and materials. This information is made available through a comprehensive on-line resource called the Arabidopsis Information Resource (TAIR) (http://arabidopsis.org), which is accessible via commonly used web browsers and can be searched and downloaded in a number of ways. In the last two years, efforts have been focused on increasing data content and diversity, functionally annotating genes and gene products with controlled vocabularies, and improving data retrieval, analysis and visualization tools. New information include sequence polymorphisms including alleles, germplasms and phenotypes, Gene Ontology annotations, gene families, protein information, metabolic pathways, gene expression data from microarray experiments and seed and DNA stocks. New data visualization and analysis tools include SeqViewer, which interactively displays the genome from the whole chromosome down to 10 kb of nucleotide sequence and AraCyc, a metabolic pathway database and map tool that allows overlaying expression data onto the pathway diagrams. Finally, we have recently incorporated seed and DNA stock information from the Arabidopsis Biological Resource Center (ABRC) and implemented a shopping-cart style on-line ordering system.
View details for DOI 10.1093/nar/gkg076
View details for Web of Science ID 000181079700051
View details for PubMedID 12519987
View details for PubMedCentralID PMC165523
Human immunodeficiency virus reverse transcriptase and protease sequence database
NUCLEIC ACIDS RESEARCH
2003; 31 (1): 298-303
The HIV reverse transcriptase and protease sequence database is an on-line relational database that catalogues evolutionary and drug-related sequence variation in the human immunodeficiency virus (HIV) reverse transcriptase (RT) and protease enzymes, the molecular targets of antiretroviral therapy (http://hivdb.stanford.edu). The database contains a compilation of nearly all published HIV RT and protease sequences, including submissions to GenBank, sequences published in journal articles and sequences of HIV isolates from persons participating in clinical trials. Sequences are linked to data about the source of the sequence, the antiretroviral drug treatment history of the person from whom the sequence was obtained and the results of in vitro drug susceptibility testing. Sequence data on two new molecular targets of HIV drug therapy--gp41 (cell fusion) and integrase--will be added to the database in 2003.
View details for DOI 10.1093/nar/gkg100
View details for Web of Science ID 000181079700071
View details for PubMedID 12520007
View details for PubMedCentralID PMC165547
TAIR: a resource for integrated Arabidopsis data.
Functional & integrative genomics
2002; 2 (6): 239-253
The Arabidopsis Information Resource (TAIR; http://arabidopsis.org) provides an integrated view of genomic data for Arabidopsis thaliana. The information is obtained from a battery of sources, including the Arabidopsis user community, the literature, and the major genome centers. Currently TAIR provides information about genes, markers, polymorphisms, maps, sequences, clones, DNA and seed stocks, gene families and proteins. In addition, users can find Arabidopsis publications and information about Arabidopsis researchers. Our emphasis is now on incorporating functional annotations of genes and gene products, genome-wide expression, and biochemical pathway data. Among the tools developed at TAIR, the most notable is the Sequence Viewer, which displays gene annotation, clones, transcripts, markers and polymorphisms on the Arabidopsis genome, and allows zooming in to the nucleotide level. A tool recently released is AraCyc, which is designed for visualization of biochemical pathways. We are also developing tools to extract information from the literature in a systematic way, and building controlled vocabularies to describe biological concepts in collaboration with other database groups. A significant new feature is the integration of the ABRC database functions and stock ordering system, which allows users to place orders for seed and DNA stocks directly from the TAIR site.
View details for PubMedID 12444417
Surviving in a sea of data: a survey of plant genome data resources and issues in building data management systems
PLANT MOLECULAR BIOLOGY
2002; 48 (1-2): 59-74
Exponential growth of data, largely from whole-genome analyses, has changed the way biologists think about and handle data. Optimal use of these data requires effective methods to analyze and manage these data sets. Computers, software and the World Wide Web are now integral components of biological discovery. Understanding how information is obtained, processed and annotated in public databases allows researchers to effectively organize, analyze and export their own data into these databases. In this review we focus largely on two areas related to management of genomic data. We cite examples of resources available in the public domain and describe some of the software for data management systems currently available for plant research. In addition, we discuss a few concepts of data management from the perspective of an individual or group that wishes to provide data to the public databases, to use the information in the public databases more efficiently, or to develop a database to manage large data sets internally or for public access. These concepts include data descriptions, exchange format, curation, attribution, and database implementation.
View details for Web of Science ID 000173211000005
View details for PubMedID 11860214
Creating the gene ontology resource: Design and implementation
2001; 11 (8): 1425-1433
The exponential growth in the volume of accessible biological information has generated a confusion of voices surrounding the annotation of molecular information about genes and their products. The Gene Ontology (GO) project seeks to provide a set of structured vocabularies for specific biological domains that can be used to describe gene products in any organism. This work includes building three extensive ontologies to describe molecular function, biological process, and cellular component, and providing a community database resource that supports the use of these ontologies. The GO Consortium was initiated by scientists associated with three model organism databases: SGD, the Saccharomyces Genome database; FlyBase, the Drosophila genome database; and MGD/GXD, the Mouse Genome Informatics databases. Additional model organism database groups are joining the project. Each of these model organism information systems is annotating genes and gene products using GO vocabulary terms and incorporating these annotations into their respective model organism databases. Each database contributes its annotation files to a shared GO data resource accessible to the public at http://www.geneontology.org/. The GO site can be used by the community both to recover the GO vocabularies and to access the annotated gene product data sets from the model organism databases. The GO Consortium supports the development of the GO database resource and provides tools enabling curators and researchers to query and manipulate the vocabularies. We believe that the shared development of this molecular annotation resource will contribute to the unification of biological information.
View details for Web of Science ID 000170263900015
View details for PubMedID 11483584
The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant
NUCLEIC ACIDS RESEARCH
2001; 29 (1): 102-105
Arabidopsis thaliana, a small annual plant belonging to the mustard family, is the subject of study by an estimated 7000 researchers around the world. In addition to the large body of genetic, physiological and biochemical data gathered for this plant, it will be the first higher plant genome to be completely sequenced, with completion expected at the end of the year 2000. The sequencing effort has been coordinated by an international collaboration, the Arabidopsis Genome Initiative (AGI). The rationale for intensive investigation of Arabidopsis is that it is an excellent model for higher plants. In order to maximize use of the knowledge gained about this plant, there is a need for a comprehensive database and information retrieval and analysis system that will provide user-friendly access to Arabidopsis information. This paper describes the initial steps we have taken toward realizing these goals in a project called The Arabidopsis Information Resource (TAIR) (www.arabidopsis.org).
View details for Web of Science ID 000166360300025
View details for PubMedID 11125061
View details for PubMedCentralID PMC29827
- Bioinformatic resources, challenges, and opportunities using Arabidopsis as a model organism in a post-genomic era PLANT PHYSIOLOGY 2000; 124 (4): 1460-1464
Unified display of Arabidopsis thaliana physical maps from AtDB, the A.thaliana database
NUCLEIC ACIDS RESEARCH
1999; 27 (1): 79-84
In the past several years, there has been a tremendous effort to construct physical maps and to sequence the genome of Arabidopsis thaliana. As a result, four of the five chromosomes are completely covered by overlapping clones except at the centromeric and nucleolus organizer regions (NOR). In addition, over 30% of the genome has been sequenced and completion is anticipated by the end of the year 2000. Despite these accomplishments, the physical maps are provided in many formats on laboratories' Web sites. These data are thus difficult to obtain in a coherent manner for researchers. To alleviate this problem, AtDB (Arabidopsis thaliana DataBase, URL: http://genome-www.stanford.edu/Arabidopsis/) has constructed a unified display of the physical maps where all publicly available physical-map data for all chromosomes are presented through the Web in a clickable, 'on-the-fly' graphic, created by CGI programs that directly consult our relational database.
View details for Web of Science ID 000077983000018
View details for PubMedID 9847147
View details for PubMedCentralID PMC148102
Genome maps 9. Arabidopsis thaliana. Wall chart.
1998; 282 (5389): 663-667
View details for PubMedID 9841422
Tetrad pollen formation in quartet mutants of Arabidopsis thaliana is associated with persistence of pectic polysaccharides of the pollen mother cell wall
1998; 15 (1): 79-88
The quartet (qrt) mutants of Arabidopsis thaliana produce tetrad pollen in which microspores fail to separate during pollen development. Because the amount of callose deposition between microspores is correlated with tetrad pollen formation in other species, and because pectin is implicated as playing a role in cell adhesion, these cell-wall components in wild-type and mutant anthers were visualized by immunofluorescence microscopy at different stages of microsporogenesis. In wild-type, callose was detected around the pollen mother cell at the onset of meiosis and around the microspores during the tetrad stage. Microspores were released into the anther locule at the stage where callose was no longer detected. Deposition and degradation of callose during tetrad pollen formation in qrt1 and qrt2 mutants were indistinguishable from those in wild-type. Enzymatic removal of callose from wild-type microspores at the tetrad stage did not release the microspores, suggesting that callose removal is not sufficient to disperse the microspores in wild-type. Pectic components were detected in the primary wall of the pollen mother cell. This wall surrounded the callosic wall around the pollen mother cell and the microspores during the tetrad stage. In wild-type, pectic components of this wall were no longer detectable at the time of microspore release. However, in qrt1 and qrt2 mutants, pectic components of this wall persisted after callose degradation. This result suggests that failure of pectin degradation in the pollen mother cell wall is associated with tetrad pollen formation in qrt mutants, and indicates that QRT1 and QRT2 may be required for cell type-specific pectin degradation to separate microspores.
View details for Web of Science ID 000075109800008
View details for PubMedID 9744097
FLAT-SURFACE GRAFTING IN ARABIDOPSIS-THALIANA
PLANT MOLECULAR BIOLOGY REPORTER
1995; 13 (2): 118-123
View details for Web of Science ID A1995RJ50600002
TETRAD ANALYSIS POSSIBLE IN ARABIDOPSIS WITH MUTATION OF THE QUARTET (QRT) GENES
1994; 264 (5164): 1458-1460
Two Arabidopsis thaliana genes, QRT1 and QRT2, are required for pollen separation during normal development. In qrt mutants, the outer walls of the four meiotic products of the pollen mother cell are fused, and pollen grains are released in tetrads. Pollen is viable and fertile, and the cytoplasmic pollen contents are discrete. Pollination with a single tetrad usually yields four seeds, and genetic analysis confirmed that marker loci segregate in a 2:2 ratio within these tetrads. These mutations allow tetrad analysis to be performed in Arabidopsis and define steps in pollen cell wall development.
View details for Web of Science ID A1994NP22100042
View details for PubMedID 8197459