Idan Gabdank
Senior Biocuration Scientist, Biomedical Data Science
Bio
Dr. Gabdank is a Senior Biocuration Scientist with the Lattice team, where his research interests focus on advancing computational genomics through data-driven portfolio management and strategic decision-making frameworks. His work involves developing data standards and assessment methodologies that optimize research impact and guide investment strategies in genomics initiatives. This includes facilitating cross-functional collaborations between researchers and federal agencies, overseeing computational genomics programs, and leading the development of data pipelines and curation standards. Before joining Lattice, Idan served as Program Director at the National Human Genome Research Institute and Director of Data Science at Stanford University School of Medicine, where he provided strategic oversight for multi-institutional genomics consortia, including the ENCODE and IGVF consortia. Idan received his PhD in Bioinformatics from Ben Gurion University of the Negev, specializing in computational approaches to genomic data analysis and standardization.
Current Role at Stanford
Manage data wrangling and curation for innovative cutting-edge single cell and CRISPR screen experiments within the Billion Cell Project funded by CZI, serving as a key member of the Lattice team at Stanford working in close collaboration with CZI and academy labs to ensure standardized data processing and quality control across high-throughput experimental datasets. Integrate AI tools and automate cloud-based pipelines for data validation and curation, streamlining quality assurance processes and reducing manual oversight requirements while maintaining data integrity standards.
Honors & Awards
-
Research Excellence Prize, Ben Gurion University of the Negev (2010)
-
Human Frontier Science Program Long-Term Cross-Disciplinary Fellowship, Human Frontier Science Program (2011 - 2014)
Work Experience
-
Program Director, National Human Genome Research Institute (August 11, 2024 - March 24, 2025)
Dr. Idan Gabdank joined the National Human Genome Research Institute's (NHGRI) Division of Genome Sciences as a program director in 2024 and served in that capacity until March 2025. He is a strong advocate for open and reproducible science, adhering to the FAIR principles. Dr. Gabdank is a part of NHGRI team overseeing the Human Genome Reference Program (HGRP) and Computational Genomics and Data Science Program (CGDS) team.
Location
Bethesda, MD
-
Principal Data Wrangler, Stanford (October 1, 2018 - August 11, 2024)
Leader of a group of scientists responsible for coordination, curation, uniform processing and sharing of the data generated by the IGVF and ENCODE consortia.
Location
Stanford, CA
All Publications
-
Deciphering the impact of genomic variation on function.
Nature
2024; 633 (8028): 47-57
Abstract
Our genomes influence nearly every aspect of human biology-from molecular and cellular functions to phenotypes in health and disease. Studying the differences in DNA sequence between individuals (genomic variation) could reveal previously unknown mechanisms of human biology, uncover the basis of genetic predispositions to diseases, and guide the development of new diagnostic tools and therapeutic agents. Yet, understanding how genomic variation alters genome function to influence phenotype has proved challenging. To unlock these insights, we need a systematic and comprehensive catalogue of genome function and the molecular and cellular effects of genomic variants. Towards this goal, the Impact of Genomic Variation on Function (IGVF) Consortium will combine approaches in single-cell mapping, genomic perturbations and predictive modelling to investigate the relationships among genomic variation, genome function and phenotypes. IGVF will create maps across hundreds of cell types and states describing how coding variants alter protein activity, how noncoding variants change the regulation of gene expression, and how such effects connect through gene-regulatory and protein-interaction networks. These experimental data, computational predictions and accompanying standards and pipelines will be integrated into an open resource that will catalyse community efforts to explore how our genomes influence biology and disease across populations.
View details for DOI 10.1038/s41586-024-07510-0
View details for PubMedID 39232149
View details for PubMedCentralID 7405896
-
Multicenter integrated analysis of noncoding CRISPRi screens.
Nature methods
2024
Abstract
The ENCODE Consortium's efforts to annotate noncoding cis-regulatory elements (CREs) have advanced our understanding of gene regulatory landscapes. Pooled, noncoding CRISPR screens offer a systematic approach to investigate cis-regulatory mechanisms. The ENCODE4 Functional Characterization Centers conducted 108 screens in human cell lines, comprising >540,000 perturbations across 24.85 megabases of the genome. Using 332 functionally confirmed CRE-gene links in K562 cells, we established guidelines for screening endogenous noncoding elements with CRISPR interference (CRISPRi), including accurate detection of CREs that exhibit variable, often low, transcriptional effects. Benchmarking five screen analysis tools, we find that CASA produces the most conservative CRE calls and is robust to artifacts of low-specificity single guide RNAs. We uncover a subtle DNA strand bias for CRISPRi in transcribed regions with implications for screen design and analysis. Together, we provide an accessible data resource, predesigned single guide RNAs for targeting 3,275,697 ENCODE SCREEN candidate CREs with CRISPRi and screening guidelines to accelerate functional characterization of the noncoding genome.
View details for DOI 10.1038/s41592-024-02216-7
View details for PubMedID 38504114
View details for PubMedCentralID 3771521
-
The ENCODE Uniform Analysis Pipelines.
Research square
2023
Abstract
The Encyclopedia of DNA elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human genome. The current database comprises more than 19000 functional genomics experiments across more than 1000 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the Homo sapiens and Mus musculus genomes. All experimental data, metadata, and associated computational analyses created by the ENCODE consortium are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. The ENCODE project has engineered and distributed uniform processing pipelines in order to promote data provenance and reproducibility as well as allow interoperability between genomic resources and other consortia. All data files, reference genome versions, software versions, and parameters used by the pipelines are captured and available via the ENCODE Portal. The pipeline code, developed using Docker and Workflow Description Language (WDL; https://openwdl.org/) is publicly available in GitHub, with images available on Dockerhub (https://hub.docker.com), enabling access to a diverse range of biomedical researchers. ENCODE pipelines maintained and used by the DCC can be installed to run on personal computers, local HPC clusters, or in cloud computing environments via Cromwell. Access to the pipelines and data via the cloud allows small labs the ability to use the data or software without access to institutional compute clusters. Standardization of the computational methodologies for analysis and quality control leads to comparable results from different ENCODE collections - a prerequisite for successful integrative analyses.
View details for DOI 10.21203/rs.3.rs-3111932/v1
View details for PubMedID 37503119
View details for PubMedCentralID PMC10371165
-
The ENCODE4 long-read RNA-seq collection reveals distinct classes of transcript structure diversity.
bioRxiv : the preprint server for biology
2023
Abstract
The majority of mammalian genes encode multiple transcript isoforms that result from differential promoter use, changes in exonic splicing, and alternative 3' end choice. Detecting and quantifying transcript isoforms across tissues, cell types, and species has been extremely challenging because transcripts are much longer than the short reads normally used for RNA-seq. By contrast, long-read RNA-seq (LR-RNA-seq) gives the complete structure of most transcripts. We sequenced 264 LR-RNA-seq PacBio libraries totaling over 1 billion circular consensus reads (CCS) for 81 unique human and mouse samples. We detect at least one full-length transcript from 87.7% of annotated human protein coding genes and a total of 200,000 full-length transcripts, 40% of which have novel exon junction chains. To capture and compute on the three sources of transcript structure diversity, we introduce a gene and transcript annotation framework that uses triplets representing the transcript start site, exon junction chain, and transcript end site of each transcript. Using triplets in a simplex representation demonstrates how promoter selection, splice pattern, and 3' processing are deployed across human tissues, with nearly half of multi-transcript protein coding genes showing a clear bias toward one of the three diversity mechanisms. Evaluated across samples, the predominantly expressed transcript changes for 74% of protein coding genes. In evolution, the human and mouse transcriptomes are globally similar in types of transcript structure diversity, yet among individual orthologous gene pairs, more than half (57.8%) show substantial differences in mechanism of diversification in matching tissues. This initial large-scale survey of human and mouse long-read transcriptomes provides a foundation for further analyses of alternative transcript usage, and is complemented by short-read and microRNA data on the same samples and by epigenome data elsewhere in the ENCODE4 collection.
View details for DOI 10.1101/2023.05.15.540865
View details for PubMedID 37292896
View details for PubMedCentralID PMC10245583
-
The ENCODE Uniform Analysis Pipelines.
bioRxiv : the preprint server for biology
2023
Abstract
The Encyclopedia of DNA elements (ENCODE) project is a collaborative effort to create a comprehensive catalog of functional elements in the human genome. The current database comprises more than 19000 functional genomics experiments across more than 1000 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the Homo sapiens and Mus musculus genomes. All experimental data, metadata, and associated computational analyses created by the ENCODE consortium are submitted to the Data Coordination Center (DCC) for validation, tracking, storage, and distribution to community resources and the scientific community. The ENCODE project has engineered and distributed uniform processing pipelines in order to promote data provenance and reproducibility as well as allow interoperability between genomic resources and other consortia. All data files, reference genome versions, software versions, and parameters used by the pipelines are captured and available via the ENCODE Portal. The pipeline code, developed using Docker and Workflow Description Language (WDL; https://openwdl.org/) is publicly available in GitHub, with images available on Dockerhub (https://hub.docker.com), enabling access to a diverse range of biomedical researchers. ENCODE pipelines maintained and used by the DCC can be installed to run on personal computers, local HPC clusters, or in cloud computing environments via Cromwell. Access to the pipelines and data via the cloud allows small labs the ability to use the data or software without access to institutional compute clusters. Standardization of the computational methodologies for analysis and quality control leads to comparable results from different ENCODE collections - a prerequisite for successful integrative analyses.
View details for DOI 10.1101/2023.04.04.535623
View details for PubMedID 37066421
View details for PubMedCentralID PMC10104020
-
The EN-TEx resource of multi-tissue personal epigenomes & variant-impact models.
Cell
2023; 186 (7): 1493-1511.e40
Abstract
Understanding how genetic variants impact molecular phenotypes is a key goal of functional genomics, currently hindered by reliance on a single haploid reference genome. Here, we present the EN-TEx resource of 1,635 open-access datasets from four donors (∼30 tissues × ∼15 assays). The datasets are mapped to matched, diploid genomes with long-read phasing and structural variants, instantiating a catalog of >1 million allele-specific loci. These loci exhibit coordinated activity along haplotypes and are less conserved than corresponding, non-allele-specific ones. Surprisingly, a deep-learning transformer model can predict the allele-specific activity based only on local nucleotide-sequence context, highlighting the importance of transcription-factor-binding motifs particularly sensitive to variants. Furthermore, combining EN-TEx with existing genome annotations reveals strong associations between allele-specific and GWAS loci. It also enables models for transferring known eQTLs to difficult-to-profile tissues (e.g., from skin to heart). Overall, EN-TEx provides rich data and generalizable models for more accurate personal functional genomics.
View details for DOI 10.1016/j.cell.2023.02.018
View details for PubMedID 37001506
-
Author Correction: Perspectives on ENCODE.
Nature
2022
View details for DOI 10.1038/s41586-021-04213-8
View details for PubMedID 35474002
-
Author Correction: Expanded encyclopaedias of DNA elements in the human and mouse genomes.
Nature
2022
View details for DOI 10.1038/s41586-021-04226-3
View details for PubMedID 35474001
-
Perspectives on ENCODE.
Nature
2020; 583 (7818): 693–98
Abstract
The Encylopedia of DNA Elements (ENCODE) Project launched in 2003 with the long-term goal of developing a comprehensive map of functional elements in the human genome. These included genes, biochemical regions associated with gene regulation (for example, transcription factor binding sites, open chromatin, and histone marks) and transcript isoforms. The marks serve as sites for candidate cis-regulatory elements (cCREs) that may serve functional roles in regulating gene expression1. The project has been extended to model organisms, particularly the mouse. In the third phase of ENCODE, nearly a million and more than 300,000 cCRE annotations have been generated for human and mouse, respectively, and these have provided a valuable resource for the scientific community.
View details for DOI 10.1038/s41586-020-2449-8
View details for PubMedID 32728248
-
Expanded encyclopaedias of DNA elements in the human and mouse genomes.
Nature
2020; 583 (7818): 699–710
Abstract
The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen.encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.
View details for DOI 10.1038/s41586-020-2493-4
View details for PubMedID 32728249
-
The ENCODE Portal as an Epigenomics Resource.
Current protocols in bioinformatics
2019; 68 (1): e89
Abstract
The Encyclopedia of DNA Elements (ENCODE) web portal hosts genomic data generated by the ENCODE Consortium, Genomics of Gene Regulation, The NIH Roadmap Epigenomics Consortium, and the modENCODE and modERN projects. The goal of the ENCODE project is to build a comprehensive map of the functional elements of the human and mouse genomes. Currently, the portal database stores over 500 TB of raw and processed data from over 15,000 experiments spanning assays that measure gene expression, DNA accessibility, DNA and RNA binding, DNA methylation, and 3D chromatin structure across numerous cell lines, tissue types, and differentiation states with selected genetic and molecular perturbations. The ENCODE portal provides unrestricted access to the aforementioned data and relevant metadata as a service to the scientific community. The metadata model captures the details of the experiments, raw and processed data files, and processing pipelines in human and machine-readable form and enables the user to search for specific data either using a web browser or programmatically via REST API. Furthermore, ENCODE data can be freely visualized or downloaded for additional analyses. © 2019 The Authors. Basic Protocol: Query the portal Support Protocol 1: Batch downloading Support Protocol 2: Using the cart to download files Support Protocol 3: Visualize data Alternate Protocol: Query building and programmatic access.
View details for DOI 10.1002/cpbi.89
View details for PubMedID 31751002
-
Recompleting the Caenorhabditis elegans genome.
Genome research
2019
Abstract
Caenorhabditis elegans was the first multicellular eukaryotic genome sequenced to apparent completion. Although this assembly employed a standard C. elegans strain (N2), it used sequence data from several laboratories, with DNA propagated in bacteria and yeast. Thus, the N2 assembly has many differences from any C. elegans available today. To provide a more accurate C. elegans genome, we performed long-read assembly of VC2010, a modern strain derived from N2. Our VC2010 assembly has 99.98% identity to N2 but with an additional 1.8 Mb including tandem repeat expansions and genome duplications. For 116 structural discrepancies between N2 and VC2010, 97 structures matching VC2010 (84%) were also found in two outgroup strains, implying deficiencies in N2. Over 98% of N2 genes encoded unchanged products in VC2010; moreover, we predicted ≥53 new genes in VC2010. The recompleted genome of C. elegans should be a valuable resource for genetics, genomics, and systems biology.
View details for DOI 10.1101/gr.244830.118
View details for PubMedID 31123080
-
New developments on the Encyclopedia of DNA Elements (ENCODE) data portal.
Nucleic acids research
2019
Abstract
The Encyclopedia of DNA Elements (ENCODE) is an ongoing collaborative research project aimed at identifying all the functional elements in the human and mouse genomes. Data generated by the ENCODE consortium are freely accessible at the ENCODE portal (https://www.encodeproject.org/), which is developed and maintained by the ENCODE Data Coordinating Center (DCC). Since the initial portal release in 2013, the ENCODE DCC has updated the portal to make ENCODE data more findable, accessible, interoperable and reusable. Here, we report on recent updates, including new ENCODE data and assays, ENCODE uniform data processing pipelines, new visualization tools, a dataset cart feature, unrestricted public access to ENCODE data on the cloud (Amazon Web Services open data registry, https://registry.opendata.aws/encode-project/) and more comprehensive tutorials and documentation.
View details for DOI 10.1093/nar/gkz1062
View details for PubMedID 31713622
- Prevention of data duplication for high throughput sequencing repositories DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018
-
Intricate and Cell Type-Specific Populations of Endogenous Circular DNA (eccDNA) in Caenorhabditis elegans and Homo sapiens.
G3 (Bethesda, Md.)
2017; 7 (10): 3295-3303
Abstract
Investigations aimed at defining the 3D configuration of eukaryotic chromosomes have consistently encountered an endogenous population of chromosome-derived circular genomic DNA, referred to as extrachromosomal circular DNA (eccDNA). While the production, distribution, and activities of eccDNAs remain understudied, eccDNA formation from specific regions of the linear genome has profound consequences on the regulatory and coding capabilities for these regions. Here, we define eccDNA distributions in Caenorhabditis elegans and in three human cell types, utilizing a set of DNA topology-dependent approaches for enrichment and characterization. The use of parallel biophysical, enzymatic, and informatic approaches provides a comprehensive profiling of eccDNA robust to isolation and analysis methodology. Results in human and nematode systems provide quantitative analysis of the eccDNA loci at both unique and repetitive regions. Our studies converge on and support a consistent picture, in which endogenous genomic DNA circles are present in normal physiological states, and in which the circles come from both coding and noncoding genomic regions. Prominent among the coding regions generating DNA circles are several genes known to produce a diversity of protein isoforms, with mucin proteins and titin as specific examples.
View details for DOI 10.1534/g3.117.300141
View details for PubMedID 28801508
View details for PubMedCentralID PMC5633380
-
SnoVault and encodeD: A novel object-based storage system and applications to ENCODE metadata
PLOS ONE
2017; 12 (4)
Abstract
The Encyclopedia of DNA elements (ENCODE) project is an ongoing collaborative effort to create a comprehensive catalog of functional elements initiated shortly after the completion of the Human Genome Project. The current database exceeds 6500 experiments across more than 450 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the H. sapiens and M. musculus genomes. All ENCODE experimental data, metadata, and associated computational analyses are submitted to the ENCODE Data Coordination Center (DCC) for validation, tracking, storage, unified processing, and distribution to community resources and the scientific community. As the volume of data increases, the identification and organization of experimental details becomes increasingly intricate and demands careful curation. The ENCODE DCC has created a general purpose software system, known as SnoVault, that supports metadata and file submission, a database used for metadata storage, web pages for displaying the metadata and a robust API for querying the metadata. The software is fully open-source, code and installation instructions can be found at: http://github.com/ENCODE-DCC/snovault/ (for the generic database) and http://github.com/ENCODE-DCC/encoded/ to store genomic data in the manner of ENCODE. The core database engine, SnoVault (which is completely independent of ENCODE, genomic data, or bioinformatic data) has been released as a separate Python package.
View details for DOI 10.1371/journal.pone.0175310
View details for Web of Science ID 000399955200049
View details for PubMedID 28403240
-
Intricate and Cell Type-Specific Populations of Endogenous Circular DNA (eccDNA) in Caenorhabditis elegans and Homo sapiens
G3: GENES, GENOMES, GENETICS
2017; 7: 3295-3303
Abstract
Investigations aimed at defining the 3D configuration of eukaryotic chromosomes have consistently encountered an endogenous population of chromosome-derived circular genomic DNA, referred to as extrachromosomal circular DNA (eccDNA). While the production, distribution, and activities of eccDNAs remain understudied, eccDNA formation from specific regions of the linear genome has profound consequences on the regulatory and coding capabilities for these regions. Here, we define eccDNA distributions in Caenorhabditis elegans and in three human cell types, utilizing a set of DNA topology-dependent approaches for enrichment and characterization. The use of parallel biophysical, enzymatic, and informatic approaches provides a comprehensive profiling of eccDNA robust to isolation and analysis methodology. Results in human and nematode systems provide quantitative analysis of the eccDNA loci at both unique and repetitive regions. Our studies converge on and support a consistent picture, in which endogenous genomic DNA circles are present in normal physiological states, and in which the circles come from both coding and noncoding genomic regions. Prominent among the coding regions generating DNA circles are several genes known to produce a diversity of protein isoforms, with mucin proteins and titin as specific examples.
View details for DOI 10.1534/g3.117.300141
View details for PubMedCentralID PMC5633380
-
The Encyclopedia of DNA elements (ENCODE): data portal update.
Nucleic acids research
2017
Abstract
The Encyclopedia of DNA Elements (ENCODE) Data Coordinating Center has developed the ENCODE Portal database and website as the source for the data and metadata generated by the ENCODE Consortium. Two principles have motivated the design. First, experimental protocols, analytical procedures and the data themselves should be made publicly accessible through a coherent, web-based search and download interface. Second, the same interface should serve carefully curated metadata that record the provenance of the data and justify its interpretation in biological terms. Since its initial release in 2013 and in response to recommendations from consortium members and the wider community of scientists who use the Portal to access ENCODE data, the Portal has been regularly updated to better reflect these design principles. Here we report on these updates, including results from new experiments, uniformly-processed data from other projects, new visualization tools and more comprehensive metadata to describe experiments and analyses. Additionally, the Portal is now home to meta(data) from related projects including Genomics of Gene Regulation, Roadmap Epigenome Project, Model organism ENCODE (modENCODE) and modERN. The Portal now makes available over 13000 datasets and their accompanying metadata and can be accessed at: https://www.encodeproject.org/.
View details for PubMedID 29126249
-
A streamlined tethered chromosome conformation capture protocol
BMC GENOMICS
2016; 17
Abstract
Identification of locus-locus contacts at the chromatin level provides a valuable foundation for understanding of nuclear architecture and function and a valuable tool for inferring long-range linkage relationships. As one approach to this, chromatin conformation capture-based techniques allow creation of genome spatial organization maps. While such approaches have been available for some time, methodological advances will be of considerable use in minimizing both time and input material required for successful application.Here we report a modified tethered conformation capture protocol that utilizes a series of rapid and efficient molecular manipulations. We applied the method to Caenorhabditis elegans, obtaining chromatin interaction maps that provide a sequence-anchored delineation of salient aspects of Caenorhabditis elegans chromosome structure, demonstrating a high level of consistency in overall chromosome organization between biological samples collected under different conditions. In addition to the application of the method to defining nuclear architecture, we found the resulting chromatin interaction maps to be of sufficient resolution and sensitivity to enable detection of large-scale structural variants such as inversions or translocations.Our streamlined protocol provides an accelerated, robust, and broadly applicable means of generating chromatin spatial organization maps and detecting genome rearrangements without a need for cellular or chromatin fractionation.
View details for DOI 10.1186/s12864-016-2596-3
View details for Web of Science ID 000373560100001
View details for PubMedID 27036078
View details for PubMedCentralID PMC4818521
-
ENCODE data at the ENCODE portal.
Nucleic acids research
2016; 44 (D1): D726-32
Abstract
The Encyclopedia of DNA Elements (ENCODE) Project is in its third phase of creating a comprehensive catalog of functional elements in the human genome. This phase of the project includes an expansion of assays that measure diverse RNA populations, identify proteins that interact with RNA and DNA, probe regions of DNA hypersensitivity, and measure levels of DNA methylation in a wide range of cell and tissue types to identify putative regulatory elements. To date, results for almost 5000 experiments have been released for use by the scientific community. These data are available for searching, visualization and download at the new ENCODE Portal (www.encodeproject.org). The revamped ENCODE Portal provides new ways to browse and search the ENCODE data based on the metadata that describe the assays as well as summaries of the assays that focus on data provenance. In addition, it is a flexible platform that allows integration of genomic data from multiple projects. The portal experience was designed to improve access to ENCODE data by relying on metadata that allow reusability and reproducibility of the experiments.
View details for DOI 10.1093/nar/gkv1160
View details for PubMedID 26527727
View details for PubMedCentralID PMC4702836
-
Principles of metadata organization at the ENCODE data coordination center.
Database : the journal of biological databases and curation
2016; 2016
Abstract
The Encyclopedia of DNA Elements (ENCODE) Data Coordinating Center (DCC) is responsible for organizing, describing and providing access to the diverse data generated by the ENCODE project. The description of these data, known as metadata, includes the biological sample used as input, the protocols and assays performed on these samples, the data files generated from the results and the computational methods used to analyze the data. Here, we outline the principles and philosophy used to define the ENCODE metadata in order to create a metadata standard that can be applied to diverse assays and multiple genomic projects. In addition, we present how the data are validated and used by the ENCODE DCC in creating the ENCODE Portal (https://www.encodeproject.org/). Database URL: www.encodeproject.org.
View details for DOI 10.1093/database/baw001
View details for PubMedID 26980513
View details for PubMedCentralID PMC4792520
-
Gamete-Type Dependent Crossover Interference Levels in a Defined Region of Caenorhabditis elegans Chromosome V.
G3 (Bethesda, Md.)
2014; 4 (1): 117-120
Abstract
In certain organisms, numbers of crossover events for any single chromosome are limited ("crossover interference") so that double crossover events are obtained at much lower frequencies than would be expected from the simple product of independent single-crossover events. We present a number of observations during which we examined interference over a large region of Caenorhabditis elegans chromosome V. Examining this region for multiple crossover events in heteroallelic configurations with limited dimorphism, we observed high levels of crossover interference in oocytes with only partial interference in spermatocytes.
View details for DOI 10.1534/g3.113.008672
View details for PubMedID 24240780
View details for PubMedCentralID PMC3887527
-
On topological indices for small RNA graphs
COMPUTATIONAL BIOLOGY AND CHEMISTRY
2012; 41: 35-40
Abstract
The secondary structure of RNAs can be represented by graphs at various resolutions. While it was shown that RNA secondary structures can be represented by coarse grain tree-graphs and meaningful topological indices can be used to distinguish between various structures, small RNAs are needed to be represented by full graphs. No meaningful topological index has yet been suggested for the analysis of such type of RNA graphs. Recalling that the second eigenvalue of the Laplacian matrix can be used to track topological changes in the case of coarse grain tree-graphs, it is plausible to assume that a topological index such as the Wiener index that represents all Laplacian eigenvalues may provide a similar guide for full graphs. However, by its original definition, the Wiener index was defined for acyclic graphs. Nevertheless, similarly to cyclic chemical graphs, small RNA graphs can be analyzed using elementary cuts, which enables the calculation of topological indices for small RNAs in an intuitive way. We show how to calculate a structural descriptor that is suitable for cyclic graphs, the Szeged index, for small RNA graphs by elementary cuts. We discuss potential uses of such a procedure that considers all eigenvalues of the associated Laplacian matrices to quantify the topology of small RNA graphs.
View details for DOI 10.1016/j.compbiolchem.2012.10.004
View details for Web of Science ID 000313772100004
View details for PubMedID 23147564
-
The RNAmute web server for the mutational analysis of RNA secondary structures
NUCLEIC ACIDS RESEARCH
2011; 39: W92-W99
Abstract
RNA mutational analysis at the secondary-structure level can be useful to a wide-range of biological applications. It can be used to predict an optimal site for performing a nucleotide mutation at the single molecular level, as well as to analyze basic phenomena at the systems level. For the former, as more sequence modification experiments are performed that include site-directed mutagenesis to find and explore functional motifs in RNAs, a pre-processing step that helps guide in planning the experiment becomes vital. For the latter, mutations are generally accepted as a central mechanism by which evolution occurs, and mutational analysis relating to structure should gain a better understanding of system functionality and evolution. In the past several years, the program RNAmute that is structure based and relies on RNA secondary-structure prediction has been developed for assisting in RNA mutational analysis. It has been extended from single-point mutations to treat multiple-point mutations efficiently by initially calculating all suboptimal solutions, after which only the mutations that stabilize the suboptimal solutions and destabilize the optimal one are considered as candidates for being deleterious. The RNAmute web server for mutational analysis is available at http://www.cs.bgu.ac.il/~xrnamute/XRNAmute.
View details for DOI 10.1093/nar/gkr207
View details for Web of Science ID 000292325300016
View details for PubMedID 21478166
-
Single-base Resolution Nucleosome Mapping on DNA Sequences
JOURNAL OF BIOMOLECULAR STRUCTURE & DYNAMICS
2010; 28 (1): 107-121
Abstract
Nucleosome DNA bendability pattern extracted from large nucleosome DNA database of C. elegans is used for construction of full length (116 dinucleotide positions) nucleosome DNA bendability matrix. The matrix can be used for sequence-directed mapping of the nucleosomes on the sequences. Several alternative positions for a given nucleosome are typically predicted, separated by multiples of nucleosome DNA period. The corresponding computer program is successfully tested on best known experimental examples of accurately positioned nucleosomes. The uncertainty of the computational mapping is +/-1 base. The procedure is placed on publicly accessible server and can be applied to any DNA sequence of interest.
View details for Web of Science ID 000278516000009
View details for PubMedID 20476799
-
FineStr: a web server for single-base-resolution nucleosome positioning
BIOINFORMATICS
2010; 26 (6): 845-846
Abstract
The DNA in eukaryotic cells is packed into the chromatin that is composed of nucleosomes. Positioning of the nucleosome core particles on the sequence is a problem of great interest because of the role nucleosomes play in different cellular processes including gene regulation. Using the sequence structure of 10.4 base DNA repeat presented in our previous works and nucleosome core DNA sequences database, we have derived the complete nucleosome DNA bendability matrix of Caenorhabditis elegans. We have developed a web server named FineStr that allows users to upload genomic sequences in FASTA format and to perform a single-base-resolution nucleosome mapping on them.FineStr server is freely available for use on the web at http:/www.cs.bgu.ac.il/ approximately nucleom. The site contains a help file with explanation regarding the exact usage.gabdank@cs.bgu.ac.il.
View details for DOI 10.1093/bioinformatics/btq030
View details for Web of Science ID 000275243500021
View details for PubMedID 20106816
-
Preferential translation of Hsp83 in Leishmania requires a thermosensitive polypyrimidine-rich element in the 3 ' UTR and involves scanning of the 5 ' UTR
RNA-A PUBLICATION OF THE RNA SOCIETY
2010; 16 (2): 364-374
Abstract
Heat shock proteins (HSPs) provide a useful system for studying developmental patterns in the digenetic Leishmania parasites, since their expression is induced in the mammalian life form. Translation regulation plays a key role in control of protein coding genes in trypanosomatids, and is directed exclusively by elements in the 3' untranslated region (UTR). Using sequential deletions of the Leishmania Hsp83 3' UTR (888 nucleotides [nt]), we mapped a region of 150 nt that was required, but not sufficient for preferential translation of a reporter gene at mammalian-like temperatures, suggesting that changes in RNA structure could be involved. An advanced bioinformatics package for prediction of RNA folding (UNAfold) marked the regulatory region on a highly probable structural arm that includes a polypyrimidine tract (PPT). Mutagenesis of this PPT abrogated completely preferential translation of the fused reporter gene. Furthermore, temperature elevation caused the regulatory region to melt more extensively than the same region that lacked the PPT. We propose that at elevated temperatures the regulatory element in the 3' UTR is more accessible to mediators that promote its interaction with the basal translation components at the 5' end during mRNA circularization. Translation initiation of Hsp83 at all temperatures appears to proceed via scanning of the 5' UTR, since a hairpin structure abolishes expression of a fused reporter gene.
View details for DOI 10.1261/rna.1874710
View details for Web of Science ID 000273868900013
View details for PubMedID 20040590
-
Nucleosome DNA Bendability Matrix (C-elegans)
JOURNAL OF BIOMOLECULAR STRUCTURE & DYNAMICS
2009; 26 (4): 403-411
Abstract
An original signal extraction procedure is applied to database of 146 base nucleosome core DNA sequences from C. elegans (S. M. Johnson et al. Genome Research 16, 1505-1516, 2006). The positional preferences of various dinucleotides within the 10.4 base nucleosome DNA repeat are calculated, resulting in derivation of the nucleosome DNA bendability matrix of 16x10 elements. A simplified one-line presentation of the matrix ("consensus" repeat) is ...A(TTTCCGGAAA)T.... All 6 chromosomes of C. elegans conform to the bendability pattern. The strongest affinity to their respective positions is displayed by dinucleotides AT and CG, separated within the repeat by 5 bases. The derived pattern makes a basis for sequence-directed mapping of nucleosome positions in the genome of C. elegans. As the first complete matrix of bendability available the pattern may serve for iterative calculations of the species-specific matrices of bendability applicable to other genomic sequences.
View details for Web of Science ID 000262917000001
View details for PubMedID 19108579
-
Computational identification of three-way junctions in folded RNAs: a case study in Arabidopsis.
In silico biology
2008; 8 (2): 105-120
Abstract
Three-way junctions in folded RNAs have been investigated both experimentally and computationally. The interest in their analysis stems from the fact that they have significantly been found to possess a functional role. In recent work, three-way junctions have been categorized into families depending on the relative lengths of the segments linking the three helices. Here, based on ideas originating from computational geometry, an algorithm is proposed for detecting three-way junctions in data sets of genes that are related to a metabolic pathway of interest. In its current implementation, the algorithm relies on a moving window that performs energy minimization folding predictions, and is demonstrated on a set of genes that are involved in purine metabolism in plants. The pattern matching algorithm can be extended to other organisms and other metabolic cycles of interest in which three-way junctions have been or will be discovered to play an important role. In the test case presented here with, the computational prediction of a three-way junction in Arabidopsis that was speculated to have an interesting functional role is verified experimentally.
View details for PubMedID 18928199
-
In silico design of small RNA switches
IEEE TRANSACTIONS ON NANOBIOSCIENCE
2007; 6 (1): 4-11
Abstract
The discovery of natural RNA sensors that respond to a change in the environment by a conformational switch can be utilized for various biotechnological and nanobiotechnological advances. One class of RNA sensors is the riboswitch: an RNA genetic control element that is capable of sensing small molecules, responding to a deviation in ligand concentration with a structural change. Riboswitches are modularly built from smaller components. Computational methods can potentially be utilized in assembling these building block components and offering improvements in the biochemical design process. We describe a computational procedure to design RNA switches from building blocks with favorable properties. To achieve maximal throughput for genetic control purposes, future designer RNA switches can be assembled based on a computerized preprocessing buildup of the constituent domains, namely the aptamer and the expression platform in the case of a synthetic riboswitch. Conformational switching is enabled by the RNA versatility to possess two highly stable states that are energetically close to each other but topologically distinct, separated by an energy barrier between them. Initially, computer simulations can produce a list of short sequences that switch between two conformers when trigerred by point mutations or temperature. The short sequences should possess an additional desirable property; when these selected small RNA switch segments are attached to various aptamers, the ligand binding mechanism should replace the aforementioned event triggers, which will no longer be effective for crossing the energy barrier. In the assembled RNA sequence, energy minimization folding predictions should then show no difference between the folded structure of the entire sequence relative to the folded structure of each of its constituents. Moreover, energy minimization methods applied on the entire sequence could aid at this preprocessing stage by exhibiting high mutational robustness to capture the stability of the formed hairpin in the expression platform. The above computer-assisted assembly procedure together with application specific considerations may further be tailored for therapeutic gene regulation. Index Terms-Design of RNA switches, energy minimization methods, RNA folding predictions.
View details for DOI 10.1109/TNB.2007.891894
View details for Web of Science ID 000244944600002
View details for PubMedID 17393844
-
Primordia vita. Deconvolution from modern sequences.
Annual Meeting of the Deutsche-Gesellschaft-fur-Zuchtungskunde e V
SPRINGER. 2006: 559–65
Abstract
Evolution of the triplet code is reconstructed on the basis of consensus temporal order of appearance of amino acids. Several important predictions are confirmed by computational sequence analyses. The earliest amino acids, alanine and glycine, have been encoded by GCC and GGC codons, as today. They were succeeded, respectively, by A- and G-series of amino acids, encoded by pyrimidine-central and purine-central codons. The length of the earliest proteins is estimated to be 6-7 residues. The earliest mRNAs were short G+C-rich molecules. These short sequences could have formed hairpins. This is confirmed by analysis of modern prokaryotic mRNA sequences. Predominant size of detected ancient hairpins also corresponds to 6-7 amino acids, as above. Vestiges of last common ancestor can be found in extant proteins in form of entirely conserved short sequences of size six to nine residues present in all or almost all sequenced prokaryotic proteomes (omnipresent motifs). The functions of the topmost conserved octamers are not involved in the basic elementary syntheses. This suggests an initial abiotic supply of amino acids, bases and sugars.
View details for DOI 10.1007/s11084-006-9042-5
View details for Web of Science ID 000243623600019
View details for PubMedID 17120122
-
Tracing ancient mRNA hairpins
JOURNAL OF BIOMOLECULAR STRUCTURE & DYNAMICS
2006; 24 (2): 163-169
Abstract
From recent developments of the early evolution theory it follows that the earliest mRNAs were short ( approximately 20 nt) (G+C)-rich polynucleotides. These short sequences could form hairpins, which would be of high evolutionary advantage because of stability and uniqueness of their conformations. Due to mutations accumulated during billions of years of evolution, the speculated earliest hairpins would largely lose the initial complementarities. Some of the original complementary base-to-base contacts, however, may have survived. Computational analysis of modern prokaryotic mRNA sequences reveals excess population of the expected short range complementarities. The derived earliest mRNA hairpin size fully corresponds to the predicted size of ancient coding duplexes. The repertoire of the surviving hairpins traced in modern mRNA confirms duplex structure of the earliest mRNA, suggested by the early molecular evolution theory.
View details for Web of Science ID 000241066100007
View details for PubMedID 16928139
https://orcid.org/0000-0001-5025-5886