Jason Hilton
Senior Research Engineer, Biomedical Data Science
Current Role at Stanford
PI & Director, Lattice
All Publications
-
Data navigation on the ENCODE portal.
Nature communications
2025; 16 (1): 9592
Abstract
Spanning two decades, the collaborative ENCODE project aims to identify all the functional elements within human and mouse genomes. To best serve the scientific community, the comprehensive ENCODE data including results from 23,000+ functional genomics experiments, 800+ functional elements characterization experiments and 60,000+ results from integrative computational analyses are available on an open-access data-portal ( https://www.encodeproject.org/ ). The final phase of the project includes data from several novel assays aimed at characterization and validation of genomic elements. In addition to developing and maintaining the data portal, the Data Coordination Center (DCC) implemented and utilised uniform processing pipelines to generate uniformly processed data. Here we report recent updates to the data portal including a redesigned home page, an improved search interface, new custom-designed pages highlighting biologically related datasets and an enhanced cart interface for data visualisation plus user-friendly data download options. A summary of data generated using uniform processing pipelines is also provided.
View details for DOI 10.1038/s41467-025-64343-9
View details for PubMedID 41168159
View details for PubMedCentralID 5389787
-
Mondo: Integrating Disease Terminology Across Communities.
Genetics
2025
Abstract
Precision medicine aims to enhance diagnosis, treatment, and prognosis by integrating multimodal data at the point of care. However, challenges arise due to the vast number of diseases, differing methods of classification, and conflicting terminological coding systems and practices used to represent molecular definitions of disease. This lack of interoperability artificially constrains the potential for diagnosis, clinical decision support, care outcome analysis, as well as data linkage across research domains to support the development or repurposing of therapeutics. There is a clear and pressing need for a unified system for managing disease entities - including identifiers, synonyms, and definitions. To address these issues, we created the Mondo disease ontology-a community-driven, open-source, unified disease classification system that harmonizes diverse terminologies into a consistent, computable framework. Mondo integrates key medical and biomedical terminologies, including Online Mendelian Inheritance in Man (OMIM), Orphanet, Medical Subject Headings (MeSH), National Cancer Institute Thesaurus (NCIt), and more, to provide a comprehensive and accurate representation of disease concepts with fully provenanced and attributed links back to the sources. Mondo can be used as the handle for curation of gene-disease associations utilized in diagnostic applications, research applications such as computational phenotyping, and in clinical coding systems in clinical decision support by pointing the clinician to the numerous knowledge resources linked to the Mondo identifier. Mondo's community-centric approach, stewarded by the Monarch Initiative's expertise in ontologies, ensures that the ontology remains adaptable to the evolving needs of biomedical research and clinical communities, as well as the knowledge providers.
View details for DOI 10.1093/genetics/iyaf215
View details for PubMedID 41052288
-
Defining breast epithelial cell types in the single-cell era.
Developmental cell
2025; 60 (17): 2218-2236
Abstract
Single-cell studies on breast tissue have contributed to a change in our understanding of breast epithelial diversity that has, in turn, precipitated a lack of consensus on breast cell types. The confusion surrounding this issue highlights a possible challenge for advancing breast atlas efforts. In this perspective, we present our consensus on the identities, properties, and naming conventions for breast epithelial cell types and propose goals for future atlas endeavors. Our proposals and their underlying thought processes aim to catalyze the adoption of a shared model for this tissue and to serve as guidance for other investigators facing similar challenges.
View details for DOI 10.1016/j.devcel.2025.06.032
View details for PubMedID 40925326
-
Gene Spatial Integration: enhancing spatial transcriptomics analysis via deep learning and batch effect mitigation.
Bioinformatics (Oxford, England)
2025
Abstract
Spatial transcriptomics (ST) is a groundbreaking technique for studying the correlation between cellular organization within a tissue and their physiological and pathological properties. Every facet of spatial information, including cell/spot proximity, distribution, and dimensionality, is significant. Most methods lean heavily on proximity for ST analysis, each resulting in useful insights but still leaving other aspects untapped. In addition, samples procured at different times, different donors, and by different technologies introduce a batch effects problem that hinders the statistical approach employed by most analysis tools. Addressing these challenges, we have developed a deep learning method for analyzing integrated multiple ST data, focusing on the distribution aspect. Furthermore, our method aims to leverage single-cell analysis tools.Our study introduces Gene Spatial Integration (GSI), a data integration pipeline utilizing representation learning approach to extract spatial distribution of genes into the same feature space as gene expression features. We employ Autoencoder network to extract spatial embedding, facilitating the projection of spatial features into gene expression feature space. Our approach allows for seamless integration of multiple samples with minimum detriment, increasing the performance of the ST data analysis tool. We show application of our method on human DLPFC dataset. Our method consistently improves the performance of the clustering of Seurat tools, with the most significant increase observed in sample 151673, almost doubling the ARI score from 0.225 to 0.405. We also combine our pipeline with the clustering of GraphST, achieving a significantly higher ARI score in sample 151672 from 0.614 to 0.795. This result reveals the potential of gene distribution spatial aspect, also emphasizes the impact of integration and batch effect removal in developing a refined analysis in understanding tissue characteristics.Implementation of GSI is accessible at https://github.com/Riandanis/Spatial_Integration_GSI.Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btaf350
View details for PubMedID 40511994
-
CZ CELLxGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data.
Nucleic acids research
2024
Abstract
Hundreds of millions of single cells have been analyzed using high-throughput transcriptomic methods. The cumulative knowledge within these datasets provides an exciting opportunity for unlocking insights into health and disease at the level of single cells. Meta-analyses that span diverse datasets building on recent advances in large language models and other machine-learning approaches pose exciting new directions to model and extract insight from single-cell data. Despite the promise of these and emerging analytical tools for analyzing large amounts of data, the sheer number of datasets, data models and accessibility remains a challenge. Here, we present CZ CELLxGENE Discover (cellxgene.cziscience.com), a data platform that provides curated and interoperable single-cell data. Available via a free-to-use online data portal, CZ CELLxGENE hosts a growing corpus of community-contributed data of over 93 million unique cells. Curated, standardized and associated with consistent cell-level metadata, this collection of single-cell transcriptomic data is the largest of its kind and growing rapidly via community contributions. A suite of tools and features enables accessibility and reusability of the data via both computational and visual interfaces to allow researchers to explore individual datasets, perform cross-corpus analysis, and run meta-analyses of tens of millions of cells across studies and tissues at the resolution of single cells.
View details for DOI 10.1093/nar/gkae1142
View details for PubMedID 39607691
-
MAMS: matrix and analysis metadata standards to facilitate harmonization and reproducibility of single-cell data.
Genome biology
2024; 25 (1): 205
Abstract
Many datasets are being produced by consortia that seek to characterize healthy and disease tissues at single-cell resolution. While biospecimen and experimental information is often captured, detailed metadata standards related to data matrices and analysis workflows are currently lacking. To address this, we develop the matrix and analysis metadata standards (MAMS) to serve as a resource for data centers, repositories, and tool developers. We define metadata fields for matrices and parameters commonly utilized in analytical workflows and developed the rmams package to extract MAMS from single-cell objects. Overall, MAMS promotes the harmonization, integration, and reproducibility of single-cell data across platforms.
View details for DOI 10.1186/s13059-024-03349-w
View details for PubMedID 39090672
View details for PubMedCentralID 7376497
-
Perspectives on ENCODE.
Nature
2020; 583 (7818): 693–98
Abstract
The Encylopedia of DNA Elements (ENCODE) Project launched in 2003 with the long-term goal of developing a comprehensive map of functional elements in the human genome. These included genes, biochemical regions associated with gene regulation (for example, transcription factor binding sites, open chromatin, and histone marks) and transcript isoforms. The marks serve as sites for candidate cis-regulatory elements (cCREs) that may serve functional roles in regulating gene expression1. The project has been extended to model organisms, particularly the mouse. In the third phase of ENCODE, nearly a million and more than 300,000 cCRE annotations have been generated for human and mouse, respectively, and these have provided a valuable resource for the scientific community.
View details for DOI 10.1038/s41586-020-2449-8
View details for PubMedID 32728248
-
Expanded encyclopaedias of DNA elements in the human and mouse genomes.
Nature
2020; 583 (7818): 699–710
Abstract
The human and mouse genomes contain instructions that specify RNAs and proteins and govern the timing, magnitude, and cellular context of their production. To better delineate these elements, phase III of the Encyclopedia of DNA Elements (ENCODE) Project has expanded analysis of the cell and tissue repertoires of RNA transcription, chromatin structure and modification, DNA methylation, chromatin looping, and occupancy by transcription factors and RNA-binding proteins. Here we summarize these efforts, which have produced 5,992 new experimental datasets, including systematic determinations across mouse fetal development. All data are available through the ENCODE data portal (https://www.encodeproject.org), including phase II ENCODE1 and Roadmap Epigenomics2 data. We have developed a registry of 926,535 human and 339,815 mouse candidate cis-regulatory elements, covering 7.9 and 3.4% of their respective genomes, by integrating selected datatypes associated with gene regulation, and constructed a web-based server (SCREEN; http://screen.encodeproject.org) to provide flexible, user-defined access to this resource. Collectively, the ENCODE data and registry provide an expansive resource for the scientific community to build a better understanding of the organization and function of the human and mouse genomes.
View details for DOI 10.1038/s41586-020-2493-4
View details for PubMedID 32728249
-
The ENCODE Portal as an Epigenomics Resource.
Current protocols in bioinformatics
2019; 68 (1): e89
Abstract
The Encyclopedia of DNA Elements (ENCODE) web portal hosts genomic data generated by the ENCODE Consortium, Genomics of Gene Regulation, The NIH Roadmap Epigenomics Consortium, and the modENCODE and modERN projects. The goal of the ENCODE project is to build a comprehensive map of the functional elements of the human and mouse genomes. Currently, the portal database stores over 500 TB of raw and processed data from over 15,000 experiments spanning assays that measure gene expression, DNA accessibility, DNA and RNA binding, DNA methylation, and 3D chromatin structure across numerous cell lines, tissue types, and differentiation states with selected genetic and molecular perturbations. The ENCODE portal provides unrestricted access to the aforementioned data and relevant metadata as a service to the scientific community. The metadata model captures the details of the experiments, raw and processed data files, and processing pipelines in human and machine-readable form and enables the user to search for specific data either using a web browser or programmatically via REST API. Furthermore, ENCODE data can be freely visualized or downloaded for additional analyses. © 2019 The Authors. Basic Protocol: Query the portal Support Protocol 1: Batch downloading Support Protocol 2: Using the cart to download files Support Protocol 3: Visualize data Alternate Protocol: Query building and programmatic access.
View details for DOI 10.1002/cpbi.89
View details for PubMedID 31751002
-
New developments on the Encyclopedia of DNA Elements (ENCODE) data portal.
Nucleic acids research
2019
Abstract
The Encyclopedia of DNA Elements (ENCODE) is an ongoing collaborative research project aimed at identifying all the functional elements in the human and mouse genomes. Data generated by the ENCODE consortium are freely accessible at the ENCODE portal (https://www.encodeproject.org/), which is developed and maintained by the ENCODE Data Coordinating Center (DCC). Since the initial portal release in 2013, the ENCODE DCC has updated the portal to make ENCODE data more findable, accessible, interoperable and reusable. Here, we report on recent updates, including new ENCODE data and assays, ENCODE uniform data processing pipelines, new visualization tools, a dataset cart feature, unrestricted public access to ENCODE data on the cloud (Amazon Web Services open data registry, https://registry.opendata.aws/encode-project/) and more comprehensive tutorials and documentation.
View details for DOI 10.1093/nar/gkz1062
View details for PubMedID 31713622
- Prevention of data duplication for high throughput sequencing repositories DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018
-
SnoVault and encodeD: A novel object-based storage system and applications to ENCODE metadata
PLOS ONE
2017; 12 (4)
Abstract
The Encyclopedia of DNA elements (ENCODE) project is an ongoing collaborative effort to create a comprehensive catalog of functional elements initiated shortly after the completion of the Human Genome Project. The current database exceeds 6500 experiments across more than 450 cell lines and tissues using a wide array of experimental techniques to study the chromatin structure, regulatory and transcriptional landscape of the H. sapiens and M. musculus genomes. All ENCODE experimental data, metadata, and associated computational analyses are submitted to the ENCODE Data Coordination Center (DCC) for validation, tracking, storage, unified processing, and distribution to community resources and the scientific community. As the volume of data increases, the identification and organization of experimental details becomes increasingly intricate and demands careful curation. The ENCODE DCC has created a general purpose software system, known as SnoVault, that supports metadata and file submission, a database used for metadata storage, web pages for displaying the metadata and a robust API for querying the metadata. The software is fully open-source, code and installation instructions can be found at: http://github.com/ENCODE-DCC/snovault/ (for the generic database) and http://github.com/ENCODE-DCC/encoded/ to store genomic data in the manner of ENCODE. The core database engine, SnoVault (which is completely independent of ENCODE, genomic data, or bioinformatic data) has been released as a separate Python package.
View details for DOI 10.1371/journal.pone.0175310
View details for Web of Science ID 000399955200049
View details for PubMedID 28403240
-
The Encyclopedia of DNA elements (ENCODE): data portal update.
Nucleic acids research
2017
Abstract
The Encyclopedia of DNA Elements (ENCODE) Data Coordinating Center has developed the ENCODE Portal database and website as the source for the data and metadata generated by the ENCODE Consortium. Two principles have motivated the design. First, experimental protocols, analytical procedures and the data themselves should be made publicly accessible through a coherent, web-based search and download interface. Second, the same interface should serve carefully curated metadata that record the provenance of the data and justify its interpretation in biological terms. Since its initial release in 2013 and in response to recommendations from consortium members and the wider community of scientists who use the Portal to access ENCODE data, the Portal has been regularly updated to better reflect these design principles. Here we report on these updates, including results from new experiments, uniformly-processed data from other projects, new visualization tools and more comprehensive metadata to describe experiments and analyses. Additionally, the Portal is now home to meta(data) from related projects including Genomics of Gene Regulation, Roadmap Epigenome Project, Model organism ENCODE (modENCODE) and modERN. The Portal now makes available over 13000 datasets and their accompanying metadata and can be accessed at: https://www.encodeproject.org/.
View details for PubMedID 29126249
-
Surveying DNA Elements within Functional Genes of Heterocyst-Forming Cyanobacteria
PLOS ONE
2016; 11 (5): e0156034
Abstract
Some cyanobacteria are capable of differentiating a variety of cell types in response to environmental factors. For instance, in low nitrogen conditions, some cyanobacteria form heterocysts, which are specialized for N2 fixation. Many heterocyst-forming cyanobacteria have DNA elements interrupting key N2 fixation genes, elements that are excised during heterocyst differentiation. While the mechanism for the excision of the element has been well-studied, many questions remain regarding the introduction of the elements into the cyanobacterial lineage and whether they have been retained ever since or have been lost and reintroduced. To examine the evolutionary relationships and possible function of DNA sequences that interrupt genes of heterocyst-forming cyanobacteria, we identified and compared 101 interruption element sequences within genes from 38 heterocyst-forming cyanobacterial genomes. The interruption element lengths ranged from about 1 kb (the minimum able to encode the recombinase responsible for element excision), up to nearly 1 Mb. The recombinase gene sequences served as genetic markers that were common across the interruption elements and were used to track element evolution. Elements were found that interrupted 22 different orthologs, only five of which had been previously observed to be interrupted by an element. Most of the newly identified interrupted orthologs encode proteins that have been shown to have heterocyst-specific activity. However, the presence of interruption elements within genes with no known role in N2 fixation, as well as in three non-heterocyst-forming cyanobacteria, indicates that the processes that trigger the excision of elements may not be limited to heterocyst development or that the elements move randomly within genomes. This comprehensive analysis provides the framework to study the history and behavior of these unique sequences, and offers new insight regarding the frequency and persistence of interruption elements in heterocyst-forming cyanobacteria.
View details for DOI 10.1371/journal.pone.0156034
View details for Web of Science ID 000376291500040
View details for PubMedID 27206019
View details for PubMedCentralID PMC4874684
-
Principles of metadata organization at the ENCODE data coordination center.
Database : the journal of biological databases and curation
2016; 2016
Abstract
The Encyclopedia of DNA Elements (ENCODE) Data Coordinating Center (DCC) is responsible for organizing, describing and providing access to the diverse data generated by the ENCODE project. The description of these data, known as metadata, includes the biological sample used as input, the protocols and assays performed on these samples, the data files generated from the results and the computational methods used to analyze the data. Here, we outline the principles and philosophy used to define the ENCODE metadata in order to create a metadata standard that can be applied to diverse assays and multiple genomic projects. In addition, we present how the data are validated and used by the ENCODE DCC in creating the ENCODE Portal (https://www.encodeproject.org/). Database URL: www.encodeproject.org.
View details for DOI 10.1093/database/baw001
View details for PubMedID 26980513
View details for PubMedCentralID PMC4792520
https://orcid.org/0000-0002-1196-4871