James Zou
Associate Professor of Biomedical Data Science and, by courtesy, of Computer Science and of Electrical Engineering
Department of Biomedical Data Science
Web page: https://www.james-zou.com/
Bio
I am an Associate Professor of Biomedical Data Science and, by courtesy, of Computer Science and Electrical Engineering at Stanford University. I work on making AImore reliable, human-compatible and statistically rigorous, and am especially interested in applications in human disease and health. I received my Ph.D from Harvard in 2014, and was at one time a member of Microsoft Research, a Gates Scholar at Cambridge and a Simons fellow at U.C. Berkeley. I joined Stanford in 2016 and am excited to also be a Chan-Zuckerberg Investigator. We are also a part of the Stanford AI Lab. My research is supported by two Chan-Zuckerberg Biohub Investigator Awards, the Sloan Fellowship, the NSF CAREER Award, a Top Ten Clinical Achievement Award and faculty awards from Google, Adobe and Amazon.
Academic Appointments
-
Associate Professor, Department of Biomedical Data Science
-
Associate Professor (By courtesy), Computer Science
-
Associate Professor (By courtesy), Electrical Engineering
-
Member, Bio-X
-
Faculty Affiliate, Institute for Human-Centered Artificial Intelligence (HAI)
-
Member, Stanford Cancer Institute
-
Member, Wu Tsai Neurosciences Institute
Honors & Awards
-
Chan-Zuckerberg Investigator, CZ Biohub (2023)
-
Sloan Research Fellowship, Sloan Foundation (2021)
-
NSF CAREER Award, NSF (2020)
-
RECOMB Best Paper, RECOMB (2019)
-
Google Faculty Award, Google (2018)
-
Chan-Zuckerberg Investigator, CZ Biohub (2017)
-
Simons Research Fellow, Simons Foundation (2014)
-
NSF GRFP, NSF (2008)
-
Gates-Cambridge Scholar, Gates Foundation (2007)
Current Research and Scholarly Interests
My group works on both foundations of statistical machine learning and applications in biomedicine and healthcare. We develop new technologies that make ML more accountable to humans, more reliable/robust and reveals core scientific insights.
We want our ML to be impactful and beneficial, and as such, we are deeply motivated by transformative applications in biotech and health. We collaborate with and advise many academic and industry groups.
2024-25 Courses
- Deep Learning in Genomics and Biomedicine
BIODS 237, CS 273B (Spr) - Foundation Models for Healthcare
BIODS 271, RAD 271 (Spr) -
Independent Studies (17)
- Advanced Reading and Research
CS 499 (Aut, Win, Spr) - Advanced Reading and Research
CS 499P (Aut, Win, Spr) - Biomedical Informatics Teaching Methods
BIOMEDIN 290 (Aut, Win, Spr, Sum) - Curricular Practical Training
CS 390A (Aut, Win, Spr) - Curricular Practical Training
CS 390B (Aut, Win, Spr) - Directed Reading and Research
BIODS 299 (Aut, Win, Spr, Sum) - Directed Reading and Research
BIOMEDIN 299 (Aut, Win, Spr, Sum) - Directed Study
BIOE 391 (Aut, Win, Spr) - Independent Project
CS 399 (Aut, Win, Spr) - Independent Project
CS 399P (Aut, Win, Spr) - Independent Study
SYMSYS 196 (Aut, Win, Spr) - Independent Work
CS 199 (Aut, Win, Spr) - Independent Work
CS 199P (Aut, Win, Spr) - Medical Scholars Research
BIOMEDIN 370 (Aut, Win, Spr, Sum) - Part-time Curricular Practical Training
CS 390D (Aut, Win, Spr) - Senior Project
CS 191 (Aut, Win, Spr) - Writing Intensive Senior Research Project
CS 191W (Aut, Win, Spr)
- Advanced Reading and Research
-
Prior Year Courses
2023-24 Courses
- Biomedical Informatics Student Seminar
BIODS 201, BIOMEDIN 201 (Win) - Critical Exploration of Topics in Biomedical Data Science: Generative AI
BIODS 290 (Aut) - Deep Learning in Genomics and Biomedicine
BIODS 237, CS 273B (Spr) - Foundation Models for Healthcare
BIODS 271, CS 277, RAD 271 (Win)
2022-23 Courses
- Deep Learning in Genomics and Biomedicine
BIODS 237, BIOMEDIN 273B, CS 273B, GENE 236 (Spr) - Workshop in Biostatistics
BIODS 260B, STATS 260B (Win)
2021-22 Courses
- Value of Data and AI
CS 320 (Win)
- Biomedical Informatics Student Seminar
Stanford Advisees
-
Doctoral Dissertation Reader (AC)
Louis Blankemeier, Trang Le -
Orals Chair
Omar Khattab -
Postdoctoral Faculty Sponsor
Yiqun Chen, Siyu He, Sheng Liu, Pan Lu -
Doctoral Dissertation Advisor (AC)
Joseph Boen, Yixing Jiang, Elana Simon, Eric Sun, Rahul Thapa, Kailas Vodrahalli, Eric Wu, Kevin Wu -
Master's Program Advisor
Kathryn Garcia, Manoj Maddali, Ryan Park, Christopher Pondoc, Gaurav Rane, Daniel Schreck, Rohan Sikand, Jessy Song, Ori Spector, Ryan Zhao -
Doctoral Dissertation Co-Advisor (AC)
Shirley Wu -
Undergraduate Major Advisor
Nikhiya Shamsher -
Doctoral (Program)
Jacob Chang, Karen Feng, Weixin Liang, Kyle Swanson, Nitya Thakkar, Haotian Ye, Mert Yuksekgonul -
Postdoctoral Research Mentor
Ian Covert
All Publications
-
BABEL enables cross-modality translation between multiomic profiles at single-cell resolution.
Proceedings of the National Academy of Sciences of the United States of America
2021; 118 (15)
Abstract
Simultaneous profiling of multiomic modalities within a single cell is a grand challenge for single-cell biology. While there have been impressive technical innovations demonstrating feasibility-for example, generating paired measurements of single-cell transcriptome (single-cell RNA sequencing [scRNA-seq]) and chromatin accessibility (single-cell assay for transposase-accessible chromatin using sequencing [scATAC-seq])-widespread application of joint profiling is challenging due to its experimental complexity, noise, and cost. Here, we introduce BABEL, a deep learning method that translates between the transcriptome and chromatin profiles of a single cell. Leveraging an interoperable neural network model, BABEL can predict single-cell expression directly from a cell's scATAC-seq and vice versa after training on relevant data. This makes it possible to computationally synthesize paired multiomic measurements when only one modality is experimentally available. Across several paired single-cell ATAC and gene expression datasets in human and mouse, we validate that BABEL accurately translates between these modalities for individual cells. BABEL also generalizes well to cell types within new biological contexts not seen during training. Starting from scATAC-seq of patient-derived basal cell carcinoma (BCC), BABEL generated single-cell expression that enabled fine-grained classification of complex cell states, despite having never seen BCC data. These predictions are comparable to analyses of experimental BCC scRNA-seq data for diverse cell types related to BABEL's training data. We further show that BABEL can incorporate additional single-cell data modalities, such as protein epitope profiling, thus enabling translation across chromatin, RNA, and protein. BABEL offers a powerful approach for data exploration and hypothesis generation.
View details for DOI 10.1073/pnas.2023070118
View details for PubMedID 33827925
-
Evaluating eligibility criteria of oncology trials using real-world data and AI.
Nature
2021
Abstract
There is a growing focus on making clinical trials more inclusive but the design of trial eligibility criteria remains challenging1-3. Here we systematically evaluate the effect of different eligibility criteria on cancer trial populations and outcomes with real-world data using the computational framework of Trial Pathfinder. We apply Trial Pathfinder to emulate completed trials of advanced non-small-cell lung cancer using data from a nationwide database of electronic health records comprising 61,094 patients with advanced non-small-cell lung cancer. Our analyses reveal that many common criteria, including exclusions based on several laboratory values, had a minimal effect on the trial hazard ratios. When we used a data-driven approach to broaden restrictive criteria, the pool of eligible patients more than doubled on average and the hazard ratio of the overall survival decreased by an average of 0.05. This suggests that many patients who were not eligible under the original trial criteria could potentially benefit from the treatments. We further support our findings through analyses of other types of cancer and patient-safety data from diverse clinical trials. Our data-driven methodology for evaluating eligibility criteria can facilitate the design of more-inclusive trials while maintaining safeguards for patient safety.
View details for DOI 10.1038/s41586-021-03430-5
View details for PubMedID 33828294
-
How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals.
Nature medicine
2021
View details for DOI 10.1038/s41591-021-01312-x
View details for PubMedID 33820998
-
Integrating spatial gene expression and breast tumour morphology via deep learning.
Nature biomedical engineering
2020
Abstract
Spatial transcriptomics allows for the measurement of RNA abundance at a high spatial resolution, making it possible to systematically link the morphology of cellular neighbourhoods and spatially localized gene expression. Here, we report the development of a deep learning algorithm for the prediction of local gene expression from haematoxylin-and-eosin-stained histopathology images using a new dataset of 30,612 spatially resolved gene expression data matched to histopathology images from 23 patients with breast cancer. We identified over 100 genes, including known breast cancer biomarkers of intratumoral heterogeneity and the co-localization of tumour growth and immune activation, the expression of which can be predicted from the histopathology images at a resolution of 100m. We also show that the algorithm generalizes well to The Cancer Genome Atlas and to other breast cancer gene expression datasets without the need for re-training. Predicting the spatially resolved transcriptome of a tissue directly from tissue images may enable image-based screening for molecular biomarkers with spatial variation.
View details for DOI 10.1038/s41551-020-0578-x
View details for PubMedID 32572199
-
Video-based AI for beat-to-beat assessment of cardiac function.
Nature
2020; 580 (7802): 252-256
Abstract
Accurate assessment of cardiac function is crucial for the diagnosis of cardiovascular disease1, screening for cardiotoxicity2 and decisions regarding the clinical management of patients with a critical illness3. However, human assessment of cardiac function focuses on a limited sampling of cardiac cycles and has considerable inter-observer variability despite years of training4,5. Here, to overcome this challenge, we present a video-based deep learning algorithm-EchoNet-Dynamic-that surpasses the performance of human experts in the critical tasks of segmenting the left ventricle, estimating ejection fraction and assessing cardiomyopathy. Trained on echocardiogram videos, our model accurately segments the left ventricle with a Dice similarity coefficient of 0.92, predicts ejection fraction with a mean absolute error of 4.1% and reliably classifies heart failure with reduced ejection fraction (area under the curve of 0.97). In an external dataset from another healthcare system, EchoNet-Dynamic predicts the ejection fraction with a mean absolute error of 6.0% and classifies heart failure with reduced ejection fraction with an area under the curve of 0.96. Prospective evaluation with repeated human measurements confirms that the model has variance that is comparable to or less than that of human experts. By leveraging information across multiple cardiac cycles, our model can rapidly identify subtle changes in ejection fraction, is more reproducible than human evaluation and lays the foundation for precise diagnosis of cardiovascular disease in real time. As a resource to promote further innovation, we also make publicly available a large dataset of 10,030 annotated echocardiogram videos.
View details for DOI 10.1038/s41586-020-2145-8
View details for PubMedID 32269341
-
How Much Does Your Data Exploration Overfit? Controlling Bias via Information Usage
IEEE TRANSACTIONS ON INFORMATION THEORY
2020; 66 (1): 302–23
View details for DOI 10.1109/TIT.2019.2945779
View details for Web of Science ID 000505566100019
-
Fast and covariate-adaptive method amplifies detection power in large-scale multiplehypothesis testing.
Nature communications
2019; 10 (1): 3433
Abstract
Multiple hypothesis testing is an essential component of modern data science. In many settings, in addition to the p-value, additional covariates for each hypothesis are available, e.g., functional annotation of variants in genome-wide association studies. Such information is ignored by popular multiple testing approaches such as the Benjamini-Hochberg procedure (BH). Here we introduce AdaFDR, a fast and flexible method that adaptively learns the optimal p-value threshold from covariates to significantly improve detection power. On eQTL analysis of the GTEx data, AdaFDR discovers 32% more associations than BH at the same false discovery rate. We prove that AdaFDR controls false discovery proportion and show that it makes substantially more discoveries while controlling false discovery rate (FDR) in extensive experiments. AdaFDR is computationally efficient and allows multi-dimensional covariates with both numeric and categorical values, making it broadly useful across many applications.
View details for DOI 10.1038/s41467-019-11247-0
View details for PubMedID 31366926
-
Large dataset enables prediction of repair after CRISPR-Cas9 editing in primary T cells.
Nature biotechnology
2019
Abstract
Understanding of repair outcomes after Cas9-induced DNA cleavage is still limited, especially in primary human cells. We sequence repair outcomes at 1,656 on-target genomic sites in primary human T cells and use these data to train a machine learning model, which we have called CRISPR Repair Outcome (SPROUT). SPROUT accurately predicts the length, probability and sequence of nucleotide insertions and deletions, and will facilitate design of SpCas9 guide RNAs in therapeutically important primary human cells.
View details for DOI 10.1038/s41587-019-0203-2
View details for PubMedID 31359007
-
Making AI Forget You: Data Deletion in Machine Learning
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2019
View details for Web of Science ID 000534424303050
-
Interpretation of Neural Networks Is Fragile
ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE. 2019: 3681–88
View details for Web of Science ID 000485292603086
-
Ensuring that biomedical AI benefits diverse populations.
EBioMedicine
2021: 103358
Abstract
Artificial Intelligence (AI) can potentially impact many aspects of human health, from basic research discovery to individual health assessment. It is critical that these advances in technology broadly benefit diverse populations from around the world. This can be challenging because AI algorithms are often developed on non-representative samples and evaluated based on narrow metrics. Here we outline key challenges to biomedical AI in outcome design, data collection and technology evaluation, and use examples from precision health to illustrate how bias and health disparity may arise in each stage. We then suggest both short term approaches-more diverse data collection and AI monitoring-and longer term structural changes in funding, publications, and education to address these challenges.
View details for DOI 10.1016/j.ebiom.2021.103358
View details for PubMedID 33962897
-
Data valuation for medical imaging using Shapley value and application to a large-scale chest X-ray dataset.
Scientific reports
2021; 11 (1): 8366
Abstract
The reliability of machine learning models can be compromised when trained on low quality data. Many large-scale medical imaging datasets contain low quality labels extracted from sources such as medical reports. Moreover, images within a dataset may have heterogeneous quality due to artifacts and biases arising from equipment or measurement errors. Therefore, algorithms that can automatically identify low quality data are highly desired. In this study, we used data Shapley, a data valuation metric, to quantify the value of training data to the performance of a pneumonia detection algorithm in a large chest X-ray dataset. We characterized the effectiveness of data Shapley in identifying low quality versus valuable data for pneumonia detection. We found that removing training data with high Shapley values decreased the pneumonia detection performance, whereas removing data with low Shapley values improved the model performance. Furthermore, there were more mislabeled examples in low Shapley value data and more true pneumonia cases in high Shapley value data. Our results suggest that low Shapley value indicates mislabeled or poor quality images, whereas high Shapley value indicates data that are valuable for pneumonia detection. Our method can serve as a framework for using data Shapley to denoise large-scale medical imaging datasets.
View details for DOI 10.1038/s41598-021-87762-2
View details for PubMedID 33863957
-
How to evaluate deep learning for cancer diagnostics - factors and recommendations.
Biochimica et biophysica acta. Reviews on cancer
2021: 188515
Abstract
The large volume of data used in cancer diagnosis presents a unique opportunity for deep learning algorithms, which improve in predictive performance with increasing data. When applying deep learning to cancer diagnosis, the goal is often to learn how to classify an input sample (such as images or biomarkers) into predefined categories (such as benign or cancerous). In this article, we examine examples of how deep learning algorithms have been implemented to make predictions related to cancer diagnosis using clinical, radiological, and pathological image data. We present a systematic approach for evaluating the development and application of clinical deep learning algorithms. Based on these examples and the current state of deep learning in medicine, we discuss the future possibilities in this space and outline a roadmap for implementations of deep learning in cancer diagnosis.
View details for DOI 10.1016/j.bbcan.2021.188515
View details for PubMedID 33513392
-
TrueImage: A Machine Learning Algorithm to Improve the Quality of Telehealth Photos.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2021; 26: 220–31
Abstract
Telehealth is an increasingly critical component of the health care ecosystem, especially due to the COVID-19 pandemic. Rapid adoption of telehealth has exposed limitations in the existing infrastructure. In this paper, we study and highlight photo quality as a major challenge in the telehealth workflow. We focus on teledermatology, where photo quality is particularly important; the framework proposed here can be generalized to other health domains. For telemedicine, dermatologists request that patients submit images of their lesions for assessment. However, these images are often of insufficient quality to make a clinical diagnosis since patients do not have experience taking clinical photos. A clinician has to manually triage poor quality images and request new images to be submitted, leading to wasted time for both the clinician and the patient. We propose an automated image assessment machine learning pipeline, TrueImage, to detect poor quality dermatology photos and to guide patients in taking better photos. Our experiments indicate that TrueImage can reject ~50% of the sub-par quality images, while retaining ~80% of good quality images patients send in, despite heterogeneity and limitations in the training data. These promising results suggest that our solution is feasible and can improve the quality of teledermatology care.
View details for PubMedID 33691019
-
Mouse aging cell atlas analysis reveals global and cell type-specific aging signatures.
eLife
2021; 10
Abstract
Aging is associated with complex molecular and cellular processes that are poorly understood. Here we leveraged the Tabula Muris Senis single-cell RNA-seq data set to systematically characterize gene expression changes during aging across diverse cell types in the mouse. We identified aging-dependent genes in 76 tissue-cell types from 23 tissues and characterized both shared and tissue-cell-specific aging behaviors. We found that the aging-related genes shared by multiple tissue-cell types also change their expression congruently in the same direction during aging in most tissue-cell types, suggesting a coordinated global aging behavior at the organismal level. Scoring cells based on these shared aging genes allowed us to contrast the aging status of different tissues and cell types from a transcriptomic perspective. In addition, we identified genes that exhibit age-related expression changes specific to each functional category of tissue-cell types. Altogether, our analyses provide one of the most comprehensive and systematic characterizations of the molecular signatures of aging across diverse tissue-cell types in a mammalian system.
View details for DOI 10.7554/eLife.62293
View details for PubMedID 33847263
-
Variation in COVID-19 Data Reporting Across India: 6Months into the Pandemic.
Journal of the Indian Institute of Science
2020: 1–8
Abstract
India reported its first case of COVID-19 on January 30, 2020. Six months since then, COVID-19 continues to be a growing crisis in India with over 1.6 million reported cases. In this communication, we assess the quality of COVID-19 data reporting done by the state and union territory governments in India between July 12 and July 25, 2020. We compare our findings with those from an earlier assessment conducted in May 2020. We conclude that 6months into the pandemic, the quality of COVID-19 data reporting across India continues to be highly disparate, which could hinder public health efforts.
View details for DOI 10.1007/s41745-020-00188-z
View details for PubMedID 33078049
-
Association of Rapid Eye Movement Sleep With Mortality in Middle-aged and Older Adults.
JAMA neurology
2020
Abstract
Importance: Rapid eye movement (REM) sleep has been linked with health outcomes, but little is known about the relationship between REM sleep and mortality.Objective: To investigate whether REM sleep is associated with greater risk of mortality in 2 independent cohorts and to explore whether another sleep stage could be driving the findings.Design, Setting, and Participants: This multicenter population-based cross-sectional study used data from the Outcomes of Sleep Disorders in Older Men (MrOS) Sleep Study and Wisconsin Sleep Cohort (WSC). MrOS participants were recruited from December 2003 to March 2005, and WSC began in 1988. MrOS and WSC participants who had REM sleep and mortality data were included. Analysis began May 2018 and ended December 2019.Main Outcomes and Measures: All-cause and cause-specific mortality confirmed with death certificates.Results: The MrOS cohort included 2675 individuals (2675 men [100%]; mean [SD] age, 76.3[5.5] years) and was followed up for a median (interquartile range) of 12.1 (7.8-13.2) years. The WSC cohort included 1386 individuals (753 men [54.3%]; mean [SD] age, 51.5[8.5] years) and was followed up for a median (interquartile range) of 20.8 (17.9-22.4) years. MrOS participants had a 13% higher mortality rate for every 5% reduction in REM sleep (percentage REM sleep SD=6.6%) after adjusting for multiple demographic, sleep, and health covariates (age-adjusted hazard ratio,1.12; fully adjusted hazard ratio,1.13; 95% CI, 1.08-1.19). Results were similar for cardiovascular and other causes of death. Possible threshold effects were seen on the Kaplan-Meier curves, particularly for cancer; individuals with less than 15% REM sleep had a higher mortality rate compared with individuals with 15% or more for each mortality outcome with odds ratios ranging from 1.20 to 1.35. Findings were replicated in the WSC cohort despite younger age, inclusion of women, and longer follow-up (hazard ratio,1.13; 95% CI, 1.08-1.19). A random forest model identified REM sleep as the most important sleep stage associated with survival.Conclusions and Relevance: Decreased percentage REM sleep was associated with greater risk of all-cause, cardiovascular, and other noncancer-related mortality in 2 independent cohorts.
View details for DOI 10.1001/jamaneurol.2020.2108
View details for PubMedID 32628261
-
Deep learning models to detect hidden clinical correlates.
The Lancet. Digital health
2020; 2 (7): e334-e335
View details for DOI 10.1016/S2589-7500(20)30138-2
View details for PubMedID 33328091
-
Deep learning models to detect hidden clinical correlates
LANCET DIGITAL HEALTH
2020; 2 (7): E334–E335
View details for Web of Science ID 000544134300003
-
Clinical Genetics Lacks Standard Definitions and Protocols for the Collection and Use of Diversity Measures.
American journal of human genetics
2020
Abstract
Genetics researchers and clinical professionals rely on diversity measures such as race, ethnicity, and ancestry (REA) to stratify study participants and patients for a variety of applications in research and precision medicine. However, there are no comprehensive, widely accepted standards or guidelines for collecting and using such data in clinical genetics practice. Two NIH-funded research consortia, the Clinical Genome Resource (ClinGen) and Clinical Sequencing Evidence-generating Research (CSER), have partnered to address this issue and report how REA are currently collected, conceptualized, and used. Surveying clinical genetics professionals and researchers (n = 448), we found heterogeneity in the way REA are perceived, defined, and measured, with variation in the perceived importance of REA in both clinical and research settings. The majority of respondents (>55%) felt that REA are at least somewhat important for clinical variant interpretation, ordering genetic tests, and communicating results to patients. However, there was no consensus on the relevance of REA, including how each of these measures should be used in different scenarios and what information they can convey in the context of human genetics. A lack of common definitions and applications of REA across the precision medicine pipeline may contribute to inconsistencies in data collection, missing or inaccurate classifications, and misleading or inconclusive results. Thus, our findings support the need for standardization and harmonization of REA data collection and use in clinical genetics and precision health research.
View details for DOI 10.1016/j.ajhg.2020.05.005
View details for PubMedID 32504544
-
RNA-GPS predicts high-resolution RNA subcellular localization and highlights the role of splicing.
RNA (New York, N.Y.)
2020
Abstract
Subcellular localization is essential to RNA biogenesis, processing, and function across the gene expression life cycle. However, the specific nucleotide sequence motifs that direct RNA localization are incompletely understood. Fortunately, new sequencing technologies have provided transcriptome-wide atlases of RNA localization, creating an opportunity to leverage computational modeling. Here we present RNA-GPS, a new machine learning model that uses nucleotide-level features to predict RNA localization across 8 different subcellular locations - the first to provide such a wide range of predictions. RNA-GPS's design enables high throughput sequence ablation and feature importance analyses to probe the sequence motifs that drive localization prediction. We find localization informative motifs to be concentrated on 3' UTRs and scattered along the coding sequence, and motifs related to splicing to be important drivers of predicted localization, even for cytotopic distinctions for membraneless bodies within the nucleus or for organelles within the cytoplasm. Overall, our results suggest transcript splicing is one of many elements influencing RNA subcellular localization.
View details for DOI 10.1261/rna.074161.119
View details for PubMedID 32220894
-
Video-based AI for beat-to-beat assessment of cardiac function
NATURE
2020
View details for DOI 10.1038/s41586-020-2145-8
View details for Web of Science ID 000521531000001
-
A benchmark of algorithms for the analysis of pooled CRISPR screens.
Genome biology
2020; 21 (1): 62
Abstract
Genome-wide pooled CRISPR-Cas-mediated knockout, activation, and repression screens are powerful tools for functional genomic investigations. Despite their increasing importance, there is currently little guidance on how to design and analyze CRISPR-pooled screens. Here, we provide a review of the commonly used algorithms in the computational analysis of pooled CRISPR screens. We develop a comprehensive simulation framework to benchmark and compare the performance of these algorithms using both synthetic and real datasets. Our findings inform parameter choices of CRISPR screens and provide guidance to researchers on the design and analysis of pooled CRISPR screens.
View details for DOI 10.1186/s13059-020-01972-x
View details for PubMedID 32151271
-
An online platform for interactive feedback in biomedical machine learning
NATURE MACHINE INTELLIGENCE
2020; 2 (2): 86–88
View details for DOI 10.1038/s42256-020-0147-8
View details for Web of Science ID 000571258300004
-
Deep learning interpretation of echocardiograms.
NPJ digital medicine
2020; 3 (1): 10
Abstract
Echocardiography uses ultrasound technology to capture high temporal and spatial resolution images of the heart and surrounding structures, and is the most common imaging modality in cardiovascular medicine. Using convolutional neural networks on a large new dataset, we show that deep learning applied to echocardiography can identify local cardiac structures, estimate cardiac function, and predict systemic phenotypes that modify cardiovascular risk but not readily identifiable to human interpretation. Our deep learning model, EchoNet, accurately identified the presence of pacemaker leads (AUC = 0.89), enlarged left atrium (AUC = 0.86), left ventricular hypertrophy (AUC = 0.75), left ventricular end systolic and diastolic volumes ([Formula: see text] = 0.74 and [Formula: see text] = 0.70), and ejection fraction ([Formula: see text] = 0.50), as well as predicted systemic phenotypes of age ([Formula: see text] = 0.46), sex (AUC = 0.88), weight ([Formula: see text] = 0.56), and height ([Formula: see text] = 0.33). Interpretation analysis validates that EchoNet shows appropriate attention to key cardiac structures when performing human-explainable tasks and highlights hypothesis-generating regions of interest when predicting systemic phenotypes difficult for human interpretation. Machine learning on echocardiography images can streamline repetitive tasks in the clinical workflow, provide preliminary interpretation in areas with insufficient qualified cardiologists, and predict phenotypes challenging for human evaluation.
View details for DOI 10.1038/s41746-019-0216-8
View details for PubMedID 33483633
-
LitGen: Genetic Literature Recommendation Guided by Human Explanations.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2020; 25: 67–78
Abstract
As genetic sequencing costs decrease, the lack of clinical interpretation of variants has become the bottleneck in using genetics data. A major rate limiting step in clinical interpretation is the manual curation of evidence in the genetic literature by highly trained biocurators. What makes curation particularly time-consuming is that the curator needs to identify papers that study variant pathogenicity using different types of approaches and evidences-e.g. biochemical assays or case control analysis. In collaboration with the Clinical Genomic Resource (ClinGen)-the flagship NIH program for clinical curation-we propose the first machine learning system, LitGen, that can retrieve papers for a particular variant and filter them by specific evidence types used by curators to assess for pathogenicity. LitGen uses semi-supervised deep learning to predict the type of evi+dence provided by each paper. It is trained on papers annotated by ClinGen curators and systematically evaluated on new test data collected by ClinGen. LitGen further leverages rich human explanations and unlabeled data to gain 7.9%-12.6% relative performance improvement over models learned only on the annotated papers. It is a useful framework to improve clinical variant curation.
View details for PubMedID 31797587
-
NCI Workshop on Artificial Intelligence in Radiation Oncology: Training the Next Generation.
Practical radiation oncology
2020
Abstract
Artificial intelligence (AI) is about to touch every aspect of radiotherapy from consultation, treatment planning, quality assurance, therapy delivery, to outcomes modeling. There is an urgent need to train radiation oncologists and medical physicists in data science to help shepherd AI solutions into clinical practice. Poorly trained personnel may do more harm than good when attempting to apply rapidly developing and complex technologies. As the amount of AI research expands in our field, the radiation oncology community needs to discuss how to educate future generations in this area. The National Cancer Institute (NCI) Workshop on AI in Radiation Oncology (Shady Grove, MD, April 4-5, 2019) was the first (https://dctd.cancer.gov/NewsEvents/20190523_ai_in_radiation_oncology.htm) of two data science workshops in radiation oncology hosted by the NCI in 2019. During this workshop, the Training and Education Working Group was formed by volunteers among the invited attendees. Its members represent radiation oncology, medical physics, radiology, computer science, industry, and the NCI. In this perspective article written by members of the Training and Education Working Group, we provide and discuss Action Points relevant for future trainees interested in radiation oncology AI: (1) creating AI awareness and responsible conduct; (2) implementing a practical didactic curriculum; (3) creating a publicly available database of training resources; and (4) accelerate learning and funding opportunities. Together, these Action Points can facilitate the translation of AI into clinical practice.
View details for DOI 10.1016/j.prro.2020.06.001
View details for PubMedID 32544635
-
Predicting target genes of noncoding regulatory variants with ICE.
Bioinformatics (Oxford, England)
2020
Abstract
Interpreting genetic variants of unknown significance (VUS) is essential in clinical applications of genome sequencing for diagnosis and personalized care. Noncoding variants remain particularly difficult to interpret, despite making up a large majority of trait associations identified in GWAS analyses. Predicting the regulatory effects of noncoding variants on candidate genes is a key step in evaluating their clinical significance. Here we develop a machine learning algorithm, ICE (Inference of Connected eQTLs), to predict the regulatory targets of noncoding variants identified in studies of expression quantitative trait loci (eQTLs). We assemble datasets using eQTL results from the Genotype-Tissue Expression (GTEx) project and learn to separate positive and negative pairs based on annotations characterizing the variant, gene and the intermediate sequence. ICE achieves an area under the receiver operating characteristic curve (ROC-AUC) of 0.799 using random cross-validation, and 0.700 for a more stringent position-based cross-validation. Further evaluation on rare variants and experimentally-validated regulatory variants shows a significant enrichment in ICE identifying the true target genes versus negative controls. In gene ranking experiments, ICE achieves a top-1 accuracy of 50% and top-3 accuracy of 90%. Salient features, including GC content, histone modifications and Hi-C interactions are further analyzed and visualized to illustrate their influences on predictions. ICE can be applied to any VUS of interest and each candidate nearby gene to output a score reflecting the likelihood of regulatory effect on the expression level. These scores can be used to prioritize variants and genes to assist in patient diagnosis and GWAS follow-up studies.Codes and data used in this work are available at https://github.com/miaecle/eQTL_Trees.Supplementary data.
View details for DOI 10.1093/bioinformatics/btaa254
View details for PubMedID 32330225
-
Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation
ASSOC COMPUTATIONAL LINGUISTICS-ACL. 2020: 1363–74
View details for Web of Science ID 000570978201061
-
Deep profiling of protease substrate specificity enabled by dual random and scanned human proteome substrate phage libraries.
Proceedings of the National Academy of Sciences of the United States of America
2020
Abstract
Proteolysis is a major posttranslational regulator of biology inside and outside of cells. Broad identification of optimal cleavage sites and natural substrates of proteases is critical for drug discovery and to understand protease biology. Here, we present a method that employs two genetically encoded substrate phage display libraries coupled with next generation sequencing (SPD-NGS) that allows up to 10,000-fold deeper sequence coverage of the typical six- to eight-residue protease cleavage sites compared to state-of-the-art synthetic peptide libraries or proteomics. We applied SPD-NGS to two classes of proteases, the intracellular caspases, and the ectodomains of the sheddases, ADAMs 10 and 17. The first library (Lib 10AA) allowed us to identify 104 to 105 unique cleavage sites over a 1,000-fold dynamic range of NGS counts and produced consensus and optimal cleavage motifs based position-specific scoring matrices. A second SPD-NGS library (Lib hP), which displayed virtually the entire human proteome tiled in contiguous 49 amino acid sequences with 25 amino acid overlaps, enabled us to identify candidate human proteome sequences. We identified up to 104 natural linear cut sites, depending on the protease, and captured most of the examples previously identified by proteomics and predicted 10- to 100-fold more. Structural bioinformatics was used to facilitate the identification of candidate natural protein substrates. SPD-NGS is rapid, reproducible, simple to perform and analyze, inexpensive, and renewable, with unprecedented depth of coverage for substrate sequences, and is an important tool for protease biologists interested in protease specificity for specific assays and inhibitors and to facilitate identification of natural protein substrates.
View details for DOI 10.1073/pnas.2009279117
View details for PubMedID 32973096
-
PB-Net: Automatic peak integration by sequential deep learning for multiple reaction monitoring.
Journal of proteomics
2020: 103820
Abstract
Mass spectrometry (MS) based proteomics has become an indispensable component of modern molecular and cellular biochemistry analysis. Multiple reaction monitoring (MRM) is one of the most well-established MS techniques for molecule detection and quantification. Despite its wide usage, there lacks an accurate computational framework to analyze MRM data, and expert annotation is often required, especially to perform peak integration. Here we propose a deep learning method PB-Net (Peak Boundary Neural Network), built upon recent advances in sequential neural networks, for fully automatic chromatographic peak integration. To train PB-Net, we generated a large dataset of over 170,000 expert annotated peaks from MS transitions spanning a wide dynamic range, including both peptides and intact glycopeptides. Our model demonstrated outstanding performances on unseen test samples, reaching near-perfect agreement (Pearson's r 0.997) with human annotated ground truth. Systematic evaluations also show that PB-Net is substantially more robust and accurate compared to previous state-of-the-art peak integration software. PB-Net can benefit the wide community of mass spectrometry data analysis, especially in applications involving high-throughput MS experiments. Codes and test data used in this work are available at https://github.com/miaecle/PB-net. SIGNIFICANCE: Human annotations serve an important role in accurate quantification of multiple reaction monitoring (MRM) experiments, though they are costly to collect and limit analysis throughput. In this work we proposed and developed a novel technique for the peak-integration step in MRM, based on recent innovations in sequential deep learning models. We collected in total 170,000 expert-annotated MRM peaks and trained a set of accurate and robust neural networks for the task. Results demonstrated a substantial improvement over the current state-of-the-art software for mass spectrometry analysis and comparable level of accuracy and precision as human annotators.
View details for DOI 10.1016/j.jprot.2020.103820
View details for PubMedID 32416316
-
RNA-GPS Predicts SARS-CoV-2 RNA Localization to Host Mitochondria and Nucleolus.
bioRxiv : the preprint server for biology
2020
Abstract
The SARS-CoV-2 coronavirus is driving a global pandemic, but its biological mechanisms are less well understood. SARS-CoV-2 is an RNA virus whose multiple genomic and subgenomic RNA (sgRNA) transcripts hijack the host cell's machinery, located across distinct cytotopic locations. Subcellular localization of its viral RNA could play important roles in viral replication and host antiviral immune response. Here we perform computational modeling of SARS-CoV-2 viral RNA localization across eight subcellular neighborhoods. We compare hundreds of SARS-CoV-2 genomes to the human transcriptome and other coronaviruses and perform systematic sub-sequence analyses to identify the responsible signals. Using state-of-the-art machine learning models, we predict that the SARS-CoV-2 RNA genome and all sgRNAs are enriched in the host mitochondrial matrix and nucleolus. The 5' and 3' viral untranslated regions possess the strongest and most distinct localization signals. We discuss the mitochondrial localization signal in relation to the formation of double-membrane vesicles, a critical stage in the coronavirus life cycle. Our computational analysis serves as a hypothesis generation tool to suggest models for SARS-CoV-2 biology and inform experimental efforts to combat the virus.
View details for DOI 10.1101/2020.04.28.065201
View details for PubMedID 32511373
View details for PubMedCentralID PMC7263502
-
RNA-GPS Predicts SARS-CoV-2 RNA Residency to Host Mitochondria and Nucleolus.
Cell systems
2020
Abstract
SARS-CoV-2 genomic and subgenomic RNA (sgRNA) transcripts hijack the host cell's machinery. Subcellular localization of its viral RNA could, thus, play important roles in viral replication and host antiviral immune response. We perform computational modeling of SARS-CoV-2 viral RNA subcellular residency across eight subcellular neighborhoods. We compare hundreds of SARS-CoV-2 genomes with the human transcriptome and other coronaviruses. We predict the SARS-CoV-2 RNA genome and sgRNAs to be enriched toward the host mitochondrial matrix and nucleolus, and that the 5' and 3' viral untranslated regions contain the strongest, most distinct localization signals. We interpret the mitochondrial residency signal as an indicator of intracellular RNA trafficking with respect to double-membrane vesicles, a critical stage in the coronavirus life cycle. Our computational analysis serves as a hypothesis generation tool to suggest models for SARS-CoV-2 biology and inform experimental efforts to combat the virus. A record of this paper's Transparent Peer Review process is included in the Supplemental Information.
View details for DOI 10.1016/j.cels.2020.06.008
View details for PubMedID 32673562
-
A single-cell transcriptomic atlas characterizes ageing tissues in the mouse.
Nature
2020
Abstract
Ageing is characterized by a progressive loss of physiological integrity, leading to impaired function and increased vulnerability to death1. Despite rapid advances over recent years, many of the molecular and cellular processes that underlie the progressive loss of healthy physiology are poorly understood2. To gain a better insight into these processes, here we generate a single-cell transcriptomic atlas across the lifespan of Mus musculus that includes data from 23 tissues and organs. We found cell-specific changes occurring across multiple cell types and organs, as well as age-related changes in the cellular composition of different organs. Using single-cell transcriptomic data, we assessed cell-type-specific manifestations of different hallmarks of ageing-such as senescence3, genomic instability4 and changes in the immune system2. This transcriptomic atlas-which we denote Tabula Muris Senis, or 'Mouse Ageing Cell Atlas'-provides molecular information about how the most important hallmarks of ageing are reflected in a broad range of tissues and cell types.
View details for DOI 10.1038/s41586-020-2496-1
View details for PubMedID 32669714
-
Deep learning interpretation of echocardiograms.
NPJ digital medicine
2020; 3: 10
Abstract
Echocardiography uses ultrasound technology to capture high temporal and spatial resolution images of the heart and surrounding structures, and is the most common imaging modality in cardiovascular medicine. Using convolutional neural networks on a large new dataset, we show that deep learning applied to echocardiography can identify local cardiac structures, estimate cardiac function, and predict systemic phenotypes that modify cardiovascular risk but not readily identifiable to human interpretation. Our deep learning model, EchoNet, accurately identified the presence of pacemaker leads (AUC=0.89), enlarged left atrium (AUC=0.86), left ventricular hypertrophy (AUC=0.75), left ventricular end systolic and diastolic volumes ( R 2 =0.74 and R 2 =0.70), and ejection fraction ( R 2 =0.50), as well as predicted systemic phenotypes of age ( R 2 =0.46), sex (AUC=0.88), weight ( R 2 =0.56), and height ( R 2 =0.33). Interpretation analysis validates that EchoNet shows appropriate attention to key cardiac structures when performing human-explainable tasks and highlights hypothesis-generating regions of interest when predicting systemic phenotypes difficult for human interpretation. Machine learning on echocardiography images can streamline repetitive tasks in the clinical workflow, provide preliminary interpretation in areas with insufficient qualified cardiologists, and predict phenotypes challenging for human evaluation.
View details for DOI 10.1038/s41746-019-0216-8
View details for PubMedID 31993508
-
Sex and gender analysis improves science and engineering.
Nature
2019; 575 (7781): 137–46
Abstract
The goal of sex and gender analysis is to promote rigorous, reproducible and responsible science. Incorporating sex and gender analysis into experimental design has enabled advancements across many disciplines, such as improved treatment of heart disease and insights into the societal impact of algorithmic bias. Here we discuss the potential for sex and gender analysis to foster scientific discovery, improve experimental efficiency and enable social equality. We provide a roadmap for sex and gender analysis across scientific disciplines and call on researchers, funding agencies, peer-reviewed journals and universities to coordinate efforts to implement robust methods of sex and gender analysis.
View details for DOI 10.1038/s41586-019-1657-6
View details for PubMedID 31695204
-
VetTag: improving automated veterinary diagnosis coding via large-scale language modeling.
NPJ digital medicine
2019; 2: 35
Abstract
Unlike human medical records, most of the veterinary records are free text without standard diagnosis coding. The lack of systematic coding is a major barrier to the growing interest in leveraging veterinary records for public health and translational research. Recent machine learning effort is limited to predicting 42 top-level diagnosis categories from veterinary notes. Here we develop a large-scale algorithm to automatically predict all 4577 standard veterinary diagnosis codes from free text. We train our algorithm on a curated dataset of over 100 K expert labeled veterinary notes and over one million unlabeled notes. Our algorithm is based on the adapted Transformer architecture and we demonstrate that large-scale language modeling on the unlabeled notes via pretraining and as an auxiliary objective during supervised learning greatly improves performance. We systematically evaluate the performance of the model and several baselines in challenging settings where algorithms trained on one hospital are evaluated in a different hospital with substantial domain shift. In addition, we show that hierarchical training can address severe data imbalances for fine-grained diagnosis with a few training cases, and we provide interpretation for what is learned by the deep network. Our algorithm addresses an important challenge in veterinary medicine, and our model and experiments add insights into the power of unsupervised learning for clinical natural language processing.
View details for DOI 10.1038/s41746-019-0113-1
View details for PubMedID 31304381
View details for PubMedCentralID PMC6550141
-
VetTag: improving automated veterinary diagnosis coding via large-scale language modeling
NPJ DIGITAL MEDICINE
2019; 2
View details for DOI 10.1038/s41746-019-0113-1
View details for Web of Science ID 000467504500001
-
Modeling Spatial Correlation of Transcripts with Application to Developing Pancreas
SCIENTIFIC REPORTS
2019; 9
View details for DOI 10.1038/s41598-019-41951-2
View details for Web of Science ID 000463178500053
-
Modeling Spatial Correlation of Transcripts with Application to Developing Pancreas.
Scientific reports
2019; 9 (1): 5592
Abstract
Recently high-throughput image-based transcriptomic methods were developed and enabled researchers to spatially resolve gene expression variation at the molecular level for the first time. In this work, we develop a general analysis tool to quantitatively study the spatial correlations of gene expression in fixed tissue sections. As an illustration, we analyze the spatial distribution of single mRNA molecules measured by in situ sequencing on human fetal pancreas at three developmental time points-80, 87 and 117days post-fertilization. We develop a density profile-based method to capture the spatial relationship between gene expression and other morphological features of the tissue sample such as position of nuclei and endocrine cells of the pancreas. In addition, we build a statistical model to characterize correlations in the spatial distribution of the expression level among different genes. This model enables us to infer the inhibitory and clustering effects throughout different time points. Our analysis framework is applicable to a wide variety of spatially-resolved transcriptomic data to derive biological insights.
View details for PubMedID 30944357
-
A large CRISPR-induced bystander mutation causes immune dysregulation.
Communications biology
2019; 2: 70
Abstract
A persistent concern with CRISPR-Cas9 gene editing has been the potential to generate mutations at off-target genomic sites. While CRISPR-engineering mice to delete a ~360bp intronic enhancer, here we discovered a founder line that had marked immune dysregulation caused by a 24kb tandem duplication of the sequence adjacent to the on-target deletion. Our results suggest unintended repair of on-target genomic cuts can cause pathogenic "bystander" mutations that escape detection by routine targeted genotyping assays.
View details for PubMedID 30793048
-
Multiaccuracy: Black-Box Post-Processing for Fairness in Classification
ASSOC COMPUTING MACHINERY. 2019: 247–54
View details for DOI 10.1145/3306618.3314287
View details for Web of Science ID 000556121100035
-
Improving the Stability of the Knockoff Procedure: Multiple Simultaneous Knockoffs and Entropy Maximization
MICROTOME PUBLISHING. 2019
View details for Web of Science ID 000509687902024
-
Knockoffs for the Mass: New Feature Importance Statistics with False Discovery Guarantees
MICROTOME PUBLISHING. 2019
View details for Web of Science ID 000509687902018
-
Towards Automatic Concept-based Explanations
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2019
View details for Web of Science ID 000535866900082
-
Contrastive Multivariate Singular Spectrum Analysis
IEEE. 2019: 1122–27
View details for Web of Science ID 000535355700155
-
Contingent Payment Mechanisms for Resource Utilization
ASSOC COMPUTING MACHINERY. 2019: 422–30
View details for Web of Science ID 000474345000052
-
A primer on deep learning in genomics.
Nature genetics
2018
Abstract
Deep learning methods are a class of machine learning techniques capable of identifying highly complex patterns in large datasets. Here, we provide a perspective and primer on deep learning applications for genome analysis. We discuss successful applications in the fields of regulatory genomics, variant calling and pathogenicity scores. We include general guidance for how to effectively use deep learning methods as well as a practical guide to tools and resources. This primer is accompanied by an interactive online tutorial.
View details for PubMedID 30478442
-
The clinical imperative for inclusivity: Race, ethnicity, and ancestry (REA) in genomics.
Human mutation
2018; 39 (11): 1713–20
Abstract
The Clinical Genome Resource (ClinGen) Ancestry and Diversity Working Group highlights the need to develop guidance on race, ethnicity, and ancestry (REA) data collection and use in clinical genomics. We present quantitative and qualitative evidence to characterize: (1) acquisition of REA data via clinical laboratory requisition forms, and (2) information disparity across populations in the Genome Aggregation Database (gnomAD) at clinically relevant sites ascertained from annotations in ClinVar. Our requisition form analysis showed substantial heterogeneity in clinical laboratory ascertainment of REA, as well as marked incongruity among terms used to define REA categories. There was also striking disparity across REA populations in the amount of information available about clinically relevant variants in gnomAD. European ancestral populations constituted the majority of observations (55.8%), allele counts (59.7%), and private alleles (56.1%) in gnomAD at 550 loci with "pathogenic" and "likely pathogenic" expert-reviewed variants in ClinVar. Our findings highlight the importance of implementing and supporting programs to increase diversity in genome sequencing and clinical genomics, as well as measuring uncertainty around population-level datasets that are used in variant interpretation. Finally, we suggest the need for a standardized REA data collection framework to be developed through partnerships and collaborations and adopted across clinical genomics.
View details for PubMedID 30311373
-
DeepTag: inferring diagnoses from veterinary clinical notes.
NPJ digital medicine
2018; 1: 60
Abstract
Large scale veterinary clinical records can become a powerful resource for patient care and research. However, clinicians lack the time and resource to annotate patient records with standard medical diagnostic codes and most veterinary visits are captured in free-text notes. The lack of standard coding makes it challenging to use the clinical data to improve patient care. It is also a major impediment to cross-species translational research, which relies on the ability to accurately identify patient cohorts with specific diagnostic criteria in humans and animals. In order to reduce the coding burden for veterinary clinical practice and aid translational research, we have developed a deep learning algorithm, DeepTag, which automatically infers diagnostic codes from veterinary free-text notes. DeepTag is trained on a newly curated dataset of 112,558 veterinary notes manually annotated by experts. DeepTag extends multitask LSTM with an improved hierarchical objective that captures the semantic structures between diseases. To foster human-machine collaboration, DeepTag also learns to abstain in examples when it is uncertain and defers them to human experts, resulting in improved performance. DeepTag accurately infers disease codes from free-text even in challenging cross-hospital settings where the text comes from different clinical settings than the ones used for training. It enables automated disease annotation across a broad range of clinical diagnoses with minimal preprocessing. The technical framework in this work can be applied in other medical domains that currently lack medical coding resources.
View details for DOI 10.1038/s41746-018-0067-8
View details for PubMedID 31304339
View details for PubMedCentralID PMC6550285
-
DeepTag: inferring diagnoses from veterinary clinical notes
NPJ DIGITAL MEDICINE
2018; 1
View details for DOI 10.1038/s41746-018-0067-8
View details for Web of Science ID 000449685400001
-
Integrative proteomics and bioinformatic prediction enable a high-confidence apicoplast proteome in malaria parasites.
PLoS biology
2018; 16 (9): e2005895
Abstract
Malaria parasites (Plasmodium spp.) and related apicomplexan pathogens contain a nonphotosynthetic plastid called the apicoplast. Derived from an unusual secondary eukaryote-eukaryote endosymbiosis, the apicoplast is a fascinating organelle whose function and biogenesis rely on a complex amalgamation of bacterial and algal pathways. Because these pathways are distinct from the human host, the apicoplast is an excellent source of novel antimalarial targets. Despite its biomedical importance and evolutionary significance, the absence of a reliable apicoplast proteome has limited most studies to the handful of pathways identified by homology to bacteria or primary chloroplasts, precluding our ability to study the most novel apicoplast pathways. Here, we combine proximity biotinylation-based proteomics (BioID) and a new machine learning algorithm to generate a high-confidence apicoplast proteome consisting of 346 proteins. Critically, the high accuracy of this proteome significantly outperforms previous prediction-based methods and extends beyond other BioID studies of unique parasite compartments. Half of identified proteins have unknown function, and 77% are predicted to be important for normal blood-stage growth. We validate the apicoplast localization of a subset of novel proteins and show that an ATP-binding cassette protein ABCF1 is essential for blood-stage survival and plays a previously unknown role in apicoplast biogenesis. These findings indicate critical organellar functions for newly discovered apicoplast proteins. The apicoplast proteome will be an important resource for elucidating unique pathways derived from secondary endosymbiosis and prioritizing antimalarial drug targets.
View details for PubMedID 30212465
-
Design AI so that it's fair
NATURE
2018; 559 (7714): 324–26
View details for DOI 10.1038/d41586-018-05707-8
View details for Web of Science ID 000439059800025
View details for PubMedID 30018439
-
Exploring patterns enriched in a dataset with contrastive principal component analysis
NATURE COMMUNICATIONS
2018; 9: 2134
Abstract
Visualization and exploration of high-dimensional data is a ubiquitous challenge across disciplines. Widely used techniques such as principal component analysis (PCA) aim to identify dominant trends in one dataset. However, in many settings we have datasets collected under different conditions, e.g., a treatment and a control experiment, and we are interested in visualizing and exploring patterns that are specific to one dataset. This paper proposes a method, contrastive principal component analysis (cPCA), which identifies low-dimensional structures that are enriched in a dataset relative to comparison data. In a wide variety of experiments, we demonstrate that cPCA with a background dataset enables us to visualize dataset-specific patterns missed by PCA and other standard methods. We further provide a geometric interpretation of cPCA and strong mathematical guarantees. An implementation of cPCA is publicly available, and can be used for exploratory data analysis in many applications where PCA is currently used.
View details for PubMedID 29849030
-
Word embeddings quantify 100 years of gender and ethnic stereotypes
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2018; 115 (16): E3635–E3644
Abstract
Word embeddings are a powerful machine-learning framework that represents each English word by a vector. The geometric relationship between these vectors captures meaningful semantic relationships between the corresponding words. In this paper, we develop a framework to demonstrate how the temporal dynamics of the embedding helps to quantify changes in stereotypes and attitudes toward women and ethnic minorities in the 20th and 21st centuries in the United States. We integrate word embeddings trained on 100 y of text data with the US Census to show that changes in the embedding track closely with demographic and occupation shifts over time. The embedding captures societal shifts-e.g., the women's movement in the 1960s and Asian immigration into the United States-and also illuminates how specific adjectives and occupations became more closely associated with certain populations over time. Our framework for temporal analysis of word embedding opens up a fruitful intersection between machine learning and quantitative social science.
View details for PubMedID 29615513
-
Autowarp: Learning a Warping Distance from Unlabeled Time Series Using Sequence Autoencoders
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2018
View details for Web of Science ID 000461852005015
-
Embedding for Informative Missingness: Deep Learning With Incomplete Data
IEEE. 2018: 437–45
View details for Web of Science ID 000461021200062
-
The Effects of Memory Replay in Reinforcement Learning
IEEE. 2018: 478–85
View details for Web of Science ID 000461021200067
-
Diabetes reversal by inhibition of the low-molecular-weight tyrosine phosphatase
NATURE CHEMICAL BIOLOGY
2017; 13 (6): 624-?
Abstract
Obesity-associated insulin resistance plays a central role in type 2 diabetes. As such, tyrosine phosphatases that dephosphorylate the insulin receptor (IR) are potential therapeutic targets. The low-molecular-weight protein tyrosine phosphatase (LMPTP) is a proposed IR phosphatase, yet its role in insulin signaling in vivo has not been defined. Here we show that global and liver-specific LMPTP deletion protects mice from high-fat diet-induced diabetes without affecting body weight. To examine the role of the catalytic activity of LMPTP, we developed a small-molecule inhibitor with a novel uncompetitive mechanism, a unique binding site at the opening of the catalytic pocket, and an exquisite selectivity over other phosphatases. This inhibitor is orally bioavailable, and it increases liver IR phosphorylation in vivo and reverses high-fat diet-induced diabetes. Our findings suggest that LMPTP is a key promoter of insulin resistance and that LMPTP inhibitors would be beneficial for treating type 2 diabetes.
View details for DOI 10.1038/nchembio.2344
View details for Web of Science ID 000401419300015
View details for PubMedID 28346406
-
Correcting for cell-type heterogeneity in DNA methylation: a comprehensive evaluation.
Nature methods
2017; 14 (3): 218-219
View details for DOI 10.1038/nmeth.4190
View details for PubMedID 28245214
-
NeuralFDR: Learning Discovery Thresholds from Hypothesis Features
NEURAL INFORMATION PROCESSING SYSTEMS (NIPS). 2017
View details for Web of Science ID 000452649401056
-
Quantifying unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects.
Nature communications
2016; 7: 13293-?
Abstract
As new proposals aim to sequence ever larger collection of humans, it is critical to have a quantitative framework to evaluate the statistical power of these projects. We developed a new algorithm, UnseenEst, and applied it to the exomes of 60,706 individuals to estimate the frequency distribution of all protein-coding variants, including rare variants that have not been observed yet in the current cohorts. Our results quantified the number of new variants that we expect to identify as sequencing cohorts reach hundreds of thousands of individuals. With 500K individuals, we find that we expect to capture 7.5% of all possible loss-of-function variants and 12% of all possible missense variants. We also estimate that 2,900 genes have loss-of-function frequency of <0.00001 in healthy humans, consistent with very strong intolerance to gene inactivation.
View details for DOI 10.1038/ncomms13293
View details for PubMedID 27796292
View details for PubMedCentralID PMC5095512
-
Limits on Active to Sterile Neutrino Oscillations from Disappearance Searches in the MINOS, Daya Bay, and Bugey-3 Experiments
PHYSICAL REVIEW LETTERS
2016; 117 (15)
Abstract
Searches for a light sterile neutrino have been performed independently by the MINOS and the Daya Bay experiments using the muon (anti)neutrino and electron antineutrino disappearance channels, respectively. In this Letter, results from both experiments are combined with those from the Bugey-3 reactor neutrino experiment to constrain oscillations into light sterile neutrinos. The three experiments are sensitive to complementary regions of parameter space, enabling the combined analysis to probe regions allowed by the Liquid Scintillator Neutrino Detector (LSND) and MiniBooNE experiments in a minimally extended four-neutrino flavor framework. Stringent limits on sin^{2}2θ_{μe} are set over 6 orders of magnitude in the sterile mass-squared splitting Δm_{41}^{2}. The sterile-neutrino mixing phase space allowed by the LSND and MiniBooNE experiments is excluded for Δm_{41}^{2}<0.8 eV^{2} at 95% CL_{s}.
View details for DOI 10.1103/PhysRevLett.117.151801
View details for PubMedID 27768356
-
Hierarchical Patterning of Multifunctional Conducting Polymer Nanoparticles as a Bionic Platform for Topographic Contact Guidance
ACS NANO
2015; 9 (2): 1767-1774
Abstract
The use of programmed electrical signals to influence biological events has been a widely accepted clinical methodology for neurostimulation. An optimal biocompatible platform for neural activation efficiently transfers electrical signals across the electrode-cell interface and also incorporates large-area neural guidance conduits. Inherently conducting polymers (ICPs) have emerged as frontrunners as soft biocompatible alternatives to traditionally used metal electrodes, which are highly invasive and elicit tissue damage over long-term implantation. However, fabrication techniques for the ICPs suffer a major bottleneck, which limits their usability and medical translation. Herein, we report that these limitations can be overcome using colloidal chemistry to fabricate multimodal conducting polymer nanoparticles. Furthermore, we demonstrate that these polymer nanoparticles can be precisely assembled into large-area linear conduits using surface chemistry. Finally, we validate that this platform can act as guidance conduits for neurostimulation, whereby the presence of electrical current induces remarkable dendritic axonal sprouting of cells.
View details for DOI 10.1021/nn506607x
View details for Web of Science ID 000349940500072
View details for PubMedID 25623615
-
Endovascular Repair With the Chimney Technique for Stanford Type B Aortic Dissection Involving Right-Sided Arch With Mirror Image Branching
JOURNAL OF ENDOVASCULAR THERAPY
2013; 20 (3): 283-288
Abstract
To report endovascular repair with the chimney technique of type B aortic dissection involving a right-sided aortic arch (RAA).Two hypertensive men aged 48 and 42 years with symptoms of aortic dissection resistant to medical therapy underwent emergent thoracic endovascular aortic repair with the chimney technique to extend the proximal landing zones. Both patients had right-sided arches with mirror image branching. One patient required a bare metal chimney stent to maintain perfusion to the right subclavian artery, while the other patient had a chimney stent to revascularize the right common carotid artery. Short-term follow-up (1 year and 1 month, respectively) showed that there was positive aortic remodeling, and the chimney stents were patent.Chimney TEVAR seems safe and effective for Stanford type B dissection in patients having RAA with mirror image branching and no sufficient proximal fixation zone.
View details for Web of Science ID 000320074100005
View details for PubMedID 23731297
-
Conversion of Human Fibroblasts to Functional Endothelial Cells by Defined Factors
ARTERIOSCLEROSIS THROMBOSIS AND VASCULAR BIOLOGY
2013; 33 (6): 1366-?
Abstract
Transdifferentiation of fibroblasts to endothelial cells (ECs) may provide a novel therapeutic avenue for diseases, including ischemia and fibrosis. Here, we demonstrate that human fibroblasts can be transdifferentiated into functional ECs by using only 2 factors, Oct4 and Klf4, under inductive signaling conditions.To determine whether human fibroblasts could be converted into ECs by transient expression of pluripotency factors, human neonatal fibroblasts were transduced with lentiviruses encoding Oct4 and Klf4 in the presence of soluble factors that promote the induction of an endothelial program. After 28 days, clusters of induced endothelial (iEnd) cells seemed and were isolated for further propagation and subsequent characterization. The iEnd cells resembled primary human ECs in their transcriptional signature by expressing endothelial phenotypic markers, such as CD31, vascular endothelial-cadherin, and von Willebrand Factor. Furthermore, the iEnd cells could incorporate acetylated low-density lipoprotein and form vascular structures in vitro and in vivo. When injected into the ischemic limb of mice, the iEnd cells engrafted, increased capillary density, and enhanced tissue perfusion. During the transdifferentiation process, the endogenous pluripotency network was not activated, suggesting that this process bypassed a pluripotent intermediate step.Pluripotent factor-induced transdifferentiation can be successfully applied for generating functional autologous ECs for therapeutic applications.
View details for DOI 10.1161/ATVBAHA.112.301167
View details for Web of Science ID 000319119500038
View details for PubMedID 23520160
-
Amino Acid Homeostasis Modulates Salicylic Acid-Associated Redox Status and Defense Responses in Arabidopsis
PLANT CELL
2010; 22 (11): 3845-3863
Abstract
The tight association between nitrogen status and pathogenesis has been broadly documented in plant-pathogen interactions. However, the interface between primary metabolism and disease responses remains largely unclear. Here, we show that knockout of a single amino acid transporter, LYSINE HISTIDINE TRANSPORTER1 (LHT1), is sufficient for Arabidopsis thaliana plants to confer a broad spectrum of disease resistance in a salicylic acid-dependent manner. We found that redox fine-tuning in photosynthetic cells was causally linked to the lht1 mutant-associated phenotypes. Furthermore, the enhanced resistance in lht1 could be attributed to a specific deficiency of its main physiological substrate, Gln, and not to a general nitrogen deficiency. Thus, by enabling nitrogen metabolism to moderate the cellular redox status, a plant primary metabolite, Gln, plays a crucial role in plant disease resistance.
View details for DOI 10.1105/tpc.110.079392
View details for Web of Science ID 000285576500025
View details for PubMedID 21097712
View details for PubMedCentralID PMC3015111
-
Alcoholic neurobiology: Changes in dependence and recovery
12th International Congress of the International-Society-for-Biomedical-Research-on-Alcoholism
WILEY-BLACKWELL. 2005: 1504–13
Abstract
This article presents the proceedings of a symposium held at the meeting of the International Society for Biomedical Research on Alcoholism (ISBRA) in Mannheim, Germany, in October, 2004. Chronic alcoholism follows a fluctuating course, which provides a naturalistic experiment in vulnerability, resilience, and recovery of human neural systems in response to presence, absence, and history of the neurotoxic effects of alcoholism. Alcohol dependence is a progressive chronic disease that is associated with changes in neuroanatomy, neurophysiology, neural gene expression, psychology, and behavior. Specifically, alcohol dependence is characterized by a neuropsychological profile of mild to moderate impairment in executive functions, visuospatial abilities, and postural stability, together with relative sparing of declarative memory, language skills, and primary motor and perceptual abilities. Recovery from alcoholism is associated with a partial reversal of CNS deficits that occur in alcoholism. The reversal of deficits during recovery from alcoholism indicates that brain structure is capable of repair and restructuring in response to insult in adulthood. Indirect support of this repair model derives from studies of selective neuropsychological processes, structural and functional neuroimaging studies, and preclinical studies on degeneration and regeneration during the development of alcohol dependence and recovery form dependence. Genetics and brain regional specificity contribute to unique changes in neuropsychology and neuroanatomy in alcoholism and recovery. This symposium includes state-of-the-art presentations on changes that occur during active alcoholism as well as those that may occur during recovery-abstinence from alcohol dependence. Included are human neuroimaging and neuropsychological assessments, changes in human brain gene expression, allelic combinations of genes associated with alcohol dependence and preclinical studies investigating mechanisms of alcohol induced neurotoxicity, and neuroprogenetor cell expansion during recovery from alcohol dependence.
View details for DOI 10.1097/01.alc.0000175013.50644.61
View details for Web of Science ID 000231767900018
View details for PubMedID 16156047
-
ANTI-IL-6 MONOCLONAL-ANTIBODIES PROTECT AGAINST LETHAL ESCHERICHIA-COLI INFECTION AND LETHAL TUMOR-NECROSIS-FACTOR-ALPHA CHALLENGE IN MICE
JOURNAL OF IMMUNOLOGY
1990; 145 (12): 4185-4191
Abstract
Potentially fatal physiologic and metabolic derangements can occur in response to bacterial infection in animals and man. Recently it has been shown that alterations in the levels of circulating cytokines such as IL-6 and TNF-alpha occur shortly after bacterial challenge. To understand better the role of IL-6 in inflammation, we investigated the effects of in vivo anti-mouse IL-6 antibody treatment in a mouse model of septic shock. Rat anti-mouse IL-6 neutralizing mAb was produced from splenocytes of an animal immunized with mouse rIL-6. This mAb, MP5-20F3, was a very potent and specific antagonist of mouse IL-6 in vitro bioactivity, demonstrated using the NFS60 myelomonocytic and KD83 plasmacytoma target cell lines, and also immunoprecipitated radiolabeled IL-6. Anti-IL-6 mAb pretreatment of mice subsequently challenged with lethal doses of i.p. Escherichia coli or i.v. TNF-alpha protected mice from death caused by these treatments. Pretreatment of E. coli-challenged mice with anti-IL-6 led to an increase in serum TNF bioactivity, in comparison to isotype control antibody, implicating IL-6 as a negative modulator of TNF in vivo. Anti-TNF-alpha treatment of mice challenged i.p. with live E. coli resulted in a 70% decrease in serum IL-6 levels, determined by immunoenzymetric assay, compared to control antibody, thereby supporting a role for TNF-alpha as a positive regulator of IL-6 levels. We conclude that IL-6 is a mediator in lethal E. coli infection, and suggest that antagonists of IL-6 may be beneficial therapeutically in life-threatening bacterial infection.
View details for Web of Science ID A1990EP04100033
View details for PubMedID 2124237