Oana M. Enache
Ph.D. Student in Biomedical Data Science, admitted Autumn 2021
Stanford Student Employee, Health Policy
Education & Certifications
-
MS, Duke University, Biostatistics
-
BA, University of California, Berkeley, Applied Mathematics
All Publications
-
Clinical Research Reporting Paradigms May Incompletely Describe Participant Identities.
American journal of epidemiology
2024
Abstract
Reporting of participants' baseline characteristics in clinical research is important for understanding a given study's context and typically occurs in a tabular format. However, this format incompletely and ambiguously describes included participants, as their identities are more fully represented by an intersecting set of sociodemographic characteristics rather than discrete characteristics in a table. Standard tabular reporting practices therefore introduce limitations in assessing a study's representativeness as well as its internal validity and external validity. To address this, we propose the addition of a simple graph that more clearly shows the joint distribution of baseline sociodemographic characteristics in a given study. We also discuss several practical considerations for the implementation of such graphs in the communication of clinical research.
View details for DOI 10.1093/aje/kwae291
View details for PubMedID 39191644
-
A method for intelligent allocation of diagnostic testing by leveraging data from commercial wearable devices: a case study on COVID-19.
NPJ digital medicine
2022; 5 (1): 130
Abstract
Mass surveillance testing can help control outbreaks of infectious diseases such as COVID-19. However, diagnostic test shortages are prevalent globally and continue to occur in the US with the onset of new COVID-19 variants and emerging diseases like monkeypox, demonstrating an unprecedented need for improving our current methods for mass surveillance testing. By targeting surveillance testing toward individuals who are most likely to be infected and, thus, increasing the testing positivity rate (i.e., percent positive in the surveillance group), fewer tests are needed to capture the same number of positive cases. Here, we developed an Intelligent Testing Allocation (ITA) method by leveraging data from the CovIdentify study (6765 participants) and the MyPHD study (8580 participants), including smartwatch data from 1265 individuals of whom 126 tested positive for COVID-19. Our rigorous model and parameter search uncovered the optimal time periods and aggregate metrics for monitoring continuous digital biomarkers to increase the positivity rate of COVID-19 diagnostic testing. We found that resting heart rate (RHR) features distinguished between COVID-19-positive and -negative cases earlier in the course of the infection than steps features, as early as 10 and 5 days prior to the diagnostic test, respectively. We also found that including steps features increased the area under the receiver operating characteristic curve (AUC-ROC) by 7-11% when compared with RHR features alone, while including RHR features improved the AUC of the ITA model's precision-recall curve (AUC-PR) by 38-50% when compared with steps features alone. The best AUC-ROC (0.73±0.14 and 0.77 on the cross-validated training set and independent test set, respectively) and AUC-PR (0.55±0.21 and 0.24) were achieved by using data from a single device type (Fitbit) with high-resolution (minute-level) data. Finally, we show that ITA generates up to a 6.5-fold increase in the positivity rate in the cross-validated training set and up to a 4.5-fold increase in the positivity rate in the independent test set, including both symptomatic and asymptomatic (up to 27%) individuals. Our findings suggest that, if deployed on a large scale and without needing self-reported symptoms, the ITA method could improve the allocation of diagnostic testing resources and reduce the burden of test shortages.
View details for DOI 10.1038/s41746-022-00672-z
View details for PubMedID 36050372
-
Noncanonical open reading frames encode functional proteins essential for cancer cell survival
NATURE BIOTECHNOLOGY
2021; 39 (6): 697-+
Abstract
Although genomic analyses predict many noncanonical open reading frames (ORFs) in the human genome, it is unclear whether they encode biologically active proteins. Here we experimentally interrogated 553 candidates selected from noncanonical ORF datasets. Of these, 57 induced viability defects when knocked out in human cancer cell lines. Following ectopic expression, 257 showed evidence of protein expression and 401 induced gene expression changes. Clustered regularly interspaced short palindromic repeat (CRISPR) tiling and start codon mutagenesis indicated that their biological effects required translation as opposed to RNA-mediated effects. We found that one of these ORFs, G029442-renamed glycine-rich extracellular protein-1 (GREP1)-encodes a secreted protein highly expressed in breast cancer, and its knockout in 263 cancer cell lines showed preferential essentiality in breast cancer-derived lines. The secretome of GREP1-expressing cells has an increased abundance of the oncogenic cytokine GDF15, and GDF15 supplementation mitigated the growth-inhibitory effect of GREP1 knockout. Our experiments suggest that noncanonical ORFs can express biologically active proteins that are potential therapeutic targets.
View details for DOI 10.1038/s41587-020-00806-2
View details for Web of Science ID 000612593200001
View details for PubMedID 33510483
View details for PubMedCentralID PMC8195866
-
Reply: Matters Arising 'Investigating sources of inaccuracy in wearable optical heart rate sensors'.
NPJ digital medicine
2021; 4 (1): 39
View details for DOI 10.1038/s41746-021-00409-4
View details for PubMedID 33637842
View details for PubMedCentralID PMC7910441
-
Cas9 activates the p53 pathway and selects for p53-inactivating mutations
NATURE GENETICS
2020; 52 (7): 662-+
Abstract
Cas9 is commonly introduced into cell lines to enable CRISPR-Cas9-mediated genome editing. Here, we studied the genetic and transcriptional consequences of Cas9 expression itself. Gene expression profiling of 165 pairs of human cancer cell lines and their Cas9-expressing derivatives revealed upregulation of the p53 pathway upon introduction of Cas9, specifically in wild-type TP53 (TP53-WT) cell lines. This was confirmed at the messenger RNA and protein levels. Moreover, elevated levels of DNA repair were observed in Cas9-expressing cell lines. Genetic characterization of 42 cell line pairs showed that introduction of Cas9 can lead to the emergence and expansion of p53-inactivating mutations. This was confirmed by competition experiments in isogenic TP53-WT and TP53-null (TP53-/-) cell lines. Lastly, Cas9 was less active in TP53-WT than in TP53-mutant cell lines, and Cas9-induced p53 pathway activation affected cellular sensitivity to both genetic and chemical perturbations. These findings may have broad implications for the proper use of CRISPR-Cas9-mediated genome editing.
View details for DOI 10.1038/s41588-020-0623-4
View details for Web of Science ID 000533846800003
View details for PubMedID 32424350
View details for PubMedCentralID PMC7343612
-
Adding to the CASeload: unwarranted p53 signaling induced by Cas9.
Molecular & cellular oncology
2020; 7 (5): 1789419
Abstract
We investigated the genetic and transcriptional changes associated with Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) associated protein 9 (Cas9) expression in human cancer cell lines. For a subset of cell lines with a wild-type tumor protein TP53 (best known as p53), we detected p53 pathway activation, DNA damage accumulation and emerging p53-inactivating mutations following Cas9 introduction. We discuss the potential implications of our findings in basic and translational research.
View details for DOI 10.1080/23723556.2020.1789419
View details for PubMedID 32944644
View details for PubMedCentralID PMC7469564
-
The GCTx format and cmap{Py, R, M, J} packages: resources for optimized storage and integrated traversal of annotated dense matrices
BIOINFORMATICS
2019; 35 (8): 1427-1429
Abstract
Facilitated by technological improvements, pharmacologic and genetic perturbational datasets have grown in recent years to include millions of experiments. Sharing and publicly distributing these diverse data creates many opportunities for discovery, but in recent years the unprecedented size of data generated and its complex associated metadata have also created data storage and integration challenges.We present the GCTx file format and a suite of open-source packages for the efficient storage, serialization and analysis of dense two-dimensional matrices. We have extensively used the format in the Connectivity Map to assemble and share massive datasets currently comprising 1.3 million experiments, and we anticipate that the format's generalizability, paired with code libraries that we provide, will lower barriers for integrated cross-assay analysis and algorithm development.Software packages (available in Python, R, Matlab and Java) are freely available at https://github.com/cmap. Additional instructions, tutorials and datasets are available at clue.io/code.Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/bty784
View details for Web of Science ID 000473691900025
View details for PubMedID 30203022
View details for PubMedCentralID PMC6477971
-
Bioconda: sustainable and comprehensive software distribution for the life sciences.
Nature methods
2018; 15 (7): 475–76
View details for DOI 10.1038/s41592-018-0046-7
View details for PubMedID 29967506
-
A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles
CELL
2017; 171 (6): 1437-+
Abstract
We previously piloted the concept of a Connectivity Map (CMap), whereby genes, drugs, and disease states are connected by virtue of common gene-expression signatures. Here, we report more than a 1,000-fold scale-up of the CMap as part of the NIH LINCS Consortium, made possible by a new, low-cost, high-throughput reduced representation expression profiling method that we term L1000. We show that L1000 is highly reproducible, comparable to RNA sequencing, and suitable for computational inference of the expression levels of 81% of non-measured transcripts. We further show that the expanded CMap can be used to discover mechanism of action of small molecules, functionally annotate genetic variants of disease genes, and inform clinical trials. The 1.3 million L1000 profiles described here, as well as tools for their analysis, are available at https://clue.io.
View details for DOI 10.1016/j.cell.2017.10.049
View details for Web of Science ID 000417362700023
View details for PubMedID 29195078
-
A robust prognostic signature for hormone-positive node-negative breast cancer
GENOME MEDICINE
2013; 5: 92
Abstract
Systemic chemotherapy in the adjuvant setting can cure breast cancer in some patients that would otherwise recur with incurable, metastatic disease. However, since only a fraction of patients would have recurrence after surgery alone, the challenge is to stratify high-risk patients (who stand to benefit from systemic chemotherapy) from low-risk patients (who can safely be spared treatment related toxicities and costs).We focus here on risk stratification in node-negative, ER-positive, HER2-negative breast cancer. We use a large database of publicly available microarray datasets to build a random forests classifier and develop a robust multi-gene mRNA transcription-based predictor of relapse free survival at 10 years, which we call the Random Forests Relapse Score (RFRS). Performance was assessed by internal cross-validation, multiple independent data sets, and comparison to existing algorithms using receiver-operating characteristic and Kaplan-Meier survival analysis. Internal redundancy of features was determined using k-means clustering to define optimal signatures with smaller numbers of primary genes, each with multiple alternates.Internal OOB cross-validation for the initial (full-gene-set) model on training data reported an ROC AUC of 0.704, which was comparable to or better than those reported previously or obtained by applying existing methods to our dataset. Three risk groups with probability cutoffs for low, intermediate, and high-risk were defined. Survival analysis determined a highly significant difference in relapse rate between these risk groups. Validation of the models against independent test datasets showed highly similar results. Smaller 17-gene and 8-gene optimized models were also developed with minimal reduction in performance. Furthermore, the signature was shown to be almost equally effective on both hormone-treated and untreated patients.RFRS allows flexibility in both the number and identity of genes utilized from thousands to as few as 17 or eight genes, each with multiple alternatives. The RFRS reports a probability score strongly correlated with risk of relapse. This score could therefore be used to assign systemic chemotherapy specifically to those high-risk patients most likely to benefit from further treatment.
View details for DOI 10.1186/gm496
View details for Web of Science ID 000326549600002
View details for PubMedID 24112773
View details for PubMedCentralID PMC3961800
-
Modeling precision treatment of breast cancer
GENOME BIOLOGY
2013; 14 (10): R110
Abstract
First-generation molecular profiles for human breast cancers have enabled the identification of features that can predict therapeutic response; however, little is known about how the various data types can best be combined to yield optimal predictors. Collections of breast cancer cell lines mirror many aspects of breast cancer molecular pathobiology, and measurements of their omic and biological therapeutic responses are well-suited for development of strategies to identify the most predictive molecular feature sets.We used least squares-support vector machines and random forest algorithms to identify molecular features associated with responses of a collection of 70 breast cancer cell lines to 90 experimental or approved therapeutic agents. The datasets analyzed included measurements of copy number aberrations, mutations, gene and isoform expression, promoter methylation and protein expression. Transcriptional subtype contributed strongly to response predictors for 25% of compounds, and adding other molecular data types improved prediction for 65%. No single molecular dataset consistently out-performed the others, suggesting that therapeutic response is mediated at multiple levels in the genome. Response predictors were developed and applied to TCGA data, and were found to be present in subsets of those patient samples.These results suggest that matching patients to treatments based on transcriptional subtype will improve response rates, and inclusion of additional features from other profiling data types may provide additional benefit. Further, we suggest a systems biology strategy for guiding clinical trials so that patient cohorts most likely to respond to new therapies may be more efficiently identified.
View details for DOI 10.1186/gb-2013-14-10-r110
View details for Web of Science ID 000329387500003
View details for PubMedID 24176112
View details for PubMedCentralID PMC3937590