In 2014 I received my BS in Industrial Engineering from the Universitat Politècnica de Catalunya in Barcelona, where I majored in power electronics and signals. During my undergraduate, I served as the Team Captain of the ETSEIB Formula Student team, where I helped develop the first electric vehicle for the school. After graduating, I went on to obtain an MS from the Electrical & Computer Engineering department at Texas A&M University in College Station, Texas. There, I joined the Center for Bioinformatics and Genomic Systems Engineering, where I was involved in several computational genomics research projects under Dr. Aniruddha Datta. In 2015, I was awarded the “la Caixa” fellowship to pursue research in computational biology during my Ph.D. as a member of the Goutsias lab, developing computational methods to study epigenetic signatures in close collaboration with the Feinberg lab of the Johns Hopkins University School of Medicine. In addition to my research and Ph.D. coursework, I earned an MS focused on statistical learning from the Applied Mathematics & Statistics department at Johns Hopkins University in 2018. After successfully defending my dissertation entitled “Statistical Signal Processing Methods for Epigenetic Landscape Analysis” on May 10 of 2021, I joined Salzman’s lab at Stanford University as a Postdoctoral Research Fellow being awarded the postdoctoral fellowship from the Center for Computational, Evolutionary and Human Genomics.
Honors & Awards
Fellow, La Caixa Foundation (2016)
Postdoctoral Research Fellow, Center for Computational, Evolutionary & Human Genomics (9/2021)
Julia Salzman, Postdoctoral Faculty Sponsor
Estimating DNA methylation potential energy landscapes from nanopore sequencing data
2021; 11 (1): 21619
High-throughput third-generation nanopore sequencing devices have enormous potential for simultaneously observing epigenetic modifications in human cells over large regions of the genome. However, signals generated by these devices are subject to considerable noise that can lead to unsatisfactory detection performance and hamper downstream analysis. Here we develop a statistical method, CpelNano, for the quantification and analysis of 5mC methylation landscapes using nanopore data. CpelNano takes into account nanopore noise by means of a hidden Markov model (HMM) in which the true but unknown ("hidden") methylation state is modeled through an Ising probability distribution that is consistent with methylation means and pairwise correlations, whereas nanopore current signals constitute the observed state. It then estimates the associated methylation potential energy function by employing the expectation-maximization (EM) algorithm and performs differential methylation analysis via permutation-based hypothesis testing. Using simulations and analysis of published data obtained from three human cell lines (GM12878, MCF-10A, and MDA-MB-231), we show that CpelNano can faithfully estimate DNA methylation potential energy landscapes, substantially improving current methods and leading to a powerful tool for the modeling and analysis of epigenetic landscapes using nanopore sequencing data.
View details for DOI 10.1038/s41598-021-00781-x
View details for Web of Science ID 000714415600050
View details for PubMedID 34732768
View details for PubMedCentralID PMC8566571
Converging genetic and epigenetic drivers of paediatric acute lymphoblastic leukaemia identified by an information-theoretic analysis
NATURE BIOMEDICAL ENGINEERING
2021; 5 (4): 360-376
In cancer, linking epigenetic alterations to drivers of transformation has been difficult, in part because DNA methylation analyses must capture epigenetic variability, which is central to tumour heterogeneity and tumour plasticity. Here, by conducting a comprehensive analysis, based on information theory, of differences in methylation stochasticity in samples from patients with paediatric acute lymphoblastic leukaemia (ALL), we show that ALL epigenomes are stochastic and marked by increased methylation entropy at specific regulatory regions and genes. By integrating DNA methylation and single-cell gene-expression data, we arrived at a relationship between methylation entropy and gene-expression variability, and found that epigenetic changes in ALL converge on a shared set of genes that overlap with genetic drivers involved in chromosomal translocations across the disease spectrum. Our findings suggest that an epigenetically driven gene-regulation network, with UHRF1 (ubiquitin-like with PHD and RING finger domains 1) as a central node, links genetic drivers and epigenetic mediators in ALL.
View details for DOI 10.1038/s41551-021-00703-2
View details for Web of Science ID 000640471800001
View details for PubMedID 33859388
View details for PubMedCentralID PMC8370714
Detection of haplotype-dependent allele-specific DNA methylation in WGBS data
2020; 11 (1): 5238
In heterozygous genomes, allele-specific measurements can reveal biologically significant differences in DNA methylation between homologous alleles associated with local changes in genetic sequence. Current approaches for detecting such events from whole-genome bisulfite sequencing (WGBS) data perform statistically independent marginal analysis at individual cytosine-phosphate-guanine (CpG) sites, thus ignoring correlations in the methylation state, or carry-out a joint statistical analysis of methylation patterns at four CpG sites producing unreliable statistical evidence. Here, we employ the one-dimensional Ising model of statistical physics and develop a method for detecting allele-specific methylation (ASM) events within segments of DNA containing clusters of linked single-nucleotide polymorphisms (SNPs), called haplotypes. Comparisons with existing approaches using simulated and real WGBS data show that our method provides an improved fit to data, especially when considering large haplotypes. Importantly, the method employs robust hypothesis testing for detecting statistically significant imbalances in mean methylation level and methylation entropy, as well as for identifying haplotypes for which the genetic variant carries significant information about the methylation state. As such, our ASM analysis approach can potentially lead to biological discoveries with important implications for the genetics of complex human diseases.
View details for DOI 10.1038/s41467-020-19077-1
View details for Web of Science ID 000582056600001
View details for PubMedID 33067439
View details for PubMedCentralID PMC7567826
A Dysregulated DNA Methylation Landscape Linked to Gene Expression in MLL-Rearranged AML.
Translocations of the KMT2A (MLL) gene define a biologically distinct and clinically aggressive subtype of acute myeloid leukaemia (AML), marked by a characteristic gene expression profile and few cooperating mutations. Although dysregulation of the epigenetic landscape in this leukaemia is particularly interesting given the low mutation frequency, its comprehensive analysis using whole genome bisulphite sequencing (WGBS) has not been previously performed. Here we investigated epigenetic dysregulation in nine MLL-rearranged (MLL-r) AML samples by comparing them to six normal myeloid controls, using a computational method that encapsulates mean DNA methylation measurements along with analyses of methylation stochasticity. We discovered a dramatically altered epigenetic profile in MLL-r AML, associated with genome-wide hypomethylation and a markedly increased DNA methylation entropy reflecting an increasingly disordered epigenome. Methylation discordance mapped to key genes and regulatory elements that included bivalent promoters and active enhancers. Genes associated with significant changes in methylation stochasticity recapitulated known MLL-r AML expression signatures, suggesting a role for the altered epigenetic landscape in the transcriptional programme initiated by MLL translocations. Accordingly, we established statistically significant associations between discordances in methylation stochasticity and gene expression in MLL-r AML, thus providing a link between the altered epigenetic landscape and the phenotype.
View details for DOI 10.1080/15592294.2020.1734149
View details for PubMedID 32114880
Community assessment to advance computational prediction of cancer drug combinations in a pharmacogenomic screen
2019; 10: 2674
The effectiveness of most cancer targeted therapies is short-lived. Tumors often develop resistance that might be overcome with drug combinations. However, the number of possible combinations is vast, necessitating data-driven approaches to find optimal patient-specific treatments. Here we report AstraZeneca's large drug combination dataset, consisting of 11,576 experiments from 910 combinations across 85 molecularly characterized cancer cell lines, and results of a DREAM Challenge to evaluate computational strategies for predicting synergistic drug pairs and biomarkers. 160 teams participated to provide a comprehensive methodological development and benchmarking. Winning methods incorporate prior knowledge of drug-target interactions. Synergy is predicted with an accuracy matching biological replicates for >60% of combinations. However, 20% of drug combinations are poorly predicted by all methods. Genomic rationale for synergy predictions are identified, including ADAM17 inhibitor antagonism when combined with PIK3CB/D inhibition contrasting to synergy when combined with other PI3K-pathway inhibitors in PIK3CA mutant cells.
View details for DOI 10.1038/s41467-019-09799-2
View details for Web of Science ID 000471758500010
View details for PubMedID 31209238
View details for PubMedCentralID PMC6572829
Ranking genomic features using an information-theoretic measure of epigenetic discordance
2019; 20: 175
Establishment and maintenance of DNA methylation throughout the genome is an important epigenetic mechanism that regulates gene expression whose disruption has been implicated in human diseases like cancer. It is therefore crucial to know which genes, or other genomic features of interest, exhibit significant discordance in DNA methylation between two phenotypes. We have previously proposed an approach for ranking genes based on methylation discordance within their promoter regions, determined by centering a window of fixed size at their transcription start sites. However, we cannot use this method to identify statistically significant genomic features and handle features of variable length and with missing data.We present a new approach for computing the statistical significance of methylation discordance within genomic features of interest in single and multiple test/reference studies. We base the proposed method on a well-articulated hypothesis testing problem that produces p- and q-values for each genomic feature, which we then use to identify and rank features based on the statistical significance of their epigenetic dysregulation. We employ the information-theoretic concept of mutual information to derive a novel test statistic, which we can evaluate by computing Jensen-Shannon distances between the probability distributions of methylation in a test and a reference sample. We design the proposed methodology to simultaneously handle biological, statistical, and technical variability in the data, as well as variable feature lengths and missing data, thus enabling its wide-spread use on any list of genomic features. This is accomplished by estimating, from reference data, the null distribution of the test statistic as a function of feature length using generalized additive regression models. Differential assessment, using normal/cancer data from healthy fetal tissue and pediatric high-grade glioma patients, illustrates the potential of our approach to greatly facilitate the exploratory phases of clinically and biologically relevant methylation studies.The proposed approach provides the first computational tool for statistically testing and ranking genomic features of interest based on observed DNA methylation discordance in comparative studies that accounts, in a rigorous manner, for biological, statistical, and technical variability in methylation data, as well as for variability in feature length and for missing data.
View details for DOI 10.1186/s12859-019-2777-6
View details for Web of Science ID 000464754500001
View details for PubMedID 30961526
View details for PubMedCentralID PMC6454630
An information-theoretic approach to the modeling and analysis of whole-genome bisulfite sequencing data
2018; 19: 87
DNA methylation is a stable form of epigenetic memory used by cells to control gene expression. Whole genome bisulfite sequencing (WGBS) has emerged as a gold-standard experimental technique for studying DNA methylation by producing high resolution genome-wide methylation profiles. Statistical modeling and analysis is employed to computationally extract and quantify information from these profiles in an effort to identify regions of the genome that demonstrate crucial or aberrant epigenetic behavior. However, the performance of most currently available methods for methylation analysis is hampered by their inability to directly account for statistical dependencies between neighboring methylation sites, thus ignoring significant information available in WGBS reads.We present a powerful information-theoretic approach for genome-wide modeling and analysis of WGBS data based on the 1D Ising model of statistical physics. This approach takes into account correlations in methylation by utilizing a joint probability model that encapsulates all information available in WGBS methylation reads and produces accurate results even when applied on single WGBS samples with low coverage. Using the Shannon entropy, our approach provides a rigorous quantification of methylation stochasticity in individual WGBS samples genome-wide. Furthermore, it utilizes the Jensen-Shannon distance to evaluate differences in methylation distributions between a test and a reference sample. Differential performance assessment using simulated and real human lung normal/cancer data demonstrate a clear superiority of our approach over DSS, a recently proposed method for WGBS data analysis. Critically, these results demonstrate that marginal methods become statistically invalid when correlations are present in the data.This contribution demonstrates clear benefits and the necessity of modeling joint probability distributions of methylation using the 1D Ising model of statistical physics and of quantifying methylation stochasticity using concepts from information theory. By employing this methodology, substantial improvement of DNA methylation analysis can be achieved by effectively taking into account the massive amount of statistical information available in WGBS data, which is largely ignored by existing methods.
View details for DOI 10.1186/s12859-018-2086-5
View details for Web of Science ID 000427154900001
View details for PubMedID 29514626
View details for PubMedCentralID PMC5842653
HiMMe: using genetic patterns as a proxy for genome assembly reliability assessment
2017; 18: 694
The information content of genomes plays a crucial role in the existence and proper development of living organisms. Thus, tremendous effort has been dedicated to developing DNA sequencing technologies that provide a better understanding of the underlying mechanisms of cellular processes. Advances in the development of sequencing technology have made it possible to sequence genomes in a relatively fast and inexpensive way. However, as with any measurement technology, there is noise involved and this needs to be addressed to reach conclusions based on the resulting data. In addition, there are multiple intermediate steps and degrees of freedom when constructing genome assemblies that lead to ambiguous and inconsistent results among assemblers.Here we introduce HiMMe, an HMM-based tool that relies on genetic patterns to score genome assemblies. Through a Markov chain, the model is able to detect characteristic genetic patterns, while, by introducing emission probabilities, the noise involved in the process is taken into account. Prior knowledge can be used by training the model to fit a given organism or sequencing technology.Our results show that the method presented is able to recognize patterns even with relatively small k-mer size choices and limited computational resources.Our methodology provides an individual quality metric per contig in addition to an overall genome assembly score, with a time complexity well below that of an aligner. Ultimately, HiMMe provides meaningful statistical insights that can be leveraged by researchers to better select contigs and genome assemblies for downstream analysis.
View details for DOI 10.1186/s12864-017-3965-2
View details for Web of Science ID 000409208100002
View details for PubMedID 28874136
View details for PubMedCentralID PMC5584555
- Computational Considerations in Transcriptome Assemblies and Their Evaluation, using High Quality Human RNA-Seq data ASSOC COMPUTING MACHINERY. 2016