
Soumya Kundu
Ph.D. Student in Computer Science, admitted Autumn 2018
Education & Certifications
-
Master of Science, University of Connecticut, Computer Science and Engineering (2018)
-
Bachelor of Science, University of Connecticut, Computer Science and Engineering (2018)
All Publications
-
CXCL12 drives natural variation in coronary artery anatomy across diverse populations.
Cell
2025
Abstract
Coronary arteries have a specific branching pattern crucial for oxygenating heart muscle. Among humans, there is natural variation in coronary anatomy with respect to perfusion of the inferior/posterior left heart, which can branch from either the right arterial tree, the left, or both-a phenotype known as coronary dominance. Using angiographic data for >60,000 US veterans of diverse ancestry, we conducted a genome-wide association study of coronary dominance, revealing moderate heritability and identifying ten significant loci. The strongest association occurred near CXCL12 in both European- and African-ancestry cohorts, with downstream analyses implicating effects on CXCL12 expression. We show that CXCL12 is expressed in human fetal hearts at the time dominance is established. Reducing Cxcl12 in mice altered coronary dominance and caused septal arteries to develop away from Cxcl12 expression domains. These findings indicate that CXCL12 patterns human coronary arteries, paving the way for "medical revascularization" through targeting developmental pathways.
View details for DOI 10.1016/j.cell.2025.02.005
View details for PubMedID 40049164
-
Mapping the regulatory effects of common and rare non-coding variants across cellular and developmental contexts in the brain and heart.
bioRxiv : the preprint server for biology
2025
Abstract
Whole genome sequencing has identified over a billion non-coding variants in humans, while GWAS has revealed the non-coding genome as a significant contributor to disease. However, prioritizing causal common and rare non-coding variants in human disease, and understanding how selective pressures have shaped the non-coding genome, remains a significant challenge. Here, we predicted the effects of 15 million variants with deep learning models trained on single-cell ATAC-seq across 132 cellular contexts in adult and fetal brain and heart, producing nearly two billion context-specific predictions. Using these predictions, we distinguish candidate causal variants underlying human traits and diseases and their context-specific effects. While common variant effects are more cell-type-specific, rare variants exert more cell-type-shared regulatory effects, with selective pressures particularly targeting variants affecting fetal brain neurons. To prioritize de novo mutations with extreme regulatory effects, we developed FLARE, a context-specific functional genomic model of constraint. FLARE outperformed other methods in prioritizing case mutations from autism-affected families near syndromic autism-associated genes; for example, identifying mutation outliers near CNTNAP2 that would be missed by alternative approaches. Overall, our findings demonstrate the potential of integrating single-cell maps with population genetics and deep learning-based variant effect prediction to elucidate mechanisms of development and disease-ultimately, supporting the notion that genetic contributions to neurodevelopmental disorders are predominantly rare.
View details for DOI 10.1101/2025.02.18.638922
View details for PubMedID 40027628
View details for PubMedCentralID PMC11870466
-
ChromBPNet: bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants.
bioRxiv : the preprint server for biology
2025
Abstract
Despite extensive mapping of cis-regulatory elements (cREs) across cellular contexts with chromatin accessibility assays, the sequence syntax and genetic variants that regulate transcription factor (TF) binding and chromatin accessibility at context-specific cREs remain elusive. We introduce ChromBPNet, a deep learning DNA sequence model of base-resolution accessibility profiles that detects, learns and deconvolves assay-specific enzyme biases from regulatory sequence determinants of accessibility, enabling robust discovery of compact TF motif lexicons, cooperative motif syntax and precision footprints across assays and sequencing depths. Extensive benchmarks show that ChromBPNet, despite its lightweight design, is competitive with much larger contemporary models at predicting variant effects on chromatin accessibility, pioneer TF binding and reporter activity across assays, cell contexts and ancestry, while providing interpretation of disrupted regulatory syntax. ChromBPNet also helps prioritize and interpret regulatory variants that influence complex traits and rare diseases, thereby providing a powerful lens to decode regulatory DNA and genetic variation.
View details for DOI 10.1101/2024.12.25.630221
View details for PubMedID 39829783
View details for PubMedCentralID PMC11741299
-
Molecular convergence of risk variants for congenital heart defects leveraging a regulatory map of the human fetal heart.
medRxiv : the preprint server for health sciences
2024
Abstract
Congenital heart defects (CHD) arise in part due to inherited genetic variants that alter genes and noncoding regulatory elements in the human genome. These variants are thought to act during fetal development to influence the formation of different heart structures. However, identifying the genes, pathways, and cell types that mediate these effects has been challenging due to the immense diversity of cell types involved in heart development as well as the superimposed complexities of interpreting noncoding sequences. As such, understanding the molecular functions of both noncoding and coding variants remains paramount to our fundamental understanding of cardiac development and CHD. Here, we created a gene regulation map of the healthy human fetal heart across developmental time, and applied it to interpret the functions of variants associated with CHD and quantitative cardiac traits. We collected single-cell multiomic data from 734,000 single cells sampled from 41 fetal hearts spanning post-conception weeks 6 to 22, enabling the construction of gene regulation maps in 90 cardiac cell types and states, including rare populations of cardiac conduction cells. Through an unbiased analysis of all 90 cell types, we find that both rare coding variants associated with CHD and common noncoding variants associated with valve traits converge to affect valvular interstitial cells (VICs). VICs are enriched for high expression of known CHD genes previously identified through mapping of rare coding variants. Eight CHD genes, as well as other genes in similar molecular pathways, are linked to common noncoding variants associated with other valve diseases or traits via enhancers in VICs. In addition, certain common noncoding variants impact enhancers with activities highly specific to particular subanatomic structures in the heart, illuminating how such variants can impact specific aspects of heart structure and function. Together, these results implicate new enhancers, genes, and cell types in the genetic etiology of CHD, identify molecular convergence of common noncoding and rare coding variants on VICs, and suggest a more expansive view of the cell types instrumental in genetic risk for CHD, beyond the working cardiomyocyte. This regulatory map of the human fetal heart will provide a foundational resource for understanding cardiac development, interpreting genetic variants associated with heart disease, and discovering targets for cell-type specific therapies.
View details for DOI 10.1101/2024.11.20.24317557
View details for PubMedID 39606363
View details for PubMedCentralID PMC11601760
-
Single cell variant to enhancer to gene map for coronary artery disease.
medRxiv : the preprint server for health sciences
2024
Abstract
Although genome wide association studies (GWAS) in large populations have identified hundreds of variants associated with common diseases such as coronary artery disease (CAD), most disease-associated variants lie within non-coding regions of the genome, rendering it difficult to determine the downstream causal gene and cell type. Here, we performed paired single nucleus gene expression and chromatin accessibility profiling from 44 human coronary arteries. To link disease variants to molecular traits, we developed a meta-map of 88 samples and discovered 11,182 single-cell chromatin accessibility quantitative trait loci (caQTLs). Heritability enrichment analysis and disease variant mapping demonstrated that smooth muscle cells (SMCs) harbor the greatest genetic risk for CAD. To capture the continuum of SMC cell states in disease, we used dynamic single cell caQTL modeling for the first time in tissue to uncover QTLs whose effects are modified by cell state and expand our insight into genetic regulation of heterogenous cell populations. Notably, we identified a variant in the COL4A1/COL4A2 CAD GWAS locus which becomes a caQTL as SMCs de-differentiate by changing a transcription factor binding site for EGR1/2. To unbiasedly prioritize functional candidate genes, we built a genome-wide single cell variant to enhancer to gene (scV2E2G) map for human CAD to link disease variants to causal genes in cell types. Using this approach, we found several hundred genes predicted to be linked to disease variants in different cell types. Next, we performed genome-wide Hi-C in 16 human coronary arteries to build tissue specific maps of chromatin conformation and link disease variants to integrated chromatin hubs and distal target genes. Using this approach, we show that rs4887091 within the ADAMTS7 CAD GWAS locus modulates function of a super chromatin interactome through a change in a CTCF binding site. Finally, we used CRISPR interference to validate a distal gene, AMOTL2, liked to a CAD GWAS locus. Collectively we provide a disease-agnostic framework to translate human genetic findings to identify pathologic cell states and genes driving disease, producing a comprehensive scV2E2G map with genetic and tissue level convergence for future mechanistic and therapeutic studies.
View details for DOI 10.1101/2024.11.13.24317257
View details for PubMedID 39606421
View details for PubMedCentralID PMC11601770
-
Deciphering the impact of genomic variation on function.
Nature
2024; 633 (8028): 47-57
Abstract
Our genomes influence nearly every aspect of human biology-from molecular and cellular functions to phenotypes in health and disease. Studying the differences in DNA sequence between individuals (genomic variation) could reveal previously unknown mechanisms of human biology, uncover the basis of genetic predispositions to diseases, and guide the development of new diagnostic tools and therapeutic agents. Yet, understanding how genomic variation alters genome function to influence phenotype has proved challenging. To unlock these insights, we need a systematic and comprehensive catalogue of genome function and the molecular and cellular effects of genomic variants. Towards this goal, the Impact of Genomic Variation on Function (IGVF) Consortium will combine approaches in single-cell mapping, genomic perturbations and predictive modelling to investigate the relationships among genomic variation, genome function and phenotypes. IGVF will create maps across hundreds of cell types and states describing how coding variants alter protein activity, how noncoding variants change the regulation of gene expression, and how such effects connect through gene-regulatory and protein-interaction networks. These experimental data, computational predictions and accompanying standards and pipelines will be integrated into an open resource that will catalyse community efforts to explore how our genomes influence biology and disease across populations.
View details for DOI 10.1038/s41586-024-07510-0
View details for PubMedID 39232149
View details for PubMedCentralID 7405896
-
Detection and analysis of complex structural variation in human genomes across populations and in brains of donors with psychiatric disorders
Cell
2024; Published online September 30, 2024
View details for DOI 10.1016/j.cell.2024.09.014
-
Transcriptomics and chromatin accessibility in multiple African population samples.
bioRxiv : the preprint server for biology
2023
Abstract
Mapping the functional human genome and impact of genetic variants is often limited to European-descendent population samples. To aid in overcoming this limitation, we measured gene expression using RNA sequencing in lymphoblastoid cell lines (LCLs) from 599 individuals from six African populations to identify novel transcripts including those not represented in the hg38 reference genome. We used whole genomes from the 1000 Genomes Project and 164 Maasai individuals to identify 8,881 expression and 6,949 splicing quantitative trait loci (eQTLs/sQTLs), and 2,611 structural variants associated with gene expression (SV-eQTLs). We further profiled chromatin accessibility using ATAC-Seq in a subset of 100 representative individuals, to identity chromatin accessibility quantitative trait loci (caQTLs) and allele-specific chromatin accessibility, and provide predictions for the functional effect of 78.9 million variants on chromatin accessibility. Using this map of eQTLs and caQTLs we fine-mapped GWAS signals for a range of complex diseases. Combined, this work expands global functional genomic data to identify novel transcripts, functional elements and variants, understand population genetic history of molecular quantitative trait loci, and further resolve the genetic basis of multiple human traits and disease.
View details for DOI 10.1101/2023.11.04.564839
View details for PubMedID 37986808
View details for PubMedCentralID PMC10659267
-
CXCL12 regulates coronary artery dominance in diverse populations and links development to disease.
medRxiv : the preprint server for health sciences
2023
Abstract
Mammalian cardiac muscle is supplied with blood by right and left coronary arteries that form branches covering both ventricles of the heart. Whether branches of the right or left coronary arteries wrap around to the inferior side of the left ventricle is variable in humans and termed right or left dominance. Coronary dominance is likely a heritable trait, but its genetic architecture has never been explored. Here, we present the first large-scale multi-ancestry genome-wide association study of dominance in 61,043 participants of the VA Million Veteran Program, including over 10,300 Africans and 4,400 Admixed Americans. Dominance was moderately heritable with ten loci reaching genome wide significance. The most significant mapped to the chemokine CXCL12 in both Europeans and Africans. Whole-organ imaging of human fetal hearts revealed that dominance is established during development in locations where CXCL12 is expressed. In mice, dominance involved the septal coronary artery, and its patterning was altered with Cxcl12 deficiency. Finally, we linked human dominance patterns with coronary artery disease through colocalization, genome-wide genetic correlation and Mendelian Randomization analyses. Together, our data supports CXCL12 as a primary determinant of coronary artery dominance in humans of diverse backgrounds and suggests that developmental patterning of arteries may influence one's susceptibility to ischemic heart disease.
View details for DOI 10.1101/2023.10.27.23297507
View details for PubMedID 37961706
View details for PubMedCentralID PMC10635223
-
Integrative single-cell analysis of cardiogenesis identifies developmental trajectories and non-coding mutations in congenital heart disease.
Cell
2022; 185 (26): 4937
Abstract
To define the multi-cellular epigenomic and transcriptional landscape of cardiac cellular development, we generated single-cell chromatin accessibility maps of human fetal heart tissues. We identified eight major differentiation trajectories involving primary cardiac cell types, each associated with dynamic transcription factor (TF) activity signatures. We contrasted regulatory landscapes of iPSC-derived cardiac cell types and their invivo counterparts, which enabled optimization of invitro differentiation of epicardial cells. Further, we interpreted sequence based deep learning models of cell-type-resolved chromatin accessibility profiles to decipher underlying TF motif lexicons. De novo mutations predicted to affect chromatin accessibility in arterial endothelium were enriched in congenital heart disease (CHD) cases vs. controls. Invitro studies in iPSCs validated the functional impact of identified variation on the predicted developmental cell types. This work thus defines the cell-type-resolved cis-regulatory sequence determinants of heart development and identifies disruption of cell type-specific regulatory elements in CHD.
View details for DOI 10.1016/j.cell.2022.11.028
View details for PubMedID 36563664
-
Author Correction: Single-nucleus chromatin accessibility profiling highlights regulatory mechanisms of coronary artery disease risk.
Nature genetics
2022
View details for DOI 10.1038/s41588-022-01142-8
View details for PubMedID 35768727
-
Single-nucleus chromatin accessibility profiling highlights regulatory mechanisms of coronary artery disease risk.
Nature genetics
2022
Abstract
Coronary artery disease (CAD) is a complex inflammatory disease involving genetic influences across cell types. Genome-wide association studies have identified over 200 loci associated with CAD, where the majority of risk variants reside in noncoding DNA sequences impacting cis-regulatory elements. Here, we applied single-nucleus assay for transposase-accessible chromatin with sequencing to profile 28,316 nuclei across coronary artery segments from 41 patients with varying stages of CAD, which revealed 14 distinct cellular clusters. We mapped ~320,000 accessible sites across all cells, identified cell-type-specific elements and transcription factors, and prioritized functional CAD risk variants. We identified elements in smooth muscle cell transition states (for example, fibromyocytes) and functional variants predicted to alter smooth muscle cell- and macrophage-specific regulation of MRAS (3q22) and LIPA (10q23), respectively. We further nominated key driver transcription factors such as PRDM16 and TBX2. Together, this single-nucleus atlas provides a critical step towards interpreting regulatory mechanisms across the continuum of CAD risk.
View details for DOI 10.1038/s41588-022-01069-0
View details for PubMedID 35590109
-
Single-cell epigenomic analyses implicate candidate causal variants at inherited risk loci for Alzheimer's and Parkinson's diseases.
Nature genetics
2020
Abstract
Genome-wide association studies of neurological diseases have identified thousands of variants associated with disease phenotypes. However, most of these variants do not alter coding sequences, making it difficult to assign their function. Here, we present a multi-omic epigenetic atlas of the adult human brain through profiling of single-cell chromatin accessibility landscapes and three-dimensional chromatin interactions of diverse adult brain regions across a cohort of cognitively healthy individuals. We developed a machine-learning classifier to integrate this multi-omic framework and predict dozens of functional SNPs for Alzheimer's and Parkinson's diseases, nominating target genes and cell types for previously orphaned loci from genome-wide association studies. Moreover, we dissected the complex inverted haplotype of the MAPT (encoding tau) Parkinson's disease risk locus, identifying putative ectopic regulatory interactions in neurons that may mediate this disease association. This work expands understanding of inherited variation and provides a roadmap for the epigenomic dissection of causal regulatory variation in disease.
View details for DOI 10.1038/s41588-020-00721-x
View details for PubMedID 33106633
-
Assessing the accuracy of phylogenetic rooting methods on prokaryotic gene families
PLOS ONE
2020; 15 (5): e0232950
Abstract
Almost all standard phylogenetic methods for reconstructing gene trees result in unrooted trees; yet, many of the most useful applications of gene trees require that the gene trees be correctly rooted. As a result, several computational methods have been developed for inferring the root of unrooted gene trees. However, the accuracy of such methods has never been systematically evaluated on prokaryotic gene families, where horizontal gene transfer is often one of the dominant evolutionary events driving gene family evolution. In this work, we address this gap by conducting a thorough comparative evaluation of five different rooting methods using large collections of both simulated and empirical prokaryotic gene trees. Our simulation study is based on 6000 true and reconstructed gene trees on 100 species and characterizes the rooting accuracy of the four methods under 36 different evolutionary conditions and 3 levels of gene tree reconstruction error. The empirical study is based on a large, carefully designed data set of 3098 gene trees from 504 bacterial species (406 Alphaproteobacteria and 98 Cyanobacteria) and reveals insights that supplement those gleaned from the simulation study. Overall, this work provides several valuable insights into the accuracy of the considered methods that will help inform the choice of rooting methods to use when studying microbial gene family evolution. Among other findings, this study identifies parsimonious Duplication-Transfer-Loss (DTL) rooting and Minimal Ancestor Deviation (MAD) rooting as two of the most accurate gene tree rooting methods for prokaryotes and specifies the evolutionary conditions under which these methods are most accurate, demonstrates that DTL rooting is highly sensitive to high evolutionary rates and gene tree error, and that rooting methods based on branch-lengths are generally robust to gene tree reconstruction error.
View details for DOI 10.1371/journal.pone.0232950
View details for Web of Science ID 000537496000029
View details for PubMedID 32413061
View details for PubMedCentralID PMC7228096
-
SaGePhy: An improved phylogenetic simulation framework for gene and subgene evolution.
Bioinformatics (Oxford, England)
2019
Abstract
SaGePhy is a software package for improved phylogenetic simulation of gene and subgene evolution. SaGePhy can be used to generate species trees, gene trees, and subgene or (protein) domain trees using a probabilistic birth-death process that allows for gene and subgene duplication, horizontal gene and subgene transfer, and gene and subgene loss. SaGePhy implements a range of important features not found in other phylogenetic simulation frameworks/software. These include (i) simulation of subgene or domain level evolution inside one or more gene trees, (ii) simultaneous simulation of both additive and replacing horizontal gene/subgene transfers, and (iii) probabilistic sampling of species tree and gene tree nodes, respectively, for gene-family and domain-family birth. SaGePhy is open-source, platform independent, and written in Java and Python.Executables, source code (open-source under the revised BSD licence), and a detailed manual are freely available from http://compbio.engr.uconn.edu/software/sagephy/.
View details for DOI 10.1093/bioinformatics/btz081
View details for PubMedID 30715213
-
RANGER-DTL 2.0: rigorous reconstruction of gene-family evolution by duplication, transfer and loss
BIOINFORMATICS
2018; 34 (18): 3214–16
Abstract
RANGER-DTL 2.0 is a software program for inferring gene family evolution using Duplication-Transfer-Loss reconciliation. This new software is highly scalable and easy to use, and offers many new features not currently available in any other reconciliation program. RANGER-DTL 2.0 has a particular focus on reconciliation accuracy and can account for many sources of reconciliation uncertainty including uncertain gene tree rooting, gene tree topological uncertainty, multiple optimal reconciliations and alternative event cost assignments. RANGER-DTL 2.0 is open-source and written in C++ and Python.Pre-compiled executables, source code (open-source under GNU GPL) and a detailed manual are freely available from http://compbio.engr.uconn.edu/software/RANGER-DTL/.Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/bty314
View details for Web of Science ID 000446433800022
View details for PubMedID 29688310
View details for PubMedCentralID PMC6137995
-
On the impact of uncertain gene tree rooting on duplication-transfer-loss reconciliation
BMC. 2018: 290
Abstract
Duplication-Transfer-Loss (DTL) reconciliation is a powerful and increasingly popular technique for studying the evolution of microbial gene families. DTL reconciliation requires the use of rooted gene trees to perform the reconciliation with the species tree, and the standard technique for rooting gene trees is to assign a root that results in the minimum reconciliation cost across all rootings of that gene tree. However, even though it is well understood that many gene trees have multiple optimal roots, only a single optimal root is randomly chosen to create the rooted gene tree and perform the reconciliation. This remains an important overlooked and unaddressed problem in DTL reconciliation, leading to incorrect evolutionary inferences. In this work, we perform an in-depth analysis of the impact of uncertain gene tree rooting on the computed DTL reconciliation and provide the first computational tools to quantify and negate the impact of gene tree rooting uncertainty on DTL reconciliation.Our analysis of a large data set of over 4500 gene families from 100 species shows that a large fraction of gene trees have multiple optimal rootings, that these multiple roots often, but not always, appear closely clustered together in the same region of the gene tree, that many aspects of the reconciliation remain conserved across the multiple rootings, that gene tree error has a profound impact on the prevalence and structure of multiple optimal rootings, and that there are specific interesting patterns in the reconciliation of those gene trees that have multiple optimal roots.Our results show that unrooted gene trees can be meaningfully reconciled and high-quality evolutionary information can be obtained from them even after accounting for multiple optimal rootings. In addition, the techniques and tools introduced in this paper make it possible to systematically avoid incorrect evolutionary inferences caused by incorrect or uncertain gene tree rooting. These tools have been implemented in the phylogenetic reconciliation software package RANGER-DTL 2.0, freely available from http://compbio.engr.uconn.edu/software/RANGER-DTL/ .
View details for DOI 10.1186/s12859-018-2269-0
View details for Web of Science ID 000442105800011
View details for PubMedID 30367593