Soumyadeep Roy
Postdoctoral Scholar, Biomedical Informatics
Bio
I am a postdoctoral scholar at the Center for Biomedical Informatics Research of Stanford University, advised by Prof. Tina Hernandez-Boussard.
My primary area of research is natural language processing, with expertise in medical and healthcare applications. My research areas of interest are Foundation Models for Medicine, Generative AI, Text Summarization, and Efficient Pretraining.
I hold a PhD in Computer Science and Engineering from the Indian Institute of Technology Kharagpur, where I worked with Prof. Niloy Ganguly and Prof. Shamik Sural. Here, I was part of the Complex Networks Research Group (CNeRG). My PhD thesis is titled “Domain Adaptation for Medical Language Understanding”, where I developed novel domain adaptation techniques to effectively and efficiently adapt open-domain AI models to the medical domain.
In summary, I have six years of experience working with medical NLP data, which includes clinical trial registry data (2018-2021), medical forum questions (2020-2021), DNA sequence data (2021-2024), biomedical scientific literature (2023 - 2025), clinical data (2021-2023) and EHR clinical notes (2025). My medical AI research experience includes 2.5 years at L3S Research Germany collaborating with Hannover Medical School as well as a 7-month research internship at GE HealthCare Technology and Innovation Center (HTIC) in Bangalore, India. I also presented a tutorial on March 10, 2025 titled "Building Trustworthy AI Models for Medicine" at WSDM 2025 held in Germany.
In my free time, I like hiking, and playing chess or table tennis.
Professional Education
-
Doctor of Philosophy, Indian Institute of Technology, Kharagpur (2025)
-
Master of Science, Indian Institute of Technology, Kharagpur (2019)
-
Bachelor of Technology, Maulana Abul Kalam Azad University of Technology (2017)
All Publications
-
Decision tree-based approach to robust Parkinson's disease subtyping using clinical data of the Michael J. Fox Foundation LRRK2 cross-sectional study.
Frontiers in artificial intelligence
2025; 8: 1668206
Abstract
Parkinson's Disease (PD) is a neurodegenerative disorder with high heterogeneity in clinical symptoms, progression course, treatment response, and genetic factors. Thus, PD subtyping aims to enhance understanding of disease mechanisms and helps to facilitate targeted interventions or treatment regimens. Data-driven PD subtyping is typically done using cluster analysis. Still, such studies face difficulty from widespread adoption in clinical practice due to the following issues: (i) results are quite sensitive to study design, and actual subtype rules are not reasonably interpretable; (ii) results are not robustly replicable across multiple datasets, and most studies focus on a single dataset. This paper aims to identify novel PD subtypes using an interpretable decision-tree-based method that is robustly reproducible in an independent PD cohort. We first train a decision tree classifier on an LRRK2 dataset to determine whether a patient has early onset or late onset PD. By tracing back from the leaves of the learned decision tree subtyping rules are established. The independent MDS dataset is used for external validation, after mapping features between the two datasets. We finally obtained six novel subtypes that are clinically consistent and sufficiently large across both training and external validation datasets. Finally, a clinical characterization study showed that the following clinical features may be the most important diagnostic markers for our six detected subtypes: (i) persistent asymmetry affecting the side of onset most, (ii) clinical course of 10 years or more, and (iii) postural instability not caused by other dysfunction. The subtypes identified in our study may provide relevant guidance for prognosis and therapeutic strategies. An early onset subtype (E4) can be linked to a comparatively favorable prognosis. In contrast, the mixed onset subtypes (M3 and M7) may predict faster functional decline, suggesting that patients in these groups could benefit from intensified supportive measures. One late onset subtype (L1) seems to have a more benign course, while the other two (L2 and L4) are connected with predictors of reduced quality of life and increased care dependency.
View details for DOI 10.3389/frai.2025.1668206
View details for PubMedID 41356666
View details for PubMedCentralID PMC12678400
-
Building Trustworthy AI Models for Medicine: From Theory to Applications
ASSOC COMPUTING MACHINERY. 2025: 1012-1015
View details for DOI 10.1145/3701551.3703477
View details for Web of Science ID 001476971200112
https://orcid.org/0000-0001-7269-2163