Behzad Naderalvojoud
Biostatistician 2, Med/BMIR
All Publications
-
Large Language Models Outperform Traditional Natural Language Processing Methods in Extracting Patient-Reported Outcomes in Inflammatory Bowel Disease.
Gastro hep advances
2025; 4 (2): 100563
Abstract
Patient-reported outcomes (PROs) are vital in assessing disease activity and treatment outcomes in inflammatory bowel disease (IBD). However, manual extraction of these PROs from the free-text of clinical notes is burdensome. We aimed to improve data curation from free-text information in the electronic health record, making it more available for research and quality improvement. This study aimed to compare traditional natural language processing (tNLP) and large language models (LLMs) in extracting 3 IBD PROs (abdominal pain, diarrhea, fecal blood) from clinical notes across 2 institutions.Clinic notes were annotated for each PRO using preset protocols. Models were developed and internally tested at the University of California, San Francisco, and then externally validated at Stanford University. We compared tNLP and LLM-based models on accuracy, sensitivity, specificity, positive, and negative predictive value. In addition, we conducted fairness and error assessments.Interrater reliability between annotators was >90%. On the University of California, San Francisco test set (n = 50), the top-performing tNLP models showcased accuracies of 92% (abdominal pain), 82% (diarrhea) and 80% (fecal blood), comparable to GPT-4, which was 96%, 88%, and 90% accurate, respectively. On external validation at Stanford (n = 250), tNLP models failed to generalize (61%-62% accuracy) while GPT-4 maintained accuracies >90%. Pathways Language Model-2 and Generative Pre-trained Transformer-4 showed similar performance. No biases were detected based on demographics or diagnosis.LLMs are accurate and generalizable methods for extracting PROs. They maintain excellent accuracy across institutions, despite heterogeneity in note templates and authors. Widespread adoption of such tools has the potential to enhance IBD research and patient care.
View details for DOI 10.1016/j.gastha.2024.10.003
View details for PubMedID 39877865
View details for PubMedCentralID PMC11772946
-
Ensemble learning to enhance accurate identification of patients with glaucoma using electronic health records.
JAMIA open
2025; 8 (4): ooaf080
Abstract
Existing ophthalmology studies for clinical phenotypes identification in real-world datasets (RWD) rely exclusively on structured data elements (SDE). We evaluated the performance, generalizability, and fairness of multimodal ensemble models that integrate real-world SDE and free-text data compared to SDE-only models to identify patients with glaucoma.This is a retrospective cross-sectional study involving 2 health systems- University of Michigan (UoM) and Stanford University (SU). It involves 1728 patients visiting eye clinics during 2012-2021. Free-text embeddings extracted using BioClinicalBERT were combined with SDE. EditedNearestNeighbor (ENN) undersampling and Borderline-Synthetic Minority Over-sampling Technique (bSMOTE) addressed class imbalance. Lasso Regression (LR), Random Forest (RF), Support Vector Classifier (SVC) models were trained on UoM imbalanced (imb) and resampled data along with bagging ensemble method. Models were externally validated with SU data. Fairness was assessed using equalized odds difference (EOD) and Target Probability Difference (TPD).Among 900 and 828 patients from UoM and SU, 10% and 23% respectively had glaucoma as confirmed by ophthalmologists. At UoM, multimodal LRimb (F1 = 76.60 [61.90-88.89]; AUROC = 95.41 [87.01-99.63]) outperformed unimodal RFimb (F1 = 69.77 [52.94-83.64]; AUROC = 97.72 [95.95-99.18]) and ICD-coding method (F1 = 53.01 [39.51-65.43]; AUROC = 90.10 [84.59-93.93]). Bagging (BM = LRENN + LRbSMOTE) improved performance achieving an F1 of 83.02 [70.59-92.86] and AUROC of 97.59 [92.98-99.88]. During external validation BM achieved the highest F1 (68.47 [62.61-73.75]), outperforming unimodal (F1 = 51.26 [43.80-58.13]) and multimodal LRimb (F1 = 62.46 [55.95-68.24]). BM EOD revealed lower disparities for sex (<0.1), race (<0.5) and ethnicity (<0.5), and had least uncertainty using TDP analysis as compared to traditional models.Multimodal ensemble models integrating structured and unstructured EHR data outperformed traditional SDE models achieving fair predictions across demographic sub-groups. Among ensemble methods, bagging demonstrated better generalizability than stacking, particularly when training data is limited.This approach can enhance phenotype discovery to enable future research studies using RWD, leading to better patient management and clinical outcomes.
View details for DOI 10.1093/jamiaopen/ooaf080
View details for PubMedID 40799932
View details for PubMedCentralID PMC12342940
-
Large Language Models Outperform Traditional Natural Language Processing Methods in Extracting Patient- Reported Outcomes in Inflammatory Bowel Disease
GASTRO HEP ADVANCES
2025; 4 (2)
View details for DOI 10.1016/j.gastha.2024.10.003
View details for Web of Science ID 001398433800001
-
Large language models outperform traditional natural language processing methods in extracting patient-reported outcomes in IBD.
medRxiv : the preprint server for health sciences
2024
Abstract
Patient-reported outcomes (PROs) are vital in assessing disease activity and treatment outcomes in inflammatory bowel disease (IBD). However, manual extraction of these PROs from the free-text of clinical notes is burdensome. We aimed to improve data curation from free-text information in the electronic health record, making it more available for research and quality improvement. This study aimed to compare traditional natural language processing (tNLP) and large language models (LLMs) in extracting three IBD PROs (abdominal pain, diarrhea, fecal blood) from clinical notes across two institutions.Clinic notes were annotated for each PRO using preset protocols. Models were developed and internally tested at the University of California San Francisco (UCSF), and then externally validated at Stanford University. We compared tNLP and LLM-based models on accuracy, sensitivity, specificity, positive and negative predictive value. Additionally, we conducted fairness and error assessments.Inter-rater reliability between annotators was >90%. On the UCSF test set (n=50), the top-performing tNLP models showcased accuracies of 92% (abdominal pain), 82% (diarrhea) and 80% (fecal blood), comparable to GPT-4, which was 96%, 88%, and 90% accurate, respectively. On external validation at Stanford (n=250), tNLP models failed to generalize (61-62% accuracy) while GPT-4 maintained accuracies >90%. PaLM-2 and GPT-4 showed similar performance. No biases were detected based on demographics or diagnosis.LLMs are accurate and generalizable methods for extracting PROs. They maintain excellent accuracy across institutions, despite heterogeneity in note templates and authors. Widespread adoption of such tools has the potential to enhance IBD research and patient care.
View details for DOI 10.1101/2024.09.05.24313139
View details for PubMedID 39281744
View details for PubMedCentralID PMC11398594
-
Towards global model generalizability: independent cross-site feature evaluation for patient-level risk prediction models using the OHDSI network.
Journal of the American Medical Informatics Association : JAMIA
2024
Abstract
Predictive models show promise in healthcare, but their successful deployment is challenging due to limited generalizability. Current external validation often focuses on model performance with restricted feature use from the original training data, lacking insights into their suitability at external sites. Our study introduces an innovative methodology for evaluating features during both the development phase and the validation, focusing on creating and validating predictive models for post-surgery patient outcomes with improved generalizability.Electronic health records (EHRs) from 4 countries (United States, United Kingdom, Finland, and Korea) were mapped to the OMOP Common Data Model (CDM), 2008-2019. Machine learning (ML) models were developed to predict post-surgery prolonged opioid use (POU) risks using data collected 6 months before surgery. Both local and cross-site feature selection methods were applied in the development and external validation datasets. Models were developed using Observational Health Data Sciences and Informatics (OHDSI) tools and validated on separate patient cohorts.Model development included 41 929 patients, 14.6% with POU. The external validation included 31 932 (UK), 23 100 (US), 7295 (Korea), and 3934 (Finland) patients with POU of 44.2%, 22.0%, 15.8%, and 21.8%, respectively. The top-performing model, Lasso logistic regression, achieved an area under the receiver operating characteristic curve (AUROC) of 0.75 during local validation and 0.69 (SD = 0.02) (averaged) in external validation. Models trained with cross-site feature selection significantly outperformed those using only features from the development site through external validation (P < .05).Using EHRs across four countries mapped to the OMOP CDM, we developed generalizable predictive models for POU. Our approach demonstrates the significant impact of cross-site feature selection in improving model performance, underscoring the importance of incorporating diverse feature sets from various clinical settings to enhance the generalizability and utility of predictive healthcare models.
View details for DOI 10.1093/jamia/ocae028
View details for PubMedID 38412331
-
Trends in Influenza Vaccination Rates among a Medicaid Population from 2016 to 2021.
Vaccines
2023; 11 (11)
Abstract
Seasonal influenza is a leading cause of death in the U.S., causing significant morbidity, mortality, and economic burden. Despite the proven efficacy of vaccinations, rates remain notably low, especially among Medicaid enrollees. Leveraging Medicaid claims data, this study characterizes influenza vaccination rates among Medicaid enrollees and aims to elucidate factors influencing vaccine uptake, providing insights that might also be applicable to other vaccine-preventable diseases, including COVID-19. This study used Medicaid claims data from nine U.S. states (2016-2021], encompassing three types of claims: fee-for-service, major Medicaid managed care plan, and combined. We included Medicaid enrollees who had an in-person healthcare encounter during an influenza season in this period, excluding those under 6 months of age, over 65 years, or having telehealth-only encounters. Vaccination was the primary outcome, with secondary outcomes involving in-person healthcare encounters. Chi-square tests, multivariable logistic regression, and Fisher's exact test were utilized for statistical analysis. A total of 20,868,910 enrollees with at least one healthcare encounter in at least one influenza season were included in the study population between 2016 and 2021. Overall, 15% (N = 3,050,471) of enrollees received an influenza vaccine between 2016 and 2021. During peri-COVID periods, there was an increase in vaccination rates among enrollees compared to pre-COVID periods, from 14% to 16%. Children had the highest influenza vaccination rates among all age groups at 29%, whereas only 17% were of 5-17 years, and 10% were of the 18-64 years were vaccinated. We observed differences in the likelihood of receiving the influenza vaccine among enrollees based on their health conditions and medical encounters. In a study of Medicaid enrollees across nine states, 15% received an influenza vaccine from July 2016 to June 2021. Vaccination rates rose annually, peaking during peri-COVID seasons. The highest uptake was among children (6 months-4 years), and the lowest was in adults (18-64 years). Female gender, urban residency, and Medicaid-managed care affiliation positively influenced uptake. However, mental health and substance abuse disorders decreased the likelihood. This study, reliant on Medicaid claims data, underscores the need for outreach services.
View details for DOI 10.3390/vaccines11111712
View details for PubMedID 38006044
-
Improving machine learning with ensemble learning on observational healthcare data.
AMIA ... Annual Symposium proceedings. AMIA Symposium
2023; 2023: 521-529
Abstract
Ensemble learning is a powerful technique for improving the accuracy and reliability of prediction models, especially in scenarios where individual models may not perform well. However, combining models with varying accuracies may not always improve the final prediction results, as models with lower accuracies may obscure the results of models with higher accuracies. This paper addresses this issue and answers the question of when an ensemble approach outperforms individual models for prediction. As a result, we propose an ensemble model for predicting patients at risk of postoperative prolonged opioid. The model incorporates two machine learning models that are trained using different covariates, resulting in high precision and recall. Our study, which employs five different machine learning algorithms, shows that the proposed approach significantly improves the final prediction results in terms of AUROC and AUPRC.
View details for PubMedID 38222353
View details for PubMedCentralID PMC10785929