Behzad Naderalvojoud's Profile

Contact

Work
behzadn@stanford.edu

University - Staff Department: Med/BMIR Position: Biostatistician 2

Additional Info

Mail Code: 5479

All Publications

Large Language Models Outperform Traditional Natural Language Processing Methods in Extracting Patient-Reported Outcomes in Inflammatory Bowel Disease. Gastro hep advances Patel, P. V., Davis, C., Ralbovsky, A., Tinoco, D., Williams, C. Y., Slatter, S., Naderalvojoud, B., Rosen, M. J., Hernandez-Boussard, T., Rudrapatna, V. 2025; 4 (2): 100563

Abstract

Patient-reported outcomes (PROs) are vital in assessing disease activity and treatment outcomes in inflammatory bowel disease (IBD). However, manual extraction of these PROs from the free-text of clinical notes is burdensome. We aimed to improve data curation from free-text information in the electronic health record, making it more available for research and quality improvement. This study aimed to compare traditional natural language processing (tNLP) and large language models (LLMs) in extracting 3 IBD PROs (abdominal pain, diarrhea, fecal blood) from clinical notes across 2 institutions.Clinic notes were annotated for each PRO using preset protocols. Models were developed and internally tested at the University of California, San Francisco, and then externally validated at Stanford University. We compared tNLP and LLM-based models on accuracy, sensitivity, specificity, positive, and negative predictive value. In addition, we conducted fairness and error assessments.Interrater reliability between annotators was >90%. On the University of California, San Francisco test set (n = 50), the top-performing tNLP models showcased accuracies of 92% (abdominal pain), 82% (diarrhea) and 80% (fecal blood), comparable to GPT-4, which was 96%, 88%, and 90% accurate, respectively. On external validation at Stanford (n = 250), tNLP models failed to generalize (61%-62% accuracy) while GPT-4 maintained accuracies >90%. Pathways Language Model-2 and Generative Pre-trained Transformer-4 showed similar performance. No biases were detected based on demographics or diagnosis.LLMs are accurate and generalizable methods for extracting PROs. They maintain excellent accuracy across institutions, despite heterogeneity in note templates and authors. Widespread adoption of such tools has the potential to enhance IBD research and patient care.

View details for DOI 10.1016/j.gastha.2024.10.003

View details for PubMedID 39877865

View details for PubMedCentralID PMC11772946
Large language models outperform traditional natural language processing methods in extracting patient-reported outcomes in IBD. medRxiv : the preprint server for health sciences Patel, P. V., Davis, C., Ralbovsky, A., Tinoco, D., Williams, C. Y., Slatter, S., Naderalvojoud, B., Rosen, M. J., Hernandez-Boussard, T., Rudrapatna, V. 2024

Abstract

Patient-reported outcomes (PROs) are vital in assessing disease activity and treatment outcomes in inflammatory bowel disease (IBD). However, manual extraction of these PROs from the free-text of clinical notes is burdensome. We aimed to improve data curation from free-text information in the electronic health record, making it more available for research and quality improvement. This study aimed to compare traditional natural language processing (tNLP) and large language models (LLMs) in extracting three IBD PROs (abdominal pain, diarrhea, fecal blood) from clinical notes across two institutions.Clinic notes were annotated for each PRO using preset protocols. Models were developed and internally tested at the University of California San Francisco (UCSF), and then externally validated at Stanford University. We compared tNLP and LLM-based models on accuracy, sensitivity, specificity, positive and negative predictive value. Additionally, we conducted fairness and error assessments.Inter-rater reliability between annotators was >90%. On the UCSF test set (n=50), the top-performing tNLP models showcased accuracies of 92% (abdominal pain), 82% (diarrhea) and 80% (fecal blood), comparable to GPT-4, which was 96%, 88%, and 90% accurate, respectively. On external validation at Stanford (n=250), tNLP models failed to generalize (61-62% accuracy) while GPT-4 maintained accuracies >90%. PaLM-2 and GPT-4 showed similar performance. No biases were detected based on demographics or diagnosis.LLMs are accurate and generalizable methods for extracting PROs. They maintain excellent accuracy across institutions, despite heterogeneity in note templates and authors. Widespread adoption of such tools has the potential to enhance IBD research and patient care.

View details for DOI 10.1101/2024.09.05.24313139

View details for PubMedID 39281744

View details for PubMedCentralID PMC11398594
Towards global model generalizability: independent cross-site feature evaluation for patient-level risk prediction models using the OHDSI network. Journal of the American Medical Informatics Association : JAMIA Naderalvojoud, B., Curtin, C. M., Yanover, C., El-Hay, T., Choi, B., Park, R. W., Tabuenca, J. G., Reeve, M. P., Falconer, T., Humphreys, K., Asch, S. M., Hernandez-Boussard, T. 2024

Abstract

Predictive models show promise in healthcare, but their successful deployment is challenging due to limited generalizability. Current external validation often focuses on model performance with restricted feature use from the original training data, lacking insights into their suitability at external sites. Our study introduces an innovative methodology for evaluating features during both the development phase and the validation, focusing on creating and validating predictive models for post-surgery patient outcomes with improved generalizability.Electronic health records (EHRs) from 4 countries (United States, United Kingdom, Finland, and Korea) were mapped to the OMOP Common Data Model (CDM), 2008-2019. Machine learning (ML) models were developed to predict post-surgery prolonged opioid use (POU) risks using data collected 6 months before surgery. Both local and cross-site feature selection methods were applied in the development and external validation datasets. Models were developed using Observational Health Data Sciences and Informatics (OHDSI) tools and validated on separate patient cohorts.Model development included 41 929 patients, 14.6% with POU. The external validation included 31 932 (UK), 23 100 (US), 7295 (Korea), and 3934 (Finland) patients with POU of 44.2%, 22.0%, 15.8%, and 21.8%, respectively. The top-performing model, Lasso logistic regression, achieved an area under the receiver operating characteristic curve (AUROC) of 0.75 during local validation and 0.69 (SD = 0.02) (averaged) in external validation. Models trained with cross-site feature selection significantly outperformed those using only features from the development site through external validation (P < .05).Using EHRs across four countries mapped to the OMOP CDM, we developed generalizable predictive models for POU. Our approach demonstrates the significant impact of cross-site feature selection in improving model performance, underscoring the importance of incorporating diverse feature sets from various clinical settings to enhance the generalizability and utility of predictive healthcare models.

View details for DOI 10.1093/jamia/ocae028

View details for PubMedID 38412331
Trends in Influenza Vaccination Rates among a Medicaid Population from 2016 to 2021. Vaccines Naderalvojoud, B., Shah, N. D., Mutanga, J. N., Belov, A., Staiger, R., Chen, J. H., Whitaker, B., Hernandez-Boussard, T. 2023; 11 (11)

Abstract

Seasonal influenza is a leading cause of death in the U.S., causing significant morbidity, mortality, and economic burden. Despite the proven efficacy of vaccinations, rates remain notably low, especially among Medicaid enrollees. Leveraging Medicaid claims data, this study characterizes influenza vaccination rates among Medicaid enrollees and aims to elucidate factors influencing vaccine uptake, providing insights that might also be applicable to other vaccine-preventable diseases, including COVID-19. This study used Medicaid claims data from nine U.S. states (2016-2021], encompassing three types of claims: fee-for-service, major Medicaid managed care plan, and combined. We included Medicaid enrollees who had an in-person healthcare encounter during an influenza season in this period, excluding those under 6 months of age, over 65 years, or having telehealth-only encounters. Vaccination was the primary outcome, with secondary outcomes involving in-person healthcare encounters. Chi-square tests, multivariable logistic regression, and Fisher's exact test were utilized for statistical analysis. A total of 20,868,910 enrollees with at least one healthcare encounter in at least one influenza season were included in the study population between 2016 and 2021. Overall, 15% (N = 3,050,471) of enrollees received an influenza vaccine between 2016 and 2021. During peri-COVID periods, there was an increase in vaccination rates among enrollees compared to pre-COVID periods, from 14% to 16%. Children had the highest influenza vaccination rates among all age groups at 29%, whereas only 17% were of 5-17 years, and 10% were of the 18-64 years were vaccinated. We observed differences in the likelihood of receiving the influenza vaccine among enrollees based on their health conditions and medical encounters. In a study of Medicaid enrollees across nine states, 15% received an influenza vaccine from July 2016 to June 2021. Vaccination rates rose annually, peaking during peri-COVID seasons. The highest uptake was among children (6 months-4 years), and the lowest was in adults (18-64 years). Female gender, urban residency, and Medicaid-managed care affiliation positively influenced uptake. However, mental health and substance abuse disorders decreased the likelihood. This study, reliant on Medicaid claims data, underscores the need for outreach services.

View details for DOI 10.3390/vaccines11111712

View details for PubMedID 38006044
Improving machine learning with ensemble learning on observational healthcare data. AMIA ... Annual Symposium proceedings. AMIA Symposium Naderalvojoud, B., Hernandez-Boussard, T. 2023; 2023: 521-529

Abstract

Ensemble learning is a powerful technique for improving the accuracy and reliability of prediction models, especially in scenarios where individual models may not perform well. However, combining models with varying accuracies may not always improve the final prediction results, as models with lower accuracies may obscure the results of models with higher accuracies. This paper addresses this issue and answers the question of when an ensemble approach outperforms individual models for prediction. As a result, we propose an ensemble model for predicting patients at risk of postoperative prolonged opioid. The model incorporates two machine learning models that are trained using different covariates, resulting in high precision and recall. Our study, which employs five different machine learning algorithms, shows that the proposed approach significantly improves the final prediction results in terms of AUROC and AUPRC.

View details for PubMedID 38222353

View details for PubMedCentralID PMC10785929

Behzad Naderalvojoud

Biostatistician 2, Med/BMIR

Contact

Additional Info

All Publications

Abstract

Abstract

Abstract

Abstract

Abstract