Akshay Swaminathan
MD Student with Scholarly Concentration in Informatics & Data-Driven Medicine, expected graduation Winter 2026
Ph.D. Student in Biomedical Data Science with Scholarly Concentration in Informatics & Data-Driven Medicine, admitted Autumn 2024
All Publications
-
Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review.
JAMA
2024
Abstract
Large language models (LLMs) can assist in various health care activities, but current evaluation approaches may not adequately identify the most useful application areas.To summarize existing evaluations of LLMs in health care in terms of 5 components: (1) evaluation data type, (2) health care task, (3) natural language processing (NLP) and natural language understanding (NLU) tasks, (4) dimension of evaluation, and (5) medical specialty.A systematic search of PubMed and Web of Science was performed for studies published between January 1, 2022, and February 19, 2024.Studies evaluating 1 or more LLMs in health care.Three independent reviewers categorized studies via keyword searches based on the data used, the health care tasks, the NLP and NLU tasks, the dimensions of evaluation, and the medical specialty.Of 519 studies reviewed, published between January 1, 2022, and February 19, 2024, only 5% used real patient care data for LLM evaluation. The most common health care tasks were assessing medical knowledge such as answering medical licensing examination questions (44.5%) and making diagnoses (19.5%). Administrative tasks such as assigning billing codes (0.2%) and writing prescriptions (0.2%) were less studied. For NLP and NLU tasks, most studies focused on question answering (84.2%), while tasks such as summarization (8.9%) and conversational dialogue (3.3%) were infrequent. Almost all studies (95.4%) used accuracy as the primary dimension of evaluation; fairness, bias, and toxicity (15.8%), deployment considerations (4.6%), and calibration and uncertainty (1.2%) were infrequently measured. Finally, in terms of medical specialty area, most studies were in generic health care applications (25.6%), internal medicine (16.4%), surgery (11.4%), and ophthalmology (6.9%), with nuclear medicine (0.6%), physical medicine (0.4%), and medical genetics (0.2%) being the least represented.Existing evaluations of LLMs mostly focus on accuracy of question answering for medical examinations, without consideration of real patient care data. Dimensions such as fairness, bias, and toxicity and deployment considerations received limited attention. Future evaluations should adopt standardized applications and metrics, use clinical data, and broaden focus to include a wider range of tasks and specialties.
View details for DOI 10.1001/jama.2024.21700
View details for PubMedID 39405325
View details for PubMedCentralID PMC11480901
-
Extraction of Unstructured Electronic Health Records to Evaluate Glioblastoma Treatment Patterns.
JCO clinical cancer informatics
2024; 8: e2300091
Abstract
Data on lines of therapy (LOTs) for cancer treatment are important for clinical oncology research, but LOTs are not explicitly recorded in electronic health records (EHRs). We present an efficient approach for clinical data abstraction and a flexible algorithm to derive LOTs from EHR-based medication data on patients with glioblastoma multiforme (GBM).Nonclinicians were trained to abstract the diagnosis of GBM from EHRs, and their accuracy was compared with abstraction performed by clinicians. The resulting data were used to build a cohort of patients with confirmed GBM diagnosis. An algorithm was developed to derive LOTs using structured medication data, accounting for the addition and discontinuation of therapies and drug class. Descriptive statistics were calculated and time-to-next-treatment (TTNT) analysis was performed using the Kaplan-Meier method.Treating clinicians as the gold standard, nonclinicians abstracted GBM diagnosis with a sensitivity of 0.98, specificity 1.00, positive predictive value 1.00, and negative predictive value 0.90, suggesting that nonclinician abstraction of GBM diagnosis was comparable with clinician abstraction. Of 693 patients with a confirmed diagnosis of GBM, 246 patients contained structured information about the types of medications received. Of them, 165 (67.1%) received a first-line therapy (1L) of temozolomide, and the median TTNT from the start of 1L was 179 days.We described a workflow for extracting diagnosis of GBM and LOT from EHR data that combines nonclinician abstraction with algorithmic processing, demonstrating comparable accuracy with clinician abstraction and highlighting the potential for scalable and efficient EHR-based oncology research.
View details for DOI 10.1200/CCI.23.00091
View details for PubMedID 38857465
-
Burn Care Funding in the Era of Price Transparency-Does Verification Signal Bargaining Power?
Journal of burn care & research : official publication of the American Burn Association
2024
Abstract
The Price Transparency Rule of 2021 forced payors and hospitals to publicly disclose negotiated prices to foster competition and reduce cost. Burn care is costly and concentrated at less than 130 centers in the US. We aimed to analyze geographic price variations for inpatient burn care and measure the effects of American Burn Association (ABA) verification status and market concentration on prices. All available commercial rates for 2021-2022 for burn-related Diagnosis Related Groups (DRG) 927, 928, 929, 933, 934, and 935 were merged with hospital-level variables, ABA verification status, and Herfindahl-Hirschman Index (HHI) data. For the DRG 927 (most intensive burn admission) a linear mixed effects model was fit with cost as the outcome and the following variables as covariates: HHI, plan type, safety net status, profit status, verification status, rural status, teaching hospital status. Random intercepts allowed for individual burn centers. There were 170,738 rates published from 1541 unique hospitals. Commercial reimbursement rates for the same DRG varied by a factor of approximately three within hospitals for all DRGs. Similarly, rates across different hospitals varied by a factor of three for all DRGs, with DRG 927 having the most variation. Burn center status was independently associated with higher reimbursement rates adjusting for facility-level factors for all DRGs except for 935. Notably, HHI was the largest predictor of commercial rates (p<0.001). Negotiated prices for inpatient burn care vary widely. ABA-verified centers garner higher rates along with burn centers in more concentrated/monopolistic markets.
View details for DOI 10.1093/jbcr/irae078
View details for PubMedID 38733210
-
Emerging Outlook on Personalized Neuromodulation for Depression: Insights from Tractography-Based Targeting.
Biological psychiatry. Cognitive neuroscience and neuroimaging
2024
Abstract
Deep brain stimulation (DBS) has shown individual promise in treating treatment resistant depression (TRD), but larger-scale trials have been less successful. Here, we create the largest meta-analysis with individual patient data (IPD) to date to explore if the use of tractography enhances the efficacy of DBS for TRD.We systematically reviewed 1823 articles, selecting 32 that contributed data from 366 patients. We stratified the IPD based on stimulation target and use of tractography. Utilizing two-way type III Analysis of Variance (ANOVA), Welch Two Sample t-tests, and mixed-effects linear regression models, we evaluated changes in depression severity 9-15 months post-surgery (1-Y) and at last follow-up (LFU) (4 weeks - 8 years) as assessed by depression scales.Tractography was used for medial forebrain bundle (MFB, n=17/32), subcallosal cingulate (SCC, n=39/241), and ventral capsule/ventral striatum (VC/VS, n=3/41) targets; and not used for bed nucleus of stria terminalis (n=11), lateral habenula (n=10), and inferior thalamic peduncle (n=1). Across all patients, tractography significantly improved mean depression scores at 1-Y (p<0.001) and LFU (p=0.009). Within the target cohorts, tractography improved depression scores at 1-Y for both MFB and SCC, though significance was only met at the alpha = 0.1 level (SCC: β=15.8%, p=0.09; MFB: β=52.4%, p=0.10). Within the tractography cohort, MFB with tractography patients showed greater improvement than those with SCC with tractography (72.42±7.17% versus 54.78±4.08%) at 1-Y (p=0.044).Our findings underscore the promise of tractography in DBS for TRD as a methodology for personalization of therapy, supporting its inclusion in future trials.
View details for DOI 10.1016/j.bpsc.2024.04.007
View details for PubMedID 38679323
-
Diversifying cardiac intensive care unit models: Successful example of an operating surgeon-led unit.
JTCVS open
2023; 16: 524-531
Abstract
Objective: The intensivist-led cardiovascular intensive care unit model is the standard of care in cardiac surgery. This study examines whether a cardiovascular intensive care unit model that uses operating cardiac surgeons, cardiothoracic surgery residents, and advanced practice providers is associated with comparable outcomes.Methods: This is a single-institution review of the first 400 cardiac surgery patients admitted to an operating surgeon-led cardiovascular intensive care unit from 2020 to 2022. Inclusion criteria are elective status and operations managed by both cardiovascular intensive care unit models (aortic operations, valve operations, coronary operations, septal myectomy). Patients from the surgeon-led cardiovascular intensive care unit were exact matched by operation type and 1:1 propensity score matched with controls from the traditional cardiovascular intensive care unit using a logistic regression model that included age, sex, preoperative mortality risk, incision type, and use of cardiopulmonary bypass and circulatory arrest. Primary outcome was total postoperative length of stay. Secondary outcomes included postoperative intensive care unit length of stay, 30-day mortality, 30-day Society of Thoracic Surgeons-defined morbidity (permanent stroke, renal failure, cardiac reoperation, prolonged intubation, deep sternal infection), packed red cell transfusions, and vasopressor use. Outcomes between the 2 groups were compared using chi-square, Fisher exact test, or 2-sample t test as appropriate.Results: A total of 400 patients from the surgeon-led cardiovascular intensive care unit (mean age 61.2±12.8years, 131 female patients [33%], 346 patients [86.5%] with European System for Cardiac Operative Risk Evaluation II <2%) and their matched controls were included. The most common operations across both units were coronary artery bypass grafting (n=318, 39.8%) and mitral valve repair or replacement (n=238, 29.8%). Approximately half of the operations were performed via sternotomy (n=462, 57.8%). There were 3 (0.2%) in-hospital deaths, and 47 patients (5.9%) had a 30-day complication. The total length of stay was significantly shorter for the surgeon-led cardiovascular intensive care unit patients (6.3 vs 7.0days, P=.028), and intensive care unit length of stay trended in the same direction (2.5 vs 2.9days, P=.16). Intensive care unit readmission rates, 30-day mortality, and 30-day morbidity were not significantly different between cardiovascular intensive care unit models. The surgeon-led cardiovascular intensive care unit was associated with fewer postoperative red blood cell transfusions in the cardiovascular intensive care unit (P=.002) and decreased vasopressor use (P=.001).Conclusions: In its first 2years, the surgeon-led cardiovascular intensive care unit demonstrated comparable outcomes to the traditional cardiovascular intensive care unit with significant improvements in total length of stay, postoperative transfusions in the cardiovascular intensive care unit, and vasopressor use. This early success exemplifies how an operating surgeon-led cardiovascular intensive care unit can provide similar outcomes to the standard-of-care model for patients undergoing elective cardiac surgery.
View details for DOI 10.1016/j.xjon.2023.09.040
View details for PubMedID 38204639
-
Natural language processing system for rapid detection and intervention of mental health crisis chat messages.
NPJ digital medicine
2023; 6 (1): 213
Abstract
Patients experiencing mental health crises often seek help through messaging-based platforms, but may face long wait times due to limited message triage capacity. Here we build and deploy a machine-learning-enabled system to improve response times to crisis messages in a large, national telehealth provider network. We train a two-stage natural language processing (NLP) system with key word filtering followed by logistic regression on 721 electronic medical record chat messages, of which 32% are potential crises (suicidal/homicidal ideation, domestic violence, or non-suicidal self-injury). Model performance is evaluated on a retrospective test set (4/1/21-4/1/22, N=481) and a prospective test set (10/1/22-10/31/22, N=102,471). In the retrospective test set, the model has an AUC of 0.82 (95% CI: 0.78-0.86), sensitivity of 0.99 (95% CI: 0.96-1.00), and PPV of 0.35 (95% CI: 0.309-0.4). In the prospective test set, the model has an AUC of 0.98 (95% CI: 0.966-0.984), sensitivity of 0.98 (95% CI: 0.96-0.99), and PPV of 0.66 (95% CI: 0.626-0.692). The daily median time from message receipt to crisis specialist triage ranges from 8 to 13min, compared to 9h before the deployment of the system. We demonstrate that a NLP-based machine learning model can reliably identify potential crisis chat messages in a telehealth setting. Our system integrates into existing clinical workflows, suggesting that with appropriate training, humans can successfully leverage ML systems to facilitate triage of crisis messages.
View details for DOI 10.1038/s41746-023-00951-3
View details for PubMedID 37990134
-
Selective prediction for extracting unstructured clinical data.
Journal of the American Medical Informatics Association : JAMIA
2023
Abstract
While there are currently approaches to handle unstructured clinical data, such as manual abstraction and structured proxy variables, these methods may be time-consuming, not scalable, and imprecise. This article aims to determine whether selective prediction, which gives a model the option to abstain from generating a prediction, can improve the accuracy and efficiency of unstructured clinical data abstraction.We trained selective classifiers (logistic regression, random forest, support vector machine) to extract 5 variables from clinical notes: depression (n = 1563), glioblastoma (GBM, n = 659), rectal adenocarcinoma (DRA, n = 601), and abdominoperineal resection (APR, n = 601) and low anterior resection (LAR, n = 601) of adenocarcinoma. We varied the cost of false positives (FP), false negatives (FN), and abstained notes and measured total misclassification cost.The depression selective classifiers abstained on anywhere from 0% to 97% of notes, and the change in total misclassification cost ranged from -58% to 9%. Selective classifiers abstained on 5%-43% of notes across the GBM and colorectal cancer models. The GBM selective classifier abstained on 43% of notes, which led to improvements in sensitivity (0.94 to 0.96), specificity (0.79 to 0.96), PPV (0.89 to 0.98), and NPV (0.88 to 0.91) when compared to a non-selective classifier and when compared to structured proxy variables.We showed that selective classifiers outperformed both non-selective classifiers and structured proxy variables for extracting data from unstructured clinical notes.Selective prediction should be considered when abstaining is preferable to making an incorrect prediction.
View details for DOI 10.1093/jamia/ocad182
View details for PubMedID 37769323
-
Critically reading machine learning literature in neurosurgery: a reader's guide and checklist for appraising prediction models.
Neurosurgical focus
2023; 54 (6): E3
Abstract
OBJECTIVE: Machine learning (ML) has become an increasingly popular tool for use in neurosurgical research. The number of publications and interest in the field have recently seen significant expansion in both quantity and complexity. However, this also places a commensurate burden on the general neurosurgical readership to appraise this literature and decide if these algorithms can be effectively translated into practice. To this end, the authors sought to review the burgeoning neurosurgical ML literature and to develop a checklist to help readers critically review and digest this work.METHODS: The authors performed a literature search of recent ML papers in the PubMed database with the terms "neurosurgery" AND "machine learning," with additional modifiers "trauma," "cancer," "pediatric," and "spine" also used to ensure a diverse selection of relevant papers within the field. Papers were reviewed for their ML methodology, including the formulation of the clinical problem, data acquisition, data preprocessing, model development, model validation, model performance, and model deployment.RESULTS: The resulting checklist consists of 14 key questions for critically appraising ML models and development techniques; these are organized according to their timing along the standard ML workflow. In addition, the authors provide an overview of the ML development process, as well as a review of key terms, models, and concepts referenced in the literature.CONCLUSIONS: ML is poised to become an increasingly important part of neurosurgical research and clinical care. The authors hope that dissemination of education on ML techniques will help neurosurgeons to critically review new research better and more effectively integrate this technology into their practices.
View details for DOI 10.3171/2023.3.FOCUS2352
View details for PubMedID 37283326
-
Post-traumatic growth in PhD students during the COVID-19 pandemic.
Psychiatry research communications
2023; 3 (1): 100104
Abstract
Throughout the COVID-19 pandemic, graduate students have faced increased risk of mental health challenges. Research suggests that experiencing adversity may induce positive psychological changes, called post-traumatic growth (PTG). These changes can include improved relationships with others, perceptions of oneself, and enjoyment of life. Few existing studies have explored this phenomenon among graduate students. This secondary data analysis of a survey conducted in November 2020 among graduate students at a private R1 University in the northeast United States examined graduate students' levels and correlates of PTG during the COVID-19 pandemic. Students had a low level of PTG, with a mean score of 10.31 out of 50. Linear regression models showed significant positive relationships between anxiety and PTG and between a measure of self-reported impact of the pandemic and PTG. Non-White minorities also had significantly greater PTG than White participants. Experiencing more negative impact due to the pandemic and ruminating about the pandemic were correlated with greater PTG. These findings advance research on the patterns of PTG during the COVID-19 pandemic and can inform future studies of graduate students' coping mechanisms and support efforts to promote pandemic recovery and resilience.
View details for DOI 10.1016/j.psycom.2023.100104
View details for PubMedID 36743383
View details for PubMedCentralID PMC9886426