Learning decision thresholds for risk stratification models from aggregate clinician behavior.
Journal of the American Medical Informatics Association : JAMIA
Using a risk stratification model to guide clinical practice often requires the choice of a cutoff-called the decision threshold-on the model's output to trigger a subsequent action such as an electronic alert. Choosing this cutoff is not always straightforward. We propose a flexible approach that leverages the collective information in treatment decisions made in real life to learn reference decision thresholds from physician practice. Using the example of prescribing a statin for primary prevention of cardiovascular disease based on 10-year risk calculated by the 2013 pooled cohort equations, we demonstrate the feasibility of using real-world data to learn the implicit decision threshold that reflects existing physician behavior. Learning a decision threshold in this manner allows for evaluation of a proposed operating point against the threshold reflective of the community standard of care. Furthermore, this approach can be used to monitor and audit model-guided clinical decision making following model deployment.
View details for DOI 10.1093/jamia/ocab159
View details for PubMedID 34350942
Systematic Review of Approaches to Preserve Machine Learning Performance in the Presence of Temporal Dataset Shift in Clinical Medicine.
Applied clinical informatics
2021; 12 (4): 808-815
OBJECTIVE: The change in performance of machine learning models over time as a result of temporal dataset shift is a barrier to machine learning-derived models facilitating decision-making in clinical practice. Our aim was to describe technical procedures used to preserve the performance of machine learning models in the presence of temporal dataset shifts.METHODS: Studies were included if they were fully published articles that used machine learning and implemented a procedure to mitigate the effects of temporal dataset shift in a clinical setting. We described how dataset shift was measured, the procedures used to preserve model performance, and their effects.RESULTS: Of 4,457 potentially relevant publications identified, 15 were included. The impact of temporal dataset shift was primarily quantified using changes, usually deterioration, in calibration or discrimination. Calibration deterioration was more common (n=11) than discrimination deterioration (n=3). Mitigation strategies were categorized as model level or feature level. Model-level approaches (n=15) were more common than feature-level approaches (n=2), with the most common approaches being model refitting (n=12), probability calibration (n=7), model updating (n=6), and model selection (n=6). In general, all mitigation strategies were successful at preserving calibration but not uniformly successful in preserving discrimination.CONCLUSION: There was limited research in preserving the performance of machine learning models in the presence of temporal dataset shift in clinical medicine. Future research could focus on the impact of dataset shift on clinical decision making, benchmark the mitigation strategies on a wider range of datasets and tasks, and identify optimal strategies for specific settings.
View details for DOI 10.1055/s-0041-1735184
View details for PubMedID 34470057
An empirical characterization of fair machine learning for clinical risk prediction.
Journal of biomedical informatics
The use of machine learning to guide clinical decision making has the potential to worsen existing health disparities. Several recent works frame the problem as that of algorithmic fairness, a framework that has attracted considerable attention and criticism. However, the appropriateness of this framework is unclear due to both ethical as well as technical considerations, the latter of which include trade-offs between measures of fairness and model performance that are not well-understood for predictive models of clinical outcomes. To inform the ongoing debate, we conduct an empirical study to characterize the impact of penalizing group fairness violations on an array of measures of model performance and group fairness. We repeat the analysis across multiple observational healthcare databases, clinical outcomes, and sensitive attributes. We find that procedures that penalize differences between the distributions of predictions across groups induce nearly-universal degradation of multiple performance metrics within groups. On examining the secondary impact of these procedures, we observe heterogeneity of the effect of these procedures on measures of fairness in calibration and ranking across experimental conditions. Beyond the reported trade-offs, we emphasize that analyses of algorithmic fairness in healthcare lack the contextual grounding and causal awareness necessary to reason about the mechanisms that lead to health disparities, as well as about the potential of algorithmic fairness methods to counteract those mechanisms. In light of these limitations, we encourage researchers building predictive models for clinical use to step outside the algorithmic fairness frame and engage critically with the broader sociotechnical context surrounding the use of machine learning in healthcare.
View details for DOI 10.1016/j.jbi.2020.103621
View details for PubMedID 33220494
Development and validation of a prognostic model predicting symptomatic hemorrhagic transformation in acute ischemic stroke at scale in the OHDSI network.
2020; 15 (1): e0226718
BACKGROUND AND PURPOSE: Hemorrhagic transformation (HT) after cerebral infarction is a complex and multifactorial phenomenon in the acute stage of ischemic stroke, and often results in a poor prognosis. Thus, identifying risk factors and making an early prediction of HT in acute cerebral infarction contributes not only to the selections of therapeutic regimen but also, more importantly, to the improvement of prognosis of acute cerebral infarction. The purpose of this study was to develop and validate a model to predict a patient's risk of HT within 30 days of initial ischemic stroke.METHODS: We utilized a retrospective multicenter observational cohort study design to develop a Lasso Logistic Regression prediction model with a large, US Electronic Health Record dataset which structured to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM). To examine clinical transportability, the model was externally validated across 10 additional real-world healthcare datasets include EHR records for patients from America, Europe and Asia.RESULTS: In the database the model was developed, the target population cohort contained 621,178 patients with ischemic stroke, of which 5,624 patients had HT within 30 days following initial ischemic stroke. 612 risk predictors, including the distance a patient travels in an ambulance to get to care for a HT, were identified. An area under the receiver operating characteristic curve (AUC) of 0.75 was achieved in the internal validation of the risk model. External validation was performed across 10 databases totaling 5,515,508 patients with ischemic stroke, of which 86,401 patients had HT within 30 days following initial ischemic stroke. The mean external AUC was 0.71 and ranged between 0.60-0.78.CONCLUSIONS: A HT prognostic predict model was developed with Lasso Logistic Regression based on routinely collected EMR data. This model can identify patients who have a higher risk of HT than the population average with an AUC of 0.78. It shows the OMOP CDM is an appropriate data standard for EMR secondary use in clinical multicenter research for prognostic prediction model development and validation. In the future, combining this model with clinical information systems will assist clinicians to make the right therapy decision for patients with acute ischemic stroke.
View details for DOI 10.1371/journal.pone.0226718
View details for PubMedID 31910437
Language models are an effective representation learning technique for electronic health record data.
Journal of biomedical informatics
Widespread adoption of electronic health records (EHRs) has fueled the development of using machine learning to build prediction models for various clinical outcomes. However, this process is often constrained by having a relatively small number of patient records for training the model. We demonstrate that using patient representation schemes inspired from techniques in natural language processing can increase the accuracy of clinical prediction models by transferring information learned from the entire patient population to the task of training a specific model, where only a subset of the population is relevant. Such patient representation schemes enable a 3.5% mean improvement in AUROC on five prediction tasks compared to standard baselines, with the average improvement rising to 19% when only a small number of patient records are available for training the clinical prediction model.
View details for DOI 10.1016/j.jbi.2020.103637
View details for PubMedID 33290879
The Effectiveness of Multitask Learning for Phenotyping with Electronic Health Records Data.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2019; 24: 18–29
Electronic phenotyping is the task of ascertaining whether an individual has a medical condition of interest by analyzing their medical record and is foundational in clinical informatics. Increasingly, electronic phenotyping is performed via supervised learning. We investigate the effectiveness of multitask learning for phenotyping using electronic health records (EHR) data. Multitask learning aims to improve model performance on a target task by jointly learning additional auxiliary tasks and has been used in disparate areas of machine learning. However, its utility when applied to EHR data has not been established, and prior work suggests that its benefits are inconsistent. We present experiments that elucidate when multitask learning with neural nets improves performance for phenotyping using EHR data relative to neural nets trained for a single phenotype and to well-tuned baselines. We find that multitask neural nets consistently outperform single-task neural nets for rare phenotypes but underperform for relatively more common phenotypes. The effect size increases as more auxiliary tasks are added. Moreover, multitask learning reduces the sensitivity of neural nets to hyperparameter settings for rare phenotypes. Last, we quantify phenotype complexity and find that neural nets trained with or without multitask learning do not improve on simple baselines unless the phenotypes are sufficiently complex.
View details for PubMedID 30864307
- Creating Fair Models of Atherosclerotic Cardiovascular Disease ASSOC COMPUTING MACHINERY. 2019: 271–78
Unraveling the Complexity of Amyotrophic Lateral Sclerosis Survival Prediction.
Frontiers in neuroinformatics
2018; 12: 36
Objective: The heterogeneity of amyotrophic lateral sclerosis (ALS) survival duration, which varies from <1 year to >10 years, challenges clinical decisions and trials. Utilizing data from 801 deceased ALS patients, we: (1) assess the underlying complex relationships among common clinical ALS metrics; (2) identify which clinical ALS metrics are the "best" survival predictors and how their predictive ability changes as a function of disease progression. Methods: Analyses included examination of relationships within the raw data as well as the construction of interactive survival regression and classification models (generalized linear model and random forests model). Dimensionality reduction and feature clustering enabled decomposition of clinical variable contributions. Thirty-eight metrics were utilized, including Medical Research Council (MRC) muscle scores; respiratory function, including forced vital capacity (FVC) and FVC % predicted, oxygen saturation, negative inspiratory force (NIF); the Revised ALS Functional Rating Scale (ALSFRS-R) and its activities of daily living (ADL) and respiratory sub-scores; body weight; onset type, onset age, gender, and height. Prognostic random forest models confirm the dominance of patient age-related parameters decline in classifying survival at thresholds of 30, 60, 90, and 180 days and 1, 2, 3, 4, and 5 years. Results: Collective prognostic insight derived from the overall investigation includes: multi-dimensionality of ALSFRS-R scores suggests cautious usage for survival forecasting; upper and lower extremities independently degenerate and are autonomous from respiratory decline, with the latter associating with nearer-to-death classifications; height and weight-based metrics are auxiliary predictors for farther-from-death classifications; sex and onset site (limb, bulbar) are not independent survival predictors due to age co-correlation. Conclusion: The dimensionality and fluctuating predictors of ALS survival must be considered when developing predictive models for clinical trial development or in-clinic usage. Additional independent metrics and possible revisions to current metrics, like the ALSFRS-R, are needed to capture the underlying complexity needed for population and personalized forecasting of survival.
View details for PubMedID 29962944
View details for PubMedCentralID PMC6010549