An empirical characterization of fair machine learning for clinical risk prediction.
Journal of biomedical informatics
The use of machine learning to guide clinical decision making has the potential to worsen existing health disparities. Several recent works frame the problem as that of algorithmic fairness, a framework that has attracted considerable attention and criticism. However, the appropriateness of this framework is unclear due to both ethical as well as technical considerations, the latter of which include trade-offs between measures of fairness and model performance that are not well-understood for predictive models of clinical outcomes. To inform the ongoing debate, we conduct an empirical study to characterize the impact of penalizing group fairness violations on an array of measures of model performance and group fairness. We repeat the analysis across multiple observational healthcare databases, clinical outcomes, and sensitive attributes. We find that procedures that penalize differences between the distributions of predictions across groups induce nearly-universal degradation of multiple performance metrics within groups. On examining the secondary impact of these procedures, we observe heterogeneity of the effect of these procedures on measures of fairness in calibration and ranking across experimental conditions. Beyond the reported trade-offs, we emphasize that analyses of algorithmic fairness in healthcare lack the contextual grounding and causal awareness necessary to reason about the mechanisms that lead to health disparities, as well as about the potential of algorithmic fairness methods to counteract those mechanisms. In light of these limitations, we encourage researchers building predictive models for clinical use to step outside the algorithmic fairness frame and engage critically with the broader sociotechnical context surrounding the use of machine learning in healthcare.
View details for DOI 10.1016/j.jbi.2020.103621
View details for PubMedID 33220494
Development and validation of a prognostic model predicting symptomatic hemorrhagic transformation in acute ischemic stroke at scale in the OHDSI network.
2020; 15 (1): e0226718
BACKGROUND AND PURPOSE: Hemorrhagic transformation (HT) after cerebral infarction is a complex and multifactorial phenomenon in the acute stage of ischemic stroke, and often results in a poor prognosis. Thus, identifying risk factors and making an early prediction of HT in acute cerebral infarction contributes not only to the selections of therapeutic regimen but also, more importantly, to the improvement of prognosis of acute cerebral infarction. The purpose of this study was to develop and validate a model to predict a patient's risk of HT within 30 days of initial ischemic stroke.METHODS: We utilized a retrospective multicenter observational cohort study design to develop a Lasso Logistic Regression prediction model with a large, US Electronic Health Record dataset which structured to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM). To examine clinical transportability, the model was externally validated across 10 additional real-world healthcare datasets include EHR records for patients from America, Europe and Asia.RESULTS: In the database the model was developed, the target population cohort contained 621,178 patients with ischemic stroke, of which 5,624 patients had HT within 30 days following initial ischemic stroke. 612 risk predictors, including the distance a patient travels in an ambulance to get to care for a HT, were identified. An area under the receiver operating characteristic curve (AUC) of 0.75 was achieved in the internal validation of the risk model. External validation was performed across 10 databases totaling 5,515,508 patients with ischemic stroke, of which 86,401 patients had HT within 30 days following initial ischemic stroke. The mean external AUC was 0.71 and ranged between 0.60-0.78.CONCLUSIONS: A HT prognostic predict model was developed with Lasso Logistic Regression based on routinely collected EMR data. This model can identify patients who have a higher risk of HT than the population average with an AUC of 0.78. It shows the OMOP CDM is an appropriate data standard for EMR secondary use in clinical multicenter research for prognostic prediction model development and validation. In the future, combining this model with clinical information systems will assist clinicians to make the right therapy decision for patients with acute ischemic stroke.
View details for DOI 10.1371/journal.pone.0226718
View details for PubMedID 31910437
Language models are an effective representation learning technique for electronic health record data.
Journal of biomedical informatics
Widespread adoption of electronic health records (EHRs) has fueled the development of using machine learning to build prediction models for various clinical outcomes. However, this process is often constrained by having a relatively small number of patient records for training the model. We demonstrate that using patient representation schemes inspired from techniques in natural language processing can increase the accuracy of clinical prediction models by transferring information learned from the entire patient population to the task of training a specific model, where only a subset of the population is relevant. Such patient representation schemes enable a 3.5% mean improvement in AUROC on five prediction tasks compared to standard baselines, with the average improvement rising to 19% when only a small number of patient records are available for training the clinical prediction model.
View details for DOI 10.1016/j.jbi.2020.103637
View details for PubMedID 33290879
The Effectiveness of Multitask Learning for Phenotyping with Electronic Health Records Data.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2019; 24: 18–29
Electronic phenotyping is the task of ascertaining whether an individual has a medical condition of interest by analyzing their medical record and is foundational in clinical informatics. Increasingly, electronic phenotyping is performed via supervised learning. We investigate the effectiveness of multitask learning for phenotyping using electronic health records (EHR) data. Multitask learning aims to improve model performance on a target task by jointly learning additional auxiliary tasks and has been used in disparate areas of machine learning. However, its utility when applied to EHR data has not been established, and prior work suggests that its benefits are inconsistent. We present experiments that elucidate when multitask learning with neural nets improves performance for phenotyping using EHR data relative to neural nets trained for a single phenotype and to well-tuned baselines. We find that multitask neural nets consistently outperform single-task neural nets for rare phenotypes but underperform for relatively more common phenotypes. The effect size increases as more auxiliary tasks are added. Moreover, multitask learning reduces the sensitivity of neural nets to hyperparameter settings for rare phenotypes. Last, we quantify phenotype complexity and find that neural nets trained with or without multitask learning do not improve on simple baselines unless the phenotypes are sufficiently complex.
View details for PubMedID 30864307
- Creating Fair Models of Atherosclerotic Cardiovascular Disease ASSOC COMPUTING MACHINERY. 2019: 271–78
- Unraveling the Complexity of Amyotrophic Lateral Sclerosis Survival Prediction FRONTIERS IN NEUROINFORMATICS 2018; 12