The Stanford Medicine data science ecosystem for clinical and translational research.
2023; 6 (3): ooad054
To describe the infrastructure, tools, and services developed at Stanford Medicine to maintain its data science ecosystem and research patient data repository for clinical and translational research.The data science ecosystem, dubbed the Stanford Data Science Resources (SDSR), includes infrastructure and tools to create, search, retrieve, and analyze patient data, as well as services for data deidentification, linkage, and processing to extract high-value information from healthcare IT systems. Data are made available via self-service and concierge access, on HIPAA compliant secure computing infrastructure supported by in-depth user training.The Stanford Medicine Research Data Repository (STARR) functions as the SDSR data integration point, and includes electronic medical records, clinical images, text, bedside monitoring data and HL7 messages. SDSR tools include tools for electronic phenotyping, cohort building, and a search engine for patient timelines. The SDSR supports patient data collection, reproducible research, and teaching using healthcare data, and facilitates industry collaborations and large-scale observational studies.Research patient data repositories and their underlying data science infrastructure are essential to realizing a learning health system and advancing the mission of academic medical centers. Challenges to maintaining the SDSR include ensuring sufficient financial support while providing researchers and clinicians with maximal access to data and digital infrastructure, balancing tool development with user training, and supporting the diverse needs of users.Our experience maintaining the SDSR offers a case study for academic medical centers developing data science and research informatics infrastructure.
View details for DOI 10.1093/jamiaopen/ooad054
View details for PubMedID 37545984
View details for PubMedCentralID PMC10397535
The development of a mobile app-focused deduplication strategy for the Apple Heart Study that informs recommendations for future digital trials.
Stat (International Statistical Institute)
2022; 11 (1): e470
An app-based clinical trial enrolment process can contribute to duplicated records, carrying data management implications. Our objective was to identify duplicated records in real time in the Apple Heart Study (AHS). We leveraged personal identifiable information (PII) to develop a dissimilarity score (DS) using the Damerau-Levenshtein distance. For computational efficiency, we focused on four types of records at the highest risk of duplication. We used the receiver operating curve (ROC) and resampling methods to derive and validate a decision rule to classify duplicated records. We identified 16,398 (4%) duplicated participants, resulting in 419,297 unique participants out of a total of 438,435 possible. Our decision rule yielded a high positive predictive value (96%) with negligible impact on the trial's original findings. Our findings provide principled solutions for future digital trials. When establishing deduplication procedures for digital trials, we recommend collecting device identifiers in addition to participant identifiers; collecting and ensuring secure access to PII; conducting a pilot study to identify reasons for duplicated records; establishing an initial deduplication algorithm that can be refined; creating a data quality plan that informs refinement; and embedding the initial deduplication algorithm in the enrolment platform to ensure unique enrolment and linkage to previous records.
View details for DOI 10.1002/sta4.470
View details for PubMedID 36589778
Lessons learned in the Apple Heart Study and implications for the data management of future digital clinical trials.
Journal of biopharmaceutical statistics
The digital clinical trial is fast emerging as a pragmatic trial that can improve a trial's design including recruitment and retention, data collection and analytics. To that end, digital platforms such as electronic health records or wearable technologies that enable passive data collection can be leveraged, alleviating burden from the participant and study coordinator. However, there are challenges. For example, many of these data sources not originally intended for research may be noisier than traditionally obtained measures. Further, the secure flow of passively collected data and their integration for analysis is non-trivial. The Apple Heart Study was a prospective, single-arm, site-less digital trial designed to evaluate the ability of an app to detect atrial fibrillation. The study was designed with pragmatic features, such as an app for enrollment, a wearable device (the Apple Watch) for data collection, and electronic surveys for participant-reported outcomes that enabled a high volume of patient enrollment and accompanying data. These elements led to challenges including identifying the number of unique participants, maintaining participant-level linkage of multiple complex data streams, and participant adherence and engagement. Novel solutions were derived that inform future designs with an emphasis on data management. We build upon the excellent framework of the Clinical Trials Transformation Initiative to provide a comprehensive set of guidelines for data management of the digital clinical trial that include an increased role of collaborative data scientists in the design and conduct of the modern digital trial.
View details for DOI 10.1080/10543406.2022.2080698
View details for PubMedID 35695137
Arrhythmias Other Than Atrial Fibrillation in Those With an Irregular Pulse Detected With a Smartwatch: Findings From the Apple Heart Study.
Circulation. Arrhythmia and electrophysiology
The Apple watch irregular pulse detection algorithm was found to have a positive predictive value of 0.84 for identification of atrial fibrillation (AF). We sought to describe the prevalence of arrhythmias other than AF in those with an irregular pulse detected on a smartwatch.The Apple Heart Study investigated a smartwatch-based irregular pulse notification algorithm to identify AF. For this secondary analysis, we analyzed participants who received an ambulatory ECG patch after index irregular pulse notification. We excluded participants with AF identified on ECG patch and described the prevalence of other arrhythmias on the remaining participant ECG patches. We also reported the proportion of participants self-reporting subsequent AF diagnosis.Among 419 297 participants enrolled in the Apple Heart Study, 450 participant ECG patches were analyzed, with no AF on 297 ECG patches (66%). Non-AF arrhythmias (excluding supraventricular tachycardias <30 beats and pauses <3 seconds) were detected in 119 participants (40.1%) with ECG patches without AF. The most common arrhythmias were frequent PACs (burden ≥1% to <5%, 15.8%; ≥5% to <15%, 8.8%), atrial tachycardia (≥30 beats, 5.4%), frequent PVCs (burden ≥1% to <5%, 6.1%; ≥5% to <15%, 2.7%), and nonsustained ventricular tachycardia (4-7 beats, 6.4%; ≥8 beats, 3.7%). Of 249 participants with no AF detected on ECG patch and patient-reported data available, 76 participants (30.5%) reported subsequent AF diagnosis.In participants with an irregular pulse notification on the Apple Watch and no AF observed on ECG patch, atrial and ventricular arrhythmias, mostly PACs and PVCs, were detected in 40% of participants. Defining optimal care for patients with detection of incidental arrhythmias other than AF is important as AF detection is further investigated, implemented, and refined.
View details for DOI 10.1161/CIRCEP.121.010063
View details for PubMedID 34565178
Apple Watch App Identifies Clinically Important Arrhythmias Other Than Atrial Fibrillation: Results From the Apple Heart Study
LIPPINCOTT WILLIAMS & WILKINS. 2019: E988
View details for Web of Science ID 000508228600061
Large-Scale Assessment of a Smartwatch to Identify Atrial Fibrillation.
The New England journal of medicine
2019; 381 (20): 1909–17
BACKGROUND: Optical sensors on wearable devices can detect irregular pulses. The ability of a smartwatch application (app) to identify atrial fibrillation during typical use is unknown.METHODS: Participants without atrial fibrillation (as reported by the participants themselves) used a smartphone (Apple iPhone) app to consent to monitoring. If a smartwatch-based irregular pulse notification algorithm identified possible atrial fibrillation, a telemedicine visit was initiated and an electrocardiography (ECG) patch was mailed to the participant, to be worn for up to 7 days. Surveys were administered 90 days after notification of the irregular pulse and at the end of the study. The main objectives were to estimate the proportion of notified participants with atrial fibrillation shown on an ECG patch and the positive predictive value of irregular pulse intervals with a targeted confidence interval width of 0.10.RESULTS: We recruited 419,297 participants over 8 months. Over a median of 117 days of monitoring, 2161 participants (0.52%) received notifications of irregular pulse. Among the 450 participants who returned ECG patches containing data that could be analyzed - which had been applied, on average, 13 days after notification - atrial fibrillation was present in 34% (97.5% confidence interval [CI], 29 to 39) overall and in 35% (97.5% CI, 27 to 43) of participants 65 years of age or older. Among participants who were notified of an irregular pulse, the positive predictive value was 0.84 (95% CI, 0.76 to 0.92) for observing atrial fibrillation on the ECG simultaneously with a subsequent irregular pulse notification and 0.71 (97.5% CI, 0.69 to 0.74) for observing atrial fibrillation on the ECG simultaneously with a subsequent irregular tachogram. Of 1376 notified participants who returned a 90-day survey, 57% contacted health care providers outside the study. There were no reports of serious app-related adverse events.CONCLUSIONS: The probability of receiving an irregular pulse notification was low. Among participants who received notification of an irregular pulse, 34% had atrial fibrillation on subsequent ECG patch readings and 84% of notifications were concordant with atrial fibrillation. This siteless (no on-site visits were required for the participants), pragmatic study design provides a foundation for large-scale pragmatic studies in which outcomes or adherence can be reliably assessed with user-owned devices. (Funded by Apple; Apple Heart Study ClinicalTrials.gov number, NCT03335800.).
View details for DOI 10.1056/NEJMoa1901183
View details for PubMedID 31722151
- Rationale and design of a large-scale, app-based study to identify cardiac arrhythmias using a smartwatch: The Apple Heart Study AMERICAN HEART JOURNAL 2019; 207: 66–75
Rationale and design of a large-scale, app-based study to identify cardiac arrhythmias using a smartwatch: The Apple Heart Study.
American heart journal
BACKGROUND: Smartwatch and fitness band wearable consumer electronics can passively measure pulse rate from the wrist using photoplethysmography (PPG). Identification of pulse irregularity or variability from these data has the potential to identify atrial fibrillation or atrial flutter (AF, collectively). The rapidly expanding consumer base of these devices allows for detection of undiagnosed AF at scale.METHODS: The Apple Heart Study is a prospective, single arm pragmatic study that has enrolled 419,093 participants (NCT03335800). The primary objective is to measure the proportion of participants with an irregular pulse detected by the Apple Watch (Apple Inc, Cupertino, CA) with AF on subsequent ambulatory ECG patch monitoring. The secondary objectives are to: 1) characterize the concordance of pulse irregularity notification episodes from the Apple Watch with simultaneously recorded ambulatory ECGs; 2) estimate the rate of initial contact with a health care provider within 3 months after notification of pulse irregularity. The study is conducted virtually, with screening, consent and data collection performed electronically from within an accompanying smartphone app. Study visits are performed by telehealth study physicians via video chat through the app, and ambulatory ECG patches are mailed to the participants.CONCLUSIONS: The results of this trial will provide initial evidence for the ability of a smartwatch algorithm to identify pulse irregularity and variability which may reflect previously unknown AF. The Apple Heart Study will help provide a foundation for how wearable technology can inform the clinical approach to AF identification and screening.
View details for PubMedID 30392584
Cohort Discovery Query Optimization via Computable Controlled Vocabulary Versioning.
Studies in health technology and informatics
2015; 216: 1084-?
Self-service cohort discovery tools strive to provide intuitive interfaces to large Clinical Data Warehouses that contain extensive historic information. In those tools, controlled vocabulary (e.g., ICD-9-CM, CPT) coded clinical information is often the main search criteria used because of its ubiquity in billing processes. These tools generally require a researcher to pick specific terms from the controlled vocabulary. However, controlled vocabularies evolve over time as medical knowledge changes and can even be replaced with new versions (e.g., ICD-9 to ICD-10). These tools generally only display the current version of the controlled vocabulary. Researchers should not be expected to understand the underlying controlled vocabulary versioning issues. We propose a computable controlled vocabulary versioning system that allows cohort discovery tools to automatically expand queries to account for terminology changes.
View details for PubMedID 26262383
Pharmacovigilance using clinical notes.
Clinical pharmacology & therapeutics
2013; 93 (6): 547-555
With increasing adoption of electronic health records (EHRs), there is an opportunity to use the free-text portion of EHRs for pharmacovigilance. We present novel methods that annotate the unstructured clinical notes and transform them into a deidentified patient-feature matrix encoded using medical terminologies. We demonstrate the use of the resulting high-throughput data for detecting drug-adverse event associations and adverse events associated with drug-drug interactions. We show that these methods flag adverse events early (in most cases before an official alert), allow filtering of spurious signals by adjusting for potential confounding, and compile prevalence information. We argue that analyzing large volumes of free-text clinical notes enables drug safety surveillance using a yet untapped data source. Such data mining can be used for hypothesis generation and for rapid analysis of suspected adverse event risk.
View details for DOI 10.1038/clpt.2013.47
View details for PubMedID 23571773
A simple heuristic for blindfolded record linkage
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION
2012; 19 (E1): E157-E161
To address the challenge of balancing privacy with the need to create cross-site research registry records on individual patients, while matching the data for a given patient as he or she moves between participating sites. To evaluate the strategy of generating anonymous identifiers based on real identifiers in such a way that the chances of a shared patient being accurately identified were maximized, and the chances of incorrectly joining two records belonging to different people were minimized.Our hypothesis was that most variation in names occurs after the first two letters, and that date of birth is highly reliable, so a single match variable consisting of a hashed string built from the first two letters of the patient's first and last names plus their date of birth would have the desired characteristics. We compared and contrasted the match algorithm characteristics (rate of false positive v. rate of false negative) for our chosen variable against both Social Security Numbers and full names.In a data set of 19 000 records, a derived match variable consisting of a 2-character prefix from both first and last names combined with date of birth has a 97% sensitivity; by contrast, an anonymized identifier based on the patient's full names and date of birth has a sensitivity of only 87% and SSN has sensitivity 86%.The approach we describe is most useful in situations where privacy policies preclude the full exchange of the identifiers required by more sophisticated and sensitive linkage algorithms. For data sets of sufficiently high quality this effective approach, while producing a lower rate of matching than more complex algorithms, has the merit of being easy to explain to institutional review boards, adheres to the minimum necessary rule of the HIPAA privacy rule, and is faster and less cumbersome to implement than a full probabilistic linkage.
View details for DOI 10.1136/amiajnl-2011-000329
View details for PubMedID 22298567
Managing Medical Vocabulary Updates in a Clinical Data Warehouse: An RxNorm Case Study.
AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium
2010; 2010: 477-481
Use of terminology standards facilitates aggregating data from multiple sources for information retrieval, exchange and analysis. However, medical vocabularies are continuously updated and incorporating those changes consistently into clinical data warehouses requires rigorous methodology. To integrate pharmacy data from two hospital pharmacy information systems the Stanford Translational Research Integrated Database Environment (STRIDE) project mapped medication orders to RxNorm content using the RxNorm drug model. In order to keep the data relevant and up-to-date, we developed a strategy for updating to RxNorm, while preserving the original meaning and mapping of the legacy data. This case study discusses managing the vocabulary update by following the RxNorm content maintenance strategy and supplementing it with operations to retain access to its drug model information.
View details for PubMedID 21347024
Automated mapping of pharmacy orders from two electronic health record systems to RxNorm within the STRIDE clinical data warehouse.
AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium
2009; 2009: 244-248
The Stanford Translational Research Integrated Database Environment (STRIDE) clinical data warehouse integrates medication information from two Stanford hospitals that use different drug representation systems. To merge this pharmacy data into a single, standards-based model supporting research we developed an algorithm to map HL7 pharmacy orders to RxNorm concepts. A formal evaluation of this algorithm on 1.5 million pharmacy orders showed that the system could accurately assign pharmacy orders in over 96% of cases. This paper describes the algorithm and discusses some of the causes of failures in mapping to RxNorm.
View details for PubMedID 20351858
STRIDE--An integrated standards-based translational research informatics platform.
AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium
2009; 2009: 391-395
STRIDE (Stanford Translational Research Integrated Database Environment) is a research and development project at Stanford University to create a standards-based informatics platform supporting clinical and translational research. STRIDE consists of three integrated components: a clinical data warehouse, based on the HL7 Reference Information Model (RIM), containing clinical information on over 1.3 million pediatric and adult patients cared for at Stanford University Medical Center since 1995; an application development framework for building research data management applications on the STRIDE platform and a biospecimen data management system. STRIDE's semantic model uses standardized terminologies, such as SNOMED, RxNorm, ICD and CPT, to represent important biomedical concepts and their relationships. The system is in daily use at Stanford and is an important component of Stanford University's CTSA (Clinical and Translational Science Award) Informatics Program.
View details for PubMedID 20351886
Novel integration of hospital electronic medical records and gene expression measurements to identify genetic markers of maturation.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
Traditionally, the elucidation of genes involved in maturation and aging has been studied in a temporal fashion by examining gene expression at different time points in an organism's life as well as by knocking out, knocking in, and mutating genes thought to be involved. Here, we propose an in silico method to combine clinical electronic medical record (EMR) data and gene expression measurements in the context of disease to identify genes that may be involved in the process of human maturation and aging. First we show that absolute lymphocyte count may serve as a biomarker for maturation by using statistical methods to compare trends among different clinical laboratory tests in response to an increase in age. We then propose using the rate of decay for absolute lymphocyte count across 12 diseases as a proxy for differences in aging. We correlate the differing rates with gene expression across the same diseases to find maturation/aging related genes. Among the 53 genes with strongest correlations between expression profile and change in rate of decay, we found genes previously implicated in the process of aging, including MGMT (DNA repair), TERF2 (telomere stability), POLD1 (DNA replication and repair), and POLG (mtDNA replication).
View details for PubMedID 18229690
Clinical arrays of laboratory measures, or "clinarrays", built from an electronic health record enable disease subtyping by severity.
AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium
The severity of diseases has often been assigned by direct observation of a patient and by pathological examination after symptoms have appeared. As we move into the genomic era, the ability to predict disease severity prior to manifestation has improved dramatically due to genomic sequencing and analysis of gene expression microarrays. However, as the severity of diseases can be exacerbated by non genetic factors, the ability to predict disease severity by examining gene expression alone may be inadequate. We propose the creation of a "clinarray" to examine phenotypic expression in the form of clinical laboratory measurements. We demonstrate that the clinarray can be used to distinguish between the severities of patients with cystic fibrosis and those with Crohn's disease by applying unsupervised clustering methods that have been previously applied to microarrays.
View details for PubMedID 18693809
A proposed key escrow system for secure patient information disclosure in biomedical research databases
Annual Symposium of the American-Medical-Informatics-Association
HANLEY & BELFUS INC MED PUBLISHERS. 2002: 245–249
Access to clinical data is of increasing importance to biomedical research. The pending HIPAA privacy regulations provide specific requirements for the release of protected health information. Under the regulations, biomedical researchers may utilize anonymized data, or adhere to HIPAA requirements regarding protected health information. In order to provide researchers with anonymized data from a clinical research database, we reviewed several published strategies for de-identification of protected health information. Critical analysis with respect to this project suggests that de-identification alone is problematic when applied to clinical research databases. We propose a hybrid system; utilizing secure key escrow, de-identification, and role-based access for IRB approved researchers.
View details for PubMedID 12463824