Bio


I am a physicist by training and a biotechnologist by profession. I love system thinking approaches and complex problems.

Current Role at Stanford


My team is part of Technology & Digital Solutions. It is responsible for building and maintaining infrastructure such as STARR. Additionally, the team also builds custom solutions such as CHOIR and SEAL to support Stanford Medicine clinical innovations. I joined Stanford in Oct 2012 at Stanford Center for Genomics and Personalized Medicine (SCGPM). My responsibility at the Center was to develop and lead the bioinformatics team and establish a genomics data analysis facility. The team rocks and continues to deliver exceptional research innovations and services.

Education & Certifications


  • PhD, Boston University, MA, USA, Computational Physics (non-equilibrium statistical mechanics) (2000)
  • MSc, Indian Institute of Technology, Madras (aka Chennai), India, Physics (stochastic systems) (1994)
  • BSc, Jadavpur University, Calcutta (aka Kolkata), India, Physics (1992)

All Publications


  • The Stanford Medicine data science ecosystem for clinical and translational research. JAMIA open Callahan, A., Ashley, E., Datta, S., Desai, P., Ferris, T. A., Fries, J. A., Halaas, M., Langlotz, C. P., Mackey, S., Posada, J. D., Pfeffer, M. A., Shah, N. H. 2023; 6 (3): ooad054

    Abstract

    To describe the infrastructure, tools, and services developed at Stanford Medicine to maintain its data science ecosystem and research patient data repository for clinical and translational research.The data science ecosystem, dubbed the Stanford Data Science Resources (SDSR), includes infrastructure and tools to create, search, retrieve, and analyze patient data, as well as services for data deidentification, linkage, and processing to extract high-value information from healthcare IT systems. Data are made available via self-service and concierge access, on HIPAA compliant secure computing infrastructure supported by in-depth user training.The Stanford Medicine Research Data Repository (STARR) functions as the SDSR data integration point, and includes electronic medical records, clinical images, text, bedside monitoring data and HL7 messages. SDSR tools include tools for electronic phenotyping, cohort building, and a search engine for patient timelines. The SDSR supports patient data collection, reproducible research, and teaching using healthcare data, and facilitates industry collaborations and large-scale observational studies.Research patient data repositories and their underlying data science infrastructure are essential to realizing a learning health system and advancing the mission of academic medical centers. Challenges to maintaining the SDSR include ensuring sufficient financial support while providing researchers and clinicians with maximal access to data and digital infrastructure, balancing tool development with user training, and supporting the diverse needs of users.Our experience maintaining the SDSR offers a case study for academic medical centers developing data science and research informatics infrastructure.

    View details for DOI 10.1093/jamiaopen/ooad054

    View details for PubMedID 37545984

    View details for PubMedCentralID PMC10397535

  • A scalable, secure, and interoperable platform for deep data-driven health management. Nature communications Bahmani, A., Alavi, A., Buergel, T., Upadhyayula, S., Wang, Q., Ananthakrishnan, S. K., Alavi, A., Celis, D., Gillespie, D., Young, G., Xing, Z., Nguyen, M. H., Haque, A., Mathur, A., Payne, J., Mazaheri, G., Li, J. K., Kotipalli, P., Liao, L., Bhasin, R., Cha, K., Rolnik, B., Celli, A., Dagan-Rosenfeld, O., Higgs, E., Zhou, W., Berry, C. L., Van Winkle, K. G., Contrepois, K., Ray, U., Bettinger, K., Datta, S., Li, X., Snyder, M. P. 2021; 12 (1): 5757

    Abstract

    The large amount of biomedical data derived from wearable sensors, electronic health records, and molecular profiling (e.g., genomics data) is rapidly transforming our healthcare systems. The increasing scale and scope of biomedical data not only is generating enormous opportunities for improving health outcomes but also raises new challenges ranging from data acquisition and storage to data analysis and utilization. To meet these challenges, we developed the Personal Health Dashboard (PHD), which utilizes state-of-the-art security and scalability technologies to provide an end-to-end solution for big biomedical data analytics. The PHD platform is an open-source software framework that can be easily configured and deployed to any big data health project to store, organize, and process complex biomedical data sets, support real-time data analysis at both the individual level and the cohort level, and ensure participant privacy at every step. In addition to presenting the system, we illustrate the use of the PHD framework for large-scale applications in emerging multi-omics disease studies, such as collecting and visualization of diverse data types (wearable, clinical, omics) at a personal level, investigation of insulin resistance, and an infrastructure for the detection of presymptomatic COVID-19.

    View details for DOI 10.1038/s41467-021-26040-1

    View details for PubMedID 34599181

  • Benchmarking workflows to assess performance and suitability of germline variant calling pipelines in clinical diagnostic assays. BMC bioinformatics Krishnan, V., Utiramerur, S., Ng, Z., Datta, S., Snyder, M. P., Ashley, E. A. 2021; 22 (1): 85

    Abstract

    BACKGROUND: Benchmarking the performance of complex analytical pipelines is an essential part of developing Lab Developed Tests (LDT). Reference samples and benchmark calls published by Genome in a Bottle (GIAB) consortium have enabled the evaluation of analytical methods. The performance of such methods is not uniform across the different genomic regions of interest and variant types. Several benchmarking methods such as hap.py, vcfeval, and vcflib are available to assess the analytical performance characteristics of variant calling algorithms. However, assessing the performance characteristics of an overall LDT assay still requires stringing together several such methods and experienced bioinformaticians to interpret the results. In addition, these methods are dependent on the hardware, operating system and other software libraries, making it impossible to reliably repeat the analytical assessment, when any of the underlying dependencies change in the assay. Here we present a scalable and reproducible, cloud-based benchmarking workflow that is independent of the laboratory and the technician executing the workflow, or the underlying compute hardware used to rapidly and continually assess the performance of LDT assays, across their regions of interest and reportable range, using a broad set of benchmarking samples.RESULTS: The benchmarking workflow was used to evaluate the performance characteristics for secondary analysis pipelines commonly used by Clinical Genomics laboratories in their LDT assays such as the GATK HaplotypeCaller v3.7 and the SpeedSeq workflow based on FreeBayes v0.9.10. Five reference sample truth sets generated by Genome in a Bottle (GIAB) consortium, six samples from the Personal Genome Project (PGP) and several samples with validated clinically relevant variants from the Centers for Disease Control were used in this work. The performance characteristics were evaluated and compared for multiple reportable ranges, such as whole exome and the clinical exome.CONCLUSIONS: We have implemented a benchmarking workflow for clinical diagnostic laboratories that generates metrics such as specificity, precision and sensitivity for germline SNPs and InDels within a reportable range using whole exome or genome sequencing data. Combining these benchmarking results with validation using known variants of clinical significance in publicly available cell lines, we were able to establish the performance of variant calling pipelines in a clinical setting.

    View details for DOI 10.1186/s12859-020-03934-3

    View details for PubMedID 33627090

  • Arrhythmias Other Than Atrial Fibrillation in Those With an Irregular Pulse Detected With a Smartwatch: Findings From the Apple Heart Study. Circulation. Arrhythmia and electrophysiology Perino, A. C., Gummidipundi, S. E., Lee, J., Hedlin, H., Garcia, A., Ferris, T., Balasubramanian, V., Gardner, R. M., Cheung, L., Hung, G., Granger, C. B., Kowey, P., Rumsfeld, J. S., Russo, A. M., True Hills, M., Talati, N., Nag, D., Tsay, D., Desai, S., Desai, M., Mahaffey, K. W., Turakhia, M. P., Perez, M. V. 2021: CIRCEP121010063

    Abstract

    The Apple watch irregular pulse detection algorithm was found to have a positive predictive value of 0.84 for identification of atrial fibrillation (AF). We sought to describe the prevalence of arrhythmias other than AF in those with an irregular pulse detected on a smartwatch.The Apple Heart Study investigated a smartwatch-based irregular pulse notification algorithm to identify AF. For this secondary analysis, we analyzed participants who received an ambulatory ECG patch after index irregular pulse notification. We excluded participants with AF identified on ECG patch and described the prevalence of other arrhythmias on the remaining participant ECG patches. We also reported the proportion of participants self-reporting subsequent AF diagnosis.Among 419 297 participants enrolled in the Apple Heart Study, 450 participant ECG patches were analyzed, with no AF on 297 ECG patches (66%). Non-AF arrhythmias (excluding supraventricular tachycardias <30 beats and pauses <3 seconds) were detected in 119 participants (40.1%) with ECG patches without AF. The most common arrhythmias were frequent PACs (burden ≥1% to <5%, 15.8%; ≥5% to <15%, 8.8%), atrial tachycardia (≥30 beats, 5.4%), frequent PVCs (burden ≥1% to <5%, 6.1%; ≥5% to <15%, 2.7%), and nonsustained ventricular tachycardia (4-7 beats, 6.4%; ≥8 beats, 3.7%). Of 249 participants with no AF detected on ECG patch and patient-reported data available, 76 participants (30.5%) reported subsequent AF diagnosis.In participants with an irregular pulse notification on the Apple Watch and no AF observed on ECG patch, atrial and ventricular arrhythmias, mostly PACs and PVCs, were detected in 40% of participants. Defining optimal care for patients with detection of incidental arrhythmias other than AF is important as AF detection is further investigated, implemented, and refined.

    View details for DOI 10.1161/CIRCEP.121.010063

    View details for PubMedID 34565178

  • Large-Scale Assessment of a Smartwatch to Identify Atrial Fibrillation. The New England journal of medicine Perez, M. V., Mahaffey, K. W., Hedlin, H., Rumsfeld, J. S., Garcia, A., Ferris, T., Balasubramanian, V., Russo, A. M., Rajmane, A., Cheung, L., Hung, G., Lee, J., Kowey, P., Talati, N., Nag, D., Gummidipundi, S. E., Beatty, A., Hills, M. T., Desai, S., Granger, C. B., Desai, M., Turakhia, M. P., Apple Heart Study Investigators, Perez, M. V., Turakhia, M. P., Lhamo, K., Smith, S., Berdichesky, M., Sharma, B., Mahaffey, K., Parizo, J., Olivier, C., Nguyen, M., Tallapalli, S., Kaur, R., Gardner, R., Hung, G., Mitchell, D., Olson, G., Datta, S., Gerenrot, D., Wang, X., McCoy, P., Satpathy, B., Jacobsen, H., Makovey, D., Martin, A., Perino, A., O'Brien, C., Gupta, A., Toruno, C., Waydo, S., Brouse, C., Dorfman, D., Stein, J., Huang, J., Patel, M., Fleischer, S., Doll, E., O'Reilly, M., Dedoshka, K., Chou, M., Daniel, H., Crowley, M., Martin, C., Kirby, T., Brumand, M., McCrystale, K., Haggerty, M., Newberger, J., Keen, D., Antall, P., Holbrook, K., Braly, A., Noone, G., Leathers, B., Montrose, A., Kosowsky, J., Lewis, D., Finkelmeier, J. R., Bemis, K., Mahaffey, K. W., Desai, M., Talati, N., Nag, D., Rajmane, A., Desai, S., Caldbeck, D., Cheung, L., Granger, C., Rumsfeld, J., Kowey, P. R., Hills, M. T., Russo, A., Rockhold, F., Albert, C., Alonso, A., Wruck, L., Friday, K., Wheeler, M., Brodt, C., Park, S., Rogers, A., Jones, R., Ouyang, D., Chang, L., Yen, A., Dong, J., Mamic, P., Cheng, P., Shah, R., Lorvidhaya, P. 2019; 381 (20): 1909–17

    Abstract

    BACKGROUND: Optical sensors on wearable devices can detect irregular pulses. The ability of a smartwatch application (app) to identify atrial fibrillation during typical use is unknown.METHODS: Participants without atrial fibrillation (as reported by the participants themselves) used a smartphone (Apple iPhone) app to consent to monitoring. If a smartwatch-based irregular pulse notification algorithm identified possible atrial fibrillation, a telemedicine visit was initiated and an electrocardiography (ECG) patch was mailed to the participant, to be worn for up to 7 days. Surveys were administered 90 days after notification of the irregular pulse and at the end of the study. The main objectives were to estimate the proportion of notified participants with atrial fibrillation shown on an ECG patch and the positive predictive value of irregular pulse intervals with a targeted confidence interval width of 0.10.RESULTS: We recruited 419,297 participants over 8 months. Over a median of 117 days of monitoring, 2161 participants (0.52%) received notifications of irregular pulse. Among the 450 participants who returned ECG patches containing data that could be analyzed - which had been applied, on average, 13 days after notification - atrial fibrillation was present in 34% (97.5% confidence interval [CI], 29 to 39) overall and in 35% (97.5% CI, 27 to 43) of participants 65 years of age or older. Among participants who were notified of an irregular pulse, the positive predictive value was 0.84 (95% CI, 0.76 to 0.92) for observing atrial fibrillation on the ECG simultaneously with a subsequent irregular pulse notification and 0.71 (97.5% CI, 0.69 to 0.74) for observing atrial fibrillation on the ECG simultaneously with a subsequent irregular tachogram. Of 1376 notified participants who returned a 90-day survey, 57% contacted health care providers outside the study. There were no reports of serious app-related adverse events.CONCLUSIONS: The probability of receiving an irregular pulse notification was low. Among participants who received notification of an irregular pulse, 34% had atrial fibrillation on subsequent ECG patch readings and 84% of notifications were concordant with atrial fibrillation. This siteless (no on-site visits were required for the participants), pragmatic study design provides a foundation for large-scale pragmatic studies in which outcomes or adherence can be reliably assessed with user-owned devices. (Funded by Apple; Apple Heart Study ClinicalTrials.gov number, NCT03335800.).

    View details for DOI 10.1056/NEJMoa1901183

    View details for PubMedID 31722151

  • SciReader: A Cloud-based Recommender System for Biomedical Literature Desai, P., Telis, N., Lehmann, B., Bettinger, K., Pritchard, J. K., Datta, S. bioRxiv. 2018

    Abstract

    With the growing number of biomedical papers published each year, keeping up with relevant literature has become increasingly important, and yet more challenging. SciReader (www.scireader.com) is a cloud-based personalized recommender system that specifically aims to assist biomedical researchers and clinicians identify publications of interest to them. SciReader uses topic modeling and other machine learning algorithms to provide users with recommendations that are recent, relevant, and of high quality.

    bioRxiv preprint
  • Cloud-based interactive analytics for terabytes of genomic variants data. Bioinformatics (Oxford, England) Pan, C., McInnes, G., Deflaux, N., Snyder, M., Bingham, J., Datta, S., Tsao, P. S. 2017; 33 (23): 3709-3715

    Abstract

    Large scale genomic sequencing is now widely used to decipher questions in diverse realms such as biological function, human diseases, evolution, ecosystems, and agriculture. With the quantity and diversity these data harbor, a robust and scalable data handling and analysis solution is desired.We present interactive analytics using a cloud-based columnar database built on Dremel to perform information compression, comprehensive quality controls, and biological information retrieval in large volumes of genomic data. We demonstrate such Big Data computing paradigms can provide orders of magnitude faster turnaround for common genomic analyses, transforming long-running batch jobs submitted via a Linux shell into questions that can be asked from a web browser in seconds. Using this method, we assessed a study population of 475 deeply sequenced human genomes for genomic call rate, genotype and allele frequency distribution, variant density across the genome, and pharmacogenomic information.Our analysis framework is implemented in Google Cloud Platform and BigQuery. Codes are available at https://github.com/StanfordBioinformatics/mvp_aaa_codelabs.cuiping@stanford.edu or ptsao@stanford.edu.Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btx468

    View details for PubMedID 28961771

    View details for PubMedCentralID PMC5860318

  • Digital Health: Tracking Physiomes and Activity Using Wearable Biosensors Reveals Useful Health-Related Information. PLoS biology Li, X., Dunn, J., Salins, D., Zhou, G., Zhou, W., Schüssler-Fiorenza Rose, S. M., Perelman, D., Colbert, E., Runge, R., Rego, S., Sonecha, R., Datta, S., McLaughlin, T., Snyder, M. P. 2017; 15 (1)

    Abstract

    A new wave of portable biosensors allows frequent measurement of health-related physiology. We investigated the use of these devices to monitor human physiological changes during various activities and their role in managing health and diagnosing and analyzing disease. By recording over 250,000 daily measurements for up to 43 individuals, we found personalized circadian differences in physiological parameters, replicating previous physiological findings. Interestingly, we found striking changes in particular environments, such as airline flights (decreased peripheral capillary oxygen saturation [SpO2] and increased radiation exposure). These events are associated with physiological macro-phenotypes such as fatigue, providing a strong association between reduced pressure/oxygen and fatigue on high-altitude flights. Importantly, we combined biosensor information with frequent medical measurements and made two important observations: First, wearable devices were useful in identification of early signs of Lyme disease and inflammatory responses; we used this information to develop a personalized, activity-based normalization framework to identify abnormal physiological signals from longitudinal data for facile disease detection. Second, wearables distinguish physiological differences between insulin-sensitive and -resistant individuals. Overall, these results indicate that portable biosensors provide useful information for monitoring personal activities and physiology and are likely to play an important role in managing health and enabling affordable health care access to groups traditionally limited by socioeconomic class or remote geography.

    View details for DOI 10.1371/journal.pbio.2001402

    View details for PubMedID 28081144

  • Cloud-based Interactive Analytics for Terabytes of Genomic Variants Data Bioinformatics Pan, C., McInnes, G., Deflaux, N., Snyder, M. P., Bingham, J., Datta, S., Tsao, P. S. 2017: 3709–15

    Abstract

    Large scale genomic sequencing is now widely used to decipher questions in diverse realms such as biological function, human diseases, evolution, ecosystems, and agriculture. With the quantity and diversity these data harbor, a robust and scalable data handling and analysis solution is desired.We present interactive analytics using a cloud-based columnar database built on Dremel to perform information compression, comprehensive quality controls, and biological information retrieval in large volumes of genomic data. We demonstrate such Big Data computing paradigms can provide orders of magnitude faster turnaround for common genomic analyses, transforming long-running batch jobs submitted via a Linux shell into questions that can be asked from a web browser in seconds. Using this method, we assessed a study population of 475 deeply sequenced human genomes for genomic call rate, genotype and allele frequency distribution, variant density across the genome, and pharmacogenomic information.Our analysis framework is implemented in Google Cloud Platform and BigQuery. Codes are available at https://github.com/StanfordBioinformatics/mvp_aaa_codelabs.cuiping@stanford.edu or ptsao@stanford.edu.Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btx468

    View details for PubMedCentralID PMC5860318

  • Secure cloud computing for genomic data Nature Biotechnology Somalee, D., Keith, B., Michael, S. 2016; 34 (6): 588-91

    View details for DOI 10.1038/nbt.3496

  • Sequence to Medical Phenotypes: A Framework for Interpretation of Human Whole Genome DNA Sequence Data. PLoS genetics Dewey, F. E., Grove, M. E., Priest, J. R., Waggott, D., Batra, P., Miller, C. L., Wheeler, M., Zia, A., Pan, C., Karzcewski, K. J., Miyake, C., Whirl-Carrillo, M., Klein, T. E., Datta, S., Altman, R. B., Snyder, M., Quertermous, T., Ashley, E. A. 2015; 11 (10)

    Abstract

    High throughput sequencing has facilitated a precipitous drop in the cost of genomic sequencing, prompting predictions of a revolution in medicine via genetic personalization of diagnostic and therapeutic strategies. There are significant barriers to realizing this goal that are related to the difficult task of interpreting personal genetic variation. A comprehensive, widely accessible application for interpretation of whole genome sequence data is needed. Here, we present a series of methods for identification of genetic variants and genotypes with clinical associations, phasing genetic data and using Mendelian inheritance for quality control, and providing predictive genetic information about risk for rare disease phenotypes and response to pharmacological therapy in single individuals and father-mother-child trios. We demonstrate application of these methods for disease and drug response prognostication in whole genome sequence data from twelve unrelated adults, and for disease gene discovery in one father-mother-child trio with apparently simplex congenital ventricular arrhythmia. In doing so we identify clinically actionable inherited disease risk and drug response genotypes in pre-symptomatic individuals. We also nominate a new candidate gene in congenital arrhythmia, ATP2B4, and provide experimental evidence of a regulatory role for variants discovered using this framework.

    View details for DOI 10.1371/journal.pgen.1005496

    View details for PubMedID 26448358

  • The Integrative Human Microbiome Project: Dynamic Analysis of Microbiome-Host Omics Profiles during Periods of Human Health and Disease CELL HOST & MICROBE Proctor, L. M. 2014; 16 (3): 276-289

    Abstract

    Much has been learned about the diversity and distribution of human-associated microbial communities, but we still know little about the biology of the microbiome, how it interacts with the host, and how the host responds to its resident microbiota. The Integrative Human Microbiome Project (iHMP, http://hmp2.org), the second phase of the NIH Human Microbiome Project, will study these interactions by analyzing microbiome and host activities in longitudinal studies of disease-specific cohorts and by creating integrated data sets of microbiome and host functional properties. These data sets will serve as experimental test beds to evaluate new models, methods, and analyses on the interactions of host and microbiome. Here we describe the three models of microbiome-associated human conditions, on the dynamics of preterm birth, inflammatory bowel disease, and type 2 diabetes, and their underlying hypotheses, as well as the multi-omic data types to be collected, integrated, and distributed through public repositories as a community resource.

    View details for DOI 10.1016/j.chom.2014.08.014

    View details for Web of Science ID 000342057000006

    View details for PubMedID 25211071