Bio


I am a physicist by training and a biotechnologist by profession. I believe that with the explosion of data in healthcare and with new methods to analyze such large amounts of data, we will see massive changes in how human diseases are addressed via novel drugs, large scale genomics, wearable sensors, and software to tie it all together. I want to drive part of this revolution.

Prior to joining Stanford in 2012, I spent a dozen years at various biotechs in the Bay Area. This includes experiences as technology lead at Life Technologies (now Thermo Fisher) and founding team member of Verseon, a drug discovery company. Along the way, I have had fantastic opportunities to work alongside some of the smartest people in the field, learn from some of the most brilliant minds of our times, solve some fundamental technological problems, and delivered business impact.

Current Role at Stanford


I am currently the Director of Research IT at School Medicine. Research IT is a critical part of Stanford's Precision Health Strategy and exists to supply infrastructure, tools, and services used by researchers, patients/participants, and clinicians to collect and combine data to make discoveries and to improve human health and wellness. Our team builds and maintains STARR (STAnford medicine Research data Repository), Stanford REDCap, CHOIR and mHealth platforms and builds custom applications to streamline hundreds of studies.

I joined Stanford in Oct 2012 as the Director of Bioinformatics at Stanford Center for Genomics and Personalized Medicine (SCGPM). My responsibility at the Center was to develop and lead the bioinformatics team and establish a genomics data analysis facility. Currently, SCGPM bioinformatics team is comprised of a dozen scientists and software engineers. The team has a wide range of skill sets including omics, computational biology, machine learning, software engineering, data management, Databases, Visualization, High Performance Computing, IT, and Cloud DevOps. The team is currently supporting several large scale research and clinical programs at Stanford including prestigious consortium efforts and inter-disciplinary collaborations. The team also supports Genetics Bioinformatics Service Center (2013-), a facility that provides best-in-class high performance computational systems, scalable Cloud computing and cutting edge bioinformatics services for the Stanford community.

Education & Certifications


  • PhD, Boston University, MA, USA, Computational Physics (non-equilibrium statistical mechanics) (2000)
  • MSc, Indian Institute of Technology, Madras (aka Chennai), India, Physics (stochastic systems) (1994)
  • BSc, Jadavpur University, Calcutta (aka Kolkata), India, Physics (1992)

All Publications


  • Benchmarking workflows to assess performance and suitability of germline variant calling pipelines in clinical diagnostic assays. BMC bioinformatics Krishnan, V., Utiramerur, S., Ng, Z., Datta, S., Snyder, M. P., Ashley, E. A. 2021; 22 (1): 85

    Abstract

    BACKGROUND: Benchmarking the performance of complex analytical pipelines is an essential part of developing Lab Developed Tests (LDT). Reference samples and benchmark calls published by Genome in a Bottle (GIAB) consortium have enabled the evaluation of analytical methods. The performance of such methods is not uniform across the different genomic regions of interest and variant types. Several benchmarking methods such as hap.py, vcfeval, and vcflib are available to assess the analytical performance characteristics of variant calling algorithms. However, assessing the performance characteristics of an overall LDT assay still requires stringing together several such methods and experienced bioinformaticians to interpret the results. In addition, these methods are dependent on the hardware, operating system and other software libraries, making it impossible to reliably repeat the analytical assessment, when any of the underlying dependencies change in the assay. Here we present a scalable and reproducible, cloud-based benchmarking workflow that is independent of the laboratory and the technician executing the workflow, or the underlying compute hardware used to rapidly and continually assess the performance of LDT assays, across their regions of interest and reportable range, using a broad set of benchmarking samples.RESULTS: The benchmarking workflow was used to evaluate the performance characteristics for secondary analysis pipelines commonly used by Clinical Genomics laboratories in their LDT assays such as the GATK HaplotypeCaller v3.7 and the SpeedSeq workflow based on FreeBayes v0.9.10. Five reference sample truth sets generated by Genome in a Bottle (GIAB) consortium, six samples from the Personal Genome Project (PGP) and several samples with validated clinically relevant variants from the Centers for Disease Control were used in this work. The performance characteristics were evaluated and compared for multiple reportable ranges, such as whole exome and the clinical exome.CONCLUSIONS: We have implemented a benchmarking workflow for clinical diagnostic laboratories that generates metrics such as specificity, precision and sensitivity for germline SNPs and InDels within a reportable range using whole exome or genome sequencing data. Combining these benchmarking results with validation using known variants of clinical significance in publicly available cell lines, we were able to establish the performance of variant calling pipelines in a clinical setting.

    View details for DOI 10.1186/s12859-020-03934-3

    View details for PubMedID 33627090

  • Large-Scale Assessment of a Smartwatch to Identify Atrial Fibrillation. The New England journal of medicine Perez, M. V., Mahaffey, K. W., Hedlin, H., Rumsfeld, J. S., Garcia, A., Ferris, T., Balasubramanian, V., Russo, A. M., Rajmane, A., Cheung, L., Hung, G., Lee, J., Kowey, P., Talati, N., Nag, D., Gummidipundi, S. E., Beatty, A., Hills, M. T., Desai, S., Granger, C. B., Desai, M., Turakhia, M. P., Apple Heart Study Investigators, Perez, M. V., Turakhia, M. P., Lhamo, K., Smith, S., Berdichesky, M., Sharma, B., Mahaffey, K., Parizo, J., Olivier, C., Nguyen, M., Tallapalli, S., Kaur, R., Gardner, R., Hung, G., Mitchell, D., Olson, G., Datta, S., Gerenrot, D., Wang, X., McCoy, P., Satpathy, B., Jacobsen, H., Makovey, D., Martin, A., Perino, A., O'Brien, C., Gupta, A., Toruno, C., Waydo, S., Brouse, C., Dorfman, D., Stein, J., Huang, J., Patel, M., Fleischer, S., Doll, E., O'Reilly, M., Dedoshka, K., Chou, M., Daniel, H., Crowley, M., Martin, C., Kirby, T., Brumand, M., McCrystale, K., Haggerty, M., Newberger, J., Keen, D., Antall, P., Holbrook, K., Braly, A., Noone, G., Leathers, B., Montrose, A., Kosowsky, J., Lewis, D., Finkelmeier, J. R., Bemis, K., Mahaffey, K. W., Desai, M., Talati, N., Nag, D., Rajmane, A., Desai, S., Caldbeck, D., Cheung, L., Granger, C., Rumsfeld, J., Kowey, P. R., Hills, M. T., Russo, A., Rockhold, F., Albert, C., Alonso, A., Wruck, L., Friday, K., Wheeler, M., Brodt, C., Park, S., Rogers, A., Jones, R., Ouyang, D., Chang, L., Yen, A., Dong, J., Mamic, P., Cheng, P., Shah, R., Lorvidhaya, P. 2019; 381 (20): 1909–17

    Abstract

    BACKGROUND: Optical sensors on wearable devices can detect irregular pulses. The ability of a smartwatch application (app) to identify atrial fibrillation during typical use is unknown.METHODS: Participants without atrial fibrillation (as reported by the participants themselves) used a smartphone (Apple iPhone) app to consent to monitoring. If a smartwatch-based irregular pulse notification algorithm identified possible atrial fibrillation, a telemedicine visit was initiated and an electrocardiography (ECG) patch was mailed to the participant, to be worn for up to 7 days. Surveys were administered 90 days after notification of the irregular pulse and at the end of the study. The main objectives were to estimate the proportion of notified participants with atrial fibrillation shown on an ECG patch and the positive predictive value of irregular pulse intervals with a targeted confidence interval width of 0.10.RESULTS: We recruited 419,297 participants over 8 months. Over a median of 117 days of monitoring, 2161 participants (0.52%) received notifications of irregular pulse. Among the 450 participants who returned ECG patches containing data that could be analyzed - which had been applied, on average, 13 days after notification - atrial fibrillation was present in 34% (97.5% confidence interval [CI], 29 to 39) overall and in 35% (97.5% CI, 27 to 43) of participants 65 years of age or older. Among participants who were notified of an irregular pulse, the positive predictive value was 0.84 (95% CI, 0.76 to 0.92) for observing atrial fibrillation on the ECG simultaneously with a subsequent irregular pulse notification and 0.71 (97.5% CI, 0.69 to 0.74) for observing atrial fibrillation on the ECG simultaneously with a subsequent irregular tachogram. Of 1376 notified participants who returned a 90-day survey, 57% contacted health care providers outside the study. There were no reports of serious app-related adverse events.CONCLUSIONS: The probability of receiving an irregular pulse notification was low. Among participants who received notification of an irregular pulse, 34% had atrial fibrillation on subsequent ECG patch readings and 84% of notifications were concordant with atrial fibrillation. This siteless (no on-site visits were required for the participants), pragmatic study design provides a foundation for large-scale pragmatic studies in which outcomes or adherence can be reliably assessed with user-owned devices. (Funded by Apple; Apple Heart Study ClinicalTrials.gov number, NCT03335800.).

    View details for DOI 10.1056/NEJMoa1901183

    View details for PubMedID 31722151

  • SciReader: A Cloud-based Recommender System for Biomedical Literature Desai, P., Telis, N., Lehmann, B., Bettinger, K., Pritchard, J. K., Datta, S. bioRxiv. 2018

    Abstract

    With the growing number of biomedical papers published each year, keeping up with relevant literature has become increasingly important, and yet more challenging. SciReader (www.scireader.com) is a cloud-based personalized recommender system that specifically aims to assist biomedical researchers and clinicians identify publications of interest to them. SciReader uses topic modeling and other machine learning algorithms to provide users with recommendations that are recent, relevant, and of high quality.

    bioRxiv preprint
  • Cloud-based interactive analytics for terabytes of genomic variants data. Bioinformatics (Oxford, England) Pan, C., McInnes, G., Deflaux, N., Snyder, M., Bingham, J., Datta, S., Tsao, P. S. 2017; 33 (23): 3709-3715

    Abstract

    Large scale genomic sequencing is now widely used to decipher questions in diverse realms such as biological function, human diseases, evolution, ecosystems, and agriculture. With the quantity and diversity these data harbor, a robust and scalable data handling and analysis solution is desired.We present interactive analytics using a cloud-based columnar database built on Dremel to perform information compression, comprehensive quality controls, and biological information retrieval in large volumes of genomic data. We demonstrate such Big Data computing paradigms can provide orders of magnitude faster turnaround for common genomic analyses, transforming long-running batch jobs submitted via a Linux shell into questions that can be asked from a web browser in seconds. Using this method, we assessed a study population of 475 deeply sequenced human genomes for genomic call rate, genotype and allele frequency distribution, variant density across the genome, and pharmacogenomic information.Our analysis framework is implemented in Google Cloud Platform and BigQuery. Codes are available at https://github.com/StanfordBioinformatics/mvp_aaa_codelabs.cuiping@stanford.edu or ptsao@stanford.edu.Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btx468

    View details for PubMedID 28961771

    View details for PubMedCentralID PMC5860318

  • Digital Health: Tracking Physiomes and Activity Using Wearable Biosensors Reveals Useful Health-Related Information. PLoS biology Li, X., Dunn, J., Salins, D., Zhou, G., Zhou, W., Schüssler-Fiorenza Rose, S. M., Perelman, D., Colbert, E., Runge, R., Rego, S., Sonecha, R., Datta, S., McLaughlin, T., Snyder, M. P. 2017; 15 (1)

    Abstract

    A new wave of portable biosensors allows frequent measurement of health-related physiology. We investigated the use of these devices to monitor human physiological changes during various activities and their role in managing health and diagnosing and analyzing disease. By recording over 250,000 daily measurements for up to 43 individuals, we found personalized circadian differences in physiological parameters, replicating previous physiological findings. Interestingly, we found striking changes in particular environments, such as airline flights (decreased peripheral capillary oxygen saturation [SpO2] and increased radiation exposure). These events are associated with physiological macro-phenotypes such as fatigue, providing a strong association between reduced pressure/oxygen and fatigue on high-altitude flights. Importantly, we combined biosensor information with frequent medical measurements and made two important observations: First, wearable devices were useful in identification of early signs of Lyme disease and inflammatory responses; we used this information to develop a personalized, activity-based normalization framework to identify abnormal physiological signals from longitudinal data for facile disease detection. Second, wearables distinguish physiological differences between insulin-sensitive and -resistant individuals. Overall, these results indicate that portable biosensors provide useful information for monitoring personal activities and physiology and are likely to play an important role in managing health and enabling affordable health care access to groups traditionally limited by socioeconomic class or remote geography.

    View details for DOI 10.1371/journal.pbio.2001402

    View details for PubMedID 28081144

  • Secure cloud computing for genomic data Nature Biotechnology Somalee, D., Keith, B., Michael, S. 2016; 34 (6): 588-91

    View details for DOI 10.1038/nbt.3496

  • Sequence to Medical Phenotypes: A Framework for Interpretation of Human Whole Genome DNA Sequence Data PLOS GENETICS Dewey, F. E., Grove, M. E., Priest, J. R., Waggott, D., Batra, P., Miller, C. L., Wheeler, M., Zia, A., Pan, C., Karzcewski, K. J., Miyake, C., Whirl-Carrillo, M., Klein, T. E., Datta, S., Altman, R. B., Snyder, M., Quertermous, T., Ashley, E. A. 2015; 11 (10)

    Abstract

    High throughput sequencing has facilitated a precipitous drop in the cost of genomic sequencing, prompting predictions of a revolution in medicine via genetic personalization of diagnostic and therapeutic strategies. There are significant barriers to realizing this goal that are related to the difficult task of interpreting personal genetic variation. A comprehensive, widely accessible application for interpretation of whole genome sequence data is needed. Here, we present a series of methods for identification of genetic variants and genotypes with clinical associations, phasing genetic data and using Mendelian inheritance for quality control, and providing predictive genetic information about risk for rare disease phenotypes and response to pharmacological therapy in single individuals and father-mother-child trios. We demonstrate application of these methods for disease and drug response prognostication in whole genome sequence data from twelve unrelated adults, and for disease gene discovery in one father-mother-child trio with apparently simplex congenital ventricular arrhythmia. In doing so we identify clinically actionable inherited disease risk and drug response genotypes in pre-symptomatic individuals. We also nominate a new candidate gene in congenital arrhythmia, ATP2B4, and provide experimental evidence of a regulatory role for variants discovered using this framework.

    View details for DOI 10.1371/journal.pgen.1005496

    View details for Web of Science ID 000364401600008

    View details for PubMedID 26448358

    View details for PubMedCentralID PMC4598191

  • The Integrative Human Microbiome Project: Dynamic Analysis of Microbiome-Host Omics Profiles during Periods of Human Health and Disease CELL HOST & MICROBE Proctor, L. M. 2014; 16 (3): 276-289

    Abstract

    Much has been learned about the diversity and distribution of human-associated microbial communities, but we still know little about the biology of the microbiome, how it interacts with the host, and how the host responds to its resident microbiota. The Integrative Human Microbiome Project (iHMP, http://hmp2.org), the second phase of the NIH Human Microbiome Project, will study these interactions by analyzing microbiome and host activities in longitudinal studies of disease-specific cohorts and by creating integrated data sets of microbiome and host functional properties. These data sets will serve as experimental test beds to evaluate new models, methods, and analyses on the interactions of host and microbiome. Here we describe the three models of microbiome-associated human conditions, on the dynamics of preterm birth, inflammatory bowel disease, and type 2 diabetes, and their underlying hypotheses, as well as the multi-omic data types to be collected, integrated, and distributed through public repositories as a community resource.

    View details for DOI 10.1016/j.chom.2014.08.014

    View details for Web of Science ID 000342057000006

    View details for PubMedID 25211071