Dr. Alexander Ioannidis (Ph.D., M.Phil) earned his Ph.D. from Stanford University in Computational and Mathematical Engineering, where he teaches machine learning and data science as an Adjunct Professor in the School of Engineering. He also has an M.S. in Management Science and Engineering (Optimization) from Stanford. Prior to Stanford, he worked in superconducting computing logic and quantum computing at Northrop Grumman. He graduated summa cum laude from Harvard University in Chemistry and Physics and earned an M.Phil from the Department of Applied Math and Theoretical Physics in Computational Biology, and Diploma in Greek, from the University of Cambridge. As a current research fellow in the Stanford School of Medicine, Department of Biomedical Data Science his work focuses on the design of algorithms and application of computational methods for problems in genomics, clinical data science, and precision health with a particular focus on underrepresented populations in Oceania and Latin America.
Doctor of Philosophy, Stanford University, CME-PHD (2018)
Master of Science, Stanford University, MGTSC-MS (2018)
Master of Philosophy, University of Cambridge, Computational Biology (2005)
Bachelor of Arts, Harvard University, Chemistry and Physics (2003)
SALAI-Net: species-agnostic local ancestry inference network.
Bioinformatics (Oxford, England)
2022; 38 (Supplement_2): ii27-ii33
MOTIVATION: Local ancestry inference (LAI) is the high resolution prediction of ancestry labels along a DNA sequence. LAI is important in the study of human history and migrations, and it is beginning to play a role in precision medicine applications including ancestry-adjusted genome-wide association studies (GWASs) and polygenic risk scores (PRSs). Existing LAI models do not generalize well between species, chromosomes or even ancestry groups, requiring re-training for each different setting. Furthermore, such methods can lack interpretability, which is an important element in each of these applications.RESULTS: We present SALAI-Net, a portable statistical LAI method that can be applied on any set of species and ancestries (species-agnostic), requiring only haplotype data and no other biological parameters. Inspired by identity by descent methods, SALAI-Net estimates population labels for each segment of DNA by performing a reference matching approach, which leads to an interpretable and fast technique. We benchmark our models on whole-genome data of humans and we test these models' ability to generalize to dog breeds when trained on human data. SALAI-Net outperforms previous methods in terms of balanced accuracy, while generalizing between different settings, species and datasets. Moreover, it is up to two orders of magnitude faster and uses considerably less RAM memory than competing methods.AVAILABILITY AND IMPLEMENTATION: We provide an open source implementation and links to publicly available data at github.com/AI-sandbox/SALAI-Net. Data is publicly available as follows: https://www.internationalgenome.org (1000 Genomes), https://www.simonsfoundation.org/simons-genome-diversity-project (Simons Genome Diversity Project), https://www.sanger.ac.uk/resources/downloads/human/hapmap3.html (HapMap), ftp://ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516 (Human Genome Diversity Project) and https://www.ncbi.nlm.nih.gov/bioproject/PRJNA448733 (Canid genomes).SUPPLEMENTARY INFORMATION: Supplementary data are available from Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btac464
View details for PubMedID 36124792
Deconvoluting complex correlates of COVID-19 severity with a multi-omic pandemic tracking strategy.
2022; 13 (1): 5107
The SARS-CoV-2 pandemic has differentially impacted populations across race and ethnicity. A multi-omic approach represents a powerful tool to examine risk across multi-ancestry genomes. We leverage a pandemic tracking strategy in which we sequence viral and host genomes and transcriptomes from nasopharyngeal swabs of 1049 individuals (736 SARS-CoV-2 positive and 313 SARS-CoV-2 negative) and integrate them with digital phenotypes from electronic health records from a diverse catchment area in Northern California. Genome-wide association disaggregated by admixture mapping reveals novel COVID-19-severity-associated regions containing previously reported markers of neurologic, pulmonary and viral disease susceptibility. Phylodynamic tracking of consensus viral genomes reveals no association with disease severity or inferred ancestry. Summary data from multiomic investigation reveals metagenomic and HLA associations with severe COVID-19. The wealth of data available from residual nasopharyngeal swabs in combination with clinical data abstracted automatically at scale highlights a powerful strategy for pandemic tracking, and reveals distinct epidemiologic, genetic, and biological associations for those at the highest risk.
View details for DOI 10.1038/s41467-022-32397-8
View details for PubMedID 36042219
Archetypal Analysis for population genetics.
PLoS computational biology
2022; 18 (8): e1010301
The estimation of genetic clusters using genomic data has application from genome-wide association studies (GWAS) to demographic history to polygenic risk scores (PRS) and is expected to play an important role in the analyses of increasingly diverse, large-scale cohorts. However, existing methods are computationally-intensive, prohibitively so in the case of nationwide biobanks. Here we explore Archetypal Analysis as an efficient, unsupervised approach for identifying genetic clusters and for associating individuals with them. Such unsupervised approaches help avoid conflating socially constructed ethnic labels with genetic clusters by eliminating the need for exogenous training labels. We show that Archetypal Analysis yields similar cluster structure to existing unsupervised methods such as ADMIXTURE and provides interpretative advantages. More importantly, we show that since Archetypal Analysis can be used with lower-dimensional representations of genetic data, significant reductions in computational time and memory requirements are possible. When Archetypal Analysis is run in such a fashion, it takes several orders of magnitude less compute time than the current standard, ADMIXTURE. Finally, we demonstrate uses ranging across datasets from humans to canids.
View details for DOI 10.1371/journal.pcbi.1010301
View details for PubMedID 36007005
Generative Moment Matching Networks for Genotype Simulation.
Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
2022; 2022: 1379-1383
The generation of synthetic genomic sequences using neural networks has potential to ameliorate privacy and data sharing concerns and to mitigate potential bias within datasets due to under-representation of some population groups. However, there is not a consensus on which architectures, training procedures, and evaluation metrics should be used when simulating single nucleotide polymorphism (SNP) sequences with neural networks. In this paper, we explore the use of Generative Moment Matching Networks (GMMNs) for SNP simulation, we present some architectural and procedural changes to properly train the networks, and we introduce an evaluation scheme to qualitatively and quantitatively assess the quality of the simulated sequences.
View details for DOI 10.1109/EMBC48229.2022.9871045
View details for PubMedID 36086656
Paths and timings of the peopling of Polynesia inferred from genomic networks.
2021; 597 (7877): 522-526
Polynesia was settled in a series of extraordinary voyages across an ocean spanning one third of the Earth1, but the sequences of islands settled remain unknown and their timings disputed. Currently, several centuries separate the dates suggested by different archaeological surveys2-4. Here, using genome-wide data frommerely 430 modern individuals from 21 key Pacific island populations and novel ancestry-specific computational analyses, we unravel the detailed genetic history of this vast, dispersed island network. Our reconstruction of the branching Polynesian migration sequence reveals a serial founder expansion, characterized by directional loss of variants, that originated in Samoa and spread first through the Cook Islands (Rarotonga), then to the Society (Totaiete ma) Islands (11th century), the western Austral (Tuha'a Pae) Islands and Tuamotu Archipelago (12th century), and finally to the widely separated, but genetically connected, megalithic statue-building cultures of the Marquesas (Te Henua 'Enana) Islands in the north, Raivavae in the south, and Easter Island (Rapa Nui), the easternmost of the Polynesian islands, settled in approximately AD 1200 via Mangareva.
View details for DOI 10.1038/s41586-021-03902-8
View details for PubMedID 34552258
Mapping the human genetic architecture of COVID-19.
The genetic makeup of an individual contributes to susceptibility and response to viral infection. While environmental, clinical and social factors play a role in exposure to SARS-CoV-2 and COVID-19 disease severity1,2, host genetics may also be important. Identifying host-specific genetic factors may reveal biological mechanisms of therapeutic relevance and clarify causal relationships of modifiable environmental risk factors for SARS-CoV-2 infection and outcomes. We formed a global network of researchers to investigate the role of human genetics in SARS-CoV-2 infection and COVID-19 severity. We describe the results of three genome-wide association meta-analyses comprised of up to 49,562 COVID-19 patients from 46 studies across 19 countries. We reported 13 genome-wide significant loci that are associated with SARS-CoV-2 infection or severe manifestations of COVID-19. Several of these loci correspond to previously documented associations to lung or autoimmune and inflammatory diseases3-7. They also represent potentially actionable mechanisms in response to infection. Mendelian Randomization analyses support a causal role for smoking and body mass index for severe COVID-19 although not for type II diabetes. The identification of novel host genetic factors associated with COVID-19, with unprecedented speed, was made possible by the community of human genetic researchers coming together to prioritize sharing of data, results, resources and analytical frameworks. This working model of international collaboration underscores what is possible for future genetic discoveries in emerging pandemics, or indeed for any complex human disease.
View details for DOI 10.1038/s41586-021-03767-x
View details for PubMedID 34237774
Native American gene flow into Polynesia predating Easter Island settlement.
The possibility of voyaging contact between prehistoric Polynesian and Native Americanpopulations has long intrigued researchers. Proponents have pointed to the existence of New World crops, such as the sweet potato and bottle gourd, in the Polynesian archaeological record, but nowhere else outside the pre-Columbian Americas1-6, while critics have argued that these botanical dispersals need not have been human mediated7. The Norwegian explorer Thor Heyerdahl controversially suggested that prehistoric South Americanpopulations had an important role in the settlement of east Polynesia and particularly of Easter Island (Rapa Nui)2. Several limited molecular genetic studies have reached opposing conclusions, and the possibility continues to be as hotly contested today as it was when first suggested8-12. Here we analyse genome-wide variation in individuals from islands across Polynesia for signs of Native American admixture, analysing 807 individuals from 17 island populations and 15 Pacific coast Native American groups. We find conclusive evidence for prehistoric contact of Polynesianindividuals with Native Americanindividuals (around AD 1200) contemporaneouswith the settlement of remote Oceania13-15. Our analyses suggest strongly that a single contact event occurred in eastern Polynesia, before the settlement of Rapa Nui, between Polynesianindividuals and a Native American group most closely related to the indigenous inhabitants of present-day Colombia.
View details for DOI 10.1038/s41586-020-2487-2
View details for PubMedID 32641827
Validating and automating learning of cardiometabolic polygenic risk scores from direct-to-consumer genetic and phenotypic data: implications for scaling precision health research.
2022; 16 (1): 37
INTRODUCTION: A major challenge to enabling precision health at a global scale is the bias between those who enroll in state sponsored genomic research and those suffering from chronic disease. More than 30 million people have been genotyped by direct-to-consumer (DTC) companies such as 23andMe, Ancestry DNA, and MyHeritage, providing a potential mechanism for democratizing access to medical interventions and thus catalyzing improvements in patient outcomes as the cost of data acquisition drops. However, much of these data are sequestered in the initial provider network, without the ability for the scientific community to either access or validate. Here, we present a novel geno-pheno platform that integrates heterogeneous data sources and applies learnings to common chronic disease conditions including Type 2 diabetes (T2D) and hypertension.METHODS: We collected genotyped data from a novel DTC platform where participants upload their genotype data files and were invited to answer general health questionnaires regarding cardiometabolic traits over a period of 6months. Quality control, imputation, and genome-wide association studies were performed on this dataset, and polygenic risk scores were built in a case-control setting using the BASIL algorithm.RESULTS: We collected data on N=4,550 (389 cases / 4,161 controls) who reported being affected or previously affected for T2D and N=4,528 (1,027 cases / 3,501 controls) for hypertension. We identified 164 out of 272 variants showing identical effect direction to previously reported genome-significant findings in Europeans. Performance metric of the PRS models was AUC=0.68, which is comparable to previously published PRS models obtained with larger datasets including clinical biomarkers.DISCUSSION: DTC platforms have the potential of inverting research models of genome sequencing and phenotypic data acquisition. Quality control (QC) mechanisms proved to successfully enable traditional GWAS and PRS analyses. The direct participation of individuals has shown the potential to generate rich datasets enabling the creation of PRS cardiometabolic models. More importantly, federated learning of PRS from reuse of DTC data provides a mechanism for scaling precision health care delivery beyond the small number of countries who can afford to finance these efforts directly.CONCLUSIONS: The genetics of T2D and hypertension have been studied extensively in controlled datasets, and various polygenic risk scores (PRS) have been developed. We developed predictive tools for both phenotypes trained with heterogeneous genotypic and phenotypic data generated outside of the clinical environment and show that our methods can recapitulate prior findings with fidelity. From these observations, we conclude that it is possible to leverage DTC genetic repositories to identify individuals at risk of debilitating diseases based on their unique genetic landscape so that informed, timely clinical interventions can be incorporated.
View details for DOI 10.1186/s40246-022-00406-y
View details for PubMedID 36076307
Ancient DNA reveals five streams of migration into Micronesia and matrilocality in early Pacific seafarers.
Science (New York, N.Y.)
2022; 377 (6601): 72-79
Micronesia began to be peopled earlier than other parts of Remote Oceania, but the origins of its inhabitants remain unclear. We generated genome-wide data from 164 ancient and 112 modern individuals. Analysis reveals five migratory streams into Micronesia. Three are East Asian related, one is Polynesian, and a fifth is a Papuan source related to mainland New Guineans that is different from the New Britain-related Papuan source for southwest Pacific populations but is similarly derived from male migrants ~2500 to 2000 years ago. People of the Mariana Archipelago may derive all of their precolonial ancestry from East Asian sources, making them the only Remote Oceanians without Papuan ancestry. Female-inherited mitochondrial DNA was highly differentiated across early Remote Oceanian communities but homogeneous within, implying matrilocal practices whereby women almost never raised their children in communities different from the ones in which they grew up.
View details for DOI 10.1126/science.abm6536
View details for PubMedID 35771911
Predicting Dog Phenotypes from Genotypes.
Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
2022; 2022: 3558-3562
We analyze dog genotypes (i.e., positions of dog DNA sequences that often vary between different dogs) in order to predict the corresponding phenotypes (i.e., unique observed characteristics). More specifically, given chromosome data from a dog, we aim to predict the breed, height, and weight. We explore a variety of linear and non-linear classification and regression techniques to accomplish these three tasks. We also investigate the use of a neural network (both in linear and non-linear modes) for breed classification and compare the performance to traditional statistical methods. We show that linear methods generally outperform or match the performance of non-linear methods for breed classification. However, we show that the reverse is true for height and weight regression. Finally, we evaluate the results of all of these methods based on the number of input features used in the analysis. We conduct experiments using different fractions of the full genomic sequences, resulting in input sequences ranging from 20 SNPs to ∼200k SNPs. In doing so, we explore the impact of using a very limited number of SNPs for prediction. Our experiments demonstrate that these phenotypes in dogs can be predicted with as few as 0.5% of randomly selected SNPs (i.e., 992 SNPs) and that dog breeds can be classified with 50% balanced accuracy with as few as 0.02% SNPs (i.e., 40 SNPs).
View details for DOI 10.1109/EMBC48229.2022.9870905
View details for PubMedID 36085664
The genetic legacy of the Manila galleon trade in Mexico.
Philosophical transactions of the Royal Society of London. Series B, Biological sciences
2022; 377 (1852): 20200419
The population of Mexico has a considerable genetic substructure due to both its pre-Columbian diversity and due to genetic admixture from post-Columbian trans-oceanic migrations. The latter primarily originated in Europe and Africa, but also, to a lesser extent, in Asia. We analyze previously understudied genetic connections between Asia and Mexico to infer the timing and source of this genetic ancestry in Mexico. We identify the predominant origin within Southeast Asia-specifically western Indonesian and non-Negrito Filipino sources-and we date its arrival in Mexico to approximately 13 generations ago (1620 CE). This points to a genetic legacy from the seventeenth century Manila galleon trade between the colonial Spanish Philippines and the Pacific port of Acapulco. Indeed, within Mexico we observe the highest level of this trans-Pacific ancestry in Acapulco, located in the state of Guerrero. This colonial Spanish trade route from East Asia to Europe was centred on Mexico and appears in historical records, but its legacy has been largely ignored. Identities and stories were suppressed due to slavery, assimilation of the immigrants as 'Indios' and incomplete historical records. Here we characterize this understudied Mexican ancestry. This article is part of the theme issue 'Celebrating 50 years since Lewontin's apportionment of human diversity'.
View details for DOI 10.1098/rstb.2020.0419
View details for PubMedID 35430879
Opportunities and challenges for the use of common controls in sequencing studies.
Nature reviews. Genetics
Genome-wide association studies using large-scale genome and exome sequencing data have become increasingly valuable in identifying associations between genetic variants and disease, transforming basic research and translational medicine. However, this progress has not been equally shared across all people and conditions, in part due to limited resources. Leveraging publicly available sequencing data as external common controls, rather than sequencing new controls for every study, can better allocate resources by augmenting control sample sizes or providing controls where none existed. However, common control studies must be carefully planned and executed as even small differences in sample ascertainment and processing can result in substantial bias. Here, we discuss challenges and opportunities for the robust use of common controls in high-throughput sequencing studies, including study design, quality control and statistical approaches. Thoughtful generation and use of large and valuable genetic sequencing data sets will enable investigation of a broader and more representative set of conditions, environments and genetic ancestries than otherwise possible.
View details for DOI 10.1038/s41576-022-00487-4
View details for PubMedID 35581355
Bayesian model comparison for rare-variant association studies.
American journal of human genetics
Whole-genome sequencing studies applied to large populations or biobanks with extensive phenotyping raise new analytic challenges. The need to consider many variants at a locus or group of genes simultaneously and the potential to study many correlated phenotypes with shared genetic architecture provide opportunities for discovery not addressed by the traditional one variant, one phenotype association study. Here, we introduce a Bayesian model comparison approach called MRP (multiple rare variants and phenotypes) for rare-variant association studies that considers correlation, scale, and direction of genetic effects across a group of genetic variants, phenotypes, and studies, requiring only summary statistic data. We apply our method to exome sequencing data (n = 184,698) across 2,019 traits from the UK Biobank, aggregating signals in genes. MRP demonstrates an ability to recover signals such as associations between PCSK9 and LDL cholesterol levels. We additionally find MRP effective in conducting meta-analyses in exome data. Non-biomarker findings include associations between MC1R and red hair color and skin color, IL17RA and monocyte count, and IQGAP2 and mean platelet volume. Finally, we apply MRP in a multi-phenotype setting; after clustering the 35 biomarker phenotypes based on genetic correlation estimates, we find that joint analysis of these phenotypes results in substantial power gains for gene-trait associations, such as in TNFRSF13B in one of the clusters containing diabetes- and lipid-related traits. Overall, we show that the MRP model comparison approach improves upon useful features from widely used meta-analysis approaches for rare-variant association analyses and prioritizes protective modifiers of disease risk.
View details for DOI 10.1016/j.ajhg.2021.11.005
View details for PubMedID 34822764
- High Resolution Ancestry Deconvolution for Next Generation Genomic Data bioRxiv 2021
- Neural ADMIXTURE: rapid population clustering with autoencoders bioRxiv 2021
Discovering prescription patterns in pediatric acute-onset neuropsychiatric syndrome patients.
Journal of biomedical informatics
OBJECTIVE: Pediatric acute-onset neuropsychiatric syndrome (PANS) is a complex neuropsychiatric syndrome characterized by an abrupt onset of obsessive-compulsive symptoms and/or severe eating restrictions, along with at least two concomitant debilitating cognitive, behavioral, or neurological symptoms. A wide range of pharmacological interventions along with behavioral and environmental modifications, and psychotherapies have been adopted to treat symptoms and underlying etiologies. Our goal was to develop a data-driven approach to identify treatment patterns in this cohort.MATERIALS AND METHODS: In this cohort study, we extracted medical prescription histories from electronic health records. We developed a modified dynamic programming approach to perform global alignment of those medication histories. Our approach is unique since it considers time gaps in prescription patterns as part of the similarity strategy.RESULTS: This study included 43 consecutive new-onset pre-pubertal patients who had at least 3 clinic visits. Our algorithm identified six clusters with distinct medication usage history which may represent clinician's practice of treating PANS of different severities and etiologies i.e., two most severe groups requiring high dose intravenous steroids; two arthritic or inflammatory groups requiring prolonged nonsteroidal anti-inflammatory drug (NSAID); and two mild relapsing/remitting group treated with a short course of NSAID. The psychometric scores as outcomes in each cluster generally improved within the first two years.DISCUSSION: and conclusion Our algorithm shows potential to improve our knowledge of treatment patterns in the PANS cohort, while helping clinicians understand how patients respond to a combination of drugs.
View details for DOI 10.1016/j.jbi.2020.103664
View details for PubMedID 33359113
LAI-NET: LOCAL-ANCESTRY INFERENCE WITH NEURAL NETWORKS
IEEE. 2020: 1314–18
View details for Web of Science ID 000615970401111
- Class-Conditional VAE-GAN for Local-Ancestry Simulation MLCB Proceedings 2019
Reconstructing admixture and migration dynamics of post-contact Mexico
WILEY. 2018: 228
View details for Web of Science ID 000430656803170
- Integrated Power Divider for Superconducting Digital Circuits IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY 2011; 21 (3): 571–74
- Ultra-low-power superconductor logic JOURNAL OF APPLIED PHYSICS 2011; 109 (10)
- Digital circuits using self-shunted Nb/NbxSi1-x/Nb Josephson junctions APPLIED PHYSICS LETTERS 2010; 96 (21)