Alexander Ioannidis
Assistant Professor (Research) of Genetics and of Biomedical Data Science
Adjunct Professor, Institute for Computational and Mathematical Engineering (ICME)
Web page: https://ai-page.org/
Bio
Dr. Ioannidis earned his Ph.D. from Stanford University in Computational and Mathematical Engineering together with an M.S. in Management Science and Engineering (Optimization). He graduated summa cum laude from Harvard University in Chemistry and Physics and earned an M.Phil at the University of Cambridge from the Department of Applied Math and Theoretical Physics in Computational Biology. His research focuses on the design of algorithms and application of computational methods for problems in precision health, genomics, clinical data science, and AI in healthcare.
Academic Appointments
-
Assistant Professor (Research), Genetics
-
Assistant Professor (Research), Department of Biomedical Data Science
-
Adjunct Professor, Institute for Computational and Mathematical Engineering (ICME)
Boards, Advisory Committees, Professional Organizations
-
Editorial Board, Human Genomics
Professional Education
-
Ph.D., Stanford, Computational and Mathematical Engineering
-
M.S., Stanford, Management Science and Engineering
-
M.Phil, University of Cambridge, Computational Biology
-
B.A., summa cum laude, Harvard, Chemistry and Physics
2025-26 Courses
- Healthcare Acceleration: Artificial Intelligence
BMDS 272, DESIGN 266 (Aut) -
Independent Studies (3)
- Graduate Research
GENE 399 (Aut, Win, Spr) - Out-of-Department Graduate Research
BIO 300X (Spr) - Supervised Study
GENE 260 (Aut, Win, Spr)
- Graduate Research
-
Prior Year Courses
2023-24 Courses
- Generative AI in Healthcare
BIODS 295, DESIGN 266 (Spr)
- Generative AI in Healthcare
Graduate and Fellowship Programs
-
Biomedical Data Science (Phd Program)
-
Biomedical Data Science (Masters Program)
All Publications
-
Neural ADMIXTURE for rapid genomic clustering.
Nature computational science
2023; 3 (7): 621-629
Abstract
Characterizing the genetic structure of large cohorts has become increasingly important as genetic studies extend to massive, increasingly diverse biobanks. Popular methods decompose individual genomes into fractional cluster assignments with each cluster representing a vector of DNA variant frequencies. However, with rapidly increasing biobank sizes, these methods have become computationally intractable. Here we present Neural ADMIXTURE, a neural network autoencoder that follows the same modeling assumptions as the current standard algorithm, ADMIXTURE, while reducing the compute time by orders of magnitude surpassing even the fastest alternatives. One month of continuous compute using ADMIXTURE can be reduced to just hours with Neural ADMIXTURE. A multi-head approach allows Neural ADMIXTURE to offer even further acceleration by calculating multiple cluster numbers in a single run. Furthermore, the models can be stored, allowing cluster assignment to be performed on new data in linear time without needing to share the training samples.
View details for DOI 10.1038/s43588-023-00482-7
View details for PubMedID 37600116
View details for PubMedCentralID PMC10438426
-
Deconvoluting complex correlates of COVID-19 severity with a multi-omic pandemic tracking strategy.
Nature communications
2022; 13 (1): 5107
Abstract
The SARS-CoV-2 pandemic has differentially impacted populations across race and ethnicity. A multi-omic approach represents a powerful tool to examine risk across multi-ancestry genomes. We leverage a pandemic tracking strategy in which we sequence viral and host genomes and transcriptomes from nasopharyngeal swabs of 1049 individuals (736 SARS-CoV-2 positive and 313 SARS-CoV-2 negative) and integrate them with digital phenotypes from electronic health records from a diverse catchment area in Northern California. Genome-wide association disaggregated by admixture mapping reveals novel COVID-19-severity-associated regions containing previously reported markers of neurologic, pulmonary and viral disease susceptibility. Phylodynamic tracking of consensus viral genomes reveals no association with disease severity or inferred ancestry. Summary data from multiomic investigation reveals metagenomic and HLA associations with severe COVID-19. The wealth of data available from residual nasopharyngeal swabs in combination with clinical data abstracted automatically at scale highlights a powerful strategy for pandemic tracking, and reveals distinct epidemiologic, genetic, and biological associations for those at the highest risk.
View details for DOI 10.1038/s41467-022-32397-8
View details for PubMedID 36042219
-
Archetypal Analysis for population genetics.
PLoS computational biology
2022; 18 (8): e1010301
Abstract
The estimation of genetic clusters using genomic data has application from genome-wide association studies (GWAS) to demographic history to polygenic risk scores (PRS) and is expected to play an important role in the analyses of increasingly diverse, large-scale cohorts. However, existing methods are computationally-intensive, prohibitively so in the case of nationwide biobanks. Here we explore Archetypal Analysis as an efficient, unsupervised approach for identifying genetic clusters and for associating individuals with them. Such unsupervised approaches help avoid conflating socially constructed ethnic labels with genetic clusters by eliminating the need for exogenous training labels. We show that Archetypal Analysis yields similar cluster structure to existing unsupervised methods such as ADMIXTURE and provides interpretative advantages. More importantly, we show that since Archetypal Analysis can be used with lower-dimensional representations of genetic data, significant reductions in computational time and memory requirements are possible. When Archetypal Analysis is run in such a fashion, it takes several orders of magnitude less compute time than the current standard, ADMIXTURE. Finally, we demonstrate uses ranging across datasets from humans to canids.
View details for DOI 10.1371/journal.pcbi.1010301
View details for PubMedID 36007005
-
Paths and timings of the peopling of Polynesia inferred from genomic networks.
Nature
2021; 597 (7877): 522-526
Abstract
Polynesia was settled in a series of extraordinary voyages across an ocean spanning one third of the Earth1, but the sequences of islands settled remain unknown and their timings disputed. Currently, several centuries separate the dates suggested by different archaeological surveys2-4. Here, using genome-wide data frommerely 430 modern individuals from 21 key Pacific island populations and novel ancestry-specific computational analyses, we unravel the detailed genetic history of this vast, dispersed island network. Our reconstruction of the branching Polynesian migration sequence reveals a serial founder expansion, characterized by directional loss of variants, that originated in Samoa and spread first through the Cook Islands (Rarotonga), then to the Society (Totaiete ma) Islands (11th century), the western Austral (Tuha'a Pae) Islands and Tuamotu Archipelago (12th century), and finally to the widely separated, but genetically connected, megalithic statue-building cultures of the Marquesas (Te Henua 'Enana) Islands in the north, Raivavae in the south, and Easter Island (Rapa Nui), the easternmost of the Polynesian islands, settled in approximately AD 1200 via Mangareva.
View details for DOI 10.1038/s41586-021-03902-8
View details for PubMedID 34552258
-
Mapping the human genetic architecture of COVID-19.
Nature
2021
Abstract
The genetic makeup of an individual contributes to susceptibility and response to viral infection. While environmental, clinical and social factors play a role in exposure to SARS-CoV-2 and COVID-19 disease severity1,2, host genetics may also be important. Identifying host-specific genetic factors may reveal biological mechanisms of therapeutic relevance and clarify causal relationships of modifiable environmental risk factors for SARS-CoV-2 infection and outcomes. We formed a global network of researchers to investigate the role of human genetics in SARS-CoV-2 infection and COVID-19 severity. We describe the results of three genome-wide association meta-analyses comprised of up to 49,562 COVID-19 patients from 46 studies across 19 countries. We reported 13 genome-wide significant loci that are associated with SARS-CoV-2 infection or severe manifestations of COVID-19. Several of these loci correspond to previously documented associations to lung or autoimmune and inflammatory diseases3-7. They also represent potentially actionable mechanisms in response to infection. Mendelian Randomization analyses support a causal role for smoking and body mass index for severe COVID-19 although not for type II diabetes. The identification of novel host genetic factors associated with COVID-19, with unprecedented speed, was made possible by the community of human genetic researchers coming together to prioritize sharing of data, results, resources and analytical frameworks. This working model of international collaboration underscores what is possible for future genetic discoveries in emerging pandemics, or indeed for any complex human disease.
View details for DOI 10.1038/s41586-021-03767-x
View details for PubMedID 34237774
-
Native American gene flow into Polynesia predating Easter Island settlement.
Nature
2020
Abstract
The possibility of voyaging contact between prehistoric Polynesian and Native Americanpopulations has long intrigued researchers. Proponents have pointed to the existence of New World crops, such as the sweet potato and bottle gourd, in the Polynesian archaeological record, but nowhere else outside the pre-Columbian Americas1-6, while critics have argued that these botanical dispersals need not have been human mediated7. The Norwegian explorer Thor Heyerdahl controversially suggested that prehistoric South Americanpopulations had an important role in the settlement of east Polynesia and particularly of Easter Island (Rapa Nui)2. Several limited molecular genetic studies have reached opposing conclusions, and the possibility continues to be as hotly contested today as it was when first suggested8-12. Here we analyse genome-wide variation in individuals from islands across Polynesia for signs of Native American admixture, analysing 807 individuals from 17 island populations and 15 Pacific coast Native American groups. We find conclusive evidence for prehistoric contact of Polynesianindividuals with Native Americanindividuals (around AD 1200) contemporaneouswith the settlement of remote Oceania13-15. Our analyses suggest strongly that a single contact event occurred in eastern Polynesia, before the settlement of Rapa Nui, between Polynesianindividuals and a Native American group most closely related to the indigenous inhabitants of present-day Colombia.
View details for DOI 10.1038/s41586-020-2487-2
View details for PubMedID 32641827
-
Ultra-low-power superconductor logic
JOURNAL OF APPLIED PHYSICS
2011; 109 (10)
View details for DOI 10.1063/1.3585849
View details for Web of Science ID 000292115900092
-
Advances in Biomedical Missing Data Imputation: A Survey
IEEE ACCESS
2025; 13: 16918-16932
View details for DOI 10.1109/ACCESS.2024.3516506
View details for Web of Science ID 001410383800049
-
An archaic HLA class I receptor allele diversifies natural killer cell-driven immunity in First Nations peoples of Oceania.
Cell
2024
Abstract
Genetic variation in host immunity impacts the disproportionate burden of infectious diseases that can be experienced by First Nations peoples. Polymorphic human leukocyte antigen (HLA) class I and killer cell immunoglobulin-like receptors (KIRs) are key regulators of natural killer (NK) cells, which mediate early infection control. How this variation impacts their responses across populations is unclear. We show that HLA-A∗24:02 became the dominant ligand for inhibitory KIR3DL1 in First Nations peoples across Oceania, through positive natural selection. We identify KIR3DL1∗114, widespread across and unique to Oceania, as an allele lineage derived from archaic humans. KIR3DL1∗114+NK cells from First Nations Australian donors are inhibited through binding HLA-A∗24:02. The KIR3DL1∗114 lineage is defined by phenylalanine at residue 166. Structural and binding studies show phenylalanine 166 forms multiple unique contacts with HLA-peptide complexes, increasing both affinity and specificity. Accordingly, assessing immunogenetic variation and the functional implications for immunity are fundamental toward understanding population-based disease associations.
View details for DOI 10.1016/j.cell.2024.10.005
View details for PubMedID 39476840
-
Polygenic risk score portability for common diseases across genetically diverse populations.
Human genomics
2024; 18 (1): 93
Abstract
Polygenic risk scores (PRS) derived from European individuals have reduced portability across global populations, limiting their clinical implementation at worldwide scale. Here, we investigate the performance of a wide range of PRS models across four ancestry groups (Africans, Europeans, East Asians, and South Asians) for 14 conditions of high-medical interest.To select the best-performing model per trait, we first compared PRS performances for publicly available scores, and constructed new models using different methods (LDpred2, PRS-CSx and SNPnet). We used 285 K European individuals from the UK Biobank (UKBB) for training and 18 K, including diverse ancestries, for testing. We then evaluated PRS portability for the best models in Europeans and compared their accuracies with respect to the best PRS per ancestry. Finally, we validated the selected PRS models using an independent set of 8,417 individuals from Biobank of the Americas-Genomelink (BbofA-GL); and performed a PRS-Phewas.We confirmed a decay in PRS performances relative to Europeans when the evaluation was conducted using the best-PRS model for Europeans (51.3% for South Asians, 46.6% for East Asians and 39.4% for Africans). We observed an improvement in the PRS performances when specifically selecting ancestry specific PRS models (phenotype variance increase: 1.62 for Africans, 1.40 for South Asians and 0.96 for East Asians). Additionally, when we selected the optimal model conditional on ancestry for CAD, HDL-C and LDL-C, hypertension, hypothyroidism and T2D, PRS performance for studied populations was more comparable to what was observed in Europeans. Finally, we were able to independently validate tested models for Europeans, and conducted a PRS-Phewas, identifying cross-trait interplay between cardiometabolic conditions, and between immune-mediated components.Our work comprehensively evaluated PRS accuracy across a wide range of phenotypes, reducing the uncertainty with respect to which PRS model to choose and in which ancestry group. This evaluation has let us identify specific conditions where implementing risk-prioritization strategies could have practical utility across diverse ancestral groups, contributing to democratizing the implementation of PRS.
View details for DOI 10.1186/s40246-024-00664-y
View details for PubMedID 39218908
View details for PubMedCentralID PMC11367857
-
Genetic Signatures of Positive Selection in Human Populations Adapted to High Altitude in Papua New Guinea.
Genome biology and evolution
2024; 16 (8)
Abstract
Papua New Guinea (PNG) hosts distinct environments mainly represented by the ecoregions of the Highlands and Lowlands that display increased altitude and a predominance of pathogens, respectively. Since its initial peopling approximately 50,000 years ago, inhabitants of these ecoregions might have differentially adapted to the environmental pressures exerted by each of them. However, the genetic basis of adaptation in populations from these areas remains understudied. Here, we investigated signals of positive selection in 62 highlanders and 43 lowlanders across 14 locations in the main island of PNG using whole-genome genotype data from the Oceanian Genome Variation Project (OGVP) and searched for signals of positive selection through population differentiation and haplotype-based selection scans. Additionally, we performed archaic ancestry estimation to detect selection signals in highlanders within introgressed regions of the genome. Among highland populations we identified candidate genes representing known biomarkers for mountain sickness (SAA4, SAA1, PRDX1, LDHA) as well as candidate genes of the Notch signaling pathway (PSEN1, NUMB, RBPJ, MAML3), a novel proposed pathway for high altitude adaptation in multiple organisms. We also identified candidate genes involved in oxidative stress, inflammation, and angiogenesis, processes inducible by hypoxia, as well as in components of the eye lens and the immune response. In contrast, candidate genes in the lowlands are mainly related to the immune response (HLA-DQB1, HLA-DQA2, TAAR6, TAAR9, TAAR8, RNASE4, RNASE6, ANG). Moreover, we find two candidate regions to be also enriched with archaic introgressed segments, suggesting that archaic admixture has played a role in the local adaptation of PNG populations.
View details for DOI 10.1093/gbe/evae161
View details for PubMedID 39173139
-
Evaluating disparities in receptor status, overall survival, and time to hormone therapy among women with breast cancer
LIPPINCOTT WILLIAMS & WILKINS. 2024
View details for Web of Science ID 001275557402568
-
Genetic landscape of colorectal cancer (CRC) across genetic ancestries: Implications for early cancer detection (ECD).
LIPPINCOTT WILLIAMS & WILKINS. 2024
View details for Web of Science ID 001275557404030
-
Deep history of cultural and linguistic evolution among Central African hunter-gatherers.
Nature human behaviour
2024
Abstract
Human evolutionary history in Central Africa reflects a deep history of population connectivity. However, Central African hunter-gatherers (CAHGs) currently speak languages acquired from their neighbouring farmers. Hence it remains unclear which aspects of CAHG cultural diversity results from long-term evolution preceding agriculture and which reflect borrowing from farmers. On the basis of musical instruments, foraging tools, specialized vocabulary and genome-wide data from ten CAHG populations, we reveal evidence of large-scale cultural interconnectivity among CAHGs before and after the Bantu expansion. We also show that the distribution of hunter-gatherer musical instruments correlates with the oldest genomic segments in our sample predating farming. Music-related words are widely shared between western and eastern groups and likely precede the borrowing of Bantu languages. In contrast, subsistence tools are less frequently exchanged and may result from adaptation to local ecologies. We conclude that CAHG material culture and specialized lexicon reflect a long evolutionary history in Central Africa.
View details for DOI 10.1038/s41562-024-01891-y
View details for PubMedID 38802540
View details for PubMedCentralID 6092560
-
Comparison of colorectal cancer (CRC) characteristics across genetic ancestries: Implications for early cancer detection (ECD).
LIPPINCOTT WILLIAMS & WILKINS. 2024: 164
View details for DOI 10.1200/JCO.2024.42.3_suppl.164
View details for Web of Science ID 001266680500514
-
Machine Learning Strategies for Improved Phenotype Prediction in Underrepresented Populations.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2024; 29: 404-418
Abstract
Precision medicine models often perform better for populations of European ancestry due to the over-representation of this group in the genomic datasets and large-scale biobanks from which the models are constructed. As a result, prediction models may misrepresent or provide less accurate treatment recommendations for underrepresented populations, contributing to health disparities. This study introduces an adaptable machine learning toolkit that integrates multiple existing methodologies and novel techniques to enhance the prediction accuracy for underrepresented populations in genomic datasets. By leveraging machine learning techniques, including gradient boosting and automated methods, coupled with novel population-conditional re-sampling techniques, our method significantly improves the phenotypic prediction from single nucleotide polymorphism (SNP) data for diverse populations. We evaluate our approach using the UK Biobank, which is composed primarily of British individuals with European ancestry, and a minority representation of groups with Asian and African ancestry. Performance metrics demonstrate substantial improvements in phenotype prediction for underrepresented groups, achieving prediction accuracy comparable to that of the majority group. This approach represents a significant step towards improving prediction accuracy amidst current dataset diversity challenges. By integrating a tailored pipeline, our approach fosters more equitable validity and utility of statistical genetics methods, paving the way for more inclusive models and outcomes.
View details for PubMedID 38160295
-
HyperFast: Instant Classification for Tabular Data
edited by Wooldridge, M., Dy, J., Natarajan, S.
ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE. 2024: 11114-11123
View details for Web of Science ID 001241513600041
-
Overcoming health disparities in precision medicine
edited by Hunter, L., Altman, R. B., Ritchie, M. D., Murray, T., Klein, T. E.
WORLD SCIENTIFIC PUBL CO PTE LTD. 2024: 322-326
View details for Web of Science ID 001258333100024
-
PopGenAdapt: Semi-Supervised Domain Adaptation for Genotype-to-Phenotype Prediction in Underrepresented Populations.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2024; 29: 327-340
Abstract
The lack of diversity in genomic datasets, currently skewed towards individuals of European ancestry, presents a challenge in developing inclusive biomedical models. The scarcity of such data is particularly evident in labeled datasets that include genomic data linked to electronic health records. To address this gap, this paper presents PopGenAdapt, a genotype-to-phenotype prediction model which adopts semi-supervised domain adaptation (SSDA) techniques originally proposed for computer vision. PopGenAdapt is designed to leverage the substantial labeled data available from individuals of European ancestry, as well as the limited labeled and the larger amount of unlabeled data from currently underrepresented populations. The method is evaluated in underrepresented populations from Nigeria, Sri Lanka, and Hawaii for the prediction of several disease outcomes. The results suggest a significant improvement in the performance of genotype-to-phenotype models for these populations over state-of-the-art supervised learning methods, setting SSDA as a promising strategy for creating more inclusive machine learning models in biomedical research.Our code is available at https://github.com/AI-sandbox/PopGenAdapt.
View details for PubMedID 38160290
-
Machine Learning Strategies for Improved Phenotype Prediction in Underrepresented Populations.
bioRxiv : the preprint server for biology
2023
Abstract
Precision medicine models often perform better for populations of European ancestry due to the over-representation of this group in the genomic datasets and large-scale biobanks from which the models are constructed. As a result, prediction models may misrepresent or provide less accurate treatment recommendations for underrepresented populations, contributing to health disparities. This study introduces an adaptable machine learning toolkit that integrates multiple existing methodologies and novel techniques to enhance the prediction accuracy for underrepresented populations in genomic datasets. By leveraging machine learning techniques, including gradient boosting and automated methods, coupled with novel population-conditional re-sampling techniques, our method significantly improves the phenotypic prediction from single nucleotide polymorphism (SNP) data for diverse populations. We evaluate our approach using the UK Biobank, which is composed primarily of British individuals with European ancestry, and a minority representation of groups with Asian and African ancestry. Performance metrics demonstrate substantial improvements in phenotype prediction for underrepresented groups, achieving prediction accuracy comparable to that of the majority group. This approach represents a significant step towards improving prediction accuracy amidst current dataset diversity challenges. By integrating a tailored pipeline, our approach fosters more equitable validity and utility of statistical genetics methods, paving the way for more inclusive models and outcomes.
View details for DOI 10.1101/2023.10.12.561949
View details for PubMedID 37904983
View details for PubMedCentralID PMC10614800
-
Mexican Biobank advances population and medical genomics of diverse ancestries.
Nature
2023
Abstract
Latin America continues to be severely underrepresented in genomics research, and fine-scale genetic histories and complex trait architectures remain hidden owing to insufficient data1. To fill this gap, the Mexican Biobank project genotyped 6,057 individuals from 898 rural and urban localities across all 32 states in Mexico at a resolution of 1.8 million genome-wide markers with linked complex trait and disease information creating a valuable nationwide genotype-phenotype database. Here, using ancestry deconvolution and inference of identity-by-descent segments, we inferred ancestral population sizes across Mesoamerican regions over time, unravelling Indigenous, colonial and postcolonial demographic dynamics2-6. We observed variation in runs of homozygosity among genomic regions with different ancestries reflecting distinct demographic histories and, in turn, different distributions of rare deleterious variants. We conducted genome-wide association studies (GWAS) for 22 complex traits and found that several traits are better predicted using the Mexican Biobank GWAS compared to the UK Biobank GWAS7,8. We identified genetic and environmental factors associating with trait variation, such as the length of the genome in runs of homozygosity as a predictor for body mass index, triglycerides, glucose and height. This study provides insights into the genetic histories of individuals in Mexico and dissects their complex trait architectures, both crucial for making precision and preventive medicine initiatives accessible worldwide.
View details for DOI 10.1038/s41586-023-06560-0
View details for PubMedID 37821706
View details for PubMedCentralID 3738819
-
PopGenAdapt: Semi-Supervised Domain Adaptation for Genotype-to-Phenotype Prediction in Underrepresented Populations.
bioRxiv : the preprint server for biology
2023
Abstract
The lack of diversity in genomic datasets, currently skewed towards individuals of European ancestry, presents a challenge in developing inclusive biomedical models. The scarcity of such data is particularly evident in labeled datasets that include genomic data linked to electronic health records. To address this gap, this paper presents PopGenAdapt, a genotype-to-phenotype prediction model which adopts semi-supervised domain adaptation (SSDA) techniques originally proposed for computer vision. PopGenAdapt is designed to leverage the substantial labeled data available from individuals of European ancestry, as well as the limited labeled and the larger amount of unlabeled data from currently underrepresented populations. The method is evaluated in underrepresented populations from Nigeria, Sri Lanka, and Hawaii for the prediction of several disease outcomes. The results suggest a significant improvement in the performance of genotype-to-phenotype models for these populations over state-of-the-art supervised learning methods, setting SSDA as a promising strategy for creating more inclusive machine learning models in biomedical research.
View details for DOI 10.1101/2023.10.10.561715
View details for PubMedID 37873492
-
Demographic history and genetic structure in pre-Hispanic Central Mexico.
Science (New York, N.Y.)
2023; 380 (6645): eadd6142
Abstract
Aridoamerica and Mesoamerica are two distinct cultural areas in northern and central Mexico, respectively, that hosted numerous pre-Hispanic civilizations between 2500 BCE and 1521 CE. The division between these regions shifted southward because of severe droughts ~1100 years ago, which allegedly drove a population replacement in central Mexico by Aridoamerican peoples. In this study, we present shotgun genome-wide data from 12 individuals and 27 mitochondrial genomes from eight pre-Hispanic archaeological sites across Mexico, including two at the shifting border of Aridoamerica and Mesoamerica. We find population continuity that spans the climate change episode and a broad preservation of the genetic structure across present-day Mexico for the past 2300 years. Lastly, we identify a contribution to pre-Hispanic populations of northern and central Mexico from two ancient unsampled "ghost" populations.
View details for DOI 10.1126/science.add6142
View details for PubMedID 37167382
-
Global Biobank Meta-analysis Initiative: Powering genetic discovery across human disease.
Cell genomics
2022; 2 (10): 100192
Abstract
Biobanks facilitate genome-wide association studies (GWASs), which have mapped genomic loci across a range of human diseases and traits. However, most biobanks are primarily composed of individuals of European ancestry. We introduce the Global Biobank Meta-analysis Initiative (GBMI)-a collaborative network of 23 biobanks from 4 continents representing more than 2.2 million consented individuals with genetic data linked to electronic health records. GBMI meta-analyzes summary statistics from GWASs generated using harmonized genotypes and phenotypes from member biobanks for 14 exemplar diseases and endpoints. This strategy validates that GWASs conducted in diverse biobanks can be integrated despite heterogeneity in case definitions, recruitment strategies, and baseline characteristics. This collaborative effort improves GWAS power for diseases, benefits understudied diseases, and improves risk prediction while also enabling the nomination of disease genes and drug candidates by incorporating gene and protein expression data and providing insight into the underlying biology of human diseases and traits.
View details for DOI 10.1016/j.xgen.2022.100192
View details for PubMedID 36777996
View details for PubMedCentralID PMC9903716
-
SALAI-Net: species-agnostic local ancestry inference network.
Bioinformatics (Oxford, England)
2022; 38 (Supplement_2): ii27-ii33
Abstract
MOTIVATION: Local ancestry inference (LAI) is the high resolution prediction of ancestry labels along a DNA sequence. LAI is important in the study of human history and migrations, and it is beginning to play a role in precision medicine applications including ancestry-adjusted genome-wide association studies (GWASs) and polygenic risk scores (PRSs). Existing LAI models do not generalize well between species, chromosomes or even ancestry groups, requiring re-training for each different setting. Furthermore, such methods can lack interpretability, which is an important element in each of these applications.RESULTS: We present SALAI-Net, a portable statistical LAI method that can be applied on any set of species and ancestries (species-agnostic), requiring only haplotype data and no other biological parameters. Inspired by identity by descent methods, SALAI-Net estimates population labels for each segment of DNA by performing a reference matching approach, which leads to an interpretable and fast technique. We benchmark our models on whole-genome data of humans and we test these models' ability to generalize to dog breeds when trained on human data. SALAI-Net outperforms previous methods in terms of balanced accuracy, while generalizing between different settings, species and datasets. Moreover, it is up to two orders of magnitude faster and uses considerably less RAM memory than competing methods.AVAILABILITY AND IMPLEMENTATION: We provide an open source implementation and links to publicly available data at github.com/AI-sandbox/SALAI-Net. Data is publicly available as follows: https://www.internationalgenome.org (1000 Genomes), https://www.simonsfoundation.org/simons-genome-diversity-project (Simons Genome Diversity Project), https://www.sanger.ac.uk/resources/downloads/human/hapmap3.html (HapMap), ftp://ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516 (Human Genome Diversity Project) and https://www.ncbi.nlm.nih.gov/bioproject/PRJNA448733 (Canid genomes).SUPPLEMENTARY INFORMATION: Supplementary data are available from Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btac464
View details for PubMedID 36124792
-
Validating and automating learning of cardiometabolic polygenic risk scores from direct-to-consumer genetic and phenotypic data: implications for scaling precision health research.
Human genomics
2022; 16 (1): 37
Abstract
INTRODUCTION: A major challenge to enabling precision health at a global scale is the bias between those who enroll in state sponsored genomic research and those suffering from chronic disease. More than 30 million people have been genotyped by direct-to-consumer (DTC) companies such as 23andMe, Ancestry DNA, and MyHeritage, providing a potential mechanism for democratizing access to medical interventions and thus catalyzing improvements in patient outcomes as the cost of data acquisition drops. However, much of these data are sequestered in the initial provider network, without the ability for the scientific community to either access or validate. Here, we present a novel geno-pheno platform that integrates heterogeneous data sources and applies learnings to common chronic disease conditions including Type 2 diabetes (T2D) and hypertension.METHODS: We collected genotyped data from a novel DTC platform where participants upload their genotype data files and were invited to answer general health questionnaires regarding cardiometabolic traits over a period of 6months. Quality control, imputation, and genome-wide association studies were performed on this dataset, and polygenic risk scores were built in a case-control setting using the BASIL algorithm.RESULTS: We collected data on N=4,550 (389 cases / 4,161 controls) who reported being affected or previously affected for T2D and N=4,528 (1,027 cases / 3,501 controls) for hypertension. We identified 164 out of 272 variants showing identical effect direction to previously reported genome-significant findings in Europeans. Performance metric of the PRS models was AUC=0.68, which is comparable to previously published PRS models obtained with larger datasets including clinical biomarkers.DISCUSSION: DTC platforms have the potential of inverting research models of genome sequencing and phenotypic data acquisition. Quality control (QC) mechanisms proved to successfully enable traditional GWAS and PRS analyses. The direct participation of individuals has shown the potential to generate rich datasets enabling the creation of PRS cardiometabolic models. More importantly, federated learning of PRS from reuse of DTC data provides a mechanism for scaling precision health care delivery beyond the small number of countries who can afford to finance these efforts directly.CONCLUSIONS: The genetics of T2D and hypertension have been studied extensively in controlled datasets, and various polygenic risk scores (PRS) have been developed. We developed predictive tools for both phenotypes trained with heterogeneous genotypic and phenotypic data generated outside of the clinical environment and show that our methods can recapitulate prior findings with fidelity. From these observations, we conclude that it is possible to leverage DTC genetic repositories to identify individuals at risk of debilitating diseases based on their unique genetic landscape so that informed, timely clinical interventions can be incorporated.
View details for DOI 10.1186/s40246-022-00406-y
View details for PubMedID 36076307
-
Ancient DNA reveals five streams of migration into Micronesia and matrilocality in early Pacific seafarers.
Science (New York, N.Y.)
2022; 377 (6601): 72-79
Abstract
Micronesia began to be peopled earlier than other parts of Remote Oceania, but the origins of its inhabitants remain unclear. We generated genome-wide data from 164 ancient and 112 modern individuals. Analysis reveals five migratory streams into Micronesia. Three are East Asian related, one is Polynesian, and a fifth is a Papuan source related to mainland New Guineans that is different from the New Britain-related Papuan source for southwest Pacific populations but is similarly derived from male migrants ~2500 to 2000 years ago. People of the Mariana Archipelago may derive all of their precolonial ancestry from East Asian sources, making them the only Remote Oceanians without Papuan ancestry. Female-inherited mitochondrial DNA was highly differentiated across early Remote Oceanian communities but homogeneous within, implying matrilocal practices whereby women almost never raised their children in communities different from the ones in which they grew up.
View details for DOI 10.1126/science.abm6536
View details for PubMedID 35771911
-
Predicting Dog Phenotypes from Genotypes.
Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
2022; 2022: 3558-3562
Abstract
We analyze dog genotypes (i.e., positions of dog DNA sequences that often vary between different dogs) in order to predict the corresponding phenotypes (i.e., unique observed characteristics). More specifically, given chromosome data from a dog, we aim to predict the breed, height, and weight. We explore a variety of linear and non-linear classification and regression techniques to accomplish these three tasks. We also investigate the use of a neural network (both in linear and non-linear modes) for breed classification and compare the performance to traditional statistical methods. We show that linear methods generally outperform or match the performance of non-linear methods for breed classification. However, we show that the reverse is true for height and weight regression. Finally, we evaluate the results of all of these methods based on the number of input features used in the analysis. We conduct experiments using different fractions of the full genomic sequences, resulting in input sequences ranging from 20 SNPs to ∼200k SNPs. In doing so, we explore the impact of using a very limited number of SNPs for prediction. Our experiments demonstrate that these phenotypes in dogs can be predicted with as few as 0.5% of randomly selected SNPs (i.e., 992 SNPs) and that dog breeds can be classified with 50% balanced accuracy with as few as 0.02% SNPs (i.e., 40 SNPs).
View details for DOI 10.1109/EMBC48229.2022.9870905
View details for PubMedID 36085664
-
Generative Moment Matching Networks for Genotype Simulation.
Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
2022; 2022: 1379-1383
Abstract
The generation of synthetic genomic sequences using neural networks has potential to ameliorate privacy and data sharing concerns and to mitigate potential bias within datasets due to under-representation of some population groups. However, there is not a consensus on which architectures, training procedures, and evaluation metrics should be used when simulating single nucleotide polymorphism (SNP) sequences with neural networks. In this paper, we explore the use of Generative Moment Matching Networks (GMMNs) for SNP simulation, we present some architectural and procedural changes to properly train the networks, and we introduce an evaluation scheme to qualitatively and quantitatively assess the quality of the simulated sequences.
View details for DOI 10.1109/EMBC48229.2022.9871045
View details for PubMedID 36086656
-
The genetic legacy of the Manila galleon trade in Mexico.
Philosophical transactions of the Royal Society of London. Series B, Biological sciences
2022; 377 (1852): 20200419
Abstract
The population of Mexico has a considerable genetic substructure due to both its pre-Columbian diversity and due to genetic admixture from post-Columbian trans-oceanic migrations. The latter primarily originated in Europe and Africa, but also, to a lesser extent, in Asia. We analyze previously understudied genetic connections between Asia and Mexico to infer the timing and source of this genetic ancestry in Mexico. We identify the predominant origin within Southeast Asia-specifically western Indonesian and non-Negrito Filipino sources-and we date its arrival in Mexico to approximately 13 generations ago (1620 CE). This points to a genetic legacy from the seventeenth century Manila galleon trade between the colonial Spanish Philippines and the Pacific port of Acapulco. Indeed, within Mexico we observe the highest level of this trans-Pacific ancestry in Acapulco, located in the state of Guerrero. This colonial Spanish trade route from East Asia to Europe was centred on Mexico and appears in historical records, but its legacy has been largely ignored. Identities and stories were suppressed due to slavery, assimilation of the immigrants as 'Indios' and incomplete historical records. Here we characterize this understudied Mexican ancestry. This article is part of the theme issue 'Celebrating 50 years since Lewontin's apportionment of human diversity'.
View details for DOI 10.1098/rstb.2020.0419
View details for PubMedID 35430879
-
Opportunities and challenges for the use of common controls in sequencing studies.
Nature reviews. Genetics
2022
Abstract
Genome-wide association studies using large-scale genome and exome sequencing data have become increasingly valuable in identifying associations between genetic variants and disease, transforming basic research and translational medicine. However, this progress has not been equally shared across all people and conditions, in part due to limited resources. Leveraging publicly available sequencing data as external common controls, rather than sequencing new controls for every study, can better allocate resources by augmenting control sample sizes or providing controls where none existed. However, common control studies must be carefully planned and executed as even small differences in sample ascertainment and processing can result in substantial bias. Here, we discuss challenges and opportunities for the robust use of common controls in high-throughput sequencing studies, including study design, quality control and statistical approaches. Thoughtful generation and use of large and valuable genetic sequencing data sets will enable investigation of a broader and more representative set of conditions, environments and genetic ancestries than otherwise possible.
View details for DOI 10.1038/s41576-022-00487-4
View details for PubMedID 35581355
-
Bayesian model comparison for rare-variant association studies.
American journal of human genetics
2021
Abstract
Whole-genome sequencing studies applied to large populations or biobanks with extensive phenotyping raise new analytic challenges. The need to consider many variants at a locus or group of genes simultaneously and the potential to study many correlated phenotypes with shared genetic architecture provide opportunities for discovery not addressed by the traditional one variant, one phenotype association study. Here, we introduce a Bayesian model comparison approach called MRP (multiple rare variants and phenotypes) for rare-variant association studies that considers correlation, scale, and direction of genetic effects across a group of genetic variants, phenotypes, and studies, requiring only summary statistic data. We apply our method to exome sequencing data (n = 184,698) across 2,019 traits from the UK Biobank, aggregating signals in genes. MRP demonstrates an ability to recover signals such as associations between PCSK9 and LDL cholesterol levels. We additionally find MRP effective in conducting meta-analyses in exome data. Non-biomarker findings include associations between MC1R and red hair color and skin color, IL17RA and monocyte count, and IQGAP2 and mean platelet volume. Finally, we apply MRP in a multi-phenotype setting; after clustering the 35 biomarker phenotypes based on genetic correlation estimates, we find that joint analysis of these phenotypes results in substantial power gains for gene-trait associations, such as in TNFRSF13B in one of the clusters containing diabetes- and lipid-related traits. Overall, we show that the MRP model comparison approach improves upon useful features from widely used meta-analysis approaches for rare-variant association analyses and prioritizes protective modifiers of disease risk.
View details for DOI 10.1016/j.ajhg.2021.11.005
View details for PubMedID 34822764
- High Resolution Ancestry Deconvolution for Next Generation Genomic Data bioRxiv 2021
- Neural ADMIXTURE: rapid population clustering with autoencoders bioRxiv 2021
-
Discovering prescription patterns in pediatric acute-onset neuropsychiatric syndrome patients.
Journal of biomedical informatics
2020: 103664
Abstract
OBJECTIVE: Pediatric acute-onset neuropsychiatric syndrome (PANS) is a complex neuropsychiatric syndrome characterized by an abrupt onset of obsessive-compulsive symptoms and/or severe eating restrictions, along with at least two concomitant debilitating cognitive, behavioral, or neurological symptoms. A wide range of pharmacological interventions along with behavioral and environmental modifications, and psychotherapies have been adopted to treat symptoms and underlying etiologies. Our goal was to develop a data-driven approach to identify treatment patterns in this cohort.MATERIALS AND METHODS: In this cohort study, we extracted medical prescription histories from electronic health records. We developed a modified dynamic programming approach to perform global alignment of those medication histories. Our approach is unique since it considers time gaps in prescription patterns as part of the similarity strategy.RESULTS: This study included 43 consecutive new-onset pre-pubertal patients who had at least 3 clinic visits. Our algorithm identified six clusters with distinct medication usage history which may represent clinician's practice of treating PANS of different severities and etiologies i.e., two most severe groups requiring high dose intravenous steroids; two arthritic or inflammatory groups requiring prolonged nonsteroidal anti-inflammatory drug (NSAID); and two mild relapsing/remitting group treated with a short course of NSAID. The psychometric scores as outcomes in each cluster generally improved within the first two years.DISCUSSION: and conclusion Our algorithm shows potential to improve our knowledge of treatment patterns in the PANS cohort, while helping clinicians understand how patients respond to a combination of drugs.
View details for DOI 10.1016/j.jbi.2020.103664
View details for PubMedID 33359113
-
LAI-NET: LOCAL-ANCESTRY INFERENCE WITH NEURAL NETWORKS
IEEE. 2020: 1314–18
View details for Web of Science ID 000615970401111
- Class-Conditional VAE-GAN for Local-Ancestry Simulation MLCB Proceedings 2019
-
Reconstructing admixture and migration dynamics of post-contact Mexico
WILEY. 2018: 228
View details for Web of Science ID 000430656803170
-
Integrated Power Divider for Superconducting Digital Circuits
IEEE TRANSACTIONS ON APPLIED SUPERCONDUCTIVITY
2011; 21 (3): 571–74
View details for DOI 10.1109/TASC.2010.2086415
View details for Web of Science ID 000291050500113
-
Digital circuits using self-shunted Nb/NbxSi1-x/Nb Josephson junctions
APPLIED PHYSICS LETTERS
2010; 96 (21)
View details for DOI 10.1063/1.3432065
View details for Web of Science ID 000278183200086
https://orcid.org/0000-0002-4735-7803