Bio


Dr. Palacios’s research spans Bayesian nonparametrics, probabilistic AI, stochastic processes, and computational statistics. Her group develops stochastic models and efficient inference algorithms for understanding evolutionary dynamics in population genetics, infectious diseases and cancer.

Academic Appointments


Honors & Awards


  • Frederick E. Terman Fellow 2017, Stanford University (2017-2019)
  • Alfred P. Sloan Research Fellowship 2018, Sloan Foundation (2018-2020)

Professional Education


  • Ph.D, University of Washington, Statistics (2013)

2025-26 Courses


Stanford Advisees


All Publications


  • Generalizing matrix representations to fully heterochronous ranked tree shapes. ArXiv Jennings-Shaffer, C., Chen, C., Palacios, J. A., Matsen Iv, F. A. 2025

    Abstract

    Phylogenetic tree shapes capture fundamental signatures of evolution. We consider "ranked" tree shapes, which are equipped with a total order on the internal nodes compatible with the tree graph. Recent work has established an elegant bijection of ranked tree shapes and a class of integer matrices, called F -matrices, defined by simple inequalities. This formulation is for isochronous ranked tree shapes, where all leaves share the same sampling time, such as in the study of ancient human demography from present-day individuals. Another important style of phylogenetics concerns trees where the "timing" of events is by branch length rather than calendar time. This style of tree, called a rooted phylogram, is output by popular maximum-likelihood methods. These trees are broadly relevant, such as to study the affinity maturation of B cells in the immune system. Discretizing time in a rooted phylogram gives a fully heterochronous ranked tree shape, where leaves are part of the total order. Here we extend the F -matrix framework to such fully heterochronous ranked tree shapes. We establish an explicit bijection between a class of F -matrices and the space of such tree shapes. The matrix representation has the key feature that values at any entry are highly constrained via four previous entries, enabling straightforward enumeration of all valid tree shapes. We also use this framework to develop probabilistic models on ranked tree shapes. Our work extends understanding of combinatorial objects that have a rich history in the literature.

    View details for PubMedID 41281197

    View details for PubMedCentralID PMC12636760

  • Efficient Bayesian Phylogenetics under the Infinite Sites Model. bioRxiv : the preprint server for biology Specht, I., Palacios, J. A. 2025

    Abstract

    Bayesian phylogenetic inference from molecular sequences can provide key insights into the evolutionary history of populations. Existing tools, however, often scale poorly with sample size. We present inPhynite, a highly-efficient Bayesian phylogenetics algorithm for genomic datasets compatible with the infinite sites mutation model. A key advantage of this model is that likelihood calculation, which typically incurs a substantial computational cost, becomes trivial. We show that under the infinite sites assumption, it is possible to sample a coarse space of mutations and coalescences from which we may recover complete phylogenetic trees. We design an efficient Markov chain for this space together with effective population size trajectories, modeled as piecewise constant functions. Based on real and synthetic data, our method significantly outperforms competing methods, offering a speedup of over 225 times in statistical efficiency on large datasets without incurring any loss in accuracy. Finally, we demonstrate how inPhynite can help us understand the evolutionary history and past effective population sizes of human populations based on mitochondrial DNA.

    View details for DOI 10.1101/2025.11.14.688551

    View details for PubMedID 41292938

    View details for PubMedCentralID PMC12642419

  • Accounting for reporting delays in real-time phylodynamic analyses with preferential sampling. PLoS computational biology Medina, C. M., Palacios, J. A., Minin, V. M. 2025; 21 (5): e1012970

    Abstract

    The COVID-19 pandemic demonstrated that fast and accurate analysis of continually collected infectious disease surveillance data is crucial for situational awareness and policy making. Coalescent-based phylodynamic analysis can use genetic sequences of a pathogen to estimate changes in its effective population size, a measure of genetic diversity. These changes in effective population size can be connected to the changes in the number of infections in the population of interest under certain conditions. Phylodynamics is an important set of tools because its methods are often resilient to the ascertainment biases present in traditional surveillance data (e.g., preferentially testing symptomatic individuals). Unfortunately, it takes weeks or months to sequence and deposit the sampled pathogen genetic sequences into a database, making them available for such analyses. These reporting delays severely decrease precision of phylodynamic methods closer to present time, and for some models can lead to extreme biases. Here we present a method that affords reliable estimation of the effective population size trajectory closer to the time of data collection, allowing for policy decisions to be based on more recent data. Our work uses readily available historic times between sampling and reporting of sequenced samples for a population of interest, and incorporates this information into the sampling model to mitigate the effects of reporting delay in real-time analyses. We illustrate our methodology on simulated data and on SARS-CoV-2 sequences collected in the state of Washington in 2021.

    View details for DOI 10.1371/journal.pcbi.1012970

    View details for PubMedID 40327728

  • Curriculum Design in an Evolving Field: Perspectives on Biomedical Data Science from Stanford. Annual review of biomedical data science Yeh, C. Y., Wall, D. P., Matthys, K., Sabatti, C., Palacios, J. 2025

    Abstract

    In recent decades, there has been an explosion of data streams spanning the entire spectrum of biomedicine, opening novel opportunities to tackle biological and medical research questions, increasing our ability to provide effective and efficient health care. In parallel, augmented computational power has allowed the development and deployment of quantitative approaches at unprecedented scales. To effectively take advantage of this progress, it is important to invest in the training of a new generation of biomedical data scientists. Designing a graduate curriculum in the backdrop of a rapidly changing landscape of data, methods, and computing power demands flexibility and openness to adaptation. At the same time, we strive to ensure that the students acquire foundational competencies that might fuel productive and evolving careers, without being constrained to and defined by a niche trendy topic. We offer here a view of graduate training in biomedical data science from the standpoint of our experience at Stanford University. We conclude with a series of open challenges, the answers to which we believe will shape training in biomedical data science.

    View details for DOI 10.1146/annurev-biodatasci-090624-022951

    View details for PubMedID 40203230

  • Multiple merger coalescent inference of effective population size. Philosophical transactions of the Royal Society of London. Series B, Biological sciences Zhang, J., Palacios, J. A. 2025; 380 (1919): 20230306

    Abstract

    Variation in a sample of molecular sequence data informs about the past evolutionary history of the sample's population. Traditionally, Bayesian modelling coupled with the standard coalescent is used to infer the sample's bifurcating genealogy and demographic and evolutionary parameters such as effective population size and mutation rates. However, there are many situations where binary coalescent models do not accurately reflect the true underlying ancestral processes. Here, we propose a Bayesian non-parametric method for inferring effective population size trajectories from a multifurcating genealogy under the [Formula: see text]-coalescent. In particular, we jointly estimate the effective population size and the model parameter for the Beta-coalescent model, a special type of [Formula: see text]-coalescent. Finally, we test our methods on simulations and apply them to study various viral dynamics as well as Japanese sardine population size changes over time. The code and vignettes can be found in the phylodyn package.This article is part of the theme issue '"A mathematical theory of evolution": phylogenetic models dating back 100 years'.

    View details for DOI 10.1098/rstb.2023.0306

    View details for PubMedID 39976412

  • An Efficient Coalescent Model for Heterochronously Sampled Molecular Data JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Cappello, L., Veber, A., Palacios, J. A. 2024
  • CRP-Tree: a phylogenetic association test for binary traits JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES C-APPLIED STATISTICS Zhang, J., Preising, G. A., Schumer, M., Palacios, J. A. 2024; 73 (2): 340-377
  • Statistical summaries of unlabelled evolutionary trees. Biometrika Samyak, R., Palacios, J. A. 2024; 111 (1): 171-193

    Abstract

    Rooted and ranked phylogenetic trees are mathematical objects that are useful in modelling hierarchical data and evolutionary relationships with applications to many fields such as evolutionary biology and genetic epidemiology. Bayesian phylogenetic inference usually explores the posterior distribution of trees via Markov chain Monte Carlo methods. However, assessing uncertainty and summarizing distributions remains challenging for these types of structures. While labelled phylogenetic trees have been extensively studied, relatively less literature exists for unlabelled trees that are increasingly useful, for example when one seeks to summarize samples of trees obtained with different methods, or from different samples and environments, and wishes to assess the stability and generalizability of these summaries. In our paper, we exploit recently proposed distance metrics of unlabelled ranked binary trees and unlabelled ranked genealogies, or trees equipped with branch lengths, to define the Frechet mean, variance and interquartile sets as summaries of these tree distributions. We provide an efficient combinatorial optimization algorithm for computing the Frechet mean of a sample or of distributions on unlabelled ranked tree shapes and unlabelled ranked genealogies. We show the applicability of our summary statistics for studying popular tree distributions and for comparing the SARS-CoV-2 evolutionary trees across different locations during the COVID-19 epidemic in 2020. Our current implementations are publicly available at https://github.com/RSamyak/fmatrix.

    View details for DOI 10.1093/biomet/asad025

    View details for PubMedID 38352626

  • Bayesian Change Point Detection with Spike-and-Slab Priors JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Cappello, L., Padilla, O., Palacios, J. A. 2023
  • adaPop: Bayesian inference of dependent population dynamics in coalescent models. PLoS computational biology Cappello, L., Kim, J., Palacios, J. A. 2023; 19 (3): e1010897

    Abstract

    The coalescent is a powerful statistical framework that allows us to infer past population dynamics leveraging the ancestral relationships reconstructed from sampled molecular sequence data. In many biomedical applications, such as in the study of infectious diseases, cell development, and tumorgenesis, several distinct populations share evolutionary history and therefore become dependent. The inference of such dependence is a highly important, yet a challenging problem. With advances in sequencing technologies, we are well positioned to exploit the wealth of high-resolution biological data for tackling this problem. Here, we present adaPop, a probabilistic model to estimate past population dynamics of dependent populations and to quantify their degree of dependence. An essential feature of our approach is the ability to track the time-varying association between the populations while making minimal assumptions on their functional shapes via Markov random field priors. We provide nonparametric estimators, extensions of our base model that integrate multiple data sources, and fast scalable inference algorithms. We test our method using simulated data under various dependent population histories and demonstrate the utility of our model in shedding light on evolutionary histories of different variants of SARS-CoV-2.

    View details for DOI 10.1371/journal.pcbi.1010897

    View details for PubMedID 36940209

  • Adaptive Preferential Sampling in Phylodynamics With an Application to SARS-CoV-2. Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America Cappello, L., Palacios, J. A. 2022; 31 (2): 541-552

    Abstract

    Longitudinal molecular data of rapidly evolving viruses and pathogens provide information about disease spread and complement traditional surveillance approaches based on case count data. The coalescent is used to model the genealogy that represents the sample ancestral relationships. The basic assumption is that coalescent events occur at a rate inversely proportional to the effective population size N e (t), a time-varying measure of genetic diversity. When the sampling process (collection of samples over time) depends on N e (t), the coalescent and the sampling processes can be jointly modeled to improve estimation of N e (t). Failing to do so can lead to bias due to model misspecification. However, the way that the sampling process depends on the effective population size may vary over time. We introduce an approach where the sampling process is modeled as an inhomogeneous Poisson process with rate equal to the product of N e (t) and a time-varying coefficient, making minimal assumptions on their functional shapes via Markov random field priors. We provide efficient algorithms for inference, show the model performance vis-a-vis alternative methods in a simulation study, and apply our model to SARS-CoV-2 sequences from Los Angeles and Santa Clara counties. The methodology is implemented and available in the R package adapref. Supplementary files for this article are available online.

    View details for DOI 10.1080/10618600.2021.1987256

    View details for PubMedID 36035966

    View details for PubMedCentralID PMC9409340

  • AN ADJACENT-SWAP MARKOV CHAIN ON COALESCENT TREES JOURNAL OF APPLIED PROBABILITY Simper, M., Palacios, J. A. 2022
  • Deconvoluting complex correlates of COVID-19 severity with a multi-omic pandemic tracking strategy. Nature communications Parikh, V. N., Ioannidis, A. G., Jimenez-Morales, D., Gorzynski, J. E., De Jong, H. N., Liu, X., Roque, J., Cepeda-Espinoza, V. P., Osoegawa, K., Hughes, C., Sutton, S. C., Youlton, N., Joshi, R., Amar, D., Tanigawa, Y., Russo, D., Wong, J., Lauzon, J. T., Edelson, J., Mas Montserrat, D., Kwon, Y., Rubinacci, S., Delaneau, O., Cappello, L., Kim, J., Shoura, M. J., Raja, A. N., Watson, N., Hammond, N., Spiteri, E., Mallempati, K. C., Montero-Martín, G., Christle, J., Kim, J., Kirillova, A., Seo, K., Huang, Y., Zhao, C., Moreno-Grau, S., Hershman, S. G., Dalton, K. P., Zhen, J., Kamm, J., Bhatt, K. D., Isakova, A., Morri, M., Ranganath, T., Blish, C. A., Rogers, A. J., Nadeau, K., Yang, S., Blomkalns, A., O'Hara, R., Neff, N. F., DeBoever, C., Szalma, S., Wheeler, M. T., Gates, C. M., Farh, K., Schroth, G. P., Febbo, P., deSouza, F., Cornejo, O. E., Fernandez-Vina, M., Kistler, A., Palacios, J. A., Pinsky, B. A., Bustamante, C. D., Rivas, M. A., Ashley, E. A. 2022; 13 (1): 5107

    Abstract

    The SARS-CoV-2 pandemic has differentially impacted populations across race and ethnicity. A multi-omic approach represents a powerful tool to examine risk across multi-ancestry genomes. We leverage a pandemic tracking strategy in which we sequence viral and host genomes and transcriptomes from nasopharyngeal swabs of 1049 individuals (736 SARS-CoV-2 positive and 313 SARS-CoV-2 negative) and integrate them with digital phenotypes from electronic health records from a diverse catchment area in Northern California. Genome-wide association disaggregated by admixture mapping reveals novel COVID-19-severity-associated regions containing previously reported markers of neurologic, pulmonary and viral disease susceptibility. Phylodynamic tracking of consensus viral genomes reveals no association with disease severity or inferred ancestry. Summary data from multiomic investigation reveals metagenomic and HLA associations with severe COVID-19. The wealth of data available from residual nasopharyngeal swabs in combination with clinical data abstracted automatically at scale highlights a powerful strategy for pandemic tracking, and reveals distinct epidemiologic, genetic, and biological associations for those at the highest risk.

    View details for DOI 10.1038/s41467-022-32397-8

    View details for PubMedID 36042219

  • Enumeration of binary trees compatible with a perfect phylogeny. Journal of mathematical biology Palacios, J. A., Bhaskar, A., Disanto, F., Rosenberg, N. A. 2022; 84 (6): 54

    Abstract

    Evolutionary models used for describing molecular sequence variation suppose that at a non-recombining genomic segment, sequences share ancestry that can be represented as a genealogy-a rooted, binary, timed tree, with tips corresponding to individual sequences. Under the infinitely-many-sites mutation model, mutations are randomly superimposed along the branches of the genealogy, so that every mutation occurs at a chromosomal site that has not previously mutated; if a mutation occurs at an interior branch, then all individuals descending from that branch carry the mutation. The implication is that observed patterns of molecular variation from this model impose combinatorial constraints on the hidden state space of genealogies. In particular, observed molecular variation can be represented in the form of a perfect phylogeny, a tree structure that fully encodes the mutational differences among sequences. For a sample of n sequences, a perfect phylogeny might not possess n distinct leaves, and hence might be compatible with many possible binary tree structures that could describe the evolutionary relationships among the n sequences. Here, we investigate enumerative properties of the set of binary ranked and unranked tree shapes that are compatible with a perfect phylogeny, and hence, the binary ranked and unranked tree shapes conditioned on an observed pattern of mutations under the infinitely-many-sites mutation model. We provide a recursive enumeration of these shapes. We consider both perfect phylogenies that can be represented as binary and those that are multifurcating. The results have implications for computational aspects of the statistical inference of evolutionary parameters that underlie sets of molecular sequences.

    View details for DOI 10.1007/s00285-022-01748-w

    View details for PubMedID 35552538

  • Statistical Challenges in Tracking the Evolution of SARS-CoV-2 STATISTICAL SCIENCE Cappello, L., Kim, J., Liu, S., Palacios, J. A. 2022; 37 (2): 162-182

    View details for DOI 10.1214/22-STS853

    View details for Web of Science ID 000798149000003

  • The impact of the COVID-19 preventive measures on influenza transmission: molecular and epidemiological evidence. International journal of infectious diseases : IJID : official publication of the International Society for Infectious Diseases Tran, L. K., Huang, D., Li, N., Li, L. M., Palacios, J. A., Chang, H. 1800

    Abstract

    OBJECTIVE: We quantify the impact of COVID-19-related control measures on the spread of human influenza virus H1N1 and H3N2.METHODS: We analyzed case numbers to estimate the length of the 2019-2020 influenza season and compare its length to the median of the previous nine seasons. In addition, we used influenza molecular data to compare within-region and between-region genetic diversity and effective population size from 2019 to 2020. Finally, we analyzed personal behavior data, and policy stringency data for each region.RESULTS: The 2019-2020 influenza season was shorter than the median of the previous nine seasons in all regions. For H1N1 and H3N2, there was an increase in between-region genetic diversity in almost all pairs of regions between 2019 and 2020. For 10 of 11 regions for H1N1 and 9 of 11 regions for H3N2, there was a decrease in within-region genetic diversity. For 10 of 13 regions for H1N1 and 3 of 7 regions for H3N2, there was a decrease in effective population size.CONCLUSIONS: We found consistent evidence of decrease in influenza incidence after the introduction of preventive measures due to COVID-19 emergence.

    View details for DOI 10.1016/j.ijid.2021.12.323

    View details for PubMedID 34902583

  • Adaptive Preferential Sampling in Phylodynamics With an Application to SARS-CoV-2 JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS Cappello, L., Palacios, J. A. 2021
  • Distance metrics for ranked evolutionary trees. Proceedings of the National Academy of Sciences of the United States of America Kim, J., Rosenberg, N. A., Palacios, J. A. 2020

    Abstract

    Genealogical tree modeling is essential for estimating evolutionary parameters in population genetics and phylogenetics. Recent mathematical results concerning ranked genealogies without leaf labels unlock opportunities in the analysis of evolutionary trees. In particular, comparisons between ranked genealogies facilitate the study of evolutionary processes of different organisms sampled at multiple time periods. We propose metrics on ranked tree shapes and ranked genealogies for lineages isochronously and heterochronously sampled. Our proposed tree metrics make it possible to conduct statistical analyses of ranked tree shapes and timed ranked tree shapes or ranked genealogies. Such analyses allow us to assess differences in tree distributions, quantify estimation uncertainty, and summarize tree distributions. We show the utility of our metrics via simulations and an application in infectious diseases.

    View details for DOI 10.1073/pnas.1922851117

    View details for PubMedID 33139566

  • SEQUENTIAL IMPORTANCE SAMPLING FOR MULTIRESOLUTION KINGMAN-TAJIMA COALESCENT COUNTING ANNALS OF APPLIED STATISTICS Cappello, L., Palacios, J. A. 2020; 14 (2): 727–51
  • SEQUENTIAL IMPORTANCE SAMPLING FOR MULTIRESOLUTION KINGMAN-TAJIMA COALESCENT COUNTING. The annals of applied statistics Cappello, L., Palacios, J. A. 2020; 14 (2): 727-751

    Abstract

    Statistical inference of evolutionary parameters from molecular sequence data relies on coalescent models to account for the shared genealogical ancestry of the samples. However, inferential algorithms do not scale to available data sets. A strategy to improve computational efficiency is to rely on simpler coalescent and mutation models, resulting in smaller hidden state spaces. An estimate of the cardinality of the state-space of genealogical trees at different resolutions is essential to decide the best modeling strategy for a given dataset. To our knowledge, there is neither an exact nor approximate method to determine these cardinalities. We propose a sequential importance sampling algorithm to estimate the cardinality of the sample space of genealogical trees under different coalescent resolutions. Our sampling scheme proceeds sequentially across the set of combinatorial constraints imposed by the data, which in this work are completely linked sequences of DNA at a non recombining segment. We analyze the cardinality of different genealogical tree spaces on simulations to study the settings that favor coarser resolutions. We apply our method to estimate the cardinality of genealogical tree spaces from mtDNA data from the 1000 genomes and a sample from a Melanesian population at the β-globin locus.

    View details for DOI 10.1214/19-AOAS1313

    View details for PubMedID 33995755

    View details for PubMedCentralID PMC8118586

  • Discussion on "Horseshoe-based Bayesian nonparametric estimation of effective population size trajectories" by James R. Faulkner, Andrew F. Magee, Beth Shapiro, and Vladimir N. Minin. Biometrics Cappello, L., Ghosh, S., Palacios, J. A. 2020

    View details for DOI 10.1111/biom.13275

    View details for PubMedID 32378742

  • Bayesian Estimation of Population Size Changes by Sampling Tajima's Trees. Genetics Palacios, J. A., Véber, A., Cappello, L., Wang, Z., Wakeley, J., Ramachandran, S. 2019; 213 (3): 967-986

    Abstract

    The large state space of gene genealogies is a major hurdle for inference methods based on Kingman's coalescent. Here, we present a new Bayesian approach for inferring past population sizes, which relies on a lower-resolution coalescent process that we refer to as "Tajima's coalescent." Tajima's coalescent has a drastically smaller state space, and hence it is a computationally more efficient model, than the standard Kingman coalescent. We provide a new algorithm for efficient and exact likelihood calculations for data without recombination, which exploits a directed acyclic graph and a correspondingly tailored Markov Chain Monte Carlo method. We compare the performance of our Bayesian Estimation of population size changes by Sampling Tajima's Trees (BESTT) with a popular implementation of coalescent-based inference in BEAST using simulated and human data. We empirically demonstrate that BESTT can accurately infer effective population sizes, and it further provides an efficient alternative to the Kingman's coalescent. The algorithms described here are implemented in the R package phylodyn, which is available for download at https://github.com/JuliaPalacios/phylodyn.

    View details for DOI 10.1534/genetics.119.302373

    View details for PubMedID 33954685

  • Bayesian Estimation of Population Size Changes by Sampling Tajima's Trees. Genetics Palacios, J. A., Véber, A., Cappello, L., Wang, Z., Wakeley, J., Ramachandran, S. 2019; 213 (3): 967-986

    Abstract

    The large state space of gene genealogies is a major hurdle for inference methods based on Kingman's coalescent. Here, we present a new Bayesian approach for inferring past population sizes, which relies on a lower-resolution coalescent process that we refer to as "Tajima's coalescent." Tajima's coalescent has a drastically smaller state space, and hence it is a computationally more efficient model, than the standard Kingman coalescent. We provide a new algorithm for efficient and exact likelihood calculations for data without recombination, which exploits a directed acyclic graph and a correspondingly tailored Markov Chain Monte Carlo method. We compare the performance of our Bayesian Estimation of population size changes by Sampling Tajima's Trees (BESTT) with a popular implementation of coalescent-based inference in BEAST using simulated and human data. We empirically demonstrate that BESTT can accurately infer effective population sizes, and it further provides an efficient alternative to the Kingman's coalescent. The algorithms described here are implemented in the R package phylodyn, which is available for download at https://github.com/JuliaPalacios/phylodyn.

    View details for DOI 10.1534/genetics.119.302373

    View details for PubMedID 33954667

  • Bayesian Estimation of Population Size Changes by Sampling Tajima's Trees. Genetics Palacios, J. A., Véber, A. n., Cappello, L. n., Wang, Z. n., Wakeley, J. n., Ramachandran, S. n. 2019

    Abstract

    The large state space of gene genealogies is a major hurdle for inference methods based on Kingman's coalescent. Here, we present a new Bayesian approach for inferring past population sizes which relies on a lower resolution coalescent process we refer to as "Tajima's coalescent". Tajima's coalescent has a drastically smaller state space, and hence it is a computationally more efficient model, than the standard Kingman coalescent. We provide a new algorithm for efficient and exact likelihood calculations for data without recombination, which exploits a directed acyclic graph and a correspondingly tailored Markov Chain Monte Carlo method. We compare the performance of our Bayesian Estimation of population size changes by Sampling Tajima's Trees (BESTT) with a popular implementation of coalescent-based inference in BEAST using simulated data and human data. We empirically demonstrate that BESTT can accurately infer effective population sizes, and it further provides an efficient alternative to the Kingman's coalescent. The algorithms described here are implemented in the R package phylodyn, which is available for download at https://github.com/JuliaPalacios/phylodyn.

    View details for DOI 10.1534/genetics.119.302373

    View details for PubMedID 31511299

  • Exact limits of inference in coalescent models. Theoretical population biology Johndrow, J. E., Palacios, J. A. 2018

    Abstract

    Recovery of population size history from molecular sequence data is an important problem in population genetics. Inference commonly relies on a coalescent model linking the population size history to genealogies. The high computational cost of estimating parameters from these models usually compels researchers to select a subset of the available data or to rely on insufficient summary statistics for statistical inference. We consider the problem of recovering the true population size history from two possible alternatives on the basis of coalescent time data previously considered by Kim etal. (2015). We improve upon previous results by giving exact expressions for the probability of correctly distinguishing between the two hypotheses as a function of the separation between the alternative size histories, the number of individuals, loci, and the sampling times. In more complicated settings we estimate the exact probability of correct recovery by Monte Carlo simulation. Our results give considerably more pessimistic inferential limits than those previously reported. We also extended our analyses to pairwise SMC and SMC' models of recombination. This work is relevant for optimal design when the inference goal is to test scientific hypotheses about population size trajectories in coalescent models with and without recombination.

    View details for DOI 10.1016/j.tpb.2018.11.004

    View details for PubMedID 30571959

  • No Evidence for Recent Selection at FOXP2 among Diverse Human Populations CELL Atkinson, E., Audesse, A., Palacios, J., Bobo, D., Webb, A., Ramachandran, S., Henn, B. 2018; 174 (6): 1424-+

    Abstract

    FOXP2, initially identified for its role in human speech, contains two nonsynonymous substitutions derived in the human lineage. Evidence for a recent selective sweep in Homo sapiens, however, is at odds with the presence of these substitutions in archaic hominins. Here, we comprehensively reanalyze FOXP2 in hundreds of globally distributed genomes to test for recent selection. We do not find evidence of recent positive or balancing selection at FOXP2. Instead, the original signal appears to have been due to sample composition. Our tests do identify an intronic region that is enriched for highly conserved sites that are polymorphic among humans, compatible with a loss of function in humans. This region is lowly expressed in relevant tissue types that were tested via RNA-seq in human prefrontal cortex and RT-PCR in immortalized human brain cells. Our results represent a substantial revision to the adaptive history of FOXP2, a gene regarded as vital to human evolution.

    View details for PubMedID 30078708

    View details for PubMedCentralID PMC6128738

  • PHYLODYN: an R package for phylodynamic simulation and inference MOLECULAR ECOLOGY RESOURCES Karcher, M. D., Palacios, J. A., Lan, S., Minin, V. N. 2017; 17 (1): 96-100

    Abstract

    We introduce phylodyn, an r package for phylodynamic analysis based on gene genealogies. The package's main functionality is Bayesian nonparametric estimation of effective population size fluctuations over time. Our implementation includes several Markov chain Monte Carlo-based methods and an integrated nested Laplace approximation-based approach for phylodynamic inference that have been developed in recent years. Genealogical data describe the timed ancestral relationships of individuals sampled from a population of interest. Here, individuals are assumed to be sampled at the same point in time (isochronous sampling) or at different points in time (heterochronous sampling); in addition, sampling events can be modelled with preferential sampling, which means that the intensity of sampling events is allowed to depend on the effective population size trajectory. We assume the coalescent and the sequentially Markov coalescent processes as generative models of genealogies. We include several coalescent simulation functions that are useful for testing our phylodynamics methods via simulation studies. We compare the performance and outputs of various methods implemented in phylodyn and outline their strengths and weaknesses. r package phylodyn is available at https://github.com/mdkarcher/phylodyn.

    View details for DOI 10.1111/1755-0998.12630

    View details for Web of Science ID 000390413500012

  • phylodyn: an R package for phylodynamic simulation and inference. Molecular ecology resources Karcher, M. D., Palacios, J. A., Lan, S., Minin, V. N. 2016

    Abstract

    We introduce phylodyn, an r package for phylodynamic analysis based on gene genealogies. The package's main functionality is Bayesian nonparametric estimation of effective population size fluctuations over time. Our implementation includes several Markov chain Monte Carlo-based methods and an integrated nested Laplace approximation-based approach for phylodynamic inference that have been developed in recent years. Genealogical data describe the timed ancestral relationships of individuals sampled from a population of interest. Here, individuals are assumed to be sampled at the same point in time (isochronous sampling) or at different points in time (heterochronous sampling); in addition, sampling events can be modelled with preferential sampling, which means that the intensity of sampling events is allowed to depend on the effective population size trajectory. We assume the coalescent and the sequentially Markov coalescent processes as generative models of genealogies. We include several coalescent simulation functions that are useful for testing our phylodynamics methods via simulation studies. We compare the performance and outputs of various methods implemented in phylodyn and outline their strengths and weaknesses. r package phylodyn is available at https://github.com/mdkarcher/phylodyn.

    View details for DOI 10.1111/1755-0998.12630

    View details for PubMedID 27801980

  • Quantifying and Mitigating the Effect of Preferential Sampling on Phylodynamic Inference. PLoS computational biology Karcher, M. D., Palacios, J. A., Bedford, T., Suchard, M. A., Minin, V. N. 2016; 12 (3)

    Abstract

    Phylodynamics seeks to estimate effective population size fluctuations from molecular sequences of individuals sampled from a population of interest. One way to accomplish this task formulates an observed sequence data likelihood exploiting a coalescent model for the sampled individuals' genealogy and then integrating over all possible genealogies via Monte Carlo or, less efficiently, by conditioning on one genealogy estimated from the sequence data. However, when analyzing sequences sampled serially through time, current methods implicitly assume either that sampling times are fixed deterministically by the data collection protocol or that their distribution does not depend on the size of the population. Through simulation, we first show that, when sampling times do probabilistically depend on effective population size, estimation methods may be systematically biased. To correct for this deficiency, we propose a new model that explicitly accounts for preferential sampling by modeling the sampling times as an inhomogeneous Poisson process dependent on effective population size. We demonstrate that in the presence of preferential sampling our new model not only reduces bias, but also improves estimation precision. Finally, we compare the performance of the currently used phylodynamic methods with our proposed model through clinically-relevant, seasonal human influenza examples.

    View details for DOI 10.1371/journal.pcbi.1004789

    View details for PubMedID 26938243

  • An efficient Bayesian inference framework for coalescent-based nonparametric phylodynamics BIOINFORMATICS Lan, S., Palacios, J. A., Karcher, M., Minin, V. N., Shahbaba, B. 2015; 31 (20): 3282-3289

    Abstract

    The field of phylodynamics focuses on the problem of reconstructing population size dynamics over time using current genetic samples taken from the population of interest. This technique has been extensively used in many areas of biology but is particularly useful for studying the spread of quickly evolving infectious diseases agents, e.g. influenza virus. Phylodynamic inference uses a coalescent model that defines a probability density for the genealogy of randomly sampled individuals from the population. When we assume that such a genealogy is known, the coalescent model, equipped with a Gaussian process prior on population size trajectory, allows for nonparametric Bayesian estimation of population size dynamics. Although this approach is quite powerful, large datasets collected during infectious disease surveillance challenge the state-of-the-art of Bayesian phylodynamics and demand inferential methods with relatively low computational cost.To satisfy this demand, we provide a computationally efficient Bayesian inference framework based on Hamiltonian Monte Carlo for coalescent process models. Moreover, we show that by splitting the Hamiltonian function, we can further improve the efficiency of this approach. Using several simulated and real datasets, we show that our method provides accurate estimates of population size dynamics and is substantially faster than alternative methods based on elliptical slice sampler and Metropolis-adjusted Langevin algorithm.The R code for all simulation studies and real data analysis conducted in this article are publicly available at http://www.ics.uci.edu/∼slan/lanzi/CODES.html and in the R package phylodyn available at https://github.com/mdkarcher/phylodyn.S.Lan@warwick.ac.uk or babaks@uci.eduSupplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btv378

    View details for Web of Science ID 000362846600007

    View details for PubMedID 26093147

  • Bayesian Nonparametric Inference of Population Size Changes from Sequential Genealogies GENETICS Palacios, J. A., Wakeley, J., Ramachandran, S. 2015; 201 (1): 281-?

    Abstract

    Sophisticated inferential tools coupled with the coalescent model have recently emerged for estimating past population sizes from genomic data. Recent methods that model recombination require small sample sizes, make constraining assumptions about population size changes, and do not report measures of uncertainty for estimates. Here, we develop a Gaussian process-based Bayesian nonparametric method coupled with a sequentially Markov coalescent model that allows accurate inference of population sizes over time from a set of genealogies. In contrast to current methods, our approach considers a broad class of recombination events, including those that do not change local genealogies. We show that our method outperforms recent likelihood-based methods that rely on discretization of the parameter space. We illustrate the application of our method to multiple demographic histories, including population bottlenecks and exponential growth. In simulation, our Bayesian approach produces point estimates four times more accurate than maximum-likelihood estimation (based on the sum of absolute differences between the truth and the estimated values). Further, our method's credible intervals for population size as a function of time cover 90% of true values across multiple demographic scenarios, enabling formal hypothesis testing about population size differences over time. Using genealogies estimated with ARGweaver, we apply our method to European and Yoruban samples from the 1000 Genomes Project and confirm key known aspects of population size history over the past 150,000 years.

    View details for DOI 10.1534/genetics.115.177980

    View details for Web of Science ID 000361206400021

    View details for PubMedID 26224734

  • Phylogeography of the Trans-Volcanic bunchgrass lizard (Sceloporus bicanthalis) across the highlands of south-eastern Mexico BIOLOGICAL JOURNAL OF THE LINNEAN SOCIETY Leache, A. D., Palacios, J. A., Minin, V. N., Bryson, R. W. 2013; 110 (4): 852-865

    View details for DOI 10.1111/bij.12172

    View details for Web of Science ID 000330183200012

  • Gaussian Process-Based Bayesian Nonparametric Inference of Population Size Trajectories from Gene Genealogies BIOMETRICS Palacios, J. A., Minin, V. N. 2013; 69 (1): 8-18

    Abstract

    Changes in population size influence genetic diversity of the population and, as a result, leave a signature of these changes in individual genomes in the population. We are interested in the inverse problem of reconstructing past population dynamics from genomic data. We start with a standard framework based on the coalescent, a stochastic process that generates genealogies connecting randomly sampled individuals from the population of interest. These genealogies serve as a glue between the population demographic history and genomic sequences. It turns out that only the times of genealogical lineage coalescences contain information about population size dynamics. Viewing these coalescent times as a point process, estimating population size trajectories is equivalent to estimating a conditional intensity of this point process. Therefore, our inverse problem is similar to estimating an inhomogeneous Poisson process intensity function. We demonstrate how recent advances in Gaussian process-based nonparametric inference for Poisson processes can be extended to Bayesian nonparametric estimation of population size dynamics under the coalescent. We compare our Gaussian process (GP) approach to one of the state-of-the-art Gaussian Markov random field (GMRF) methods for estimating population trajectories. Using simulated data, we demonstrate that our method has better accuracy and precision. Next, we analyze two genealogies reconstructed from real sequences of hepatitis C and human Influenza A viruses. In both cases, we recover more believed aspects of the viral demographic histories than the GMRF approach. We also find that our GP method produces more reasonable uncertainty estimates than the GMRF method.

    View details for DOI 10.1111/biom.12003

    View details for Web of Science ID 000317303500003

    View details for PubMedID 23409705