Academic Appointments

Honors & Awards

  • Career Award in the Biomedical Sciences, Burroughs Wellcome Fund (2004)
  • Sloan Fellow in Computational and Evolutionary Molecular Biology, Alfred P. Sloan Foundation (2006)
  • Dean's Basic Science Research Award, University of Michigan Medical School (2010)
  • Stanford Professorship in Population Genetics & Society, Stanford University School of Humanitites & Sciences (2014)

Boards, Advisory Committees, Professional Organizations

  • Associate Editor, Evolution, Medicine, and Public Health (2014 - Present)
  • Editor-in-Chief, Theoretical Population Biology (2013 - Present)
  • Associate Editor, Molecular Biology and Evolution (2011 - 2014)
  • Associate Editor, Human Biology (2010 - Present)
  • Associate Editor, Genetics (2010 - Present)
  • Associate Editor, BMC Bioinformatics (2010 - 2014)
  • Associate Editor, American Journal of Human Genetics (2008 - 2010)

Professional Education

  • BA, Rice University, Mathematics (1997)
  • MS, Stanford University, Mathematics (1999)
  • PhD, Stanford University, Biology (2001)
  • Postdoc, University of Southern California, Molecular/Computational Biology (2005)

Current Research and Scholarly Interests

Research in the lab addresses problems in evolutionary biology and human
genetics through a combination of mathematical modeling, computer
simulations, development of statistical methods, and inference from
population-genetic data. Our current work covers topics such as human
genetic variation, inference of human evolutionary history, the role of
population genetics in the search for disease-susceptibility genes, the
relationship of gene trees and species trees, and mathematical properties
of statistics used for analyzing genetic variability.

2015-16 Courses

Stanford Advisees

Graduate and Fellowship Programs

  • Biology (School of Humanities and Sciences) (Phd Program)

All Publications

  • Upper bounds on F-ST in terms of the frequency of the most frequent allele and total homozygosity: The case of a specified number of alleles THEORETICAL POPULATION BIOLOGY Edge, M. D., Rosenberg, N. A. 2014; 97: 20-34
  • Autosomal Admixture Levels Are Informative About Sex Bias in Admixed Populations GENETICS Goldberg, A., Verdu, P., Rosenberg, N. A. 2014; 198 (3): 1209-1229
  • Theory and applications of a deterministic approximation to the coalescent model THEORETICAL POPULATION BIOLOGY Jewett, E. M., Rosenberg, N. A. 2014; 93: 14-29


    Under the coalescent model, the random number nt of lineages ancestral to a sample is nearly deterministic as a function of time when nt is moderate to large in value, and it is well approximated by its expectation E[nt]. In turn, this expectation is well approximated by simple deterministic functions that are easy to compute. Such deterministic functions have been applied to estimate allele age, effective population size, and genetic diversity, and they have been used to study properties of models of infectious disease dynamics. Although a number of simple approximations of E[nt] have been derived and applied to problems of population-genetic inference, the theoretical accuracy of the resulting approximate formulas and the inferences obtained using these approximations is not known, and the range of problems to which they can be applied is not well understood. Here, we demonstrate general procedures by which the approximation nt≈E[nt] can be used to reduce the computational complexity of coalescent formulas, and we show that the resulting approximations converge to their true values under simple assumptions. Such approximations provide alternatives to exact formulas that are computationally intractable or numerically unstable when the number of sampled lineages is moderate or large. We also extend an existing class of approximations of E[nt] to the case of multiple populations of time-varying size with migration among them. Our results facilitate the use of the deterministic approximation nt≈E[nt] for deriving functionally simple, computationally efficient, and numerically stable approximations of coalescent formulas under complicated demographic scenarios.

    View details for DOI 10.1016/j.tpb.2013.12.007

    View details for Web of Science ID 000333727800002

    View details for PubMedID 24412419

  • An empirical evaluation of two-stage species tree inference strategies using a multilocus dataset from North American pines BMC EVOLUTIONARY BIOLOGY DeGiorgio, M., Syring, J., Eckert, A. J., Liston, A., Cronn, R., Neale, D. B., Rosenberg, N. A. 2014; 14
  • Discordance of Species Trees with Their Most Likely Gene Trees: A Unifying Principle MOLECULAR BIOLOGY AND EVOLUTION Rosenberg, N. A. 2013; 30 (12): 2709-2713


    A labeled gene tree topology that disagrees with a labeled species tree topology is said to be anomalous if it is more probable under a coalescent model for gene lineage evolution than the labeled gene tree topology that matches the species tree. It has previously been shown that as a consequence of short internal branches of the species tree, for every labeled species tree topology with five or more taxa, and for asymmetric four-taxon species tree topologies, an assignment of species tree branch lengths can be made which gives rise to anomalous gene trees (AGTs). Here, I offer an alternative characterization of this result--a labeled species tree topology produces AGTs if and only if it contains two consecutive internal branches in an ancestor-descendant relationship--and I provide a proof that follows from the change in perspective. The reformulation and alternative proof of the existence result for AGTs provide the insight that it is not merely short internal branches that generate AGTs, but instead, short internal branches that are arranged consecutively.

    View details for DOI 10.1093/molbev/mst160

    View details for Web of Science ID 000327793000016

    View details for PubMedID 24030555

  • A Population-Genetic Perspective on the Similarities and Differences Among Worldwide Human Populations HUMAN BIOLOGY Rosenberg, N. A. 2011; 83 (6): 659-684


    Recent studies have produced a variety of advances in the investigation of genetic similarities and differences among human populations. Here, I pose a series of questions about human population-genetic similarities and differences, and I then answer these questions by numerical computation with a single shared population-genetic data set. The collection of answers obtained provides an introductory perspective for understanding key results on the features of worldwide human genetic variation.

    View details for Web of Science ID 000209009300001

  • Genotype, haplotype and copy-number variation in worldwide human populations NATURE Jakobsson, M., Scholz, S. W., Scheet, P., Gibbs, J. R., VanLiere, J. M., Fung, H., Szpiech, Z. A., Degnan, J. H., Wang, K., Guerreiro, R., Bras, J. M., Schymick, J. C., Hernandez, D. G., Traynor, B. J., Simon-Sanchez, J., Matarin, M., Britton, A., van de Leemput, J., Rafferty, I., Bucan, M., Cann, H. M., Hardy, J. A., Rosenberg, N. A., Singleton, A. B. 2008; 451 (7181): 998-1003


    Genome-wide patterns of variation across individuals provide a powerful source of data for uncovering the history of migration, range expansion, and adaptation of the human species. However, high-resolution surveys of variation in genotype, haplotype and copy number have generally focused on a small number of population groups. Here we report the analysis of high-quality genotypes at 525,910 single-nucleotide polymorphisms (SNPs) and 396 copy-number-variable loci in a worldwide sample of 29 populations. Analysis of SNP genotypes yields strongly supported fine-scale inferences about population structure. Increasing linkage disequilibrium is observed with increasing geographic distance from Africa, as expected under a serial founder effect for the out-of-Africa spread of human populations. New approaches for haplotype analysis produce inferences about population structure that complement results based on unphased SNPs. Despite a difference from SNPs in the frequency spectrum of the copy-number variants (CNVs) detected--including a comparatively large number of CNVs in previously unexamined populations from Oceania and the Americas--the global distribution of CNVs largely accords with population structure analyses for SNP data sets of similar size. Our results produce new inferences about inter-population variation, support the utility of CNVs in human population-genetic research, and serve as a genomic resource for human-genetic studies in diverse worldwide populations.

    View details for DOI 10.1038/nature06742

    View details for Web of Science ID 000253313100050

    View details for PubMedID 18288195

  • Choosing Subsamples for Sequencing Studies by Minimizing the Average Distance to the Closest Leaf GENETICS Kang, J. T., Zhang, P., Zoellner, S., Rosenberg, N. A. 2015; 201 (2): 499-511


    Imputation of genotypes in a study sample can make use of sequenced or densely genotyped external reference panels consisting of individuals that are not from the study sample. It can also employ internal reference panels, incorporating a subset of individuals from the study sample itself. Internal panels offer an advantage over external panels, as they can reduce imputation errors arising from genetic dissimilarity between a population of interest and a second, distinct population from which the external reference panel has been constructed. As the cost of next-generation sequencing decreases, internal reference panel selection is becoming increasingly feasible. However, it is not clear how best to select individuals to include in such panels. We introduce a new method for selecting an internal reference panel-minimizing the average distance to the closest leaf (ADCL)-and compare its performance relative to an earlier algorithm: maximizing phylogenetic diversity (PD). Employing both simulated data and sequences from the 1000 Genomes Project, we show that ADCL provides a significant improvement in imputation accuracy, especially for imputation of sites with low-frequency alleles. This improvement in imputation accuracy is robust to changes in reference panel size, marker density, and length of the imputation target region.

    View details for DOI 10.1534/genetics.115.176909

    View details for Web of Science ID 000362838500013

  • Coalescent Histories for Lodgepole Species Trees JOURNAL OF COMPUTATIONAL BIOLOGY Disanto, F., Rosenberg, N. A. 2015; 22 (10): 918-929


    Coalescent histories are combinatorial structures that describe for a given gene tree and species tree the possible lists of branches of the species tree on which the gene tree coalescences take place. Properties of the number of coalescent histories for gene trees and species trees affect a variety of probabilistic calculations in mathematical phylogenetics. Exact and asymptotic evaluations of the number of coalescent histories, however, are known only in a limited number of cases. Here we introduce a particular family of species trees, the lodgepole species trees (λn)n≥0, in which tree λn has m=2n+1 taxa. We determine the number of coalescent histories for the lodgepole species trees, in the case that the gene tree matches the species tree, showing that this number grows with m!! in the number of taxa m. This computation demonstrates the existence of tree families in which the growth in the number of coalescent histories is faster than exponential. Further, it provides a substantial improvement on the lower bound for the ratio of the largest number of matching coalescent histories to the smallest number of matching coalescent histories for trees with m taxa, increasing a previous bound of [Formula: see text] to [Formula: see text]. We discuss the implications of our enumerative results for phylogenetic computations.

    View details for DOI 10.1089/cmb.2015.0015

    View details for Web of Science ID 000361890100002

  • Clumpak: a program for identifying clustering modes and packaging population structure inferences across K MOLECULAR ECOLOGY RESOURCES Kopelman, N. M., Mayzel, J., Jakobsson, M., Rosenberg, N. A., Mayrose, I. 2015; 15 (5): 1179-1191


    The identification of the genetic structure of populations from multilocus genotype data has become a central component of modern population-genetic data analysis. Application of model-based clustering programs often entails a number of steps, in which the user considers different modelling assumptions, compares results across different predetermined values of the number of assumed clusters (a parameter typically denoted K), examines multiple independent runs for each fixed value of K, and distinguishes among runs belonging to substantially distinct clustering solutions. Here, we present Clumpak (Cluster Markov Packager Across K), a method that automates the postprocessing of results of model-based population structure analyses. For analysing multiple independent runs at a single K value, Clumpak identifies sets of highly similar runs, separating distinct groups of runs that represent distinct modes in the space of possible solutions. This procedure, which generates a consensus solution for each distinct mode, is performed by the use of a Markov clustering algorithm that relies on a similarity matrix between replicate runs, as computed by the software Clumpp. Next, Clumpak identifies an optimal alignment of inferred clusters across different values of K, extending a similar approach implemented for a fixed K in Clumpp and simplifying the comparison of clustering results across different K values. Clumpak incorporates additional features, such as implementations of methods for choosing K and comparing solutions obtained by different programs, models, or data subsets. Clumpak, available at, simplifies the use of model-based analyses of population structure in population genetics and molecular ecology.

    View details for DOI 10.1111/1755-0998.12387

    View details for Web of Science ID 000359631600017

    View details for PubMedID 25684545

  • Genetic Diversity and Societally Important Disparities. Genetics Rosenberg, N. A., Kang, J. T. 2015; 201 (1): 1-12


    The magnitude of genetic diversity within human populations varies in a way that reflects the sequence of migrations by which people spread throughout the world. Beyond its use in human evolutionary genetics, worldwide variation in genetic diversity sometimes can interact with social processes to produce differences among populations in their relationship to modern societal problems. We review the consequences of genetic diversity differences in the settings of familial identification in forensic genetic testing, match probabilities in bone marrow transplantation, and representation in genome-wide association studies of disease. In each of these three cases, the contribution of genetic diversity to social differences follows from population-genetic principles. For a fourth setting that is not similarly grounded, we reanalyze with expanded genetic data a report that genetic diversity differences influence global patterns of human economic development, finding no support for the claim. The four examples describe a limit to the importance of genetic diversity for explaining societal differences while illustrating a distinction that certain biologically based scenarios do require consideration of genetic diversity for solving problems to which populations have been differentially predisposed by the unique history of human migrations.

    View details for DOI 10.1534/genetics.115.176750

    View details for PubMedID 26354973

  • Beyond 2/3 and 1/3: The Complex Signatures of Sex-Biased Admixture on the X Chromosome GENETICS Goldberg, A., Rosenberg, N. A. 2015; 201 (1): 263-279


    Sex-biased demography, in which parameters governing migration and population size differ between females and males, has been studied through comparisons of X chromosomes, which are inherited sex-specifically, and autosomes, which are not. A common form of sex bias in humans is sex-biased admixture, in which at least one of the source populations differs in its proportions of females and males contributing to an admixed population. Studies of sex-biased admixture often examine the mean ancestry for markers on the X chromosome in relation to the autosomes. A simple framework noting that in a population with equally many females and males, two-thirds of X chromosomes appear in females, suggests that the mean X-chromosomal admixture fraction is a linear combination of female and male admixture parameters, with coefficients 2/3 and 1/3, respectively. Extending a mechanistic admixture model to accommodate the X chromosome, we demonstrate that this prediction is not generally true in admixture models, although it holds in the limit for an admixture process occurring as a single event. For a model with constant ongoing admixture, we determine the mean X-chromosomal admixture, comparing admixture on female and male X chromosomes to corresponding autosomal values. Surprisingly, in reanalyzing African-American genetic data to estimate sex-specific contributions from African and European sources, we find that the range of contributions compatible with the excess African ancestry on the X chromosome compared to autosomes has a wide spread, permitting scenarios either without male-biased contributions from Europe or without female-biased contributions from Africa.

    View details for DOI 10.1534/genetics.115.178509

    View details for Web of Science ID 000361206400020

  • Implications of the apportionment of human genetic diversity for the apportionment of human phenotypic diversity. Studies in history and philosophy of biological and biomedical sciences Edge, M. D., Rosenberg, N. A. 2015; 52: 32-45


    Researchers in many fields have considered the meaning of two results about genetic variation for concepts of "race." First, at most genetic loci, apportionments of human genetic diversity find that worldwide populations are genetically similar. Second, when multiple genetic loci are examined, it is possible to distinguish people with ancestry from different geographical regions. These two results raise an important question about human phenotypic diversity: To what extent do populations typically differ on phenotypes determined by multiple genetic loci? It might be expected that such phenotypes follow the pattern of similarity observed at individual loci. Alternatively, because they have a multilocus genetic architecture, they might follow the pattern of greater differentiation suggested by multilocus ancestry inference. To address the question, we extend a well-known classification model of Edwards (2003) by adding a selectively neutral quantitative trait. Using the extended model, we show, in line with previous work in quantitative genetics, that regardless of how many genetic loci influence the trait, one neutral trait is approximately as informative about ancestry as a single genetic locus. The results support the relevance of single-locus genetic-diversity partitioning for predictions about phenotypic diversity.

    View details for DOI 10.1016/j.shpsc.2014.12.005

    View details for PubMedID 25677859

  • Enhancing the mathematical properties of new haplotype homozygosity statistics for the detection of selective sweeps THEORETICAL POPULATION BIOLOGY Garud, N. R., Rosenberg, N. A. 2015; 102: 94-101


    Soft selective sweeps represent an important form of adaptation in which multiple haplotypes bearing adaptive alleles rise to high frequency. Most statistical methods for detecting selective sweeps from genetic polymorphism data, however, have focused on identifying hard selective sweeps in which a favored allele appears on a single haplotypic background; these methods might be underpowered to detect soft sweeps. Among exceptions is the set of haplotype homozygosity statistics introduced for the detection of soft sweeps by Garud et al. (2015). These statistics, examining frequencies of multiple haplotypes in relation to each other, include H12, a statistic designed to identify both hard and soft selective sweeps, and H2/H1, a statistic that conditional on high H12 values seeks to distinguish between hard and soft sweeps. A challenge in the use of H2/H1 is that its range depends on the associated value of H12, so that equal H2/H1 values might provide different levels of support for a soft sweep model at different values of H12. Here, we enhance the H12 and H2/H1 haplotype homozygosity statistics for selective sweep detection by deriving the upper bound on H2/H1 as a function of H12, thereby generating a statistic that normalizes H2/H1 to lie between 0 and 1. Through a reanalysis of resequencing data from inbred lines of Drosophila, we show that the enhanced statistic both strengthens interpretations obtained with the unnormalized statistic and leads to empirical insights that are less readily apparent without the normalization.

    View details for DOI 10.1016/j.tpb.2015.04.001

    View details for Web of Science ID 000355239700009

    View details for PubMedID 25891325

  • A comparison of worldwide phonemic and genetic variation in human populations PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Creanza, N., Ruhlen, M., Pemberton, T. J., Rosenberg, N. A., Feldman, M. W., Ramachandran, S. 2015; 112 (5): 1265-1272


    Worldwide patterns of genetic variation are driven by human demographic history. Here, we test whether this demographic history has left similar signatures on phonemes-sound units that distinguish meaning between words in languages-to those it has left on genes. We analyze, jointly and in parallel, phoneme inventories from 2,082 worldwide languages and microsatellite polymorphisms from 246 worldwide populations. On a global scale, both genetic distance and phonemic distance between populations are significantly correlated with geographic distance. Geographically close language pairs share significantly more phonemes than distant language pairs, whether or not the languages are closely related. The regional geographic axes of greatest phonemic differentiation correspond to axes of genetic differentiation, suggesting that there is a relationship between human dispersal and linguistic variation. However, the geographic distribution of phoneme inventory sizes does not follow the predictions of a serial founder effect during human expansion out of Africa. Furthermore, although geographically isolated populations lose genetic diversity via genetic drift, phonemes are not subject to drift in the same way: within a given geographic radius, languages that are relatively isolated exhibit more variance in number of phonemes than languages with many neighbors. This finding suggests that relatively isolated languages are more susceptible to phonemic change than languages with many neighbors. Within a language family, phoneme evolution along genetic, geographic, or cognate-based linguistic trees predicts similar ancestral phoneme states to those predicted from ancient sources. More genetic sampling could further elucidate the relative roles of vertical and horizontal transmission in phoneme evolution.

    View details for DOI 10.1073/pnas.1424033112

    View details for Web of Science ID 000349087700028

  • AABC: approximate approximate Bayesian computation for inference in population-genetic models. Theoretical population biology Buzbas, E. O., Rosenberg, N. A. 2015; 99: 31-42


    Approximate Bayesian computation (ABC) methods perform inference on model-specific parameters of mechanistically motivated parametric models when evaluating likelihoods is difficult. Central to the success of ABC methods, which have been used frequently in biology, is computationally inexpensive simulation of data sets from the parametric model of interest. However, when simulating data sets from a model is so computationally expensive that the posterior distribution of parameters cannot be adequately sampled by ABC, inference is not straightforward. We present "approximate approximate Bayesian computation" (AABC), a class of computationally fast inference methods that extends ABC to models in which simulating data is expensive. In AABC, we first simulate a number of data sets small enough to be computationally feasible to simulate from the parametric model. Conditional on these data sets, we use a statistical model that approximates the correct parametric model and enables efficient simulation of a large number of data sets. We show that under mild assumptions, the posterior distribution obtained by AABC converges to the posterior distribution obtained by ABC, as the number of data sets simulated from the parametric model and the sample size of the observed data set increase. We demonstrate the performance of AABC on a population-genetic model of natural selection, as well as on a model of the admixture history of hybrid populations. This latter example illustrates how, in population genetics, AABC is of particular utility in scenarios that rely on conceptually straightforward but potentially slow forward-in-time simulations.

    View details for DOI 10.1016/j.tpb.2014.09.002

    View details for PubMedID 25261426

  • On the Number of Ranked Species Trees Producing Anomalous Ranked Gene Trees IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS Disanto, F., Rosenberg, N. A. 2014; 11 (6): 1229-1238
  • Upper bounds on FST in terms of the frequency of the most frequent allele and total homozygosity: the case of a specified number of alleles. Theoretical population biology Edge, M. D., Rosenberg, N. A. 2014; 97: 20-34


    FST is one of the most frequently-used indices of genetic differentiation among groups. Though FST takes values between 0 and 1, authors going back to Wright have noted that under many circumstances, FST is constrained to be less than 1. Recently, we showed that at a genetic locus with an unspecified number of alleles, FST for two subpopulations is strictly bounded from above by functions of both the frequency of the most frequent allele (M) and the homozygosity of the total population (HT). In the two-subpopulation case, FST can equal one only when the frequency of the most frequent allele and the total homozygosity are 1/2. Here, we extend this work by deriving strict bounds on FST for two subpopulations when the number of alleles at the locus is specified to be I. We show that restricting to I alleles produces the same upper bound on FST over much of the allowable domain for M and HT, and we derive more restrictive bounds in the windows M∈[1/I,1/(I-1)) and HT∈[1/I,I/(I(2)-1)). These results extend our understanding of the behavior of FST in relation to other population-genetic statistics.

    View details for DOI 10.1016/j.tpb.2014.08.001

    View details for PubMedID 25132646

  • Mean deep coalescence cost under exchangeable probability distributions DISCRETE APPLIED MATHEMATICS Than, C. V., Rosenberg, N. A. 2014; 174: 11-26
  • Patterns of Admixture and Population Structure in Native Populations of Northwest North America PLOS GENETICS Verdu, P., Pemberton, T. J., Laurent, R., Kemp, B. M., Gonzalez-Oliver, A., Gorodezky, C., Hughes, C. E., Shattuck, M. R., Petzelt, B., Mitchell, J., Harry, H., William, T., Worl, R., Cybulski, J. S., Rosenberg, N. A., Malhi, R. S. 2014; 10 (8)


    The initial contact of European populations with indigenous populations of the Americas produced diverse admixture processes across North, Central, and South America. Recent studies have examined the genetic structure of indigenous populations of Latin America and the Caribbean and their admixed descendants, reporting on the genomic impact of the history of admixture with colonizing populations of European and African ancestry. However, relatively little genomic research has been conducted on admixture in indigenous North American populations. In this study, we analyze genomic data at 475,109 single-nucleotide polymorphisms sampled in indigenous peoples of the Pacific Northwest in British Columbia and Southeast Alaska, populations with a well-documented history of contact with European and Asian traders, fishermen, and contract laborers. We find that the indigenous populations of the Pacific Northwest have higher gene diversity than Latin American indigenous populations. Among the Pacific Northwest populations, interior groups provide more evidence for East Asian admixture, whereas coastal groups have higher levels of European admixture. In contrast with many Latin American indigenous populations, the variance of admixture is high in each of the Pacific Northwest indigenous populations, as expected for recent and ongoing admixture processes. The results reveal some similarities but notable differences between admixture patterns in the Pacific Northwest and those in Latin America, contributing to a more detailed understanding of the genomic consequences of European colonization events throughout the Americas.

    View details for DOI 10.1371/journal.pgen.1004530

    View details for Web of Science ID 000341577800027

    View details for PubMedID 25122539

  • Population-Genetic Influences on Genomic Estimates of the Inbreeding Coefficient: A Global Perspective HUMAN HEREDITY Pemberton, T. J., Rosenberg, N. A. 2014; 77 (1-4): 37-48


    Culturally driven marital practices provide a key instance of an interaction between social and genetic processes in shaping patterns of human genetic variation, producing, for example, increased identity by descent through consanguineous marriage. A commonly used measure to quantify identity by descent in an individual is the inbreeding coefficient, a quantity that reflects not only consanguinity, but also other aspects of kinship in the population to which the individual belongs. Here, in populations worldwide, we examine the relationship between genomic estimates of the inbreeding coefficient and population patterns in genetic variation.Using genotypes at 645 microsatellites, we compare inbreeding coefficients from 5,043 individuals representing 237 populations worldwide to demographic consanguinity frequency estimates available for 26 populations as well as to other quantities that can illuminate population-genetic influences on inbreeding coefficients.We observe higher inbreeding coefficient estimates in populations and geographic regions with known high levels of consanguinity or genetic isolation and in populations with an increased effect of genetic drift and decreased genetic diversity with increasing distance from Africa. For the small number of populations with specific consanguinity estimates, we find a correlation between inbreeding coefficients and consanguinity frequency (r = 0.349, p = 0.040).The results emphasize the importance of both consanguinity and population-genetic factors in influencing variation in inbreeding coefficients, and they provide insight into factors useful for assessing the effect of consanguinity on genomic patterns in different populations. © 2014 S. Karger AG, Basel.

    View details for DOI 10.1159/000362878

    View details for Web of Science ID 000339321800006

    View details for PubMedID 25060268

  • From generation to generation: the genetics of jewish populations. Human biology Rosenberg, N. A., Weitzman, S. P. 2013; 85 (6): 817-824

    View details for PubMedID 25079121

  • No Evidence from Genome-wide Data of a Khazar Origin for the Ashkenazi Jews HUMAN BIOLOGY Behar, D. M., Metspalu, M., Baran, Y., Kopelman, N. M., Yunusbayev, B., Gladstein, A., Tzur, S., Sahakyan, H., Bahmanimehr, A., Yepiskoposyan, L., Tambets, K., Khusnutdinova, E. K., Kushniarevich, A., Balanovsky, O., Balanovsky, E., Kovacevic, L., Marjanovic, D., Mihailov, E., Kouvatsi, A., Triantaphyllidis, C., King, R. J., Semino, O., Torroni, A., Hammer, M. F., Metspalu, E., Skorecki, K., Rosset, S., Halperin, E., Villems, R., Rosenberg, N. A. 2013; 85 (6): 859-900
  • Genetics and the History of the Samaritans: Y-Chromosomal Microsatellites and Genetic Affinity between Samaritans and Cohanim HUMAN BIOLOGY Oefner, P. J., Hoelzl, G., Shen, P., Shpirer, I., Gefel, D., Lavi, T., Woolf, E., Cohen, J., Cinnioglu, C., Underhill, P. A., Rosenberg, N. A., Hochrein, J., Granka, J. M., Hillel, J., Feldman, M. W. 2013; 85 (6): 825-857
  • Genotype Imputation Reference Panel Selection Using Maximal Phylogenetic Diversity GENETICS Zhang, P., Zhan, X., Rosenberg, N. A., Zoellner, S. 2013; 195 (2): 319-330
  • Coalescent Histories for Caterpillar-Like Families IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS Rosenberg, N. A. 2013; 10 (5): 1253-1262


    A coalescent history is an assignment of branches of a gene tree to branches of a species tree on which coalescences in the gene tree occur. The number of coalescent histories for a pair consisting of a labeled gene tree topology and a labeled species tree topology is important in gene tree probability computations, and more generally, in studying evolutionary possibilities for gene trees on species trees. Defining the Tr-caterpillar-like family as a sequence of n-taxon trees constructed by replacing the r-taxon subtree of n-taxon caterpillars by a specific r-taxon labeled topology Tr, we examine the number of coalescent histories for caterpillar-like families with matching gene tree and species tree labeled topologies. For each Tr with size r≤8, we compute the number of coalescent histories for n-taxon trees in the Tr-caterpillar-like family. Next, as n→∞, we find that the limiting ratio of the numbers of coalescent histories for the Tr family and caterpillars themselves is correlated with the number of labeled histories for Tr. The results support a view that large numbers of coalescent histories occur when a tree has both a relatively balanced subtree and a high tree depth, contributing to deeper understanding of the combinatorics of gene trees and species trees.

    View details for DOI 10.1109/TCBB.2013.123

    View details for Web of Science ID 000331461400017

    View details for PubMedID 24524157

  • Genotype imputation in a coalescent model with infinitely-many-sites mutation THEORETICAL POPULATION BIOLOGY Huang, L., Buzbas, E. O., Rosenberg, N. A. 2013; 87: 62-74


    Empirical studies have identified population-genetic factors as important determinants of the properties of genotype-imputation accuracy in imputation-based disease association studies. Here, we develop a simple coalescent model of three sequences that we use to explore the theoretical basis for the influence of these factors on genotype-imputation accuracy, under the assumption of infinitely-many-sites mutation. Employing a demographic model in which two populations diverged at a given time in the past, we derive the approximate expectation and variance of imputation accuracy in a study sequence sampled from one of the two populations, choosing between two reference sequences, one sampled from the same population as the study sequence and the other sampled from the other population. We show that, under this model, imputation accuracy-as measured by the proportion of polymorphic sites that are imputed correctly in the study sequence-increases in expectation with the mutation rate, the proportion of the markers in a chromosomal region that are genotyped, and the time to divergence between the study and reference populations. Each of these effects derives largely from an increase in information available for determining the reference sequence that is genetically most similar to the sequence targeted for imputation. We analyze as a function of divergence time the expected gain in imputation accuracy in the target using a reference sequence from the same population as the target rather than from the other population. Together with a growing body of empirical investigations of genotype imputation in diverse human populations, our modeling framework lays a foundation for extending imputation techniques to novel populations that have not yet been extensively examined.

    View details for DOI 10.1016/j.tpb.2012.09.006

    View details for Web of Science ID 000322688800007

    View details for PubMedID 23079542

  • Long Runs of Homozygosity Are Enriched for Deleterious Variation AMERICAN JOURNAL OF HUMAN GENETICS Szpiech, Z. A., Xu, J., Pemberton, T. J., Peng, W., Zoellner, S., Rosenberg, N. A., Li, J. Z. 2013; 93 (1): 90-102


    Exome sequencing offers the potential to study the population-genomic variables that underlie patterns of deleterious variation. Runs of homozygosity (ROH) are long stretches of consecutive homozygous genotypes probably reflecting segments shared identically by descent as the result of processes such as consanguinity, population size reduction, and natural selection. The relationship between ROH and patterns of predicted deleterious variation can provide insight into the way in which these processes contribute to the maintenance of deleterious variants. Here, we use exome sequencing to examine ROH in relation to the distribution of deleterious variation in 27 individuals of varying levels of apparent inbreeding from 6 human populations. A significantly greater fraction of all genome-wide predicted damaging homozygotes fall in ROH than would be expected from the corresponding fraction of nondamaging homozygotes in ROH (p < 0.001). This pattern is strongest for long ROH (p < 0.05). ROH, and especially long ROH, harbor disproportionately more deleterious homozygotes than would be expected on the basis of the total ROH coverage of the genome and the genomic distribution of nondamaging homozygotes. The results accord with a hypothesis that recent inbreeding, which generates long ROH, enables rare deleterious variants to exist in homozygous form. Thus, just as inbreeding can elevate the occurrence of rare recessive diseases that represent homozygotes for strongly deleterious mutations, inbreeding magnifies the occurrence of mildly deleterious variants as well.

    View details for DOI 10.1016/j.ajhg.2013.05.003

    View details for Web of Science ID 000321804500008

    View details for PubMedID 23746547

  • Population Structure in a Comprehensive Genomic Data Set on Human Microsatellite Variation G3-GENES GENOMES GENETICS Pemberton, T. J., DeGiorgio, M., Rosenberg, N. A. 2013; 3 (5): 891-907


    Over the past two decades, microsatellite genotypes have provided the data for landmark studies of human population-genetic variation. However, the various microsatellite data sets have been prepared with different procedures and sets of markers, so that it has been difficult to synthesize available data for a comprehensive analysis. Here, we combine eight human population-genetic data sets at the 645 microsatellite loci they share in common, accounting for procedural differences in the production of the different data sets, to assemble a single data set containing 5795 individuals from 267 worldwide populations. We perform a systematic analysis of genetic relatedness, detecting 240 intra-population and 92 inter-population pairs of previously unidentified close relatives and proposing standardized subsets of unrelated individuals for use in future studies. We then augment the human data with a data set of 84 chimpanzees at the 246 loci they share in common with the human samples. Multidimensional scaling and neighbor-joining analyses of these data sets offer new insights into the structure of human populations and enable a comparison of genetic variation patterns in chimpanzees with those in humans. Our combined data sets are the largest of their kind reported to date and provide a resource for use in human population-genetic studies.

    View details for DOI 10.1534/g3.113.005728

    View details for Web of Science ID 000319438700010

    View details for PubMedID 23550135

  • Geographic Sampling Scheme as a Determinant of the Major Axis of Genetic Variation in Principal Components Analysis MOLECULAR BIOLOGY AND EVOLUTION DeGiorgio, M., Rosenberg, N. A. 2013; 30 (2): 480-488


    Principal component (PC) maps, which plot the values of a given PC estimated on the basis of allele frequency variation at the geographic sampling locations of a set of populations, are often used to investigate the properties of past range expansions. Some studies have argued that in a range expansion, the axis of greatest variation (i.e., the first PC) is parallel to the axis of expansion. In contrast, others have identified a pattern in which the axis of greatest variation is perpendicular to the axis of expansion. Here, we seek to understand this difference in outcomes by investigating the effect of the geographic sampling scheme on the direction of the axis of greatest variation under a two-dimensional range expansion model. From datasets simulated using each of two different schemes for the geographic sampling of populations under the model, we create PC maps for the first PC. We find that depending on the geographic sampling scheme, the axis of greatest variation can be either parallel or perpendicular to the axis of expansion. We provide an explanation for this result in terms of intra- and interpopulation coalescence times.

    View details for DOI 10.1093/molbev/mss233

    View details for Web of Science ID 000314122000023

    View details for PubMedID 23051843

  • The Relationship Between F-ST and the Frequency of the Most Frequent Allele GENETICS Jakobsson, M., Edge, M. D., Rosenberg, N. A. 2013; 193 (2): 515-528


    F(ST) is frequently used as a summary of genetic differentiation among groups. It has been suggested that F(ST) depends on the allele frequencies at a locus, as it exhibits a variety of peculiar properties related to genetic diversity: higher values for biallelic single-nucleotide polymorphisms (SNPs) than for multiallelic microsatellites, low values among high-diversity populations viewed as substantially distinct, and low values for populations that differ primarily in their profiles of rare alleles. A full mathematical understanding of the dependence of F(ST) on allele frequencies, however, has been elusive. Here, we examine the relationship between F(ST) and the frequency of the most frequent allele, demonstrating that the range of values that F(ST) can take is restricted considerably by the allele-frequency distribution. For a two-population model, we derive strict bounds on F(ST) as a function of the frequency M of the allele with highest mean frequency between the pair of populations. Using these bounds, we show that for a value of M chosen uniformly between 0 and 1 at a multiallelic locus whose number of alleles is left unspecified, the mean maximum F(ST) is ∼0.3585. Further, F(ST) is restricted to values much less than 1 when M is low or high, and the contribution to the maximum F(ST) made by the most frequent allele is on average ∼0.4485. Using bounds on homozygosity that we have previously derived as functions of M, we describe strict bounds on F(ST) in terms of the homozygosity of the total population, finding that the mean maximum F(ST) given this homozygosity is 1 - ln 2 ≈ 0.3069. Our results provide a conceptual basis for understanding the dependence of F(ST) on allele frequencies and genetic diversity and for interpreting the roles of these quantities in computations of F(ST) from population-genetic data. Further, our analysis suggests that many unusual observations of F(ST), including the relatively low F(ST) values in high-diversity human populations from Africa and the relatively low estimates of F(ST) for microsatellites compared to SNPs, can be understood not as biological phenomena associated with different groups of populations or classes of markers but rather as consequences of the intrinsic mathematical dependence of F(ST) on the properties of allele-frequency distributions.

    View details for DOI 10.1534/genetics.112.144758

    View details for Web of Science ID 000314821300015

    View details for PubMedID 23172852

  • Mathematical properties of the deep coalescence cost. IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM Than, C. V., Rosenberg, N. A. 2013; 10 (1): 61-72


    In the minimizing-deep-coalescences (MDC) approach for species tree inference, a tree that has the minimal deep coalescence cost for reconciling a collection of gene trees is taken as an estimate of the species tree topology. The MDC method possesses the desirable Pareto property, and in practice it is quite accurate and computationally efficient. Here, in order to better understand the MDC method, we investigate some properties of the deep coalescence cost. We prove that the unit neighborhood of either a rooted species tree or a rooted gene tree under the deep coalescence cost is exactly the same as the tree's unit neighborhood under the rooted nearest-neighbor interchange (NNI) distance. Next, for a fixed species tree, we obtain the maximum deep coalescence cost across all gene trees as well as the number of gene trees that achieve the maximum cost. We also study corresponding problems for a fixed gene tree.

    View details for DOI 10.1109/TCBB.2012.133

    View details for PubMedID 23702544

  • A Characterization of the Set of Species Trees that Produce Anomalous Ranked Gene Trees IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS Degnan, J. H., Rosenberg, N. A., Stadler, T. 2012; 9 (6): 1558-1568


    Ranked gene trees, which consider both the gene tree topology and the sequence in which gene lineages separate, can potentially provide a new source of information for use in modeling genealogies and performing inference of species trees. Recently,we have calculated the probability distribution of ranked gene trees under the standard multispecies coalescent model for the evolution of gene lineages along the branches of a fixed species tree, demonstrating the existence of anomalous ranked gene trees (ARGTs), in which a ranked gene tree that does not match the ranked species tree can have greater probability under the model than the matching ranked gene tree. Here, we fully characterize the set of unranked species tree topologies that give rise to ARGTs, showing that this set contains all species tree topologies with five or more taxa, with the exceptions of caterpillars and pseudocaterpillars. The results have implications for the use of ranked gene trees in phylogenetic inference.

    View details for DOI 10.1109/TCBB.2012.110

    View details for Web of Science ID 000312558400002

    View details for PubMedID 22868677

  • Genomic Patterns of Homozygosity in Worldwide Human Populations AMERICAN JOURNAL OF HUMAN GENETICS Pemberton, T. J., Absher, D., Feldman, M. W., Myers, R. M., Rosenberg, N. A., Li, J. Z. 2012; 91 (2): 275-292


    Genome-wide patterns of homozygosity runs and their variation across individuals provide a valuable and often untapped resource for studying human genetic diversity and evolutionary history. Using genotype data at 577,489 autosomal SNPs, we employed a likelihood-based approach to identify runs of homozygosity (ROH) in 1,839 individuals representing 64 worldwide populations, classifying them by length into three classes-short, intermediate, and long-with a model-based clustering algorithm. For each class, the number and total length of ROH per individual show considerable variation across individuals and populations. The total lengths of short and intermediate ROH per individual increase with the distance of a population from East Africa, in agreement with similar patterns previously observed for locus-wise homozygosity and linkage disequilibrium. By contrast, total lengths of long ROH show large interindividual variations that probably reflect recent inbreeding patterns, with higher values occurring more often in populations with known high frequencies of consanguineous unions. Across the genome, distributions of ROH are not uniform, and they have distinctive continental patterns. ROH frequencies across the genome are correlated with local genomic variables such as recombination rate, as well as with signals of recent positive selection. In addition, long ROH are more frequent in genomic regions harboring genes associated with autosomal-dominant diseases than in regions not implicated in Mendelian diseases. These results provide insight into the way in which homozygosity patterns are produced, and they generate baseline homozygosity patterns that can be used to aid homozygosity mapping of genes associated with recessive diseases.

    View details for DOI 10.1016/j.ajhg.2012.06.014

    View details for Web of Science ID 000307608700006

    View details for PubMedID 22883143

  • Inferring Species Trees Directly from Biallelic Genetic Markers: Bypassing Gene Trees in a Full Coalescent Analysis MOLECULAR BIOLOGY AND EVOLUTION Bryant, D., Bouckaert, R., Felsenstein, J., Rosenberg, N. A., RoyChoudhury, A. 2012; 29 (8): 1917-1932


    The multispecies coalescent provides an elegant theoretical framework for estimating species trees and species demographics from genetic markers. However, practical applications of the multispecies coalescent model are limited by the need to integrate or sample over all gene trees possible for each genetic marker. Here we describe a polynomial-time algorithm that computes the likelihood of a species tree directly from the markers under a finite-sites model of mutation effectively integrating over all possible gene trees. The method applies to independent (unlinked) biallelic markers such as well-spaced single nucleotide polymorphisms, and we have implemented it in SNAPP, a Markov chain Monte Carlo sampler for inferring species trees, divergence dates, and population sizes. We report results from simulation experiments and from an analysis of 1997 amplified fragment length polymorphism loci in 69 individuals sampled from six species of Ourisia (New Zealand native foxglove).

    View details for DOI 10.1093/molbev/mss086

    View details for Web of Science ID 000307171300004

    View details for PubMedID 22422763

  • A Quantitative Comparison of the Similarity between Genes and Geography in Worldwide Human Populations PLOS GENETICS Wang, C., Zoellner, S., Rosenberg, N. A. 2012; 8 (8)


    Multivariate statistical techniques such as principal components analysis (PCA) and multidimensional scaling (MDS) have been widely used to summarize the structure of human genetic variation, often in easily visualized two-dimensional maps. Many recent studies have reported similarity between geographic maps of population locations and MDS or PCA maps of genetic variation inferred from single-nucleotide polymorphisms (SNPs). However, this similarity has been evident primarily in a qualitative sense; and, because different multivariate techniques and marker sets have been used in different studies, it has not been possible to formally compare genetic variation datasets in terms of their levels of similarity with geography. In this study, using genome-wide SNP data from 128 populations worldwide, we perform a systematic analysis to quantitatively evaluate the similarity of genes and geography in different geographic regions. For each of a series of regions, we apply a Procrustes analysis approach to find an optimal transformation that maximizes the similarity between PCA maps of genetic variation and geographic maps of population locations. We consider examples in Europe, Sub-Saharan Africa, Asia, East Asia, and Central/South Asia, as well as in a worldwide sample, finding that significant similarity between genes and geography exists in general at different geographic levels. The similarity is highest in our examples for Asia and, once highly distinctive populations have been removed, Sub-Saharan Africa. Our results provide a quantitative assessment of the geographic structure of human genetic variation worldwide, supporting the view that geography plays a strong role in giving rise to human population structure.

    View details for DOI 10.1371/journal.pgen.1002886

    View details for Web of Science ID 000308529300044

    View details for PubMedID 22927824

  • Improvements to a Class of Distance Matrix Methods for Inferring Species Trees from Gene Trees JOURNAL OF COMPUTATIONAL BIOLOGY Helmkamp, L. J., Jewett, E. M., Rosenberg, N. A. 2012; 19 (6): 632-649


    Among the methods currently available for inferring species trees from gene trees, the GLASS method of Mossel and Roch (2010), the Shallowest Divergence (SD) method of Maddison and Knowles (2006), the STEAC method of Liu et al. (2009), and a related method that we call Minimum Average Coalescence (MAC) are computationally efficient and provide branch length estimates. Further, GLASS and STEAC have been shown to be consistent estimators of tree topology under a multispecies coalescent model. However, divergence time estimates obtained with these methods are all systematically biased under the model because the pairwise interspecific gene divergence times on which they rely must be more ancient than the species divergence time. Jewett and Rosenberg (2012) derived an expression for the bias of GLASS and used it to propose an improved method that they termed iGLASS. Here, we derive the biases of SD, STEAC, and MAC, and we propose improved analogues of these methods that we call iSD, iSTEAC, and iMAC. We conduct simulations to compare the performance of these methods with their original counterparts and with GLASS and iGLASS, finding that each of them decreases the bias and mean squared error of pairwise divergence time estimates. The new methods can therefore contribute to improvements in the estimation of species trees from information on gene trees.

    View details for DOI 10.1089/cmb.2012.0042

    View details for Web of Science ID 000305335100006

    View details for PubMedID 22697239

  • iGLASS: An Improvement to the GLASS Method for Estimating Species Trees from Gene Trees JOURNAL OF COMPUTATIONAL BIOLOGY Jewett, E. M., Rosenberg, N. A. 2012; 19 (3): 293-315


    Several methods have been designed to infer species trees from gene trees while taking into account gene tree/species tree discordance. Although some of these methods provide consistent species tree topology estimates under a standard model, most either do not estimate branch lengths or are computationally slow. An exception, the GLASS method of Mossel and Roch, is consistent for the species tree topology, estimates branch lengths, and is computationally fast. However, GLASS systematically overestimates divergence times, leading to biased estimates of species tree branch lengths. By assuming a multispecies coalescent model in which multiple lineages are sampled from each of two taxa at L independent loci, we derive the distribution of the waiting time until the first interspecific coalescence occurs between the two taxa, considering all loci and measuring from the divergence time. We then use the mean of this distribution to derive a correction to the GLASS estimator of pairwise divergence times. We show that our improved estimator, which we call iGLASS, consistently estimates the divergence time between a pair of taxa as the number of loci approaches infinity, and that it is an unbiased estimator of divergence times when one lineage is sampled per taxon. We also show that many commonly used clustering methods can be combined with the iGLASS estimator of pairwise divergence times to produce a consistent estimator of the species tree topology. Through simulations, we show that iGLASS can greatly reduce the bias and mean squared error in obtaining estimates of divergence times in a species tree.

    View details for DOI 10.1089/cmb.2011.0231

    View details for Web of Science ID 000301355100005

    View details for PubMedID 22216756

  • Refining the relationship between homozygosity and the frequency of the most frequent allele JOURNAL OF MATHEMATICAL BIOLOGY Reddy, S. B., Rosenberg, N. A. 2012; 64 (1-2): 87-108


    Recent work has established that for an arbitrary genetic locus with its number of alleles unspecified, the homozygosity of the locus confines the frequency of the most frequent allele within a narrow range, and vice versa. Here we extend beyond this limiting case by investigating the relationship between homozygosity and the frequency of the most frequent allele when the number of alleles at the locus is treated as known. Given the homozygosity of a locus with at most K alleles, we find that by taking into account the value of K, the width of the allowed range for the frequency of the most frequent allele decreases from 2/3 - ?(2)/18 ? 0.1184 to 1/3 - 1/(3K) - {K/[3(K - 1)]} ?(K)(k = 2) 1/k(2). We further show that properties of the relationship between homozygosity and the frequency of the most frequent allele in the unspecified-K case can be obtained from the specified-K case by taking limits as K ? ?. The results contribute to a greater understanding of the mathematical properties of fundamental statistics employed in population-genetic analysis.

    View details for DOI 10.1007/s00285-011-0406-8

    View details for Web of Science ID 000298652400004

    View details for PubMedID 21305294

  • The probability distribution of ranked gene trees on a species tree MATHEMATICAL BIOSCIENCES Degnan, J. H., Rosenberg, N. A., Stadler, T. 2012; 235 (1): 45-55


    The properties of random gene tree topologies have recently been studied under a coalescent model that treats a species tree as a fixed parameter. Here we develop the analogous theory for random ranked gene tree topologies, in which both the topology and the sequence of coalescences for a random gene tree are considered. We derive the probability distribution of ranked gene tree topologies conditional on a fixed species tree. We then show that similar to the unranked case, ranked gene trees that do not match either the ranking or the topology of the species tree can have greater probability than the matching ranked gene tree.

    View details for DOI 10.1016/j.mbs.2011.10.006

    View details for Web of Science ID 000299761300005

    View details for PubMedID 22075548

  • Haploscope: A Tool for the Graphical Display of Haplotype Structure in Populations GENETIC EPIDEMIOLOGY San Lucas, F. A., Rosenberg, N. A., Scheet, P. 2012; 36 (1): 17-21


    Patterns of linkage disequilibrium are often depicted pictorially by using tools that rely on visualizations of raw data or pairwise correlations among individual markers. Such approaches can fail to highlight some of the more interesting and complex features of haplotype structure. To enable natural visual comparisons of haplotype structure across subgroups of a population (e.g. isolated subpopulations or cases and controls), we propose an alternative visualization that provides a novel graphical representation of haplotype frequencies. We introduce Haploscope, a tool for visualizing the haplotype cluster frequencies that are produced by statistical models for population haplotype variation. We demonstrate the utility of our technique by examining haplotypes around the LCT gene, an example of recent positive selection, in samples from the Human Genome Diversity Panel. Haploscope, which has flexible options for annotation and inspection of haplotypes, is available for download at

    View details for DOI 10.1002/gepi.20640

    View details for Web of Science ID 000302244400003

    View details for PubMedID 22147662

  • A General Mechanistic Model for Admixture Histories of Hybrid Populations GENETICS Verdu, P., Rosenberg, N. A. 2011; 189 (4): 1413-?


    Admixed populations have been used for inferring migrations, detecting natural selection, and finding disease genes. These applications often use a simple statistical model of admixture rather than a modeling perspective that incorporates a more realistic history of the admixture process. Here, we develop a general model of admixture that mechanistically accounts for complex historical admixture processes. We consider two source populations contributing to the ancestry of a hybrid population, potentially with variable contributions across generations. For a random individual in the hybrid population at a given point in time, we study the fraction of genetic admixture originating from a specific one of the source populations by computing its moments as functions of time and of introgression parameters. We show that very different admixture processes can produce identical mean admixture proportions, but that such processes produce different values for the variance of the admixture proportion. When introgression parameters from each source population are constant over time, the long-term limit of the expectation of the admixture proportion depends only on the ratio of the introgression parameters. The variance of admixture decreases quickly over time after the source populations stop contributing to the hybrid population, but remains substantial when the contributions are ongoing. Our approach will facilitate the understanding of admixture mechanisms, illustrating how the moments of the distribution of admixture proportions can be informative about the historical admixture processes contributing to the genetic diversity of hybrid populations.

    View details for DOI 10.1534/genetics.111.132787

    View details for Web of Science ID 000298412100023

    View details for PubMedID 21968194

  • Haplotype variation and genotype imputation in African populations GENETIC EPIDEMIOLOGY Huang, L., Jakobsson, M., Pemberton, T. J., Ibrahim, M., Nyambo, T., Omar, S., Pritchard, J. K., Tishkoff, S. A., Rosenberg, N. A. 2011; 35 (8): 766-780


    Sub-Saharan Africa has been identified as the part of the world with the greatest human genetic diversity. This high level of diversity causes difficulties for genome-wide association (GWA) studies in African populations-for example, by reducing the accuracy of genotype imputation in African populations compared to non-African populations. Here, we investigate haplotype variation and imputation in Africa, using 253 unrelated individuals from 15 Sub-Saharan African populations. We identify the populations that provide the greatest potential for serving as reference panels for imputing genotypes in the remaining groups. Considering reference panels comprising samples of recent African descent in Phase 3 of the HapMap Project, we identify mixtures of reference groups that produce the maximal imputation accuracy in each of the sampled populations. We find that optimal HapMap mixtures and maximal imputation accuracies identified in detailed tests of imputation procedures can instead be predicted by using simple summary statistics that measure relationships between the pattern of genetic variation in a target population and the patterns in potential reference panels. Our results provide an empirical basis for facilitating the selection of reference panels in GWA studies of diverse human populations, especially those of African ancestry.

    View details for DOI 10.1002/gepi.20626

    View details for Web of Science ID 000297468600003

    View details for PubMedID 22125220

  • Coalescence-Time Distributions in a Serial Founder Model of Human Evolutionary History GENETICS DeGiorgio, M., Degnan, J. H., Rosenberg, N. A. 2011; 189 (2): 579-593


    Simulation studies have demonstrated that a variety of patterns in worldwide genetic variation are compatible with the trends predicted by a serial founder model, in which populations expand outward from an initial source via a process in which new populations contain only subsets of the genetic diversity present in their parental populations. Here, we provide analytical results for key quantities under the serial founder model, deriving distributions of coalescence times for pairs of lineages sampled either from the same population or from different populations. We use these distributions to obtain expectations for coalescence times and for homozygosity and heterozygosity values. A predicted approximate linear decline in expected heterozygosity with increasing distance from the source population reproduces a pattern that has been observed both in human genetic data and in simulations. Our formulas predict that populations close to the source location have lower between-population gene identity than populations far from the source, also mirroring results obtained from data and simulations. We show that different models that produce similar declining patterns in heterozygosity generate quite distinct patterns in coalescence-time distributions and gene identity measures, thereby providing a basis for distinguishing these models. We interpret the theoretical results in relation to their implications for human population genetics.

    View details for DOI 10.1534/genetics.111.129296

    View details for Web of Science ID 000296158500014

    View details for PubMedID 21775469

  • On the size distribution of private microsatellite alleles THEORETICAL POPULATION BIOLOGY Szpiech, Z. A., Rosenberg, N. A. 2011; 80 (2): 100-113


    Private microsatellite alleles tend to be found in the tails rather than in the interior of the allele size distribution. To explain this phenomenon, we have investigated the size distribution of private alleles in a coalescent model of two populations, assuming the symmetric stepwise mutation model as the mode of microsatellite mutation. For the case in which four alleles are sampled, two from each population, we condition on the configuration in which three distinct allele sizes are present, one of which is common to both populations, one of which is private to one population, and the third of which is private to the other population. Conditional on this configuration, we calculate the probability that the two private alleles occupy the two tails of the size distribution. This probability, which increases as a function of mutation rate and divergence time between the two populations, is seen to be greater than the value that would be predicted if there was no relationship between privacy and location in the allele size distribution. In accordance with the prediction of the model, we find that in pairs of human populations, the frequency with which private microsatellite alleles occur in the tails of the allele size distribution increases as a function of genetic differentiation between populations.

    View details for DOI 10.1016/j.tpb.2011.03.006

    View details for Web of Science ID 000293765500003

    View details for PubMedID 21514313

  • Consistency Properties of Species Tree Inference by Minimizing Deep Coalescences JOURNAL OF COMPUTATIONAL BIOLOGY Than, C. V., Rosenberg, N. A. 2011; 18 (1): 1-15


    Methods for inferring species trees from sets of gene trees need to account for the possibility of discordance among the gene trees. Assuming that discordance is caused by incomplete lineage sorting, species tree estimates can be obtained by finding those species trees that minimize the number of "deep" coalescence events required for a given collection of gene trees. Efficient algorithms now exist for applying the minimizing-deep-coalescence (MDC) criterion, and simulation experiments have demonstrated its promising performance. However, it has also been noted from simulation results that the MDC criterion is not always guaranteed to infer the correct species tree estimate. In this article, we investigate the consistency of the MDC criterion. Using the multispecies coalescent model, we show that there are indeed anomaly zones for the MDC criterion for asymmetric four-taxon species tree topologies, and for all species tree topologies with five or more taxa.

    View details for DOI 10.1089/cmb.2010.0102

    View details for Web of Science ID 000285965600001

    View details for PubMedID 21210728

  • Coalescent histories for discordant gene trees and species trees THEORETICAL POPULATION BIOLOGY Rosenberg, N. A., Degnan, J. H. 2010; 77 (3): 145-151


    Given a gene tree and a species tree, a coalescent history is a list of the branches of the species tree on which coalescences in the gene tree take place. Each pair consisting of a gene tree topology and a species tree topology has some number of possible coalescent histories. Here we show that, for each n>or=7, there exist a species tree topology S and a gene tree topology G not equalS, both with n leaves, for which the number of coalescent histories exceeds the corresponding number of coalescent histories when the species tree topology is S and the gene tree topology is also S. This result has the interpretation that the gene tree topology G discordant with the species tree topology S can be produced by the evolutionary process in more ways than can the gene tree topology that matches the species tree topology, providing further insight into the surprising combinatorial properties of gene trees that arise from their joint consideration with species trees.

    View details for DOI 10.1016/j.tpb.2009.12.004

    View details for Web of Science ID 000276751300001

    View details for PubMedID 20064540

  • Lack of Population Diversity in Commonly Used Human Embryonic Stem-Cell Lines NEW ENGLAND JOURNAL OF MEDICINE Mosher, J. T., Pemberton, T. J., Harter, K., Wang, C., Buzbas, E. O., Dvorak, P., Simon, C., Morrison, S. J., Rosenberg, N. A. 2010; 362 (2): 183-185

    View details for DOI 10.1056/NEJMc0910371

    View details for Web of Science ID 000273558500033

    View details for PubMedID 20018958

  • Genomic microsatellites identify shared Jewish ancestry intermediate between Middle Eastern and European populations BMC GENETICS Kopelman, N. M., Stone, L., Wang, C., Gefel, D., Feldman, M. W., Hillel, J., Rosenberg, N. A. 2009; 10


    Genetic studies have often produced conflicting results on the question of whether distant Jewish populations in different geographic locations share greater genetic similarity to each other or instead, to nearby non-Jewish populations. We perform a genome-wide population-genetic study of Jewish populations, analyzing 678 autosomal microsatellite loci in 78 individuals from four Jewish groups together with similar data on 321 individuals from 12 non-Jewish Middle Eastern and European populations.We find that the Jewish populations show a high level of genetic similarity to each other, clustering together in several types of analysis of population structure. Further, Bayesian clustering, neighbor-joining trees, and multidimensional scaling place the Jewish populations as intermediate between the non-Jewish Middle Eastern and European populations.These results support the view that the Jewish populations largely share a common Middle Eastern ancestry and that over their history they have undergone varying degrees of admixture with non-Jewish populations of European descent.

    View details for DOI 10.1186/1471-2156-10-80

    View details for Web of Science ID 000273553900001

    View details for PubMedID 19995433

  • Out of Africa: modern human origins special feature: explaining worldwide patterns of human genetic variation using a coalescent-based serial founder model of migration outward from Africa. Proceedings of the National Academy of Sciences of the United States of America DeGiorgio, M., Jakobsson, M., Rosenberg, N. A. 2009; 106 (38): 16057-16062


    Studies of worldwide human variation have discovered three trends in summary statistics as a function of increasing geographic distance from East Africa: a decrease in heterozygosity, an increase in linkage disequilibrium (LD), and a decrease in the slope of the ancestral allele frequency spectrum. Forward simulations of unlinked loci have shown that the decline in heterozygosity can be described by a serial founder model, in which populations migrate outward from Africa through a process where each of a series of populations is formed from a subset of the previous population in the outward expansion. Here, we extend this approach by developing a retrospective coalescent-based serial founder model that incorporates linked loci. Our model both recovers the observed decline in heterozygosity with increasing distance from Africa and produces the patterns observed in LD and the ancestral allele frequency spectrum. Surprisingly, although migration between neighboring populations and limited admixture between modern and archaic humans can be accommodated in the model while continuing to explain the three trends, a competing model in which a wave of outward modern human migration expands into a series of preexisting archaic populations produces nearly opposite patterns to those observed in the data. We conclude by developing a simpler model to illustrate that the feature that permits the serial founder model but not the archaic persistence model to explain the three trends observed with increasing distance from Africa is its incorporation of a cumulative effect of genetic drift as humans colonized the world.

    View details for DOI 10.1073/pnas.0903341106

    View details for PubMedID 19706453

  • Replication of Genetic Associations as Pseudoreplication due to Shared Genealogy GENETIC EPIDEMIOLOGY Rosenberg, N. A., VanLiere, J. M. 2009; 33 (6): 479-487


    The genotypes of individuals in replicate genetic association studies have some level of correlation due to shared descent in the complete pedigree of all living humans. As a result of this genealogical sharing, replicate studies that search for genotype-phenotype associations using linkage disequilibrium between marker loci and disease-susceptibility loci can be considered as "pseudoreplicates" rather than true replicates. We examine the size of the pseudoreplication effect in association studies simulated from evolutionary models of the history of a population, evaluating the excess probability that both of a pair of studies detect a disease association compared to the probability expected under the assumption that the two studies are independent. Each of nine combinations of a demographic model and a penetrance model leads to a detectable pseudoreplication effect, suggesting that the degree of support that can be attributed to a replicated genetic association result is less than that which can be attributed to a replicated result in a context of true independence.

    View details for DOI 10.1002/gepi.20400

    View details for Web of Science ID 000269432400002

    View details for PubMedID 19191270

  • Population differentiation and migration: Coalescence times in a two-sex island model for autosomal and X-linked loci THEORETICAL POPULATION BIOLOGY Ramachandran, S., Rosenberg, N. A., Feldman, M. W., Wakeley, J. 2008; 74 (4): 291-301


    Evolutionists have debated whether population-genetic parameters, such as effective population size and migration rate, differ between males and females. In humans, most analyses of this problem have focused on the Y chromosome and the mitochondrial genome, while the X chromosome has largely been omitted from the discussion. Past studies have compared F(ST) values for the Y chromosome and mitochondrion under a model with migration rates that differ between the sexes but with equal male and female population sizes. In this study we investigate rates of coalescence for X-linked and autosomal lineages in an island model with different population sizes and migration rates for males and females, obtaining the mean time to coalescence for pairs of lineages from the same deme and for pairs of lineages from different demes. We apply our results to microsatellite data from the Human Genome Diversity Panel, and we examine the male and female migration rates implied by observed F(ST) values.

    View details for DOI 10.1016/j.tpb.2008.08.003

    View details for Web of Science ID 000261533200002

    View details for PubMedID 18817799

  • ADZE: a rarefaction approach for counting alleles private to combinations of populations BIOINFORMATICS Szpiech, Z. A., Jakobsson, M., Rosenberg, N. A. 2008; 24 (21): 2498-2504


    Analysis of the distribution of alleles across populations is a useful tool for examining population diversity and relationships. However, sample sizes often differ across populations, sometimes making it difficult to assess allelic distributions across groups.We introduce a generalized rarefaction approach for counting alleles private to combinations of populations. Our method evaluates the number of alleles found in each of a set of populations but absent in all remaining populations, considering equal-sized subsamples from each population. Applying this method to a worldwide human microsatellite dataset, we observe a high number of alleles private to the combination of African and Oceanian populations. This result supports the possibility of a migration out of Africa into Oceania separate from the migrations responsible for the majority of the ancestry of the modern populations of Asia, and it highlights the utility of our approach to sample size correction in evaluating hypotheses about population history.We have implemented our method in the computer pro-gram ADZE, which is available for download at

    View details for DOI 10.1093/bioinformatics/btn478

    View details for Web of Science ID 000260381200012

    View details for PubMedID 18779233

  • Genetic variation and population structure in Native Americans PLOS GENETICS Wang, S., Lewis, C. M., Jakobsson, M., Ramachandran, S., Ray, N., Bedoya, G., Rojas, W., Parra, M. V., Molina, J. A., Gallo, C., Mazzotti, G., Poletti, G., Hill, K., Hurtado, A. M., Labuda, D., Klitz, W., Barrantes, R., Bortolini, M. C., Salzano, F. M., Petzl-Erler, M. L., Tsuneto, L. T., Llop, E., Rothhammer, F., Excoffier, L., Feldman, M. W., Rosenberg, N. A., Ruiz-Linares, A. 2007; 3 (11): 2049-2067


    We examined genetic diversity and population structure in the American landmass using 678 autosomal microsatellite markers genotyped in 422 individuals representing 24 Native American populations sampled from North, Central, and South America. These data were analyzed jointly with similar data available in 54 other indigenous populations worldwide, including an additional five Native American groups. The Native American populations have lower genetic diversity and greater differentiation than populations from other continental regions. We observe gradients both of decreasing genetic diversity as a function of geographic distance from the Bering Strait and of decreasing genetic similarity to Siberians--signals of the southward dispersal of human populations from the northwestern tip of the Americas. We also observe evidence of: (1) a higher level of diversity and lower level of population structure in western South America compared to eastern South America, (2) a relative lack of differentiation between Mesoamerican and Andean populations, (3) a scenario in which coastal routes were easier for migrating peoples to traverse in comparison with inland routes, and (4) a partial agreement on a local scale between genetic similarity and the linguistic classification of populations. These findings offer new insights into the process of population dispersal and differentiation during the peopling of the Americas.

    View details for DOI 10.1371/journal.pgen.0030185

    View details for Web of Science ID 000251310200002

    View details for PubMedID 18039031

  • Genetic diversity and population structure inferred from the partially duplicated genome of domesticated carp, Cyprinus carpio L. GENETICS SELECTION EVOLUTION David, L., Rosenberg, N. A., Lavi, U., Feldman, M. W., Hillel, J. 2007; 39 (3): 319-340


    Genetic relationships among eight populations of domesticated carp (Cyprinus carpio L.), a species with a partially duplicated genome, were studied using 12 microsatellites and 505 AFLP bands. The populations included three aquacultured carp strains and five ornamental carp (koi) variants. Grass carp (Ctenopharyngodon idella) was used as an outgroup. AFLP-based gene diversity varied from 5% (grass carp) to 32% (koi) and reflected the reasonably well understood histories and breeding practices of the populations. A large fraction of the molecular variance was due to differences between aquacultured and ornamental carps. Further analyses based on microsatellite data, including cluster analysis and neighbor-joining trees, supported the genetic distinctiveness of aquacultured and ornamental carps, despite the recent divergence of the two groups. In contrast to what was observed for AFLP-based diversity, the frequency of heterozygotes based on microsatellites was comparable among all populations. This discrepancy can potentially be explained by duplication of some loci in Cyprinus carpio L., and a model that shows how duplication can increase heterozygosity estimates for microsatellites but not for AFLP loci is discussed. Our analyses in carp can help in understanding the consequences of genotyping duplicated loci and in interpreting discrepancies between dominant and co-dominant markers in species with recent genome duplication.

    View details for DOI 10.1051/gse:2007006

    View details for Web of Science ID 000245686900006

    View details for PubMedID 17433244

  • Clines, clusters, and the effect of study design on the inference of human population structure PLOS GENETICS Rosenberg, N. A., Mahajan, S., Ramachandran, S., Zhao, C. F., Pritchard, J. K., Feldman, M. W. 2005; 1 (6): 660-671


    Previously, we observed that without using prior information about individual sampling locations, a clustering algorithm applied to multilocus genotypes from worldwide human populations produced genetic clusters largely coincident with major geographic regions. It has been argued, however, that the degree of clustering is diminished by use of samples with greater uniformity in geographic distribution, and that the clusters we identified were a consequence of uneven sampling along genetic clines. Expanding our earlier dataset from 377 to 993 markers, we systematically examine the influence of several study design variables--sample size, number of loci, number of clusters, assumptions about correlations in allele frequencies across populations, and the geographic dispersion of the sample--on the "clusteredness" of individuals. With all other variables held constant, geographic dispersion is seen to have comparatively little effect on the degree of clustering. Examination of the relationship between genetic and geographic distance supports a view in which the clusters arise not as an artifact of the sampling scheme, but from small discontinuous jumps in genetic distance for most population pairs on opposite sides of geographic barriers, in comparison with genetic distance for pairs on the same side. Thus, analysis of the 993-locus dataset corroborates our earlier results: if enough markers are used with a sufficiently large worldwide sample, individuals can be partitioned into genetic clusters that match major geographic subdivisions of the globe, with some individuals from intermediate geographic locations having mixed membership in the clusters that correspond to neighboring regions.

    View details for DOI 10.1371/journal.pgen.0010070

    View details for Web of Science ID 000234900800005

    View details for PubMedID 16355252

  • Features of evolution and expansion of modern humans, inferred from genomewide microsatellite markers AMERICAN JOURNAL OF HUMAN GENETICS Zhivotovsky, L. A., Rosenberg, N. A., Feldman, M. W. 2003; 72 (5): 1171-1186


    We study data on variation in 52 worldwide populations at 377 autosomal short tandem repeat loci, to infer a demographic history of human populations. Variation at di-, tri-, and tetranucleotide repeat loci is distributed differently, although each class of markers exhibits a decrease of within-population genetic variation in the following order: sub-Saharan Africa, Eurasia, East Asia, Oceania, and America. There is a similar decrease in the frequency of private alleles. With multidimensional scaling, populations belonging to the same major geographic region cluster together, and some regions permit a finer resolution of populations. When a stepwise mutation model is used, a population tree based on TD estimates of divergence time suggests that the branches leading to the present sub-Saharan African populations of hunter-gatherers were the first to diverge from a common ancestral population (approximately 71-142 thousand years ago). The branches corresponding to sub-Saharan African farming populations and those that left Africa diverge next, with subsequent splits of branches for Eurasia, Oceania, East Asia, and America. African hunter-gatherer populations and populations of Oceania and America exhibit no statistically significant signature of growth. The features of population subdivision and growth are discussed in the context of the ancient expansion of modern humans.

    View details for Web of Science ID 000182474400010

    View details for PubMedID 12690579

  • Genetic structure of human populations SCIENCE Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivotovsky, L. A., Feldman, M. W. 2002; 298 (5602): 2381-2385


    We studied human population structure using genotypes at 377 autosomal microsatellite loci in 1056 individuals from 52 populations. Within-population differences among individuals account for 93 to 95% of genetic variation; differences among major groups constitute only 3 to 5%. Nevertheless, without using prior information about the origins of individuals, we identified six main genetic clusters, five of which correspond to major geographic regions, and subclusters that often correspond to individual populations. General agreement of genetic and predefined populations suggests that self-reported ancestry can facilitate assessments of epidemiological risks but does not obviate the need to use genetic information in genetic association studies.

    View details for Web of Science ID 000179915900054

    View details for PubMedID 12493913

  • Association mapping in structured populations AMERICAN JOURNAL OF HUMAN GENETICS Pritchard, J. K., Stephens, M., Rosenberg, N. A., Donnelly, P. 2000; 67 (1): 170-181


    The use, in association studies, of the forthcoming dense genomewide collection of single-nucleotide polymorphisms (SNPs) has been heralded as a potential breakthrough in the study of the genetic basis of common complex disorders. A serious problem with association mapping is that population structure can lead to spurious associations between a candidate marker and a phenotype. One common solution has been to abandon case-control studies in favor of family-based tests of association, such as the transmission/disequilibrium test (TDT), but this comes at a considerable cost in the need to collect DNA from close relatives of affected individuals. In this article we describe a novel, statistically valid, method for case-control association studies in structured populations. Our method uses a set of unlinked genetic markers to infer details of population structure, and to estimate the ancestry of sampled individuals, before using this information to test for associations within subpopulations. It provides power comparable with the TDT in many settings and may substantially outperform it if there are conflicting associations in different subpopulations.

    View details for Web of Science ID 000088926900019

    View details for PubMedID 10827107

  • Microsatellite evolution in modern humans: a comparison of two data sets from the same populations ANNALS OF HUMAN GENETICS Jin, L., Baskett, M. L., Cavalli-Sforza, L. L., Zhivotovsky, L. A., Feldman, M. W., Rosenberg, N. A. 2000; 64: 117-134


    We genotyped 64 dinucleotide microsatellite repeats in individuals from populations that represent all inhabited continents. Microsatellite summary statistics are reported for these data, as well as for a data set that includes 28 out of 30 loci studied by Bowcock et al. (1994) in the same individuals. For both data sets, diversity statistics such as heterozygosity, number of alleles per locus, and number of private alleles per locus produced the highest values in Africans, intermediate values in Europeans and Asians, and low values in Americans. Evolutionary trees of populations based on genetic distances separated groups from different continents. Corresponding trees were topologically similar for the two data sets, with the exception that the (deltamu)2 genetic distance reliably distinguished groups from different continents for the larger data set, but not for the smaller one. Consistent with our results from diversity statistics and from evolutionary trees, population growth statistics S k and beta, which seem particularly useful for indicating recent and ancient population size changes, confirm a model of human evolution in which human populations expand in size and through space following the departure of a small group from Africa.

    View details for Web of Science ID 000088739600003

    View details for PubMedID 11246466

  • Use of unlinked genetic markers to detect population stratification in association studies AMERICAN JOURNAL OF HUMAN GENETICS Pritchard, J. K., Rosenberg, N. A. 1999; 65 (1): 220-228


    We examine the issue of population stratification in association-mapping studies. In case-control studies of association, population subdivision or recent admixture of populations can lead to spurious associations between a phenotype and unlinked candidate loci. Using a model of sampling from a structured population, we show that if population stratification exists, it can be detected by use of unlinked marker loci. We show that the case-control-study design, using unrelated control individuals, is a valid approach for association mapping, provided that marker loci unlinked to the candidate locus are included in the study, to test for stratification. We suggest guidelines as to the number of unlinked marker loci to use.

    View details for Web of Science ID 000081224300027

    View details for PubMedID 10364535