Academic Appointments

Honors & Awards

  • Career Award in the Biomedical Sciences, Burroughs Wellcome Fund (2004)
  • Sloan Fellow in Computational and Evolutionary Molecular Biology, Alfred P. Sloan Foundation (2006)
  • Dean's Basic Science Research Award, University of Michigan Medical School (2010)
  • Stanford Professorship in Population Genetics & Society, Stanford University School of Humanitites & Sciences (2014)

Boards, Advisory Committees, Professional Organizations

  • Associate Editor, Evolution, Medicine, and Public Health (2014 - Present)
  • Editor-in-Chief, Theoretical Population Biology (2013 - Present)
  • Associate Editor, Molecular Biology and Evolution (2011 - 2014)
  • Associate Editor, Human Biology (2010 - Present)
  • Associate Editor, Genetics (2010 - Present)
  • Associate Editor, BMC Bioinformatics (2010 - 2014)
  • Associate Editor, American Journal of Human Genetics (2008 - 2010)

Professional Education

  • BA, Rice University, Mathematics (1997)
  • MS, Stanford University, Mathematics (1999)
  • PhD, Stanford University, Biology (2001)
  • Postdoc, University of Southern California, Molecular/Computational Biology (2005)

Current Research and Scholarly Interests

Research in the lab addresses problems in evolutionary biology and human
genetics through a combination of mathematical modeling, computer
simulations, development of statistical methods, and inference from
population-genetic data. Our current work covers topics such as human
genetic variation, inference of human evolutionary history, the role of
population genetics in the search for disease-susceptibility genes, the
relationship of gene trees and species trees, and mathematical properties
of statistics used for analyzing genetic variability.

Stanford Advisees

Graduate and Fellowship Programs

  • Biology (School of Humanities and Sciences) (Phd Program)

All Publications

  • Individual Identifiability Predicts Population Identifiability in Forensic Microsatellite Markers CURRENT BIOLOGY Algee-Hewitt, B. F., Edge, M. D., Kim, J., Li, J. Z., Rosenberg, N. A. 2016; 26 (7): 935-942
  • Coalescent Histories for Lodgepole Species Trees. Journal of computational biology Disanto, F., Rosenberg, N. A. 2015; 22 (10): 918-929


    Coalescent histories are combinatorial structures that describe for a given gene tree and species tree the possible lists of branches of the species tree on which the gene tree coalescences take place. Properties of the number of coalescent histories for gene trees and species trees affect a variety of probabilistic calculations in mathematical phylogenetics. Exact and asymptotic evaluations of the number of coalescent histories, however, are known only in a limited number of cases. Here we introduce a particular family of species trees, the lodgepole species trees (λn)n≥0, in which tree λn has m=2n+1 taxa. We determine the number of coalescent histories for the lodgepole species trees, in the case that the gene tree matches the species tree, showing that this number grows with m!! in the number of taxa m. This computation demonstrates the existence of tree families in which the growth in the number of coalescent histories is faster than exponential. Further, it provides a substantial improvement on the lower bound for the ratio of the largest number of matching coalescent histories to the smallest number of matching coalescent histories for trees with m taxa, increasing a previous bound of [Formula: see text] to [Formula: see text]. We discuss the implications of our enumerative results for phylogenetic computations.

    View details for DOI 10.1089/cmb.2015.0015

    View details for PubMedID 25973633

  • Beyond 2/3 and 1/3: The Complex Signatures of Sex-Biased Admixture on the X Chromosome. Genetics Goldberg, A., Rosenberg, N. A. 2015; 201 (1): 263-279


    Sex-biased demography, in which parameters governing migration and population size differ between females and males, has been studied through comparisons of X chromosomes, which are inherited sex-specifically, and autosomes, which are not. A common form of sex bias in humans is sex-biased admixture, in which at least one of the source populations differs in its proportions of females and males contributing to an admixed population. Studies of sex-biased admixture often examine the mean ancestry for markers on the X chromosome in relation to the autosomes. A simple framework noting that in a population with equally many females and males, two-thirds of X chromosomes appear in females, suggests that the mean X-chromosomal admixture fraction is a linear combination of female and male admixture parameters, with coefficients 2/3 and 1/3, respectively. Extending a mechanistic admixture model to accommodate the X chromosome, we demonstrate that this prediction is not generally true in admixture models, although it holds in the limit for an admixture process occurring as a single event. For a model with constant ongoing admixture, we determine the mean X-chromosomal admixture, comparing admixture on female and male X chromosomes to corresponding autosomal values. Surprisingly, in reanalyzing African-American genetic data to estimate sex-specific contributions from African and European sources, we find that the range of contributions compatible with the excess African ancestry on the X chromosome compared to autosomes has a wide spread, permitting scenarios either without male-biased contributions from Europe or without female-biased contributions from Africa.

    View details for DOI 10.1534/genetics.115.178509

    View details for PubMedID 26209245

    View details for PubMedCentralID PMC4566268

  • Upper bounds on F-ST in terms of the frequency of the most frequent allele and total homozygosity: The case of a specified number of alleles THEORETICAL POPULATION BIOLOGY Edge, M. D., Rosenberg, N. A. 2014; 97: 20-34
  • Theory and applications of a deterministic approximation to the coalescent model. Theoretical population biology Jewett, E. M., Rosenberg, N. A. 2014; 93: 14-29


    Under the coalescent model, the random number nt of lineages ancestral to a sample is nearly deterministic as a function of time when nt is moderate to large in value, and it is well approximated by its expectation E[nt]. In turn, this expectation is well approximated by simple deterministic functions that are easy to compute. Such deterministic functions have been applied to estimate allele age, effective population size, and genetic diversity, and they have been used to study properties of models of infectious disease dynamics. Although a number of simple approximations of E[nt] have been derived and applied to problems of population-genetic inference, the theoretical accuracy of the resulting approximate formulas and the inferences obtained using these approximations is not known, and the range of problems to which they can be applied is not well understood. Here, we demonstrate general procedures by which the approximation nt≈E[nt] can be used to reduce the computational complexity of coalescent formulas, and we show that the resulting approximations converge to their true values under simple assumptions. Such approximations provide alternatives to exact formulas that are computationally intractable or numerically unstable when the number of sampled lineages is moderate or large. We also extend an existing class of approximations of E[nt] to the case of multiple populations of time-varying size with migration among them. Our results facilitate the use of the deterministic approximation nt≈E[nt] for deriving functionally simple, computationally efficient, and numerically stable approximations of coalescent formulas under complicated demographic scenarios.

    View details for DOI 10.1016/j.tpb.2013.12.007

    View details for PubMedID 24412419

  • An empirical evaluation of two-stage species tree inference strategies using a multilocus dataset from North American pines BMC EVOLUTIONARY BIOLOGY DeGiorgio, M., Syring, J., Eckert, A. J., Liston, A., Cronn, R., Neale, D. B., Rosenberg, N. A. 2014; 14
  • Discordance of Species Trees with Their Most Likely Gene Trees: A Unifying Principle MOLECULAR BIOLOGY AND EVOLUTION Rosenberg, N. A. 2013; 30 (12): 2709-2713


    A labeled gene tree topology that disagrees with a labeled species tree topology is said to be anomalous if it is more probable under a coalescent model for gene lineage evolution than the labeled gene tree topology that matches the species tree. It has previously been shown that as a consequence of short internal branches of the species tree, for every labeled species tree topology with five or more taxa, and for asymmetric four-taxon species tree topologies, an assignment of species tree branch lengths can be made which gives rise to anomalous gene trees (AGTs). Here, I offer an alternative characterization of this result--a labeled species tree topology produces AGTs if and only if it contains two consecutive internal branches in an ancestor-descendant relationship--and I provide a proof that follows from the change in perspective. The reformulation and alternative proof of the existence result for AGTs provide the insight that it is not merely short internal branches that generate AGTs, but instead, short internal branches that are arranged consecutively.

    View details for DOI 10.1093/molbev/mst160

    View details for Web of Science ID 000327793000016

    View details for PubMedID 24030555

  • A Population-Genetic Perspective on the Similarities and Differences Among Worldwide Human Populations HUMAN BIOLOGY Rosenberg, N. A. 2011; 83 (6): 659-684


    Recent studies have produced a variety of advances in the investigation of genetic similarities and differences among human populations. Here, I pose a series of questions about human population-genetic similarities and differences, and I then answer these questions by numerical computation with a single shared population-genetic data set. The collection of answers obtained provides an introductory perspective for understanding key results on the features of worldwide human genetic variation.

    View details for Web of Science ID 000209009300001

  • Genotype, haplotype and copy-number variation in worldwide human populations NATURE Jakobsson, M., Scholz, S. W., Scheet, P., Gibbs, J. R., VanLiere, J. M., Fung, H., Szpiech, Z. A., Degnan, J. H., Wang, K., Guerreiro, R., Bras, J. M., Schymick, J. C., Hernandez, D. G., Traynor, B. J., Simon-Sanchez, J., Matarin, M., Britton, A., van de Leemput, J., Rafferty, I., Bucan, M., Cann, H. M., Hardy, J. A., Rosenberg, N. A., Singleton, A. B. 2008; 451 (7181): 998-1003


    Genome-wide patterns of variation across individuals provide a powerful source of data for uncovering the history of migration, range expansion, and adaptation of the human species. However, high-resolution surveys of variation in genotype, haplotype and copy number have generally focused on a small number of population groups. Here we report the analysis of high-quality genotypes at 525,910 single-nucleotide polymorphisms (SNPs) and 396 copy-number-variable loci in a worldwide sample of 29 populations. Analysis of SNP genotypes yields strongly supported fine-scale inferences about population structure. Increasing linkage disequilibrium is observed with increasing geographic distance from Africa, as expected under a serial founder effect for the out-of-Africa spread of human populations. New approaches for haplotype analysis produce inferences about population structure that complement results based on unphased SNPs. Despite a difference from SNPs in the frequency spectrum of the copy-number variants (CNVs) detected--including a comparatively large number of CNVs in previously unexamined populations from Oceania and the Americas--the global distribution of CNVs largely accords with population structure analyses for SNP data sets of similar size. Our results produce new inferences about inter-population variation, support the utility of CNVs in human population-genetic research, and serve as a genomic resource for human-genetic studies in diverse worldwide populations.

    View details for DOI 10.1038/nature06742

    View details for Web of Science ID 000253313100050

    View details for PubMedID 18288195

  • Linkage disequilibrium matches forensic genetic records to disjoint genomic marker sets. Proceedings of the National Academy of Sciences of the United States of America Edge, M. D., Algee-Hewitt, B. F., Pemberton, T. J., Li, J. Z., Rosenberg, N. A. 2017; 114 (22): 5671-5676


    Combining genotypes across datasets is central in facilitating advances in genetics. Data aggregation efforts often face the challenge of record matching-the identification of dataset entries that represent the same individual. We show that records can be matched across genotype datasets that have no shared markers based on linkage disequilibrium between loci appearing in different datasets. Using two datasets for the same 872 people-one with 642,563 genome-wide SNPs and the other with 13 short tandem repeats (STRs) used in forensic applications-we find that 90-98% of forensic STR records can be connected to corresponding SNP records and vice versa. Accuracy increases to 99-100% when ∼30 STRs are used. Our method expands the potential of data aggregation, but it also suggests privacy risks intrinsic in maintenance of databases containing even small numbers of markers-including databases of forensic significance.

    View details for DOI 10.1073/pnas.1619944114

    View details for PubMedID 28507140

  • Reply to Lazaridis and Reich: Robust model-based inference of male-biased admixture during Bronze Age migration from the Pontic-Caspian Steppe. Proceedings of the National Academy of Sciences of the United States of America Goldberg, A., Günther, T., Rosenberg, N. A., Jakobsson, M. 2017; 114 (20): E3875-E3877

    View details for DOI 10.1073/pnas.1704442114

    View details for PubMedID 28476765

  • Enumeration of Ancestral Configurations for Matching Gene Trees and Species Trees. Journal of computational biology : a journal of computational molecular cell biology Disanto, F., Rosenberg, N. A. 2017


    Given a gene tree and a species tree, ancestral configurations represent the combinatorially distinct sets of gene lineages that can reach a given node of the species tree. They have been introduced as a data structure for use in the recursive computation of the conditional probability under the multispecies coalescent model of a gene tree topology given a species tree, the cost of this computation being affected by the number of ancestral configurations of the gene tree in the species tree. For matching gene trees and species trees, we obtain enumerative results on ancestral configurations. We study ancestral configurations in balanced and unbalanced families of trees determined by a given seed tree, showing that for seed trees with more than one taxon, the number of ancestral configurations increases for both families exponentially in the number of taxa n. For fixed n, the maximal number of ancestral configurations tabulated at the species tree root node and the largest number of labeled histories possible for a labeled topology occur for trees with precisely the same unlabeled shape. For ancestral configurations at the root, the maximum increases with [Formula: see text], where [Formula: see text] is a quadratic recurrence constant. Under a uniform distribution over the set of labeled trees of given size, the mean number of root ancestral configurations grows with [Formula: see text] and the variance with ∼[Formula: see text]. The results provide a contribution to the combinatorial study of gene trees and species trees.

    View details for DOI 10.1089/cmb.2016.0159

    View details for PubMedID 28437136

  • Simulation-Based Evaluation of Hybridization Network Reconstruction Methods in the Presence of Incomplete Lineage Sorting EVOLUTIONARY BIOINFORMATICS Kamneva, O. K., Rosenberg, N. A. 2017; 13


    Hybridization events generate reticulate species relationships, giving rise to species networks rather than species trees. We report a comparative study of consensus, maximum parsimony, and maximum likelihood methods of species network reconstruction using gene trees simulated assuming a known species history. We evaluate the role of the divergence time between species involved in a hybridization event, the relative contributions of the hybridizing species, and the error in gene tree estimation. When gene tree discordance is mostly due to hybridization and not due to incomplete lineage sorting (ILS), most of the methods can detect even highly skewed hybridization events between highly divergent species. For recent divergences between hybridizing species, when the influence of ILS is sufficiently high, likelihood methods outperform parsimony and consensus methods, which erroneously identify extra hybridizations. The more sophisticated likelihood methods, however, are affected by gene tree errors to a greater extent than are consensus and parsimony.

    View details for DOI 10.1177/1176934317691935

    View details for Web of Science ID 000397606600002

    View details for PubMedID 28469378

  • Asymptotic Properties of the Number of Matching Coalescent Histories for Caterpillar-Like Families of Species Trees. IEEE/ACM transactions on computational biology and bioinformatics Disanto, F., Rosenberg, N. A. 2016; 13 (5): 913-925


    Coalescent histories provide lists of species tree branches on which gene tree coalescences can take place, and their enumerative properties assist in understanding the computational complexity of calculations central in the study of gene trees and species trees. Here, we solve an enumerative problem left open by Rosenberg (IEEE/ACM Transactions on Computational Biology and Bioinformatics 10: 1253-1262, 2013) concerning the number of coalescent histories for gene trees and species trees with a matching labeled topology that belongs to a generic caterpillar-like family. By bringing a generating function approach to the study of coalescent histories, we prove that for any caterpillar-like family with seed tree t , the sequence (hn)n ≥ 0 describing the number of matching coalescent histories of the n th tree of the family grows asymptotically as a constant multiple of the Catalan numbers. Thus, hn  ∼ βt cn, where the asymptotic constant βt > 0 depends on the shape of the seed tree t. The result extends a claim demonstrated only for seed trees with at most eight taxa to arbitrary seed trees, expanding the set of cases for which detailed enumerative properties of coalescent histories can be determined. We introduce a procedure that computes from t the constant βt as well as the algebraic expression for the generating function of the sequence (hn)n ≥ 0.

    View details for PubMedID 26452289

  • Consistency and inconsistency of consensus methods for inferring species trees from gene trees in the presence of ancestral population structure. Theoretical population biology DeGiorgio, M., Rosenberg, N. A. 2016; 110: 12-24


    In the last few years, several statistically consistent consensus methods for species tree inference have been devised that are robust to the gene tree discordance caused by incomplete lineage sorting in unstructured ancestral populations. One source of gene tree discordance that has only recently been identified as a potential obstacle for phylogenetic inference is ancestral population structure. In this article, we describe a general model of ancestral population structure, and by relying on a single carefully constructed example scenario, we show that the consensus methods Democratic Vote, STEAC, STAR, R(∗) Consensus, Rooted Triple Consensus, Minimize Deep Coalescences, and Majority-Rule Consensus are statistically inconsistent under the model. We find that among the consensus methods evaluated, the only method that is statistically consistent in the presence of ancestral population structure is GLASS/Maximum Tree. We use simulations to evaluate the behavior of the various consensus methods in a model with ancestral population structure, showing that as the number of gene trees increases, estimates on the basis of GLASS/Maximum Tree approach the true species tree topology irrespective of the level of population structure, whereas estimates based on the remaining methods only approach the true species tree topology if the level of structure is low. However, through simulations using species trees both with and without ancestral population structure, we show that GLASS/Maximum Tree performs unusually poorly on gene trees inferred from alignments with little information. This practical limitation of GLASS/Maximum Tree together with the inconsistency of other methods prompts the need for both further testing of additional existing methods and development of novel methods under conditions that incorporate ancestral population structure.

    View details for DOI 10.1016/j.tpb.2016.02.002

    View details for PubMedID 27086043

  • The probability of monophyly of a sample of gene lineages on a species tree. Proceedings of the National Academy of Sciences of the United States of America Mehta, R. S., Bryant, D., Rosenberg, N. A. 2016; 113 (29): 8002-8009


    Monophyletic groups-groups that consist of all of the descendants of a most recent common ancestor-arise naturally as a consequence of descent processes that result in meaningful distinctions between organisms. Aspects of monophyly are therefore central to fields that examine and use genealogical descent. In particular, studies in conservation genetics, phylogeography, population genetics, species delimitation, and systematics can all make use of mathematical predictions under evolutionary models about features of monophyly. One important calculation, the probability that a set of gene lineages is monophyletic under a two-species neutral coalescent model, has been used in many studies. Here, we extend this calculation for a species tree model that contains arbitrarily many species. We study the effects of species tree topology and branch lengths on the monophyly probability. These analyses reveal new behavior, including the maintenance of nontrivial monophyly probabilities for gene lineage samples that span multiple species and even for lineages that do not derive from a monophyletic species group. We illustrate the mathematical results using an example application to data from maize and teosinte.

    View details for DOI 10.1073/pnas.1601074113

    View details for PubMedID 27432988

    View details for PubMedCentralID PMC4961160

  • Does Gene Tree Discordance Explain the Mismatch between Macroevolutionary Models and Empirical Patterns of Tree Shape and Branching Times? Systematic biology Stadler, T., Degnan, J. H., Rosenberg, N. A. 2016; 65 (4): 628-639


    Classic null models for speciation and extinction give rise to phylogenies that differ in distribution from empirical phylogenies. In particular, empirical phylogenies are less balanced and have branching times closer to the root compared to phylogenies predicted by common null models. This difference might be due to null models of the speciation and extinction process being too simplistic, or due to the empirical datasets not being representative of random phylogenies. A third possibility arises because phylogenetic reconstruction methods often infer gene trees rather than species trees, producing an incongruity between models that predict species tree patterns and empirical analyses that consider gene trees. We investigate the extent to which the difference between gene trees and species trees under a combined birth-death and multispecies coalescent model can explain the difference in empirical trees and birth-death species trees. We simulate gene trees embedded in simulated species trees and investigate their difference with respect to tree balance and branching times. We observe that the gene trees are less balanced and typically have branching times closer to the root than the species trees. Empirical trees from TreeBase are also less balanced than our simulated species trees, and model gene trees can explain an imbalance increase of up to 8% compared to species trees. However, we see a much larger imbalance increase in empirical trees, about 100%, meaning that additional features must also be causing imbalance in empirical trees. This simulation study highlights the necessity of revisiting the assumptions made in phylogenetic analyses, as these assumptions, such as equating the gene tree with the species tree, might lead to a biased conclusion.

    View details for DOI 10.1093/sysbio/syw019

    View details for PubMedID 26968785

  • Individual Identifiability Predicts Population Identifiability in Forensic Microsatellite Markers. Current biology Algee-Hewitt, B. F., Edge, M. D., Kim, J., Li, J. Z., Rosenberg, N. A. 2016; 26 (7): 935-942


    Highly polymorphic genetic markers with significant potential for distinguishing individual identity are used as a standard tool in forensic testing [1, 2]. At the same time, population-genetic studies have suggested that genetically diverse markers with high individual identifiability also confer information about genetic ancestry [3-6]. The dual influence of polymorphism levels on ancestry inference and forensic desirability suggests that forensically useful marker sets with high levels of individual identifiability might also possess substantial ancestry information. We study a standard forensic marker set-the 13 CODIS loci used in the United States and elsewhere [2, 7-9]-together with 779 additional microsatellites [10], using direct population structure inference to test whether markers with substantial individual identifiability also produce considerable information about ancestry. Despite having been selected for individual identification and not for ancestry inference [11], the CODIS markers generate nontrivial model-based clustering patterns similar to those of other sets of 13 tetranucleotide microsatellites. Although the CODIS markers have relatively low values of the FST divergence statistic, their high heterozygosities produce greater ancestry inference potential than is possessed by less heterozygous marker sets. More generally, we observe that marker sets with greater individual identifiability also tend toward greater population identifiability. We conclude that population identifiability regularly follows as a byproduct of the use of highly polymorphic forensic markers. Our findings have implications for the design of new forensic marker sets and for evaluations of the extent to which individual characteristics beyond identification might be predicted from current and future forensic data.

    View details for DOI 10.1016/j.cub.2016.01.065

    View details for PubMedID 26996508

  • Choosing Subsamples for Sequencing Studies by Minimizing the Average Distance to the Closest Leaf GENETICS Kang, J. T., Zhang, P., Zoellner, S., Rosenberg, N. A. 2015; 201 (2): 499-511


    Imputation of genotypes in a study sample can make use of sequenced or densely genotyped external reference panels consisting of individuals that are not from the study sample. It also can employ internal reference panels, incorporating a subset of individuals from the study sample itself. Internal panels offer an advantage over external panels because they can reduce imputation errors arising from genetic dissimilarity between a population of interest and a second, distinct population from which the external reference panel has been constructed. As the cost of next-generation sequencing decreases, internal reference panel selection is becoming increasingly feasible. However, it is not clear how best to select individuals to include in such panels. We introduce a new method for selecting an internal reference panel--minimizing the average distance to the closest leaf (ADCL)--and compare its performance relative to an earlier algorithm: maximizing phylogenetic diversity (PD). Employing both simulated data and sequences from the 1000 Genomes Project, we show that ADCL provides a significant improvement in imputation accuracy, especially for imputation of sites with low-frequency alleles. This improvement in imputation accuracy is robust to changes in reference panel size, marker density, and length of the imputation target region.

    View details for DOI 10.1534/genetics.115.176909

    View details for Web of Science ID 000362838500013

    View details for PubMedID 26307072

    View details for PubMedCentralID PMC4596665

  • A General Model of the Relationship between the Apportionment of Human Genetic Diversity and the Apportionment of Human Phenotypic Diversity HUMAN BIOLOGY Edge, M. D., Rosenberg, N. A. 2015; 87 (4): 313-337
  • A General Model of the Relationship between the Apportionment of Human Genetic Diversity and the Apportionment of Human Phenotypic Diversity. Human biology Edge, M. D., Rosenberg, N. A. 2015; 87 (4): 313-337


    Models that examine genetic differences between populations alongside a genotype-phenotype map can provide insight about phenotypic variation among groups. We generalize a simple model of a completely heritable, additive, selectively neutral quantitative trait to examine the relationship between single-locus genetic differentiation and phenotypic differentiation on quantitative traits. In agreement with similar efforts using different models, we show that the expected degree to which two groups differ on a neutral quantitative trait is not strongly affected by the number of genetic loci that influence the trait: neutral trait differences are expected to have a magnitude comparable to the genetic differences at a single neutral locus. We discuss this result with respect to population differences in disease phenotypes, arguing that although neutral genetic differences between populations can contribute to specific differences between populations in health outcomes, systematic patterns of difference that run in the same direction for many genetically independent health conditions are unlikely to be explained by neutral genetic differentiation.

    View details for PubMedID 27737590

  • Genetic Diversity and Societally Important Disparities. Genetics Rosenberg, N. A., Kang, J. T. 2015; 201 (1): 1-12


    The magnitude of genetic diversity within human populations varies in a way that reflects the sequence of migrations by which people spread throughout the world. Beyond its use in human evolutionary genetics, worldwide variation in genetic diversity sometimes can interact with social processes to produce differences among populations in their relationship to modern societal problems. We review the consequences of genetic diversity differences in the settings of familial identification in forensic genetic testing, match probabilities in bone marrow transplantation, and representation in genome-wide association studies of disease. In each of these three cases, the contribution of genetic diversity to social differences follows from population-genetic principles. For a fourth setting that is not similarly grounded, we reanalyze with expanded genetic data a report that genetic diversity differences influence global patterns of human economic development, finding no support for the claim. The four examples describe a limit to the importance of genetic diversity for explaining societal differences while illustrating a distinction that certain biologically based scenarios do require consideration of genetic diversity for solving problems to which populations have been differentially predisposed by the unique history of human migrations.

    View details for DOI 10.1534/genetics.115.176750

    View details for PubMedID 26354973

    View details for PubMedCentralID PMC4566256

  • Clumpak: a program for identifying clustering modes and packaging population structure inferences across K MOLECULAR ECOLOGY RESOURCES Kopelman, N. M., Mayzel, J., Jakobsson, M., Rosenberg, N. A., Mayrose, I. 2015; 15 (5): 1179-1191


    The identification of the genetic structure of populations from multilocus genotype data has become a central component of modern population-genetic data analysis. Application of model-based clustering programs often entails a number of steps, in which the user considers different modelling assumptions, compares results across different predetermined values of the number of assumed clusters (a parameter typically denoted K), examines multiple independent runs for each fixed value of K, and distinguishes among runs belonging to substantially distinct clustering solutions. Here, we present Clumpak (Cluster Markov Packager Across K), a method that automates the postprocessing of results of model-based population structure analyses. For analysing multiple independent runs at a single K value, Clumpak identifies sets of highly similar runs, separating distinct groups of runs that represent distinct modes in the space of possible solutions. This procedure, which generates a consensus solution for each distinct mode, is performed by the use of a Markov clustering algorithm that relies on a similarity matrix between replicate runs, as computed by the software Clumpp. Next, Clumpak identifies an optimal alignment of inferred clusters across different values of K, extending a similar approach implemented for a fixed K in Clumpp and simplifying the comparison of clustering results across different K values. Clumpak incorporates additional features, such as implementations of methods for choosing K and comparing solutions obtained by different programs, models, or data subsets. Clumpak, available at, simplifies the use of model-based analyses of population structure in population genetics and molecular ecology.

    View details for DOI 10.1111/1755-0998.12387

    View details for Web of Science ID 000359631600017

    View details for PubMedID 25684545

  • Genetic Diversity and Societally Important Disparities. Genetics Rosenberg, N. A., Kang, J. T. 2015; 201 (1): 1-12

    View details for DOI 10.1534/genetics.115.176750

    View details for PubMedID 26354973

  • Implications of the apportionment of human genetic diversity for the apportionment of human phenotypic diversity. Studies in history and philosophy of biological and biomedical sciences Edge, M. D., Rosenberg, N. A. 2015; 52: 32-45


    Researchers in many fields have considered the meaning of two results about genetic variation for concepts of "race." First, at most genetic loci, apportionments of human genetic diversity find that worldwide populations are genetically similar. Second, when multiple genetic loci are examined, it is possible to distinguish people with ancestry from different geographical regions. These two results raise an important question about human phenotypic diversity: To what extent do populations typically differ on phenotypes determined by multiple genetic loci? It might be expected that such phenotypes follow the pattern of similarity observed at individual loci. Alternatively, because they have a multilocus genetic architecture, they might follow the pattern of greater differentiation suggested by multilocus ancestry inference. To address the question, we extend a well-known classification model of Edwards (2003) by adding a selectively neutral quantitative trait. Using the extended model, we show, in line with previous work in quantitative genetics, that regardless of how many genetic loci influence the trait, one neutral trait is approximately as informative about ancestry as a single genetic locus. The results support the relevance of single-locus genetic-diversity partitioning for predictions about phenotypic diversity.

    View details for DOI 10.1016/j.shpsc.2014.12.005

    View details for PubMedID 25677859

  • Enhancing the mathematical properties of new haplotype homozygosity statistics for the detection of selective sweeps THEORETICAL POPULATION BIOLOGY Garud, N. R., Rosenberg, N. A. 2015; 102: 94-101


    Soft selective sweeps represent an important form of adaptation in which multiple haplotypes bearing adaptive alleles rise to high frequency. Most statistical methods for detecting selective sweeps from genetic polymorphism data, however, have focused on identifying hard selective sweeps in which a favored allele appears on a single haplotypic background; these methods might be underpowered to detect soft sweeps. Among exceptions is the set of haplotype homozygosity statistics introduced for the detection of soft sweeps by Garud et al. (2015). These statistics, examining frequencies of multiple haplotypes in relation to each other, include H12, a statistic designed to identify both hard and soft selective sweeps, and H2/H1, a statistic that conditional on high H12 values seeks to distinguish between hard and soft sweeps. A challenge in the use of H2/H1 is that its range depends on the associated value of H12, so that equal H2/H1 values might provide different levels of support for a soft sweep model at different values of H12. Here, we enhance the H12 and H2/H1 haplotype homozygosity statistics for selective sweep detection by deriving the upper bound on H2/H1 as a function of H12, thereby generating a statistic that normalizes H2/H1 to lie between 0 and 1. Through a reanalysis of resequencing data from inbred lines of Drosophila, we show that the enhanced statistic both strengthens interpretations obtained with the unnormalized statistic and leads to empirical insights that are less readily apparent without the normalization.

    View details for DOI 10.1016/j.tpb.2015.04.001

    View details for Web of Science ID 000355239700009

    View details for PubMedID 25891325

  • A comparison of worldwide phonemic and genetic variation in human populations. Proceedings of the National Academy of Sciences of the United States of America Creanza, N., Ruhlen, M., Pemberton, T. J., Rosenberg, N. A., Feldman, M. W., Ramachandran, S. 2015; 112 (5): 1265-1272


    Worldwide patterns of genetic variation are driven by human demographic history. Here, we test whether this demographic history has left similar signatures on phonemes-sound units that distinguish meaning between words in languages-to those it has left on genes. We analyze, jointly and in parallel, phoneme inventories from 2,082 worldwide languages and microsatellite polymorphisms from 246 worldwide populations. On a global scale, both genetic distance and phonemic distance between populations are significantly correlated with geographic distance. Geographically close language pairs share significantly more phonemes than distant language pairs, whether or not the languages are closely related. The regional geographic axes of greatest phonemic differentiation correspond to axes of genetic differentiation, suggesting that there is a relationship between human dispersal and linguistic variation. However, the geographic distribution of phoneme inventory sizes does not follow the predictions of a serial founder effect during human expansion out of Africa. Furthermore, although geographically isolated populations lose genetic diversity via genetic drift, phonemes are not subject to drift in the same way: within a given geographic radius, languages that are relatively isolated exhibit more variance in number of phonemes than languages with many neighbors. This finding suggests that relatively isolated languages are more susceptible to phonemic change than languages with many neighbors. Within a language family, phoneme evolution along genetic, geographic, or cognate-based linguistic trees predicts similar ancestral phoneme states to those predicted from ancient sources. More genetic sampling could further elucidate the relative roles of vertical and horizontal transmission in phoneme evolution.

    View details for DOI 10.1073/pnas.1424033112

    View details for PubMedID 25605893

  • AABC: Approximate approximate Bayesian computation for inference in population-genetic models. Theoretical population biology Buzbas, E. O., Rosenberg, N. A. 2015; 99: 31-42


    Approximate Bayesian computation (ABC) methods perform inference on model-specific parameters of mechanistically motivated parametric models when evaluating likelihoods is difficult. Central to the success of ABC methods, which have been used frequently in biology, is computationally inexpensive simulation of data sets from the parametric model of interest. However, when simulating data sets from a model is so computationally expensive that the posterior distribution of parameters cannot be adequately sampled by ABC, inference is not straightforward. We present "approximate approximate Bayesian computation" (AABC), a class of computationally fast inference methods that extends ABC to models in which simulating data is expensive. In AABC, we first simulate a number of data sets small enough to be computationally feasible to simulate from the parametric model. Conditional on these data sets, we use a statistical model that approximates the correct parametric model and enables efficient simulation of a large number of data sets. We show that under mild assumptions, the posterior distribution obtained by AABC converges to the posterior distribution obtained by ABC, as the number of data sets simulated from the parametric model and the sample size of the observed data set increase. We demonstrate the performance of AABC on a population-genetic model of natural selection, as well as on a model of the admixture history of hybrid populations. This latter example illustrates how, in population genetics, AABC is of particular utility in scenarios that rely on conceptually straightforward but potentially slow forward-in-time simulations.

    View details for DOI 10.1016/j.tpb.2014.09.002

    View details for PubMedID 25261426

  • Autosomal Admixture Levels Are Informative About Sex Bias in Admixed Populations GENETICS Goldberg, A., Verdu, P., Rosenberg, N. A. 2014; 198 (3): 1209-1229
  • Upper bounds on FST in terms of the frequency of the most frequent allele and total homozygosity: the case of a specified number of alleles. Theoretical population biology Edge, M. D., Rosenberg, N. A. 2014; 97: 20-34


    FST is one of the most frequently-used indices of genetic differentiation among groups. Though FST takes values between 0 and 1, authors going back to Wright have noted that under many circumstances, FST is constrained to be less than 1. Recently, we showed that at a genetic locus with an unspecified number of alleles, FST for two subpopulations is strictly bounded from above by functions of both the frequency of the most frequent allele (M) and the homozygosity of the total population (HT). In the two-subpopulation case, FST can equal one only when the frequency of the most frequent allele and the total homozygosity are 1/2. Here, we extend this work by deriving strict bounds on FST for two subpopulations when the number of alleles at the locus is specified to be I. We show that restricting to I alleles produces the same upper bound on FST over much of the allowable domain for M and HT, and we derive more restrictive bounds in the windows M∈[1/I,1/(I-1)) and HT∈[1/I,I/(I(2)-1)). These results extend our understanding of the behavior of FST in relation to other population-genetic statistics.

    View details for DOI 10.1016/j.tpb.2014.08.001

    View details for PubMedID 25132646

  • On the Number of Ranked Species Trees Producing Anomalous Ranked Gene Trees IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS Disanto, F., Rosenberg, N. A. 2014; 11 (6): 1229-1238
  • Mean deep coalescence cost under exchangeable probability distributions DISCRETE APPLIED MATHEMATICS Than, C. V., Rosenberg, N. A. 2014; 174: 11-26
  • Patterns of Admixture and Population Structure in Native Populations of Northwest North America PLOS GENETICS Verdu, P., Pemberton, T. J., Laurent, R., Kemp, B. M., Gonzalez-Oliver, A., Gorodezky, C., Hughes, C. E., Shattuck, M. R., Petzelt, B., Mitchell, J., Harry, H., William, T., Worl, R., Cybulski, J. S., Rosenberg, N. A., Malhi, R. S. 2014; 10 (8)


    The initial contact of European populations with indigenous populations of the Americas produced diverse admixture processes across North, Central, and South America. Recent studies have examined the genetic structure of indigenous populations of Latin America and the Caribbean and their admixed descendants, reporting on the genomic impact of the history of admixture with colonizing populations of European and African ancestry. However, relatively little genomic research has been conducted on admixture in indigenous North American populations. In this study, we analyze genomic data at 475,109 single-nucleotide polymorphisms sampled in indigenous peoples of the Pacific Northwest in British Columbia and Southeast Alaska, populations with a well-documented history of contact with European and Asian traders, fishermen, and contract laborers. We find that the indigenous populations of the Pacific Northwest have higher gene diversity than Latin American indigenous populations. Among the Pacific Northwest populations, interior groups provide more evidence for East Asian admixture, whereas coastal groups have higher levels of European admixture. In contrast with many Latin American indigenous populations, the variance of admixture is high in each of the Pacific Northwest indigenous populations, as expected for recent and ongoing admixture processes. The results reveal some similarities but notable differences between admixture patterns in the Pacific Northwest and those in Latin America, contributing to a more detailed understanding of the genomic consequences of European colonization events throughout the Americas.

    View details for DOI 10.1371/journal.pgen.1004530

    View details for Web of Science ID 000341577800027

    View details for PubMedID 25122539

  • Population-Genetic Influences on Genomic Estimates of the Inbreeding Coefficient: A Global Perspective HUMAN HEREDITY Pemberton, T. J., Rosenberg, N. A. 2014; 77 (1-4): 37-48


    Culturally driven marital practices provide a key instance of an interaction between social and genetic processes in shaping patterns of human genetic variation, producing, for example, increased identity by descent through consanguineous marriage. A commonly used measure to quantify identity by descent in an individual is the inbreeding coefficient, a quantity that reflects not only consanguinity, but also other aspects of kinship in the population to which the individual belongs. Here, in populations worldwide, we examine the relationship between genomic estimates of the inbreeding coefficient and population patterns in genetic variation.Using genotypes at 645 microsatellites, we compare inbreeding coefficients from 5,043 individuals representing 237 populations worldwide to demographic consanguinity frequency estimates available for 26 populations as well as to other quantities that can illuminate population-genetic influences on inbreeding coefficients.We observe higher inbreeding coefficient estimates in populations and geographic regions with known high levels of consanguinity or genetic isolation and in populations with an increased effect of genetic drift and decreased genetic diversity with increasing distance from Africa. For the small number of populations with specific consanguinity estimates, we find a correlation between inbreeding coefficients and consanguinity frequency (r = 0.349, p = 0.040).The results emphasize the importance of both consanguinity and population-genetic factors in influencing variation in inbreeding coefficients, and they provide insight into factors useful for assessing the effect of consanguinity on genomic patterns in different populations. © 2014 S. Karger AG, Basel.

    View details for DOI 10.1159/000362878

    View details for Web of Science ID 000339321800006

    View details for PubMedID 25060268

  • Genetics and the History of the Samaritans: Y-Chromosomal Microsatellites and Genetic Affinity between Samaritans and Cohanim HUMAN BIOLOGY Oefner, P. J., Hoelzl, G., Shen, P., Shpirer, I., Gefel, D., Lavi, T., Woolf, E., Cohen, J., Cinnioglu, C., Underhill, P. A., Rosenberg, N. A., Hochrein, J., Granka, J. M., Hillel, J., Feldman, M. W. 2013; 85 (6): 825-857
  • No Evidence from Genome-wide Data of a Khazar Origin for the Ashkenazi Jews HUMAN BIOLOGY Behar, D. M., Metspalu, M., Baran, Y., Kopelman, N. M., Yunusbayev, B., Gladstein, A., Tzur, S., Sahakyan, H., Bahmanimehr, A., Yepiskoposyan, L., Tambets, K., Khusnutdinova, E. K., Kushniarevich, A., Balanovsky, O., Balanovsky, E., Kovacevic, L., Marjanovic, D., Mihailov, E., Kouvatsi, A., Triantaphyllidis, C., King, R. J., Semino, O., Torroni, A., Hammer, M. F., Metspalu, E., Skorecki, K., Rosset, S., Halperin, E., Villems, R., Rosenberg, N. A. 2013; 85 (6): 859-900
  • From generation to generation: the genetics of jewish populations. Human biology Rosenberg, N. A., Weitzman, S. P. 2013; 85 (6): 817-824

    View details for PubMedID 25079121

  • Genotype Imputation Reference Panel Selection Using Maximal Phylogenetic Diversity GENETICS Zhang, P., Zhan, X., Rosenberg, N. A., Zoellner, S. 2013; 195 (2): 319-330
  • Genotype imputation reference panel selection using maximal phylogenetic diversity. Genetics Zhang, P., Zhan, X., Rosenberg, N. A., Zöllner, S. 2013; 195 (2): 319-330


    The recent dramatic cost reduction of next-generation sequencing technology enables investigators to assess most variants in the human genome to identify risk variants for complex diseases. However, sequencing large samples remains very expensive. For a study sample with existing genotype data, such as array data from genome-wide association studies, a cost-effective approach is to sequence a subset of the study sample and then to impute the rest of the study sample, using the sequenced subset as a reference panel. The use of such an internal reference panel identifies population-specific variants and avoids the problem of a substantial mismatch in ancestry background between the study population and the reference population. To efficiently select an internal panel, we introduce an idea of phylogenetic diversity from mathematical phylogenetics and comparative genomics. We propose the "most diverse reference panel", defined as the subset with the maximal "phylogenetic diversity", thereby incorporating individuals that span a diverse range of genotypes within the sample. Using data both from simulations and from the 1000 Genomes Project, we show that the most diverse reference panel can substantially improve the imputation accuracy compared to randomly selected reference panels, especially for the imputation of rare variants. The improvement in imputation accuracy holds across different marker densities, reference panel sizes, and lengths for the imputed segments. We thus propose a novel strategy for planning sequencing studies on samples with existing genotype data.

    View details for DOI 10.1534/genetics.113.154591

    View details for PubMedID 23934887

  • Runs of homozygosity and parental relatedness. Genetics in medicine Rosenberg, N. A., Pemberton, T. J., Li, J. Z., Belmont, J. W. 2013; 15 (9): 753-754

    View details for DOI 10.1038/gim.2013.108

    View details for PubMedID 24008258

  • Coalescent Histories for Caterpillar-Like Families IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS Rosenberg, N. A. 2013; 10 (5): 1253-1262


    A coalescent history is an assignment of branches of a gene tree to branches of a species tree on which coalescences in the gene tree occur. The number of coalescent histories for a pair consisting of a labeled gene tree topology and a labeled species tree topology is important in gene tree probability computations, and more generally, in studying evolutionary possibilities for gene trees on species trees. Defining the Tr-caterpillar-like family as a sequence of n-taxon trees constructed by replacing the r-taxon subtree of n-taxon caterpillars by a specific r-taxon labeled topology Tr, we examine the number of coalescent histories for caterpillar-like families with matching gene tree and species tree labeled topologies. For each Tr with size r≤8, we compute the number of coalescent histories for n-taxon trees in the Tr-caterpillar-like family. Next, as n→∞, we find that the limiting ratio of the numbers of coalescent histories for the Tr family and caterpillars themselves is correlated with the number of labeled histories for Tr. The results support a view that large numbers of coalescent histories occur when a tree has both a relatively balanced subtree and a high tree depth, contributing to deeper understanding of the combinatorics of gene trees and species trees.

    View details for DOI 10.1109/TCBB.2013.123

    View details for Web of Science ID 000331461400017

    View details for PubMedID 24524157

  • Genotype imputation in a coalescent model with infinitely-many-sites mutation THEORETICAL POPULATION BIOLOGY Huang, L., Buzbas, E. O., Rosenberg, N. A. 2013; 87: 62-74


    Empirical studies have identified population-genetic factors as important determinants of the properties of genotype-imputation accuracy in imputation-based disease association studies. Here, we develop a simple coalescent model of three sequences that we use to explore the theoretical basis for the influence of these factors on genotype-imputation accuracy, under the assumption of infinitely-many-sites mutation. Employing a demographic model in which two populations diverged at a given time in the past, we derive the approximate expectation and variance of imputation accuracy in a study sequence sampled from one of the two populations, choosing between two reference sequences, one sampled from the same population as the study sequence and the other sampled from the other population. We show that, under this model, imputation accuracy-as measured by the proportion of polymorphic sites that are imputed correctly in the study sequence-increases in expectation with the mutation rate, the proportion of the markers in a chromosomal region that are genotyped, and the time to divergence between the study and reference populations. Each of these effects derives largely from an increase in information available for determining the reference sequence that is genetically most similar to the sequence targeted for imputation. We analyze as a function of divergence time the expected gain in imputation accuracy in the target using a reference sequence from the same population as the target rather than from the other population. Together with a growing body of empirical investigations of genotype imputation in diverse human populations, our modeling framework lays a foundation for extending imputation techniques to novel populations that have not yet been extensively examined.

    View details for DOI 10.1016/j.tpb.2012.09.006

    View details for Web of Science ID 000322688800007

    View details for PubMedID 23079542

  • Long Runs of Homozygosity Are Enriched for Deleterious Variation AMERICAN JOURNAL OF HUMAN GENETICS Szpiech, Z. A., Xu, J., Pemberton, T. J., Peng, W., Zoellner, S., Rosenberg, N. A., Li, J. Z. 2013; 93 (1): 90-102


    Exome sequencing offers the potential to study the population-genomic variables that underlie patterns of deleterious variation. Runs of homozygosity (ROH) are long stretches of consecutive homozygous genotypes probably reflecting segments shared identically by descent as the result of processes such as consanguinity, population size reduction, and natural selection. The relationship between ROH and patterns of predicted deleterious variation can provide insight into the way in which these processes contribute to the maintenance of deleterious variants. Here, we use exome sequencing to examine ROH in relation to the distribution of deleterious variation in 27 individuals of varying levels of apparent inbreeding from 6 human populations. A significantly greater fraction of all genome-wide predicted damaging homozygotes fall in ROH than would be expected from the corresponding fraction of nondamaging homozygotes in ROH (p < 0.001). This pattern is strongest for long ROH (p < 0.05). ROH, and especially long ROH, harbor disproportionately more deleterious homozygotes than would be expected on the basis of the total ROH coverage of the genome and the genomic distribution of nondamaging homozygotes. The results accord with a hypothesis that recent inbreeding, which generates long ROH, enables rare deleterious variants to exist in homozygous form. Thus, just as inbreeding can elevate the occurrence of rare recessive diseases that represent homozygotes for strongly deleterious mutations, inbreeding magnifies the occurrence of mildly deleterious variants as well.

    View details for DOI 10.1016/j.ajhg.2013.05.003

    View details for Web of Science ID 000321804500008

    View details for PubMedID 23746547

    View details for PubMedCentralID PMC3710769

  • Population Structure in a Comprehensive Genomic Data Set on Human Microsatellite Variation G3-GENES GENOMES GENETICS Pemberton, T. J., DeGiorgio, M., Rosenberg, N. A. 2013; 3 (5): 891-907


    Over the past two decades, microsatellite genotypes have provided the data for landmark studies of human population-genetic variation. However, the various microsatellite data sets have been prepared with different procedures and sets of markers, so that it has been difficult to synthesize available data for a comprehensive analysis. Here, we combine eight human population-genetic data sets at the 645 microsatellite loci they share in common, accounting for procedural differences in the production of the different data sets, to assemble a single data set containing 5795 individuals from 267 worldwide populations. We perform a systematic analysis of genetic relatedness, detecting 240 intra-population and 92 inter-population pairs of previously unidentified close relatives and proposing standardized subsets of unrelated individuals for use in future studies. We then augment the human data with a data set of 84 chimpanzees at the 246 loci they share in common with the human samples. Multidimensional scaling and neighbor-joining analyses of these data sets offer new insights into the structure of human populations and enable a comparison of genetic variation patterns in chimpanzees with those in humans. Our combined data sets are the largest of their kind reported to date and provide a resource for use in human population-genetic studies.

    View details for DOI 10.1534/g3.113.005728

    View details for Web of Science ID 000319438700010

    View details for PubMedID 23550135

  • Geographic Sampling Scheme as a Determinant of the Major Axis of Genetic Variation in Principal Components Analysis MOLECULAR BIOLOGY AND EVOLUTION DeGiorgio, M., Rosenberg, N. A. 2013; 30 (2): 480-488


    Principal component (PC) maps, which plot the values of a given PC estimated on the basis of allele frequency variation at the geographic sampling locations of a set of populations, are often used to investigate the properties of past range expansions. Some studies have argued that in a range expansion, the axis of greatest variation (i.e., the first PC) is parallel to the axis of expansion. In contrast, others have identified a pattern in which the axis of greatest variation is perpendicular to the axis of expansion. Here, we seek to understand this difference in outcomes by investigating the effect of the geographic sampling scheme on the direction of the axis of greatest variation under a two-dimensional range expansion model. From datasets simulated using each of two different schemes for the geographic sampling of populations under the model, we create PC maps for the first PC. We find that depending on the geographic sampling scheme, the axis of greatest variation can be either parallel or perpendicular to the axis of expansion. We provide an explanation for this result in terms of intra- and interpopulation coalescence times.

    View details for DOI 10.1093/molbev/mss233

    View details for Web of Science ID 000314122000023

    View details for PubMedID 23051843

  • The Relationship Between F-ST and the Frequency of the Most Frequent Allele GENETICS Jakobsson, M., Edge, M. D., Rosenberg, N. A. 2013; 193 (2): 515-528


    F(ST) is frequently used as a summary of genetic differentiation among groups. It has been suggested that F(ST) depends on the allele frequencies at a locus, as it exhibits a variety of peculiar properties related to genetic diversity: higher values for biallelic single-nucleotide polymorphisms (SNPs) than for multiallelic microsatellites, low values among high-diversity populations viewed as substantially distinct, and low values for populations that differ primarily in their profiles of rare alleles. A full mathematical understanding of the dependence of F(ST) on allele frequencies, however, has been elusive. Here, we examine the relationship between F(ST) and the frequency of the most frequent allele, demonstrating that the range of values that F(ST) can take is restricted considerably by the allele-frequency distribution. For a two-population model, we derive strict bounds on F(ST) as a function of the frequency M of the allele with highest mean frequency between the pair of populations. Using these bounds, we show that for a value of M chosen uniformly between 0 and 1 at a multiallelic locus whose number of alleles is left unspecified, the mean maximum F(ST) is ∼0.3585. Further, F(ST) is restricted to values much less than 1 when M is low or high, and the contribution to the maximum F(ST) made by the most frequent allele is on average ∼0.4485. Using bounds on homozygosity that we have previously derived as functions of M, we describe strict bounds on F(ST) in terms of the homozygosity of the total population, finding that the mean maximum F(ST) given this homozygosity is 1 - ln 2 ≈ 0.3069. Our results provide a conceptual basis for understanding the dependence of F(ST) on allele frequencies and genetic diversity and for interpreting the roles of these quantities in computations of F(ST) from population-genetic data. Further, our analysis suggests that many unusual observations of F(ST), including the relatively low F(ST) values in high-diversity human populations from Africa and the relatively low estimates of F(ST) for microsatellites compared to SNPs, can be understood not as biological phenomena associated with different groups of populations or classes of markers but rather as consequences of the intrinsic mathematical dependence of F(ST) on the properties of allele-frequency distributions.

    View details for DOI 10.1534/genetics.112.144758

    View details for Web of Science ID 000314821300015

    View details for PubMedID 23172852

  • Mathematical properties of the deep coalescence cost. IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM Than, C. V., Rosenberg, N. A. 2013; 10 (1): 61-72


    In the minimizing-deep-coalescences (MDC) approach for species tree inference, a tree that has the minimal deep coalescence cost for reconciling a collection of gene trees is taken as an estimate of the species tree topology. The MDC method possesses the desirable Pareto property, and in practice it is quite accurate and computationally efficient. Here, in order to better understand the MDC method, we investigate some properties of the deep coalescence cost. We prove that the unit neighborhood of either a rooted species tree or a rooted gene tree under the deep coalescence cost is exactly the same as the tree's unit neighborhood under the rooted nearest-neighbor interchange (NNI) distance. Next, for a fixed species tree, we obtain the maximum deep coalescence cost across all gene trees as well as the number of gene trees that achieve the maximum cost. We also study corresponding problems for a fixed gene tree.

    View details for DOI 10.1109/TCBB.2012.133

    View details for PubMedID 23702544

  • Windfalls and pitfalls: Applications of population genetics to the search for disease genes. Evolution, medicine, and public health Edge, M. D., Gorroochurn, P., Rosenberg, N. A. 2013; 2013 (1): 254-272


    Association mapping can be viewed as an application of population genetics and evolutionary biology to the problem of identifying genes causally connected to phenotypes. However, some population-genetic principles important to the design and analysis of association studies have not been widely understood or have even been generally misunderstood. Some of these principles underlie techniques that can aid in the discovery of genetic variants that influence phenotypes ('windfalls'), whereas others can interfere with study design or interpretation of results ('pitfalls'). Here, considering examples involving genetic variant discovery, linkage disequilibrium, power to detect associations, population stratification and genotype imputation, we address misunderstandings in the application of population genetics to association studies, and we illuminate how some surprising results in association contexts can be easily explained when considered from evolutionary and population-genetic perspectives. Through our examples, we argue that population-genetic thinking-which takes a theoretical view of the evolutionary forces that guide the emergence and propagation of genetic variants-substantially informs the design and interpretation of genetic association studies. In particular, population-genetic thinking sheds light on genetic confounding, on the relationships between association signals of typed markers and causal variants, and on the advantages and disadvantages of particular strategies for measuring genetic variation in association studies.

    View details for DOI 10.1093/emph/eot021

    View details for PubMedID 24481204

  • The behavior of admixed populations in neighbor-joining inference of population trees. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Kopelman, N. M., Stone, L., Gascuel, O., Rosenberg, N. A. 2013: 273-284


    Neighbor-joining is one of the most widely used methods for constructing evolutionary trees. This approach from phylogenetics is often employed in population genetics, where distance matrices obtained from allele frequencies are used to produce a representation of population relationships in the form of a tree. In phylogenetics, the utility of neighbor-joining derives partly from a result that for a class of distance matrices including those that are additive or tree-like-generated by summing weights over the edges connecting pairs of taxa in a tree to obtain pairwise distances-application of neighbor-joining recovers exactly the underlying tree. For populations within a species, however, migration and admixture can produce distance matrices that reflect more complex processes than those obtained from the bifurcating trees typical in the multispecies context. Admixed populations-populations descended from recent mixture of groups that have long been separated-have been observed to be located centrally in inferred neighbor-joining trees, with short external branches incident to the path connecting their source populations. Here, using a simple model, we explore mathematically the behavior of an admixed population under neighbor-joining. We show that with an additive distance matrix, a population admixed among two source populations necessarily lies on the path between the sources. Relaxing the additivity requirement, we examine the smallest nontrivial case-four populations, one of which is admixed between two of the other three-showing that the two source populations never merge with each other before one of them merges with the admixed population. Furthermore, the distance on the constructed tree between the admixed population and either source population is always smaller than the distance between the source populations, and the external branch for the admixed population is always incident to the path connecting the sources. We define three properties that hold for four taxa and that we hypothesize are satisfied under more general conditions: antecedence of clustering, intermediacy of distances, and intermediacy of path lengths. Our findings can inform interpretations of neighbor-joining trees with admixed groups, and they provide an explanation for patterns observed in trees of human populations.

    View details for PubMedID 23424132

  • A Characterization of the Set of Species Trees that Produce Anomalous Ranked Gene Trees IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS Degnan, J. H., Rosenberg, N. A., Stadler, T. 2012; 9 (6): 1558-1568


    Ranked gene trees, which consider both the gene tree topology and the sequence in which gene lineages separate, can potentially provide a new source of information for use in modeling genealogies and performing inference of species trees. Recently,we have calculated the probability distribution of ranked gene trees under the standard multispecies coalescent model for the evolution of gene lineages along the branches of a fixed species tree, demonstrating the existence of anomalous ranked gene trees (ARGTs), in which a ranked gene tree that does not match the ranked species tree can have greater probability under the model than the matching ranked gene tree. Here, we fully characterize the set of unranked species tree topologies that give rise to ARGTs, showing that this set contains all species tree topologies with five or more taxa, with the exceptions of caterpillars and pseudocaterpillars. The results have implications for the use of ranked gene trees in phylogenetic inference.

    View details for DOI 10.1109/TCBB.2012.110

    View details for Web of Science ID 000312558400002

    View details for PubMedID 22868677

  • A maximum-likelihood method to correct for allelic dropout in microsatellite data with no replicate genotypes. Genetics Wang, C., Schroeder, K. B., Rosenberg, N. A. 2012; 192 (2): 651-669


    Allelic dropout is a commonly observed source of missing data in microsatellite genotypes, in which one or both allelic copies at a locus fail to be amplified by the polymerase chain reaction. Especially for samples with poor DNA quality, this problem causes a downward bias in estimates of observed heterozygosity and an upward bias in estimates of inbreeding, owing to mistaken classifications of heterozygotes as homozygotes when one of the two copies drops out. One general approach for avoiding allelic dropout involves repeated genotyping of homozygous loci to minimize the effects of experimental error. Existing computational alternatives often require replicate genotyping as well. These approaches, however, are costly and are suitable only when enough DNA is available for repeated genotyping. In this study, we propose a maximum-likelihood approach together with an expectation-maximization algorithm to jointly estimate allelic dropout rates and allele frequencies when only one set of nonreplicated genotypes is available. Our method considers estimates of allelic dropout caused by both sample-specific factors and locus-specific factors, and it allows for deviation from Hardy-Weinberg equilibrium owing to inbreeding. Using the estimated parameters, we correct the bias in the estimation of observed heterozygosity through the use of multiple imputations of alleles in cases where dropout might have occurred. With simulated data, we show that our method can (1) effectively reproduce patterns of missing data and heterozygosity observed in real data; (2) correctly estimate model parameters, including sample-specific dropout rates, locus-specific dropout rates, and the inbreeding coefficient; and (3) successfully correct the downward bias in estimating the observed heterozygosity. We find that our method is fairly robust to violations of model assumptions caused by population structure and by genotyping errors from sources other than allelic dropout. Because the data sets imputed under our model can be investigated in additional subsequent analyses, our method will be useful for preparing data for applications in diverse contexts in population genetics and molecular ecology.

    View details for DOI 10.1534/genetics.112.139519

    View details for PubMedID 22851645

  • Genomic Patterns of Homozygosity in Worldwide Human Populations AMERICAN JOURNAL OF HUMAN GENETICS Pemberton, T. J., Absher, D., Feldman, M. W., Myers, R. M., Rosenberg, N. A., Li, J. Z. 2012; 91 (2): 275-292


    Genome-wide patterns of homozygosity runs and their variation across individuals provide a valuable and often untapped resource for studying human genetic diversity and evolutionary history. Using genotype data at 577,489 autosomal SNPs, we employed a likelihood-based approach to identify runs of homozygosity (ROH) in 1,839 individuals representing 64 worldwide populations, classifying them by length into three classes-short, intermediate, and long-with a model-based clustering algorithm. For each class, the number and total length of ROH per individual show considerable variation across individuals and populations. The total lengths of short and intermediate ROH per individual increase with the distance of a population from East Africa, in agreement with similar patterns previously observed for locus-wise homozygosity and linkage disequilibrium. By contrast, total lengths of long ROH show large interindividual variations that probably reflect recent inbreeding patterns, with higher values occurring more often in populations with known high frequencies of consanguineous unions. Across the genome, distributions of ROH are not uniform, and they have distinctive continental patterns. ROH frequencies across the genome are correlated with local genomic variables such as recombination rate, as well as with signals of recent positive selection. In addition, long ROH are more frequent in genomic regions harboring genes associated with autosomal-dominant diseases than in regions not implicated in Mendelian diseases. These results provide insight into the way in which homozygosity patterns are produced, and they generate baseline homozygosity patterns that can be used to aid homozygosity mapping of genes associated with recessive diseases.

    View details for DOI 10.1016/j.ajhg.2012.06.014

    View details for Web of Science ID 000307608700006

    View details for PubMedID 22883143

    View details for PubMedCentralID PMC3415543

  • Inferring Species Trees Directly from Biallelic Genetic Markers: Bypassing Gene Trees in a Full Coalescent Analysis MOLECULAR BIOLOGY AND EVOLUTION Bryant, D., Bouckaert, R., Felsenstein, J., Rosenberg, N. A., RoyChoudhury, A. 2012; 29 (8): 1917-1932


    The multispecies coalescent provides an elegant theoretical framework for estimating species trees and species demographics from genetic markers. However, practical applications of the multispecies coalescent model are limited by the need to integrate or sample over all gene trees possible for each genetic marker. Here we describe a polynomial-time algorithm that computes the likelihood of a species tree directly from the markers under a finite-sites model of mutation effectively integrating over all possible gene trees. The method applies to independent (unlinked) biallelic markers such as well-spaced single nucleotide polymorphisms, and we have implemented it in SNAPP, a Markov chain Monte Carlo sampler for inferring species trees, divergence dates, and population sizes. We report results from simulation experiments and from an analysis of 1997 amplified fragment length polymorphism loci in 69 individuals sampled from six species of Ourisia (New Zealand native foxglove).

    View details for DOI 10.1093/molbev/mss086

    View details for Web of Science ID 000307171300004

    View details for PubMedID 22422763

  • A Quantitative Comparison of the Similarity between Genes and Geography in Worldwide Human Populations PLOS GENETICS Wang, C., Zoellner, S., Rosenberg, N. A. 2012; 8 (8)


    Multivariate statistical techniques such as principal components analysis (PCA) and multidimensional scaling (MDS) have been widely used to summarize the structure of human genetic variation, often in easily visualized two-dimensional maps. Many recent studies have reported similarity between geographic maps of population locations and MDS or PCA maps of genetic variation inferred from single-nucleotide polymorphisms (SNPs). However, this similarity has been evident primarily in a qualitative sense; and, because different multivariate techniques and marker sets have been used in different studies, it has not been possible to formally compare genetic variation datasets in terms of their levels of similarity with geography. In this study, using genome-wide SNP data from 128 populations worldwide, we perform a systematic analysis to quantitatively evaluate the similarity of genes and geography in different geographic regions. For each of a series of regions, we apply a Procrustes analysis approach to find an optimal transformation that maximizes the similarity between PCA maps of genetic variation and geographic maps of population locations. We consider examples in Europe, Sub-Saharan Africa, Asia, East Asia, and Central/South Asia, as well as in a worldwide sample, finding that significant similarity between genes and geography exists in general at different geographic levels. The similarity is highest in our examples for Asia and, once highly distinctive populations have been removed, Sub-Saharan Africa. Our results provide a quantitative assessment of the geographic structure of human genetic variation worldwide, supporting the view that geography plays a strong role in giving rise to human population structure.

    View details for DOI 10.1371/journal.pgen.1002886

    View details for Web of Science ID 000308529300044

    View details for PubMedID 22927824

  • Improvements to a Class of Distance Matrix Methods for Inferring Species Trees from Gene Trees JOURNAL OF COMPUTATIONAL BIOLOGY Helmkamp, L. J., Jewett, E. M., Rosenberg, N. A. 2012; 19 (6): 632-649


    Among the methods currently available for inferring species trees from gene trees, the GLASS method of Mossel and Roch (2010), the Shallowest Divergence (SD) method of Maddison and Knowles (2006), the STEAC method of Liu et al. (2009), and a related method that we call Minimum Average Coalescence (MAC) are computationally efficient and provide branch length estimates. Further, GLASS and STEAC have been shown to be consistent estimators of tree topology under a multispecies coalescent model. However, divergence time estimates obtained with these methods are all systematically biased under the model because the pairwise interspecific gene divergence times on which they rely must be more ancient than the species divergence time. Jewett and Rosenberg (2012) derived an expression for the bias of GLASS and used it to propose an improved method that they termed iGLASS. Here, we derive the biases of SD, STEAC, and MAC, and we propose improved analogues of these methods that we call iSD, iSTEAC, and iMAC. We conduct simulations to compare the performance of these methods with their original counterparts and with GLASS and iGLASS, finding that each of them decreases the bias and mean squared error of pairwise divergence time estimates. The new methods can therefore contribute to improvements in the estimation of species trees from information on gene trees.

    View details for DOI 10.1089/cmb.2012.0042

    View details for Web of Science ID 000305335100006

    View details for PubMedID 22697239

  • iGLASS: An Improvement to the GLASS Method for Estimating Species Trees from Gene Trees JOURNAL OF COMPUTATIONAL BIOLOGY Jewett, E. M., Rosenberg, N. A. 2012; 19 (3): 293-315


    Several methods have been designed to infer species trees from gene trees while taking into account gene tree/species tree discordance. Although some of these methods provide consistent species tree topology estimates under a standard model, most either do not estimate branch lengths or are computationally slow. An exception, the GLASS method of Mossel and Roch, is consistent for the species tree topology, estimates branch lengths, and is computationally fast. However, GLASS systematically overestimates divergence times, leading to biased estimates of species tree branch lengths. By assuming a multispecies coalescent model in which multiple lineages are sampled from each of two taxa at L independent loci, we derive the distribution of the waiting time until the first interspecific coalescence occurs between the two taxa, considering all loci and measuring from the divergence time. We then use the mean of this distribution to derive a correction to the GLASS estimator of pairwise divergence times. We show that our improved estimator, which we call iGLASS, consistently estimates the divergence time between a pair of taxa as the number of loci approaches infinity, and that it is an unbiased estimator of divergence times when one lineage is sampled per taxon. We also show that many commonly used clustering methods can be combined with the iGLASS estimator of pairwise divergence times to produce a consistent estimator of the species tree topology. Through simulations, we show that iGLASS can greatly reduce the bias and mean squared error in obtaining estimates of divergence times in a species tree.

    View details for DOI 10.1089/cmb.2011.0231

    View details for Web of Science ID 000301355100005

    View details for PubMedID 22216756

  • Refining the relationship between homozygosity and the frequency of the most frequent allele JOURNAL OF MATHEMATICAL BIOLOGY Reddy, S. B., Rosenberg, N. A. 2012; 64 (1-2): 87-108


    Recent work has established that for an arbitrary genetic locus with its number of alleles unspecified, the homozygosity of the locus confines the frequency of the most frequent allele within a narrow range, and vice versa. Here we extend beyond this limiting case by investigating the relationship between homozygosity and the frequency of the most frequent allele when the number of alleles at the locus is treated as known. Given the homozygosity of a locus with at most K alleles, we find that by taking into account the value of K, the width of the allowed range for the frequency of the most frequent allele decreases from 2/3 - π(2)/18 ≈ 0.1184 to 1/3 - 1/(3K) - {K/[3(K - 1)]} Σ(K)(k = 2) 1/k(2). We further show that properties of the relationship between homozygosity and the frequency of the most frequent allele in the unspecified-K case can be obtained from the specified-K case by taking limits as K → ∞. The results contribute to a greater understanding of the mathematical properties of fundamental statistics employed in population-genetic analysis.

    View details for DOI 10.1007/s00285-011-0406-8

    View details for Web of Science ID 000298652400004

    View details for PubMedID 21305294

  • The probability distribution of ranked gene trees on a species tree MATHEMATICAL BIOSCIENCES Degnan, J. H., Rosenberg, N. A., Stadler, T. 2012; 235 (1): 45-55


    The properties of random gene tree topologies have recently been studied under a coalescent model that treats a species tree as a fixed parameter. Here we develop the analogous theory for random ranked gene tree topologies, in which both the topology and the sequence of coalescences for a random gene tree are considered. We derive the probability distribution of ranked gene tree topologies conditional on a fixed species tree. We then show that similar to the unranked case, ranked gene trees that do not match either the ranking or the topology of the species tree can have greater probability than the matching ranked gene tree.

    View details for DOI 10.1016/j.mbs.2011.10.006

    View details for Web of Science ID 000299761300005

    View details for PubMedID 22075548

  • Haploscope: A Tool for the Graphical Display of Haplotype Structure in Populations GENETIC EPIDEMIOLOGY San Lucas, F. A., Rosenberg, N. A., Scheet, P. 2012; 36 (1): 17-21


    Patterns of linkage disequilibrium are often depicted pictorially by using tools that rely on visualizations of raw data or pairwise correlations among individual markers. Such approaches can fail to highlight some of the more interesting and complex features of haplotype structure. To enable natural visual comparisons of haplotype structure across subgroups of a population (e.g. isolated subpopulations or cases and controls), we propose an alternative visualization that provides a novel graphical representation of haplotype frequencies. We introduce Haploscope, a tool for visualizing the haplotype cluster frequencies that are produced by statistical models for population haplotype variation. We demonstrate the utility of our technique by examining haplotypes around the LCT gene, an example of recent positive selection, in samples from the Human Genome Diversity Panel. Haploscope, which has flexible options for annotation and inspection of haplotypes, is available for download at

    View details for DOI 10.1002/gepi.20640

    View details for Web of Science ID 000302244400003

    View details for PubMedID 22147662

  • A General Mechanistic Model for Admixture Histories of Hybrid Populations GENETICS Verdu, P., Rosenberg, N. A. 2011; 189 (4): 1413-?


    Admixed populations have been used for inferring migrations, detecting natural selection, and finding disease genes. These applications often use a simple statistical model of admixture rather than a modeling perspective that incorporates a more realistic history of the admixture process. Here, we develop a general model of admixture that mechanistically accounts for complex historical admixture processes. We consider two source populations contributing to the ancestry of a hybrid population, potentially with variable contributions across generations. For a random individual in the hybrid population at a given point in time, we study the fraction of genetic admixture originating from a specific one of the source populations by computing its moments as functions of time and of introgression parameters. We show that very different admixture processes can produce identical mean admixture proportions, but that such processes produce different values for the variance of the admixture proportion. When introgression parameters from each source population are constant over time, the long-term limit of the expectation of the admixture proportion depends only on the ratio of the introgression parameters. The variance of admixture decreases quickly over time after the source populations stop contributing to the hybrid population, but remains substantial when the contributions are ongoing. Our approach will facilitate the understanding of admixture mechanisms, illustrating how the moments of the distribution of admixture proportions can be informative about the historical admixture processes contributing to the genetic diversity of hybrid populations.

    View details for DOI 10.1534/genetics.111.132787

    View details for Web of Science ID 000298412100023

    View details for PubMedID 21968194

  • A Test of the Influence of Continental Axes of Orientation on Patterns of Human Gene Flow AMERICAN JOURNAL OF PHYSICAL ANTHROPOLOGY Ramachandran, S., Rosenberg, N. A. 2011; 146 (4): 515-529


    The geographic distribution of genetic variation reflects trends in past population migrations and can be used to make inferences about these migrations. It has been proposed that the east-west orientation of the Eurasian landmass facilitated the rapid spread of ancient technological innovations across Eurasia, while the north-south orientation of the Americas led to a slower diffusion of technology there. If the diffusion of technology was accompanied by gene flow, then this hypothesis predicts that genetic differentiation in the Americas along lines of longitude will be greater than that in Eurasia along lines of latitude. We use 678 microsatellite loci from 68 indigenous populations in Eurasia and the Americas to investigate the spatial axes that underlie population-genetic variation. We find that genetic differentiation increases more rapidly along lines of longitude in the Americas than along lines of latitude in Eurasia. Distance along lines of latitude explains a sizeable portion of genetic distance in Eurasia, whereas distance along lines of longitude does not explain a large proportion of Eurasian genetic variation. Genetic differentiation in the Americas occurs along both latitudinal and longitudinal axes and has a greater magnitude than corresponding differentiation in Eurasia, even when adjusting for the lower level of genetic variation in the American populations. These results support the view that continental orientation has influenced migration patterns and has played an important role in determining both the structure of human genetic variation and the distribution and spread of cultural traits.

    View details for DOI 10.1002/ajpa.21533

    View details for Web of Science ID 000297311600004

    View details for PubMedID 21913175

  • Haplotype variation and genotype imputation in African populations GENETIC EPIDEMIOLOGY Huang, L., Jakobsson, M., Pemberton, T. J., Ibrahim, M., Nyambo, T., Omar, S., Pritchard, J. K., Tishkoff, S. A., Rosenberg, N. A. 2011; 35 (8): 766-780


    Sub-Saharan Africa has been identified as the part of the world with the greatest human genetic diversity. This high level of diversity causes difficulties for genome-wide association (GWA) studies in African populations-for example, by reducing the accuracy of genotype imputation in African populations compared to non-African populations. Here, we investigate haplotype variation and imputation in Africa, using 253 unrelated individuals from 15 Sub-Saharan African populations. We identify the populations that provide the greatest potential for serving as reference panels for imputing genotypes in the remaining groups. Considering reference panels comprising samples of recent African descent in Phase 3 of the HapMap Project, we identify mixtures of reference groups that produce the maximal imputation accuracy in each of the sampled populations. We find that optimal HapMap mixtures and maximal imputation accuracies identified in detailed tests of imputation procedures can instead be predicted by using simple summary statistics that measure relationships between the pattern of genetic variation in a target population and the patterns in potential reference panels. Our results provide an empirical basis for facilitating the selection of reference panels in GWA studies of diverse human populations, especially those of African ancestry.

    View details for DOI 10.1002/gepi.20626

    View details for Web of Science ID 000297468600003

    View details for PubMedID 22125220

  • Mathematical properties of F-st between admixed populations and their parental source populations THEORETICAL POPULATION BIOLOGY Boca, S. M., Rosenberg, N. A. 2011; 80 (3): 208-216


    We consider the properties of the F(st) measure of genetic divergence between an admixed population and its parental source populations. Among all possible populations admixed among an arbitrary set of parental populations, we show that the value of F(st) between an admixed population and a specific source population is maximized when the admixed population is simply the most distant of the other source populations. For the case with only two parental populations, as a function of the admixture fraction, we further demonstrate that this F(st) value is monotonic and convex, so that F(st) is informative about the admixture fraction. We illustrate our results using example human population-genetic data, showing how they provide a framework in which to interpret the features of F(st) in admixed populations.

    View details for DOI 10.1016/j.tpb.2011.05.003

    View details for Web of Science ID 000295902300004

    View details for PubMedID 21640742

  • Coalescence-Time Distributions in a Serial Founder Model of Human Evolutionary History GENETICS DeGiorgio, M., Degnan, J. H., Rosenberg, N. A. 2011; 189 (2): 579-593


    Simulation studies have demonstrated that a variety of patterns in worldwide genetic variation are compatible with the trends predicted by a serial founder model, in which populations expand outward from an initial source via a process in which new populations contain only subsets of the genetic diversity present in their parental populations. Here, we provide analytical results for key quantities under the serial founder model, deriving distributions of coalescence times for pairs of lineages sampled either from the same population or from different populations. We use these distributions to obtain expectations for coalescence times and for homozygosity and heterozygosity values. A predicted approximate linear decline in expected heterozygosity with increasing distance from the source population reproduces a pattern that has been observed both in human genetic data and in simulations. Our formulas predict that populations close to the source location have lower between-population gene identity than populations far from the source, also mirroring results obtained from data and simulations. We show that different models that produce similar declining patterns in heterozygosity generate quite distinct patterns in coalescence-time distributions and gene identity measures, thereby providing a basis for distinguishing these models. We interpret the theoretical results in relation to their implications for human population genetics.

    View details for DOI 10.1534/genetics.111.129296

    View details for Web of Science ID 000296158500014

    View details for PubMedID 21775469

  • On the size distribution of private microsatellite alleles THEORETICAL POPULATION BIOLOGY Szpiech, Z. A., Rosenberg, N. A. 2011; 80 (2): 100-113


    Private microsatellite alleles tend to be found in the tails rather than in the interior of the allele size distribution. To explain this phenomenon, we have investigated the size distribution of private alleles in a coalescent model of two populations, assuming the symmetric stepwise mutation model as the mode of microsatellite mutation. For the case in which four alleles are sampled, two from each population, we condition on the configuration in which three distinct allele sizes are present, one of which is common to both populations, one of which is private to one population, and the third of which is private to the other population. Conditional on this configuration, we calculate the probability that the two private alleles occupy the two tails of the size distribution. This probability, which increases as a function of mutation rate and divergence time between the two populations, is seen to be greater than the value that would be predicted if there was no relationship between privacy and location in the allele size distribution. In accordance with the prediction of the model, we find that in pairs of human populations, the frequency with which private microsatellite alleles occur in the tails of the allele size distribution increases as a function of genetic differentiation between populations.

    View details for DOI 10.1016/j.tpb.2011.03.006

    View details for Web of Science ID 000293765500003

    View details for PubMedID 21514313

  • Inference on the strength of balancing selection for epistatically interacting loci THEORETICAL POPULATION BIOLOGY Buzbas, E. O., Joyce, P., Rosenberg, N. A. 2011; 79 (3): 102-113


    Existing inference methods for estimating the strength of balancing selection in multi-locus genotypes rely on the assumption that there are no epistatic interactions between loci. Complex systems in which balancing selection is prevalent, such as sets of human immune system genes, are known to contain components that interact epistatically. Therefore, current methods may not produce reliable inference on the strength of selection at these loci. In this paper, we address this problem by presenting statistical methods that can account for epistatic interactions in making inference about balancing selection. A theoretical result due to Fearnhead (2006) is used to build a multi-locus Wright-Fisher model of balancing selection, allowing for epistatic interactions among loci. Antagonistic and synergistic types of interactions are examined. The joint posterior distribution of the selection and mutation parameters is sampled by Markov chain Monte Carlo methods, and the plausibility of models is assessed via Bayes factors. As a component of the inference process, an algorithm to generate multi-locus allele frequencies under balancing selection models with epistasis is also presented. Recent evidence on interactions among a set of human immune system genes is introduced as a motivating biological system for the epistatic model, and data on these genes are used to demonstrate the methods.

    View details for DOI 10.1016/j.tpb.2011.01.002

    View details for Web of Science ID 000289045000006

    View details for PubMedID 21277883

  • Consistency Properties of Species Tree Inference by Minimizing Deep Coalescences JOURNAL OF COMPUTATIONAL BIOLOGY Than, C. V., Rosenberg, N. A. 2011; 18 (1): 1-15


    Methods for inferring species trees from sets of gene trees need to account for the possibility of discordance among the gene trees. Assuming that discordance is caused by incomplete lineage sorting, species tree estimates can be obtained by finding those species trees that minimize the number of "deep" coalescence events required for a given collection of gene trees. Efficient algorithms now exist for applying the minimizing-deep-coalescence (MDC) criterion, and simulation experiments have demonstrated its promising performance. However, it has also been noted from simulation results that the MDC criterion is not always guaranteed to infer the correct species tree estimate. In this article, we investigate the consistency of the MDC criterion. Using the multispecies coalescent model, we show that there are indeed anomaly zones for the MDC criterion for asymmetric four-taxon species tree topologies, and for all species tree topologies with five or more taxa.

    View details for DOI 10.1089/cmb.2010.0102

    View details for Web of Science ID 000285965600001

    View details for PubMedID 21210728

  • Unbiased Estimation of Gene Diversity in Samples Containing Related Individuals: Exact Variance and Arbitrary Ploidy GENETICS DeGiorgio, M., Jankovic, I., Rosenberg, N. A. 2010; 186 (4): 1367-1387


    Gene diversity, a commonly used measure of genetic variation, evaluates the proportion of heterozygous individuals expected at a locus in a population, under the assumption of Hardy-Weinberg equilibrium. When using the standard estimator of gene diversity, the inclusion of related or inbred individuals in a sample produces a downward bias. Here, we extend a recently developed estimator shown to be unbiased in a diploid autosomal sample that includes known related or inbred individuals to the general case of arbitrary ploidy. We derive an exact formula for the variance of the new estimator, H, and present an approximation to facilitate evaluation of the variance when each individual is related to at most one other individual in a sample. When examining samples from the human X chromosome, which represent a mixture of haploid and diploid individuals, we find that H performs favorably compared to the standard estimator, both in theoretical computations of mean squared error and in data analysis. We thus propose that H is a useful tool in characterizing gene diversity in samples of arbitrary ploidy that contain related or inbred individuals.

    View details for DOI 10.1534/genetics.110.121756

    View details for Web of Science ID 000285297000024

    View details for PubMedID 20923981

  • Inference of Unexpected Genetic Relatedness among Individuals in HapMap Phase III AMERICAN JOURNAL OF HUMAN GENETICS Pemberton, T. J., Wang, C., Li, J. Z., Rosenberg, N. A. 2010; 87 (4): 457-464


    The International Haplotype Map Project (HapMap) has provided an essential database for studies of human population genetics and genome-wide association. Phases I and II of the HapMap project generated genotype data across ∼3 million SNP loci in 270 individuals representing four populations. Phase III provides dense genotype data on ∼1.5 million SNPs, generated by Illumina and Affymetrix platforms in a larger set of individuals. Release 3 of phase III of the HapMap contains 1397 individuals from 11 populations, including 250 of the original 270 phase I and phase II individuals and 1147 additional individuals. Although some known relationships among the phase III individuals have been described in the data release, the genotype data that are currently available provide an opportunity to empirically ascertain previously unknown relationships. We performed a systematic analysis of genetic relatedness and were able not only to confirm the reported relationships, but also to detect numerous additional, previously unidentified pairs of close relatives in the HapMap sample. The inferred relative pairs make it possible to propose standardized subsets of unrelated individuals for use in future studies in which relatedness needs to be clearly defined.

    View details for DOI 10.1016/j.ajhg.2010.08.014

    View details for Web of Science ID 000283037600002

    View details for PubMedID 20869033

  • MLH1 Founder Mutations with Moderate Penetrance in Spanish Lynch Syndrome Families CANCER RESEARCH Borras, E., Pineda, M., Blanco, I., Jewett, E. M., Wang, F., Teule, A., Caldes, T., Urioste, M., Martinez-Bouzas, C., Brunet, J., Balmana, J., Torres, A., Ramon y Cajal, T., Sanz, J., Perez-Cabornero, L., Castellvi-Bel, S., Alonso, A., Lanas, A., Gonzalez, S., Moreno, V., Gruber, S. B., Rosenberg, N. A., Mukherjee, B., Lazaro, C., Capella, G. 2010; 70 (19): 7379-7391


    The variants c.306+5G>A and c.1865T>A (p.Leu622His) of the DNA repair gene MLH1 occur frequently in Spanish Lynch syndrome families. To understand their ancestral history and clinical effect, we performed functional assays and a penetrance analysis and studied their genetic and geographic origins. Detailed family histories were taken from 29 carrier families. Functional analysis included in silico and in vitro assays at the RNA and protein levels. Penetrance was calculated using a modified segregation analysis adjusted for ascertainment. Founder effects were evaluated by haplotype analysis. The identified MLH1 c.306+5G>A and c.1865T>A (p.Leu622His) variants are absent in control populations and segregate with the disease. Tumors from carriers of both variants show microsatellite instability and loss of expression of the MLH1 protein. The c.306+5G>A variant is a pathogenic mutation affecting mRNA processing. The c.1865T>A (p.Leu622His) variant causes defects in MLH1 expression and stability. For both mutations, the estimated penetrance is moderate (age-cumulative colorectal cancer risk by age 70 of 20.1% and 14.1% for c.306+5G>A and of 6.8% and 7.3% for c.1865T>A in men and women carriers, respectively) in the lower range of variability estimated for other pathogenic Spanish MLH1 mutations. A common haplotype was associated with each of the identified mutations, confirming their founder origin. The ages of c.306+5G>A and c.1865T>A mutations were estimated to be 53 to 122 and 12 to 22 generations, respectively. Our results confirm the pathogenicity, moderate penetrance, and founder origin of the MLH1 c.306+5G>A and c.1865T>A mutations. These findings have important implications for genetic counseling and molecular diagnosis of Lynch syndrome.

    View details for DOI 10.1158/0008-5472.CAN-10-0570

    View details for Web of Science ID 000282647700003

    View details for PubMedID 20858721

  • Coalescent histories for discordant gene trees and species trees THEORETICAL POPULATION BIOLOGY Rosenberg, N. A., Degnan, J. H. 2010; 77 (3): 145-151


    Given a gene tree and a species tree, a coalescent history is a list of the branches of the species tree on which coalescences in the gene tree take place. Each pair consisting of a gene tree topology and a species tree topology has some number of possible coalescent histories. Here we show that, for each n>or=7, there exist a species tree topology S and a gene tree topology G not equalS, both with n leaves, for which the number of coalescent histories exceeds the corresponding number of coalescent histories when the species tree topology is S and the gene tree topology is also S. This result has the interpretation that the gene tree topology G discordant with the species tree topology S can be produced by the evolutionary process in more ways than can the gene tree topology that matches the species tree topology, providing further insight into the surprising combinatorial properties of gene trees that arise from their joint consideration with species trees.

    View details for DOI 10.1016/j.tpb.2009.12.004

    View details for Web of Science ID 000276751300001

    View details for PubMedID 20064540

  • Genome-wide association studies in diverse populations NATURE REVIEWS GENETICS Rosenberg, N. A., Huang, L., Jewett, E. M., Szpiech, Z. A., Jankovic, I., Boehnke, M. 2010; 11 (5): 356-366


    Genome-wide association (GWA) studies have identified a large number of SNPs associated with disease phenotypes. As most GWA studies have been performed in populations of European descent, this Review examines the issues involved in extending the consideration of GWA studies to diverse worldwide populations. Although challenges exist with issues such as imputation, admixture and replication, investigation of a greater diversity of populations could make substantial contributions to the goal of mapping the genetic determinants of complex diseases for the human population as a whole.

    View details for DOI 10.1038/nrg2760

    View details for Web of Science ID 000276771400013

    View details for PubMedID 20395969

  • Lack of Population Diversity in Commonly Used Human Embryonic Stem-Cell Lines NEW ENGLAND JOURNAL OF MEDICINE Mosher, J. T., Pemberton, T. J., Harter, K., Wang, C., Buzbas, E. O., Dvorak, P., Simon, C., Morrison, S. J., Rosenberg, N. A. 2010; 362 (2): 183-185

    View details for DOI 10.1056/NEJMc0910371

    View details for Web of Science ID 000273558500033

    View details for PubMedID 20018958

  • Comparing Spatial Maps of Human Population-Genetic Variation Using Procrustes Analysis STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY Wang, C., Szpiech, Z. A., Degnan, J. H., Jakobsson, M., Pemberton, T. J., Hardy, J. A., Singleton, A. B., Rosenberg, N. A. 2010; 9 (1)


    Recent applications of principal components analysis (PCA) and multidimensional scaling (MDS) in human population genetics have found that "statistical maps" based on the genotypes in population-genetic samples often resemble geographic maps of the underlying sampling locations. To provide formal tests of these qualitative observations, we describe a Procrustes analysis approach for quantitatively assessing the similarity of population-genetic and geographic maps. We confirm in two scenarios, one using single-nucleotide polymorphism (SNP) data from Europe and one using SNP data worldwide, that a measurably high level of concordance exists between statistical maps of population-genetic variation and geographic maps of sampling locations. Two other examples illustrate the versatility of the Procrustes approach in population-genetic applications, verifying the concordance of SNP analyses using PCA and MDS, and showing that statistical maps of worldwide copy-number variants (CNVs) accord with statistical maps of SNP variation, especially when CNV analysis is limited to samples with the highest-quality data. As statistical maps with PCA and MDS have become increasingly common for use in summarizing population relationships, our examples highlight the potential of Procrustes-based quantitative comparisons for interpreting the results in these maps.

    View details for DOI 10.2202/1544-6115.1493

    View details for Web of Science ID 000274198200007

    View details for PubMedID 20196748

  • Sequence determinants of human microsatellite variability BMC GENOMICS Pemberton, T. J., Sandefur, C. I., Jakobsson, M., Rosenberg, N. A. 2009; 10


    Microsatellite loci are frequently used in genomic studies of DNA sequence repeats and in population studies of genetic variability. To investigate the effect of sequence properties of microsatellites on their level of variability we have analyzed genotypes at 627 microsatellite loci in 1,048 worldwide individuals from the HGDP-CEPH cell line panel together with the DNA sequences of these microsatellites in the human RefSeq database.Calibrating PCR fragment lengths in individual genotypes by using the RefSeq sequence enabled us to infer repeat number in the HGDP-CEPH dataset and to calculate the mean number of repeats (as opposed to the mean PCR fragment length), under the assumption that differences in PCR fragment length reflect differences in the numbers of repeats in the embedded repeat sequences. We find the mean and maximum numbers of repeats across individuals to be positively correlated with heterozygosity. The size and composition of the repeat unit of a microsatellite are also important factors in predicting heterozygosity, with tetra-nucleotide repeat units high in G/C content leading to higher heterozygosity. Finally, we find that microsatellites containing more separate sets of repeated motifs generally have higher heterozygosity.These results suggest that sequence properties of microsatellites have a significant impact in determining the features of human microsatellite variability.

    View details for DOI 10.1186/1471-2164-10-612

    View details for Web of Science ID 000273570800002

    View details for PubMedID 20015383

  • Genomic microsatellites identify shared Jewish ancestry intermediate between Middle Eastern and European populations BMC GENETICS Kopelman, N. M., Stone, L., Wang, C., Gefel, D., Feldman, M. W., Hillel, J., Rosenberg, N. A. 2009; 10


    Genetic studies have often produced conflicting results on the question of whether distant Jewish populations in different geographic locations share greater genetic similarity to each other or instead, to nearby non-Jewish populations. We perform a genome-wide population-genetic study of Jewish populations, analyzing 678 autosomal microsatellite loci in 78 individuals from four Jewish groups together with similar data on 321 individuals from 12 non-Jewish Middle Eastern and European populations.We find that the Jewish populations show a high level of genetic similarity to each other, clustering together in several types of analysis of population structure. Further, Bayesian clustering, neighbor-joining trees, and multidimensional scaling place the Jewish populations as intermediate between the non-Jewish Middle Eastern and European populations.These results support the view that the Jewish populations largely share a common Middle Eastern ancestry and that over their history they have undergone varying degrees of admixture with non-Jewish populations of European descent.

    View details for DOI 10.1186/1471-2156-10-80

    View details for Web of Science ID 000273553900001

    View details for PubMedID 19995433

  • The Relationship between Imputation Error and Statistical Power in Genetic Association Studies in Diverse Populations AMERICAN JOURNAL OF HUMAN GENETICS Huang, L., Wang, C., Rosenberg, N. A. 2009; 85 (5): 692-698


    Genotype-imputation methods provide an essential technique for high-resolution genome-wide association (GWA) studies with millions of single-nucleotide polymorphisms. For optimal design and interpretation of imputation-based GWA studies, it is important to understand the connection between imputation error and power to detect associations at imputed markers. Here, using a 2x3 chi-square test, we describe a relationship between genotype-imputation error rates and the sample-size inflation required for achieving statistical power at an imputed marker equal to that obtained if genotypes at the marker were known with certainty. Surprisingly, typical imputation error rates (approximately 2%-6%) lead to a large increase in the required sample size (approximately 10%-60%), and in some African populations whose genotypes are particularly difficult to impute, the required sample-size increase is as high as approximately 30%-150%. In most populations, each 1% increase in imputation error leads to an increase of approximately 5%-13% in the sample size required for maintaining power. These results imply that in GWA sample-size calculations investigators will need to account for a potentially considerable loss of power from even low levels of imputation error and that development of additional genomic resources that decrease imputation error will translate into substantial reduction in the sample sizes needed for imputation-based detection of the variants that underlie complex human diseases.

    View details for DOI 10.1016/j.ajhg.2009.09.017

    View details for Web of Science ID 000271916500015

    View details for PubMedID 19853241

  • Out of Africa: modern human origins special feature: explaining worldwide patterns of human genetic variation using a coalescent-based serial founder model of migration outward from Africa. Proceedings of the National Academy of Sciences of the United States of America DeGiorgio, M., Jakobsson, M., Rosenberg, N. A. 2009; 106 (38): 16057-16062


    Studies of worldwide human variation have discovered three trends in summary statistics as a function of increasing geographic distance from East Africa: a decrease in heterozygosity, an increase in linkage disequilibrium (LD), and a decrease in the slope of the ancestral allele frequency spectrum. Forward simulations of unlinked loci have shown that the decline in heterozygosity can be described by a serial founder model, in which populations migrate outward from Africa through a process where each of a series of populations is formed from a subset of the previous population in the outward expansion. Here, we extend this approach by developing a retrospective coalescent-based serial founder model that incorporates linked loci. Our model both recovers the observed decline in heterozygosity with increasing distance from Africa and produces the patterns observed in LD and the ancestral allele frequency spectrum. Surprisingly, although migration between neighboring populations and limited admixture between modern and archaic humans can be accommodated in the model while continuing to explain the three trends, a competing model in which a wave of outward modern human migration expands into a series of preexisting archaic populations produces nearly opposite patterns to those observed in the data. We conclude by developing a simpler model to illustrate that the feature that permits the serial founder model but not the archaic persistence model to explain the three trends observed with increasing distance from Africa is its incorporation of a cumulative effect of genetic drift as humans colonized the world.

    View details for DOI 10.1073/pnas.0903341106

    View details for PubMedID 19706453

  • Explaining worldwide patterns of human genetic variation using a coalescent-based serial founder model of migration outward from Africa PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA DeGiorgio, M., Jakobsson, M., Rosenberg, N. A. 2009; 106 (38): 16057-16062
  • Replication of Genetic Associations as Pseudoreplication due to Shared Genealogy GENETIC EPIDEMIOLOGY Rosenberg, N. A., VanLiere, J. M. 2009; 33 (6): 479-487


    The genotypes of individuals in replicate genetic association studies have some level of correlation due to shared descent in the complete pedigree of all living humans. As a result of this genealogical sharing, replicate studies that search for genotype-phenotype associations using linkage disequilibrium between marker loci and disease-susceptibility loci can be considered as "pseudoreplicates" rather than true replicates. We examine the size of the pseudoreplication effect in association studies simulated from evolutionary models of the history of a population, evaluating the excess probability that both of a pair of studies detect a disease association compared to the probability expected under the assumption that the two studies are independent. Each of nine combinations of a demographic model and a penetrance model leads to a detectable pseudoreplication effect, suggesting that the degree of support that can be attributed to a replicated genetic association result is less than that which can be attributed to a replicated result in a context of true independence.

    View details for DOI 10.1002/gepi.20400

    View details for Web of Science ID 000269432400002

    View details for PubMedID 19191270

  • Gene tree discordance, phylogenetic inference and the multispecies coalescent TRENDS IN ECOLOGY & EVOLUTION Degnan, J. H., Rosenberg, N. A. 2009; 24 (6): 332-340


    The field of phylogenetics is entering a new era in which trees of historical relationships between species are increasingly inferred from multilocus and genomic data. A major challenge for incorporating such large amounts of data into inference of species trees is that conflicting genealogical histories often exist in different genes throughout the genome. Recent advances in genealogical modeling suggest that resolving close species relationships is not quite as simple as applying more data to the problem. Here we discuss the complexities of genealogical discordance and review the issues that new methods for multilocus species tree inference will need to address to account successfully for naturally occurring genomic variability in evolutionary histories.

    View details for DOI 10.1016/j.tree.2009.01.009

    View details for Web of Science ID 000267008900007

    View details for PubMedID 19307040

  • Haplotypic Background of a Private Allele at High Frequency in the Americas MOLECULAR BIOLOGY AND EVOLUTION Schroeder, K. B., Jakobsson, M., Crawford, M. H., Schurr, T. G., Boca, S. M., Conrad, D. F., Tito, R. Y., Osipova, L. P., Tarskaia, L. A., Zhadanov, S. I., Wall, J. D., Pritchard, J. K., Malhi, R. S., Smith, D. G., Rosenberg, N. A. 2009; 26 (5): 995-1016


    Recently, the observation of a high-frequency private allele, the 9-repeat allele at microsatellite D9S1120, in all sampled Native American and Western Beringian populations has been interpreted as evidence that all modern Native Americans descend primarily from a single founding population. However, this inference assumed that all copies of the 9-repeat allele were identical by descent and that the geographic distribution of this allele had not been influenced by natural selection. To investigate whether these assumptions are satisfied, we genotyped 34 single nucleotide polymorphisms across approximately 500 kilobases (kb) around D9S1120 in 21 Native American and Western Beringian populations and 54 other worldwide populations. All chromosomes with the 9-repeat allele share the same haplotypic background in the vicinity of D9S1120, suggesting that all sampled copies of the 9-repeat allele are identical by descent. Ninety-one percent of these chromosomes share the same 76.26 kb haplotype, which we call the "American Modal Haplotype" (AMH). Three observations lead us to conclude that the high frequency and widespread distribution of the 9-repeat allele are unlikely to be the result of positive selection: 1) aside from its association with the 9-repeat allele, the AMH does not have a high frequency in the Americas, 2) the AMH is not unusually long for its frequency compared with other haplotypes in the Americas, and 3) in Latin American mestizo populations, the proportion of Native American ancestry at D9S1120 is not unusual compared with that observed at other genomewide microsatellites. Using a new method for estimating the time to the most recent common ancestor (MRCA) of all sampled copies of an allele on the basis of an estimate of the length of the genealogy descended from the MRCA, we calculate the mean time to the MRCA of the 9-repeat allele to be between 7,325 and 39,900 years, depending on the demographic model used. The results support the hypothesis that all modern Native Americans and Western Beringians trace a large portion of their ancestry to a single founding population that may have been isolated from other Asian populations prior to expanding into the Americas.

    View details for DOI 10.1093/molbev/msp024

    View details for Web of Science ID 000265274000005

    View details for PubMedID 19221006

  • An Unbiased Estimator of Gene Diversity in Samples Containing Related Individuals MOLECULAR BIOLOGY AND EVOLUTION DeGiorgio, M., Rosenberg, N. A. 2009; 26 (3): 501-512


    Gene diversity is sometimes estimated from samples that contain inbred or related individuals. If inbred or related individuals are included in a sample, then the standard estimator for gene diversity produces a downward bias caused by an inflation of the variance of estimated allele frequencies. We develop an unbiased estimator for gene diversity that relies on kinship coefficients for pairs of individuals with known relationship and that reduces to the standard estimator when all individuals are noninbred and unrelated. Applying our estimator to data simulated based on allele frequencies observed for microsatellite loci in human populations, we find that the new estimator performs favorably compared with the standard estimator in terms of bias and similarly in terms of mean squared error. For human population-genetic data, we find that a close linear relationship previously seen between gene diversity and distance from East Africa is preserved when adjusting for the inclusion of close relatives.

    View details for DOI 10.1093/molbev/msn254

    View details for Web of Science ID 000263420900004

    View details for PubMedID 18988687

  • Genotype-Imputation Accuracy across Worldwide Human Populations AMERICAN JOURNAL OF HUMAN GENETICS Huang, L., Li, Y., Singleton, A. B., Hardy, J. A., Abecasis, G., Rosenberg, N. A., Scheet, P. 2009; 84 (2): 235-250


    A current approach to mapping complex-disease-susceptibility loci in genome-wide association (GWA) studies involves leveraging the information in a reference database of dense genotype data. By modeling the patterns of linkage disequilibrium in a reference panel, genotypes not directly measured in the study samples can be imputed and tested for disease association. This imputation strategy has been successful for GWA studies in populations well represented by existing reference panels. We used genotypes at 513,008 autosomal single-nucleotide polymorphism (SNP) loci in 443 unrelated individuals from 29 worldwide populations to evaluate the "portability" of the HapMap reference panels for imputation in studies of diverse populations. When a single HapMap panel was leveraged for imputation of randomly masked genotypes, European populations had the highest imputation accuracy, followed by populations from East Asia, Central and South Asia, the Americas, Oceania, the Middle East, and Africa. For each population, we identified "optimal" mixtures of reference panels that maximized imputation accuracy, and we found that in most populations, mixtures including individuals from at least two HapMap panels produced the highest imputation accuracy. From a separate survey of additional SNPs typed in the same samples, we evaluated imputation accuracy in the scenario in which all genotypes at a given SNP position were unobserved and were imputed on the basis of data from a commercial "SNP chip," again finding that most populations benefited from the use of combinations of two or more HapMap reference panels. Our results can serve as a guide for selecting appropriate reference panels for imputation-based GWA analysis in diverse populations.

    View details for DOI 10.1016/j.ajhg.2009.01.013

    View details for Web of Science ID 000263799700013

    View details for PubMedID 19215730

  • Properties of Consensus Methods for Inferring Species Trees from Gene Trees SYSTEMATIC BIOLOGY Degnan, J. H., DeGiorgio, M., Bryant, D., Rosenberg, N. A. 2009; 58 (1): 35-54


    Consensus methods provide a useful strategy for summarizing information from a collection of gene trees. An important application of consensus methods is to combine gene trees to estimate a species tree. To investigate the theoretical properties of consensus trees that would be obtained from large numbers of loci evolving according to a basic evolutionary model, we construct consensus trees from rooted gene trees that occur in proportion to gene-tree probabilities derived from coalescent theory. We consider majority-rule, rooted triple (R(*)), and greedy consensus trees obtained from known, rooted gene trees, both in the asymptotic case as numbers of gene trees approach infinity and for finite numbers of genes. Our results show that for some combinations of species-tree branch lengths, increasing the number of independent loci can make the rooted majority-rule consensus tree more likely to be at least partially unresolved. However, the probability that the R(*) consensus tree has the species-tree topology approaches 1 as the number of gene trees approaches infinity. Although the greedy consensus algorithm can be the quickest to converge on the correct species-tree topology when increasing the number of gene trees, it can also be positively misleading. The majority-rule consensus tree is not a misleading estimator of the species-tree topology, and the R(*) consensus tree is a statistically consistent estimator of the species-tree topology. Our results therefore suggest a method for using multiple loci to infer the species-tree topology, even when it is discordant with the most likely gene tree.

    View details for DOI 10.1093/sysbio/syp008

    View details for Web of Science ID 000266970700003

    View details for PubMedID 20525567

  • Population differentiation and migration: Coalescence times in a two-sex island model for autosomal and X-linked loci THEORETICAL POPULATION BIOLOGY Ramachandran, S., Rosenberg, N. A., Feldman, M. W., Wakeley, J. 2008; 74 (4): 291-301


    Evolutionists have debated whether population-genetic parameters, such as effective population size and migration rate, differ between males and females. In humans, most analyses of this problem have focused on the Y chromosome and the mitochondrial genome, while the X chromosome has largely been omitted from the discussion. Past studies have compared F(ST) values for the Y chromosome and mitochondrion under a model with migration rates that differ between the sexes but with equal male and female population sizes. In this study we investigate rates of coalescence for X-linked and autosomal lineages in an island model with different population sizes and migration rates for males and females, obtaining the mean time to coalescence for pairs of lineages from the same deme and for pairs of lineages from different demes. We apply our results to microsatellite data from the Human Genome Diversity Panel, and we examine the male and female migration rates implied by observed F(ST) values.

    View details for DOI 10.1016/j.tpb.2008.08.003

    View details for Web of Science ID 000261533200002

    View details for PubMedID 18817799

  • ADZE: a rarefaction approach for counting alleles private to combinations of populations BIOINFORMATICS Szpiech, Z. A., Jakobsson, M., Rosenberg, N. A. 2008; 24 (21): 2498-2504


    Analysis of the distribution of alleles across populations is a useful tool for examining population diversity and relationships. However, sample sizes often differ across populations, sometimes making it difficult to assess allelic distributions across groups.We introduce a generalized rarefaction approach for counting alleles private to combinations of populations. Our method evaluates the number of alleles found in each of a set of populations but absent in all remaining populations, considering equal-sized subsamples from each population. Applying this method to a worldwide human microsatellite dataset, we observe a high number of alleles private to the combination of African and Oceanian populations. This result supports the possibility of a migration out of Africa into Oceania separate from the migrations responsible for the majority of the ancestry of the modern populations of Asia, and it highlights the utility of our approach to sample size correction in evaluating hypotheses about population history.We have implemented our method in the computer pro-gram ADZE, which is available for download at

    View details for DOI 10.1093/bioinformatics/btn478

    View details for Web of Science ID 000260381200012

    View details for PubMedID 18779233

  • Mathematical properties of the r(2) measure of linkage disequilibrium THEORETICAL POPULATION BIOLOGY VanLiere, J. M., Rosenberg, N. A. 2008; 74 (1): 130-137


    Statistics for linkage disequilibrium (LD), the non-random association of alleles at two loci, depend on the frequencies of the alleles at the loci under consideration. Here, we examine the r(2) measure of LD and its mathematical relationship to allele frequencies, quantifying the constraints on its maximum value. Assuming independent uniform distributions for the allele frequencies of two biallelic loci, we find that the mean maximum value of r(2) is approximately 0.43051, and that r(2) can exceed a threshold of 4/5 in only approximately 14.232% of the allele frequency space. If one locus is assumed to have known allele frequencies--the situation in an association study in which LD between a known marker locus and an unknown trait locus is of interest--we find that the mean maximum value of r(2) is greatest when the known locus has a minor allele frequency of approximately 0.30131. We find that in 1/4 of the space of allowed values of minor allele frequencies and haplotype frequencies at a pair of loci, the unconstrained maximum r(2) allowing for the possibility of recombination between the loci exceeds the constrained maximum assuming that no recombination has occurred. Finally, we use r(max)(2) to examine the connection between r(2) and the D(') measure of linkage disequilibrium, finding that r(2)/r(max)(2)=D('2) for approximately 72.683% of the space of allowed values of (p(a),p(b),p(ab)). Our results concerning the properties of r(2) have the potential to inform the interpretation of unusual LD behavior and to assist in the design of LD-based association-mapping studies.

    View details for DOI 10.1016/j.tpb.2008.05.006

    View details for Web of Science ID 000257912400014

    View details for PubMedID 18572214

  • The relationship between homozygosity and the frequency of the most frequent allele GENETICS Rosenberg, N. A., Jakobsson, M. 2008; 179 (4): 2027-2036


    Homozygosity is a commonly used summary of allele-frequency distributions at polymorphic loci. Because high-frequency alleles contribute disproportionately to the homozygosity of a locus, it often occurs that most homozygotes are homozygous for the most frequent allele. To assess the relationship between homozygosity and the highest allele frequency at a locus, for a given homozygosity value, we determine the lower and upper bounds on the frequency of the most frequent allele. These bounds suggest tight constraints on the frequency of the most frequent allele as a function of homozygosity, differing by at most 14 and having an average difference of 23 - pi(2)/18 approximately 0.1184. The close connection between homozygosity and the frequency of the most frequent allele-which we illustrate using allele frequencies from human populations-has the consequence that when one of these two quantities is known, considerable information is available about the other quantity. This relationship also explains the similar performance of statistical tests of population-genetic models that rely on homozygosity and those that rely on the frequency of the most frequent allele, and it provides a basis for understanding the utility of extended homozygosity statistics in identifying haplotypes that have been elevated to high frequency as a result of positive selection.

    View details for DOI 10.1534/genetics.107.084772

    View details for Web of Science ID 000258591200024

    View details for PubMedID 18689892

  • Using population mixtures to optimize the utility of genomic databases: Linkage disequilibrium and association study design in India ANNALS OF HUMAN GENETICS Pemberton, T. J., Jakobsson, M., Conrad, D. F., Coop, G., Wall, J. D., Pritchard, J. K., Patel, P. I., Rosenberg, N. A. 2008; 72: 535-546


    When performing association studies in populations that have not been the focus of large-scale investigations of haplotype variation, it is often helpful to rely on genomic databases in other populations for study design and analysis - such as in the selection of tag SNPs and in the imputation of missing genotypes. One way of improving the use of these databases is to rely on a mixture of database samples that is similar to the population of interest, rather than using the single most similar database sample. We demonstrate the effectiveness of the mixture approach in the application of African, European, and East Asian HapMap samples for tag SNP selection in populations from India, a genetically intermediate region underrepresented in genomic studies of haplotype variation.

    View details for DOI 10.1111/j.1469-1809.2008.00457.x

    View details for Web of Science ID 000256684900009

    View details for PubMedID 18513279

  • Demographic history of European populations of Arabidopsis thaliana PLOS GENETICS Francois, O., Blum, M. G., Jakobsson, M., Rosenberg, N. A. 2008; 4 (5)


    The model plant species Arabidopsis thaliana is successful at colonizing land that has recently undergone human-mediated disturbance. To investigate the prehistoric spread of A. thaliana, we applied approximate Bayesian computation and explicit spatial modeling to 76 European accessions sequenced at 876 nuclear loci. We find evidence that a major migration wave occurred from east to west, affecting most of the sampled individuals. The longitudinal gradient appears to result from the plant having spread in Europe from the east approximately 10,000 years ago, with a rate of westward spread of approximately 0.9 km/year. This wave-of-advance model is consistent with a natural colonization from an eastern glacial refugium that overwhelmed ancient western lineages. However, the speed and time frame of the model also suggest that the migration of A. thaliana into Europe may have accompanied the spread of agriculture during the Neolithic transition.

    View details for DOI 10.1371/journal.pgen.1000075

    View details for Web of Science ID 000256869100015

    View details for PubMedID 18483550

  • Genetic variation and population structure in Native Americans PLOS GENETICS Wang, S., Lewis, C. M., Jakobsson, M., Ramachandran, S., Ray, N., Bedoya, G., Rojas, W., Parra, M. V., Molina, J. A., Gallo, C., Mazzotti, G., Poletti, G., Hill, K., Hurtado, A. M., Labuda, D., Klitz, W., Barrantes, R., Bortolini, M. C., Salzano, F. M., Petzl-Erler, M. L., Tsuneto, L. T., Llop, E., Rothhammer, F., Excoffier, L., Feldman, M. W., Rosenberg, N. A., Ruiz-Linares, A. 2007; 3 (11): 2049-2067


    We examined genetic diversity and population structure in the American landmass using 678 autosomal microsatellite markers genotyped in 422 individuals representing 24 Native American populations sampled from North, Central, and South America. These data were analyzed jointly with similar data available in 54 other indigenous populations worldwide, including an additional five Native American groups. The Native American populations have lower genetic diversity and greater differentiation than populations from other continental regions. We observe gradients both of decreasing genetic diversity as a function of geographic distance from the Bering Strait and of decreasing genetic similarity to Siberians--signals of the southward dispersal of human populations from the northwestern tip of the Americas. We also observe evidence of: (1) a higher level of diversity and lower level of population structure in western South America compared to eastern South America, (2) a relative lack of differentiation between Mesoamerican and Andean populations, (3) a scenario in which coastal routes were easier for migrating peoples to traverse in comparison with inland routes, and (4) a partial agreement on a local scale between genetic similarity and the linguistic classification of populations. These findings offer new insights into the process of population dispersal and differentiation during the peopling of the Americas.

    View details for DOI 10.1371/journal.pgen.0030185

    View details for Web of Science ID 000251310200002

    View details for PubMedID 18039031

  • CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure BIOINFORMATICS Jakobsson, M., Rosenberg, N. A. 2007; 23 (14): 1801-1806


    Clustering of individuals into populations on the basis of multilocus genotypes is informative in a variety of settings. In population-genetic clustering algorithms, such as BAPS, STRUCTURE and TESS, individual multilocus genotypes are partitioned over a set of clusters, often using unsupervised approaches that involve stochastic simulation. As a result, replicate cluster analyses of the same data may produce several distinct solutions for estimated cluster membership coefficients, even though the same initial conditions were used. Major differences among clustering solutions have two main sources: (1) 'label switching' of clusters across replicates, caused by the arbitrary way in which clusters in an unsupervised analysis are labeled, and (2) 'genuine multimodality,' truly distinct solutions across replicates.To facilitate the interpretation of population-genetic clustering results, we describe three algorithms for aligning multiple replicate analyses of the same data set. We have implemented these algorithms in the computer program CLUMPP (CLUster Matching and Permutation Program). We illustrate the use of CLUMPP by aligning the cluster membership coefficients from 100 replicate cluster analyses of 600 chickens from 20 different breeds.CLUMPP is freely available at

    View details for DOI 10.1093/bioinformatics/btm233

    View details for Web of Science ID 000249248300012

    View details for PubMedID 17485429

  • Estimating the number of ancestral lineages using a maximum-likelihood method based on rejection sampling GENETICS Blum, M. G., Rosenberg, N. A. 2007; 176 (3): 1741-1757


    Estimating the number of ancestral lineages of a sample of DNA sequences at time t in the past can be viewed as a variation on the problem of estimating the time to the most recent common ancestor. To estimate the number of ancestral lineages, we develop a maximum-likelihood approach that takes advantage of a prior model of population demography, in addition to the molecular data summarized by the pattern of polymorphic sites. The method relies on a rejection sampling algorithm that is introduced for simulating conditional coalescent trees given a fixed number of ancestral lineages at time t. Computer simulations show that the number of ancestral lineages can be estimated accurately, provided that the number of mutations that occurred since time t is sufficiently large. The method is applied to 986 present-day human sequences located in hypervariable region 1 of the mitochondrion to estimate the number of ancestral lineages of modern humans at the time of potential admixture with the Neanderthal population. Our estimates support a view that the proportion of the modern population consisting of Neanderthal contributions must be relatively small, less than approximately 5%, if the admixture happened as recently as 30,000 years ago.

    View details for DOI 10.1534/genetics.106.066233

    View details for Web of Science ID 000248416300030

    View details for PubMedID 17435232

  • Sampling properties of homozygosity-based statistics for linkage disequilibrium MATHEMATICAL BIOSCIENCES Rosenberg, N. A., Blum, M. G. 2007; 208 (1): 33-47


    Homozygosity-based statistics such as Ohta's identity-in-state (IIS) excess offer the potential to measure linkage disequilibrium for multiallelic loci in small samples. However, previous observations have suggested that for independent loci, in small samples these statistics might produce values that more frequently lie on one side rather than on the other side of zero. Here we investigate the sampling properties of the IIS excess. We find that for any pair of independent polymorphic loci, as sample size n approaches infinity, the sampling distribution of the IIS excess approaches a normal distribution. For large samples, the IIS excess tends towards symmetry around zero, and the probabilities of positive and of negative IIS excess both approach 1/2. Surprisingly, however, we also find that for sufficiently large n, independent loci can be chosen so that the probability of a sample having positive IIS excess is arbitrarily close to either 0 or 1. The results are applied to interpretation of data from human populations, and we conclude that before employing homozygosity-based statistics to measure LD in a particular sample, especially for loci with either very small or very large homozygosities, it is useful to verify that loci with the observed homozygosity values are not likely to produce a large bias in IIS excess in samples of the given size.

    View details for DOI 10.1016/j.mbs.2006.07.001

    View details for Web of Science ID 000248196400003

    View details for PubMedID 17157882

  • The probability distribution under a population divergence model of the number of genetic founding lineages of a population or species THEORETICAL POPULATION BIOLOGY Jakobsson, M., Rosenberg, N. A. 2007; 71 (4): 502-523


    The composition of genetic variation in a population or species is shaped by the number of events that led to the founding of the group. We consider a neutral coalescent model of two populations, where a derived population is founded as an offshoot of an ancestral population. For a given locus, using both recursive and nonrecursive approaches, we compute the probability distribution of the number of genetic founding lineages that have given rise to the derived population. This number of genetic founding lineages is defined as the number of ancestral individuals that contributed at the locus to the present-day derived population, and is formulated in terms of interspecific coalescence events. The effects of sample size and divergence time on the probability distribution of the number of founding lineages are studied in detail. For 99.99% of the loci in the derived population to each have one founding lineage, the two populations must be separated for 9.9N generations. However, only approximately 0.87N generations must pass since divergence for 99.99% of the loci to have <6 founding lineages. Our results are useful as a prior expectation on the number of founding lineages in scenarios that involve the evolution of one population from the splitting of an ancestral group, such as in the colonization of islands, the formation of polyploid species, and the domestication of crops and livestock from wild ancestors.

    View details for DOI 10.1016/j.tpb.2007.01.004

    View details for Web of Science ID 000247167600009

    View details for PubMedID 17383701

  • Genetic diversity and population structure inferred from the partially duplicated genome of domesticated carp, Cyprinus carpio L. GENETICS SELECTION EVOLUTION David, L., Rosenberg, N. A., Lavi, U., Feldman, M. W., Hillel, J. 2007; 39 (3): 319-340


    Genetic relationships among eight populations of domesticated carp (Cyprinus carpio L.), a species with a partially duplicated genome, were studied using 12 microsatellites and 505 AFLP bands. The populations included three aquacultured carp strains and five ornamental carp (koi) variants. Grass carp (Ctenopharyngodon idella) was used as an outgroup. AFLP-based gene diversity varied from 5% (grass carp) to 32% (koi) and reflected the reasonably well understood histories and breeding practices of the populations. A large fraction of the molecular variance was due to differences between aquacultured and ornamental carps. Further analyses based on microsatellite data, including cluster analysis and neighbor-joining trees, supported the genetic distinctiveness of aquacultured and ornamental carps, despite the recent divergence of the two groups. In contrast to what was observed for AFLP-based diversity, the frequency of heterozygotes based on microsatellites was comparable among all populations. This discrepancy can potentially be explained by duplication of some loci in Cyprinus carpio L., and a model that shows how duplication can increase heterozygosity estimates for microsatellites but not for AFLP loci is discussed. Our analyses in carp can help in understanding the consequences of genotyping duplicated loci and in interpreting discrepancies between dominant and co-dominant markers in species with recent genome duplication.

    View details for DOI 10.1051/gse:2007006

    View details for Web of Science ID 000245686900006

    View details for PubMedID 17433244

  • A private allele ubiquitous in the Americas BIOLOGY LETTERS Schroeder, K. B., Schurr, T. G., Long, J. C., Rosenberg, N. A., Crawford, M. H., Tarskaia, L. A., Osipova, L. P., Zhadanov, S. I., Smith, D. G. 2007; 3 (2): 218-223


    The three-wave migration hypothesis of Greenberg et al. has permeated the genetic literature on the peopling of the Americas. Greenberg et al. proposed that Na-Dene, Aleut-Eskimo and Amerind are language phyla which represent separate migrations from Asia to the Americas. We show that a unique allele at autosomal microsatellite locus D9S1120 is present in all sampled North and South American populations, including the Na-Dene and Aleut-Eskimo, and in related Western Beringian groups, at an average frequency of 31.7%. This allele was not observed in any sampled putative Asian source populations or in other worldwide populations. Neither selection nor admixture explains the distribution of this regionally specific marker. The simplest explanation for the ubiquity of this allele across the Americas is that the same founding population contributed a large fraction of ancestry to all modern Native American populations.

    View details for DOI 10.1098/rsbl.2006.0609

    View details for Web of Science ID 000244947700030

    View details for PubMedID 17301009

  • Low levels of genetic divergence across geographically and linguistically diverse populations from India PLOS GENETICS Rosenberg, N. A., Mahajan, S., Gonzalez-Quevedo, C., Blum, M. G., Nino-Rosales, L., Ninis, V., Das, P., Hegde, M., Molinari, L., Zapata, G., Weber, J. L., Belmont, J. W., Patel, P. I. 2006; 2 (12): 2052-2061


    Ongoing modernization in India has elevated the prevalence of many complex genetic diseases associated with a western lifestyle and diet to near-epidemic proportions. However, although India comprises more than one sixth of the world's human population, it has largely been omitted from genomic surveys that provide the backdrop for association studies of genetic disease. Here, by genotyping India-born individuals sampled in the United States, we carry out an extensive study of Indian genetic variation. We analyze 1,200 genome-wide polymorphisms in 432 individuals from 15 Indian populations. We find that populations from India, and populations from South Asia more generally, constitute one of the major human subgroups with increased similarity of genetic ancestry. However, only a relatively small amount of genetic differentiation exists among the Indian populations. Although caution is warranted due to the fact that United States-sampled Indian populations do not represent a random sample from India, these results suggest that the frequencies of many genetic variants are distinctive in India compared to other parts of the world and that the effects of population heterogeneity on the production of false positives in association studies may be smaller in Indians (and particularly in Indian-Americans) than might be expected for such a geographically and linguistically diverse subset of the human population.

    View details for DOI 10.1371/journal.pgen.0020215

    View details for Web of Science ID 000243482100010

    View details for PubMedID 17194221

  • A worldwide survey of haplotype variation and linkage disequilibrium in the human genome NATURE GENETICS Conrad, D. F., Jakobsson, M., Coop, G., Wen, X., Wall, J. D., Rosenberg, N. A., Pritchard, J. K. 2006; 38 (11): 1251-1260


    Recent genomic surveys have produced high-resolution haplotype information, but only in a small number of human populations. We report haplotype structure across 12 Mb of DNA sequence in 927 individuals representing 52 populations. The geographic distribution of haplotypes reflects human history, with a loss of haplotype diversity as distance increases from Africa. Although the extent of linkage disequilibrium (LD) varies markedly across populations, considerable sharing of haplotype structure exists, and inferred recombination hotspot locations generally match across groups. The four samples in the International HapMap Project contain the majority of common haplotypes found in most populations: averaging across populations, 83% of common 20-kb haplotypes in a population are also common in the most similar HapMap sample. Consequently, although the portability of tag SNPs based on the HapMap is reduced in low-LD Africans, the HapMap will be helpful for the design of genome-wide association mapping studies in nearly all human populations.

    View details for DOI 10.1038/ng1911

    View details for Web of Science ID 000241592700013

    View details for PubMedID 17057719

  • A general population-genetic model for the production by population structure of spurious genotype-phenotype associations in discrete, admixed or spatially distributed populations GENETICS Rosenberg, N. A., Nordborg, M. 2006; 173 (3): 1665-1678


    In linkage disequilibrium mapping of genetic variants causally associated with phenotypes, spurious associations can potentially be generated by any of a variety of types of population structure. However, mathematical theory of the production of spurious associations has largely been restricted to population structure models that involve the sampling of individuals from a collection of discrete subpopulations. Here, we introduce a general model of spurious association in structured populations, appropriate whether the population structure involves discrete groups, admixture among such groups, or continuous variation across space. Under the assumptions of the model, we find that a single common principle--applicable to both the discrete and admixed settings as well as to spatial populations--gives a necessary and sufficient condition for the occurrence of spurious associations. Using a mathematical connection between the discrete and admixed cases, we show that in admixed populations, spurious associations are less severe than in corresponding mixtures of discrete subpopulations, especially when the variance of admixture across individuals is small. This observation, together with the results of simulations that examine the relative influences of various model parameters, has important implications for the design and analysis of genetic association studies in structured populations.

    View details for DOI 10.1534/genetics.105.055335

    View details for Web of Science ID 000239629400040

    View details for PubMedID 16582435

  • Discordance of species trees with their most likely gene trees PLOS GENETICS Degnan, J. H., Rosenberg, N. A. 2006; 2 (5): 762-768


    Because of the stochastic way in which lineages sort during speciation, gene trees may differ in topology from each other and from species trees. Surprisingly, assuming that genetic lineages follow a coalescent model of within-species evolution, we find that for any species tree topology with five or more species, there exist branch lengths for which gene tree discordance is so common that the most likely gene tree topology to evolve along the branches of a species tree differs from the species phylogeny. This counterintuitive result implies that in combining data on multiple loci, the straightforward procedure of using the most frequently observed gene tree topology as an estimate of the species tree topology can be asymptotically guaranteed to produce an incorrect estimate. We conclude with suggestions that can aid in overcoming this new obstacle to accurate genomic inference of species phylogenies.

    View details for DOI 10.1371/journal.pgen.0020068

    View details for Web of Science ID 000239494600013

    View details for PubMedID 16733550

  • Clines, clusters, and the effect of study design on the inference of human population structure PLOS GENETICS Rosenberg, N. A., Mahajan, S., Ramachandran, S., Zhao, C. F., Pritchard, J. K., Feldman, M. W. 2005; 1 (6): 660-671


    Previously, we observed that without using prior information about individual sampling locations, a clustering algorithm applied to multilocus genotypes from worldwide human populations produced genetic clusters largely coincident with major geographic regions. It has been argued, however, that the degree of clustering is diminished by use of samples with greater uniformity in geographic distribution, and that the clusters we identified were a consequence of uneven sampling along genetic clines. Expanding our earlier dataset from 377 to 993 markers, we systematically examine the influence of several study design variables--sample size, number of loci, number of clusters, assumptions about correlations in allele frequencies across populations, and the geographic dispersion of the sample--on the "clusteredness" of individuals. With all other variables held constant, geographic dispersion is seen to have comparatively little effect on the degree of clustering. Examination of the relationship between genetic and geographic distance supports a view in which the clusters arise not as an artifact of the sampling scheme, but from small discontinuous jumps in genetic distance for most population pairs on opposite sides of geographic barriers, in comparison with genetic distance for pairs on the same side. Thus, analysis of the 993-locus dataset corroborates our earlier results: if enough markers are used with a sufficiently large worldwide sample, individuals can be partitioned into genetic clusters that match major geographic subdivisions of the globe, with some individuals from intermediate geographic locations having mixed membership in the clusters that correspond to neighboring regions.

    View details for DOI 10.1371/journal.pgen.0010070

    View details for Web of Science ID 000234900800005

    View details for PubMedID 16355252

  • The pattern of polymorphism in Arabidopsis thaliana PLOS BIOLOGY Nordborg, M., Hu, T. T., Ishino, Y., Jhaveri, J., Toomajian, C., Zheng, H. G., Bakker, E., Calabrese, P., Gladstone, J., Goyal, R., Jakobsson, M., Kim, S., Morozov, Y., Padhukasahasram, B., Plagnol, V., Rosenberg, N. A., Shah, C., Wall, J. D., Wang, J., Zhao, K. Y., Kalbfleisch, T., Schulz, V., Kreitman, M., Bergelson, J. 2005; 3 (7): 1289-1299


    We resequenced 876 short fragments in a sample of 96 individuals of Arabidopsis thaliana that included stock center accessions as well as a hierarchical sample from natural populations. Although A. thaliana is a selfing weed, the pattern of polymorphism in general agrees with what is expected for a widely distributed, sexually reproducing species. Linkage disequilibrium decays rapidly, within 50 kb. Variation is shared worldwide, although population structure and isolation by distance are evident. The data fail to fit standard neutral models in several ways. There is a genome-wide excess of rare alleles, at least partially due to selection. There is too much variation between genomic regions in the level of polymorphism. The local level of polymorphism is negatively correlated with gene density and positively correlated with segmental duplications. Because the data do not fit theoretical null distributions, attempts to infer natural selection from polymorphism data will require genome-wide surveys of polymorphism in order to identify anomalous regions. Despite this, our data support the utility of A. thaliana as a model for evolutionary functional genomics.

    View details for DOI 10.1371/journal.pbio.0030196

    View details for Web of Science ID 000230759000016

    View details for PubMedID 15907155

  • Polyploid and multilocus extensions of the Wahlund inequality THEORETICAL POPULATION BIOLOGY Rosenberg, N. A., Calabrese, P. P. 2004; 66 (4): 381-391


    Wahlund's inequality informally states that if a structured and an unstructured population have the same allele frequencies at a locus, the structured population contains more homozygotes. We show that this inequality holds generally for ploidy level P, that is, the structured population has more P-polyhomozygotes. Further, for M randomly chosen loci (M >or= 2), the structured population is also expected to contain more M-multihomozygotes than an unstructured population with the same single-locus homozygosities. The extended inequalities suggest multilocus identity coefficients analogous to F(ST). Using microsatellite genotypes from human populations, we demonstrate that the multilocus Wahlund inequality can explain a positive bias in "identity-in-state excess".

    View details for DOI 10.1016/j.tpb.2004.07.001

    View details for Web of Science ID 000225649600009

    View details for PubMedID 15560915

  • Informativeness of genetic markers for inference of ancestry AMERICAN JOURNAL OF HUMAN GENETICS Rosenberg, N. A., Li, L. M., Ward, R., Pritchard, J. K. 2003; 73 (6): 1402-1422


    Inference of individual ancestry is useful in various applications, such as admixture mapping and structured-association mapping. Using information-theoretic principles, we introduce a general measure, the informativeness for assignment (I(n)), applicable to any number of potential source populations, for determining the amount of information that multiallelic markers provide about individual ancestry. In a worldwide human microsatellite data set, we identify markers of highest informativeness for inference of regional ancestry and for inference of population ancestry within regions; these markers, which are listed in online-only tables in our article, can be useful both in testing for and in controlling the influence of ancestry on case-control genetic association studies. Markers that are informative in one collection of source populations are generally informative in others. Informativeness of random dinucleotides, the most informative class of microsatellites, is five to eight times that of random single-nucleotide polymorphisms (SNPs), but 2%-12% of SNPs have higher informativeness than the median for dinucleotides. Our results can aid in decisions about the type, quantity, and specific choice of markers for use in studies of ancestry.

    View details for Web of Science ID 000187491100015

    View details for PubMedID 14631557

  • Features of evolution and expansion of modern humans, inferred from genomewide microsatellite markers AMERICAN JOURNAL OF HUMAN GENETICS Zhivotovsky, L. A., Rosenberg, N. A., Feldman, M. W. 2003; 72 (5): 1171-1186


    We study data on variation in 52 worldwide populations at 377 autosomal short tandem repeat loci, to infer a demographic history of human populations. Variation at di-, tri-, and tetranucleotide repeat loci is distributed differently, although each class of markers exhibits a decrease of within-population genetic variation in the following order: sub-Saharan Africa, Eurasia, East Asia, Oceania, and America. There is a similar decrease in the frequency of private alleles. With multidimensional scaling, populations belonging to the same major geographic region cluster together, and some regions permit a finer resolution of populations. When a stepwise mutation model is used, a population tree based on TD estimates of divergence time suggests that the branches leading to the present sub-Saharan African populations of hunter-gatherers were the first to diverge from a common ancestral population (approximately 71-142 thousand years ago). The branches corresponding to sub-Saharan African farming populations and those that left Africa diverge next, with subsequent splits of branches for Eurasia, Oceania, East Asia, and America. African hunter-gatherer populations and populations of Oceania and America exhibit no statistically significant signature of growth. The features of population subdivision and growth are discussed in the context of the ancient expansion of modern humans.

    View details for Web of Science ID 000182474400010

    View details for PubMedID 12690579

  • Genetic structure of human populations SCIENCE Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivotovsky, L. A., Feldman, M. W. 2002; 298 (5602): 2381-2385


    We studied human population structure using genotypes at 377 autosomal microsatellite loci in 1056 individuals from 52 populations. Within-population differences among individuals account for 93 to 95% of genetic variation; differences among major groups constitute only 3 to 5%. Nevertheless, without using prior information about the origins of individuals, we identified six main genetic clusters, five of which correspond to major geographic regions, and subclusters that often correspond to individual populations. General agreement of genetic and predefined populations suggests that self-reported ancestry can facilitate assessments of epidemiological risks but does not obviate the need to use genetic information in genetic association studies.

    View details for Web of Science ID 000179915900054

    View details for PubMedID 12493913

  • Genealogical trees, coalescent theory and the analysis of genetic polymorphisms NATURE REVIEWS GENETICS Rosenberg, N. A., Nordborg, M. 2002; 3 (5): 380-390


    Improvements in genotyping technologies have led to the increased use of genetic polymorphism for inference about population phenomena, such as migration and selection. Such inference presents a challenge, because polymorphism data reflect a unique, complex, non-repeatable evolutionary history. Traditional analysis methods do not take this into account. A stochastic process known as the 'coalescent' presents a coherent statistical framework for analysis of genetic polymorphisms.

    View details for DOI 10.1038/nrg795

    View details for Web of Science ID 000175350000015

    View details for PubMedID 11988763

  • Association mapping in structured populations AMERICAN JOURNAL OF HUMAN GENETICS Pritchard, J. K., Stephens, M., Rosenberg, N. A., Donnelly, P. 2000; 67 (1): 170-181


    The use, in association studies, of the forthcoming dense genomewide collection of single-nucleotide polymorphisms (SNPs) has been heralded as a potential breakthrough in the study of the genetic basis of common complex disorders. A serious problem with association mapping is that population structure can lead to spurious associations between a candidate marker and a phenotype. One common solution has been to abandon case-control studies in favor of family-based tests of association, such as the transmission/disequilibrium test (TDT), but this comes at a considerable cost in the need to collect DNA from close relatives of affected individuals. In this article we describe a novel, statistically valid, method for case-control association studies in structured populations. Our method uses a set of unlinked genetic markers to infer details of population structure, and to estimate the ancestry of sampled individuals, before using this information to test for associations within subpopulations. It provides power comparable with the TDT in many settings and may substantially outperform it if there are conflicting associations in different subpopulations.

    View details for Web of Science ID 000088926900019

    View details for PubMedID 10827107

  • Microsatellite evolution in modern humans: a comparison of two data sets from the same populations ANNALS OF HUMAN GENETICS Jin, L., Baskett, M. L., Cavalli-Sforza, L. L., Zhivotovsky, L. A., Feldman, M. W., Rosenberg, N. A. 2000; 64: 117-134


    We genotyped 64 dinucleotide microsatellite repeats in individuals from populations that represent all inhabited continents. Microsatellite summary statistics are reported for these data, as well as for a data set that includes 28 out of 30 loci studied by Bowcock et al. (1994) in the same individuals. For both data sets, diversity statistics such as heterozygosity, number of alleles per locus, and number of private alleles per locus produced the highest values in Africans, intermediate values in Europeans and Asians, and low values in Americans. Evolutionary trees of populations based on genetic distances separated groups from different continents. Corresponding trees were topologically similar for the two data sets, with the exception that the (deltamu)2 genetic distance reliably distinguished groups from different continents for the larger data set, but not for the smaller one. Consistent with our results from diversity statistics and from evolutionary trees, population growth statistics S k and beta, which seem particularly useful for indicating recent and ancient population size changes, confirm a model of human evolution in which human populations expand in size and through space following the departure of a small group from Africa.

    View details for Web of Science ID 000088739600003

    View details for PubMedID 11246466

  • Use of unlinked genetic markers to detect population stratification in association studies AMERICAN JOURNAL OF HUMAN GENETICS Pritchard, J. K., Rosenberg, N. A. 1999; 65 (1): 220-228


    We examine the issue of population stratification in association-mapping studies. In case-control studies of association, population subdivision or recent admixture of populations can lead to spurious associations between a phenotype and unlinked candidate loci. Using a model of sampling from a structured population, we show that if population stratification exists, it can be detected by use of unlinked marker loci. We show that the case-control-study design, using unrelated control individuals, is a valid approach for association mapping, provided that marker loci unlinked to the candidate locus are included in the study, to test for stratification. We suggest guidelines as to the number of unlinked marker loci to use.

    View details for Web of Science ID 000081224300027

    View details for PubMedID 10364535