Honors & Awards
Marshall Sherfield Fellow, Marshall Aid Commemoration Commission (2001-2)
Sloan Research Fellow in Molecular Biology, Sloan Foundation (2007-9)
Provost Award for Distinguished Research, Cornell University (2008)
MacArthur Fellow, John D. and Catherine T. MacArthur Foundation (2010)
Ph.D., Harvard University, Biology (2001)
A.M., Harvard University, Statistics (2001)
A.B., Harvard College, Biology (1997)
Current Research and Scholarly Interests
My research focuses on analyzing genome wide patterns of variation within and between species to address fundamental questions in biology, anthropology, and medicine. My group works on a variety of organisms and model systems ranging from humans and other primates to domesticated plant and animals. Much of our research is at the interface of computational biology, mathematical genetics, and evolutionary genomics.
Personal Genomics for Preventive Cardiology
The purpose of this study is to see if providing information to a person on their inherited (genetic) risk of cardiovascular disease (CVD) helps to motivate that person to change their diet, lifestyle or medication regimen to alter their risk.
Stanford is currently not accepting patients for this trial. For more information, please contact Josh Knowles, 650-804-2526.
Independent Studies (11)
- Biomedical Informatics Teaching Methods
BIOMEDIN 290 (Aut, Win, Spr, Sum)
- Directed Reading and Research
BIOMEDIN 299 (Aut, Win, Spr, Sum)
- Directed Reading in Genetics
GENE 299 (Aut, Win, Spr, Sum)
- Graduate Research
GENE 399 (Aut, Win, Spr, Sum)
- Medical Scholars Research
BIOMEDIN 370 (Aut, Win, Spr, Sum)
- Medical Scholars Research
GENE 370 (Aut, Win, Spr, Sum)
- Out-of-Department Graduate Research
BIO 300X (Aut, Win, Spr, Sum)
- Ph.D. Research
CME 400 (Aut, Spr, Sum)
PHYSICS 490 (Aut, Win, Spr, Sum)
- Supervised Study
GENE 260 (Aut, Win, Spr, Sum)
- Undergraduate Research
GENE 199 (Aut, Win, Spr, Sum)
- Biomedical Informatics Teaching Methods
Prior Year Courses
- Advanced Genetics
GENE 205 (Win)
- Current Topics in Human, Population, and Statistical Genomics
GENE 209 (Spr)
- Advanced Genetics
Postdoctoral Faculty Sponsor
Helio Costa, Christopher DeBoever, Christopher Gignoux, Katharine Grabek, Nilah Ioannidis, Aashish Jha, Arturo Lopez Pineda, Fernando Mendez, Morten Rasmussen, Elena Sorokin, Genevieve Wojcik
Doctoral Dissertation Reader (AC)
Zoe June Assaf, Amy Goldberg, Emily Tsang, Zachary Zappala
Doctoral Dissertation Advisor (AC)
Thomas Cooke, Julian Homburger
Population Genetic Inference from Personal Genome Data: Impact of Ancestry and Admixture on Human Genomic Variation
AMERICAN JOURNAL OF HUMAN GENETICS
2012; 91 (4): 660-671
Full sequencing of individual human genomes has greatly expanded our understanding of human genetic variation and population history. Here, we present a systematic analysis of 50 human genomes from 11 diverse global populations sequenced at high coverage. Our sample includes 12 individuals who have admixed ancestry and who have varying degrees of recent (within the last 500 years) African, Native American, and European ancestry. We found over 21 million single-nucleotide variants that contribute to a 1.75-fold range in nucleotide heterozygosity across diverse human genomes. This heterozygosity ranged from a high of one heterozygous site per kilobase in west African genomes to a low of 0.57 heterozygous sites per kilobase in segments inferred to have diploid Native American ancestry from the genomes of Mexican and Puerto Rican individuals. We show evidence of all three continental ancestries in the genomes of Mexican, Puerto Rican, and African American populations, and the genome-wide statistics are highly consistent across individuals from a population once ancestry proportions have been accounted for. Using a generalized linear model, we identified subtle variations across populations in the proportion of neutral versus deleterious variation and found that genome-wide statistics vary in admixed populations even once ancestry proportions have been factored in. We further infer that multiple periods of gene flow shaped the diversity of admixed populations in the Americas-70% of the European ancestry in today's African Americans dates back to European gene flow happening only 7-8 generations ago.
View details for DOI 10.1016/j.ajhg.2012.08.025
View details for Web of Science ID 000309568500008
View details for PubMedID 23040495
Melanesian Blond Hair Is Caused by an Amino Acid Change in TYRP1
2012; 336 (6081): 554-554
Naturally blond hair is rare in humans and found almost exclusively in Europe and Oceania. Here, we identify an arginine-to-cysteine change at a highly conserved residue in tyrosinase-related protein 1 (TYRP1) as a major determinant of blond hair in Solomon Islanders. This missense mutation is predicted to affect catalytic activity of TYRP1 and causes blond hair through a recessive mode of inheritance. The mutation is at a frequency of 26% in the Solomon Islands, is absent outside of Oceania, represents a strong common genetic effect on a complex human phenotype, and highlights the importance of examining genetic associations worldwide.
View details for DOI 10.1126/science.1217849
View details for Web of Science ID 000303498800036
View details for PubMedID 22556244
Demographic history and rare allele sharing among human populations
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2011; 108 (29): 11983-11988
High-throughput sequencing technology enables population-level surveys of human genomic variation. Here, we examine the joint allele frequency distributions across continental human populations and present an approach for combining complementary aspects of whole-genome, low-coverage data and targeted high-coverage data. We apply this approach to data generated by the pilot phase of the Thousand Genomes Project, including whole-genome 2-4× coverage data for 179 samples from HapMap European, Asian, and African panels as well as high-coverage target sequencing of the exons of 800 genes from 697 individuals in seven populations. We use the site frequency spectra obtained from these data to infer demographic parameters for an Out-of-Africa model for populations of African, European, and Asian descent and to predict, by a jackknife-based approach, the amount of genetic diversity that will be discovered as sample sizes are increased. We predict that the number of discovered nonsynonymous coding variants will reach 100,000 in each population after ?1,000 sequenced chromosomes per population, whereas ?2,500 chromosomes will be needed for the same number of synonymous variants. Beyond this point, the number of segregating sites in the European and Asian panel populations is expected to overcome that of the African panel because of faster recent population growth. Overall, we find that the majority of human genomic variable sites are rare and exhibit little sharing among diverged populations. Our results emphasize that replication of disease association for specific rare genetic variants across diverged populations must overcome both reduced statistical power because of rarity and higher population divergence.
View details for DOI 10.1073/pnas.1019276108
View details for Web of Science ID 000292876900056
View details for PubMedID 21730125
- Genomics for the world NATURE 2011; 475 (7355): 163-165
Genome-wide patterns of population structure and admixture in West Africans and African Americans
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2010; 107 (2): 786-791
Quantifying patterns of population structure in Africans and African Americans illuminates the history of human populations and is critical for undertaking medical genomic studies on a global scale. To obtain a fine-scale genome-wide perspective of ancestry, we analyze Affymetrix GeneChip 500K genotype data from African Americans (n = 365) and individuals with ancestry from West Africa (n = 203 from 12 populations) and Europe (n = 400 from 42 countries). We find that population structure within the West African sample reflects primarily language and secondarily geographical distance, echoing the Bantu expansion. Among African Americans, analysis of genomic admixture by a principal component-based approach indicates that the median proportion of European ancestry is 18.5% (25th-75th percentiles: 11.6-27.7%), with very large variation among individuals. In the African-American sample as a whole, few autosomal regions showed exceptionally high or low mean African ancestry, but the X chromosome showed elevated levels of African ancestry, consistent with a sex-biased pattern of gene flow with an excess of European male and African female ancestry. We also find that genomic profiles of individual African Americans afford personalized ancestry reconstructions differentiating ancient vs. recent European and African ancestry. Finally, patterns of genetic similarity among inferred African segments of African-American genomes and genomes of contemporary African populations included in this study suggest African ancestry is most similar to non-Bantu Niger-Kordofanian-speaking populations, consistent with historical documents of the African Diaspora and trans-Atlantic slave trade.
View details for DOI 10.1073/pnas.0909559107
View details for Web of Science ID 000273559300050
View details for PubMedID 20080753
Genes mirror geography within Europe
2008; 456 (7218): 98-U5
Understanding the genetic structure of human populations is of fundamental interest to medical, forensic and anthropological sciences. Advances in high-throughput genotyping technology have markedly improved our understanding of global patterns of human genetic variation and suggest the potential to use large samples to uncover variation among closely spaced populations. Here we characterize genetic variation in a sample of 3,000 European individuals genotyped at over half a million variable DNA sites in the human genome. Despite low average levels of genetic differentiation among Europeans, we find a close correspondence between genetic and geographic distances; indeed, a geographical map of Europe arises naturally as an efficient two-dimensional summary of genetic variation in Europeans. The results emphasize that when mapping the genetic basis of a disease phenotype, spurious associations can arise if genetic structure is not properly accounted for. In addition, the results are relevant to the prospects of genetic ancestry testing; an individual's DNA can be used to infer their geographic origin with surprising accuracy-often to within a few hundred kilometres.
View details for DOI 10.1038/nature07331
View details for Web of Science ID 000260674000048
View details for PubMedID 18758442
- Population Genomic Analysis Reveals a Rich Speciation and Demographic History of Orang-utans (Pongo pygmaeus and Pongo abelii) PLOS ONE 2013; 8 (10)
Gene flow from North Africa contributes to differential human genetic diversity in southern Europe
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2013; 110 (29): 11791-11796
Human genetic diversity in southern Europe is higher than in other regions of the continent. This difference has been attributed to postglacial expansions, the demic diffusion of agriculture from the Near East, and gene flow from Africa. Using SNP data from 2,099 individuals in 43 populations, we show that estimates of recent shared ancestry between Europe and Africa are substantially increased when gene flow from North Africans, rather than Sub-Saharan Africans, is considered. The gradient of North African ancestry accounts for previous observations of low levels of sharing with Sub-Saharan Africa and is independent of recent gene flow from the Near East. The source of genetic diversity in southern Europe has important biomedical implications; we find that most disease risk alleles from genome-wide association studies follow expected patterns of divergence between Europe and North Africa, with the principal exception of multiple sclerosis.
View details for DOI 10.1073/pnas.1306223110
View details for Web of Science ID 000322086100040
View details for PubMedID 23733930
Analysis of the Genetic Basis of Disease in the Context of Worldwide Human Relationships and Migration
2013; 9 (5)
Genetic diversity across different human populations can enhance understanding of the genetic basis of disease. We calculated the genetic risk of 102 diseases in 1,043 unrelated individuals across 51 populations of the Human Genome Diversity Panel. We found that genetic risk for type 2 diabetes and pancreatic cancer decreased as humans migrated toward East Asia. In addition, biliary liver cirrhosis, alopecia areata, bladder cancer, inflammatory bowel disease, membranous nephropathy, systemic lupus erythematosus, systemic sclerosis, ulcerative colitis, and vitiligo have undergone genetic risk differentiation. This analysis represents a large-scale attempt to characterize genetic risk differentiation in the context of migration. We anticipate that our findings will enable detailed analysis pertaining to the driving forces behind genetic risk differentiation.
View details for DOI 10.1371/journal.pgen.1003447
View details for Web of Science ID 000320030000003
View details for PubMedID 23717210
Evolutionary and Population Genomics of the Cavity Causing Bacteria Streptococcus mutans
MOLECULAR BIOLOGY AND EVOLUTION
2013; 30 (4): 881-893
Streptococcus mutans is widely recognized as one of the key etiological agents of human dental caries. Despite its role in this important disease, our present knowledge of gene content variability across the species and its relationship to adaptation is minimal. Estimates of its demographic history are not available. In this study, we generated genome sequences of 57 S. mutans isolates, as well as representative strains of the most closely related species to S. mutans (S. ratti, S. macaccae, and S. criceti), to identify the overall structure and potential adaptive features of the dispensable and core components of the genome. We also performed population genetic analyses on the core genome of the species aimed at understanding the demographic history, and impact of selection shaping its genetic variation. The maximum gene content divergence among strains was approximately 23%, with the majority of strains diverging by 5-15%. The core genome consisted of 1,490 genes and the pan-genome approximately 3,296. Maximum likelihood analysis of the synonymous site frequency spectrum (SFS) suggested that the S. mutans population started expanding exponentially approximately 10,000 years ago (95% confidence interval [CI]: 3,268-14,344 years ago), coincidental with the onset of human agriculture. Analysis of the replacement SFS indicated that a majority of these substitutions are under strong negative selection, and the remainder evolved neutrally. A set of 14 genes was identified as being under positive selection, most of which were involved in either sugar metabolism or acid tolerance. Analysis of the core genome suggested that among 73 genes present in all isolates of S. mutans but absent in other species of the mutans taxonomic group, the majority can be associated with metabolic processes that could have contributed to the successful adaptation of S. mutans to its new niche, the human mouth, and with the dietary changes that accompanied the origin of agriculture.
View details for DOI 10.1093/molbev/mss278
View details for Web of Science ID 000317002300017
View details for PubMedID 23228887
High-throughput two-dimensional root system phenotyping platform facilitates genetic analysis of root growth and development
PLANT CELL AND ENVIRONMENT
2013; 36 (2): 454-466
High-throughput phenotyping of root systems requires a combination of specialized techniques and adaptable plant growth, root imaging and software tools. A custom phenotyping platform was designed to capture images of whole root systems, and novel software tools were developed to process and analyse these images. The platform and its components are adaptable to a wide range root phenotyping studies using diverse growth systems (hydroponics, paper pouches, gel and soil) involving several plant species, including, but not limited to, rice, maize, sorghum, tomato and Arabidopsis. The RootReader2D software tool is free and publicly available and was designed with both user-guided and automated features that increase flexibility and enhance efficiency when measuring root growth traits from specific roots or entire root systems during large-scale phenotyping studies. To demonstrate the unique capabilities and high-throughput capacity of this phenotyping platform for studying root systems, genome-wide association studies on rice (Oryza sativa) and maize (Zea mays) root growth were performed and root traits related to aluminium (Al) tolerance were analysed on the parents of the maize nested association mapping (NAM) population.
View details for DOI 10.1111/j.1365-3040.2012.02587.x
View details for Web of Science ID 000312997700017
View details for PubMedID 22860896
Population genomic analysis reveals a rich speciation and demographic history of orang-utans (Pongo pygmaeus and Pongo abelii).
2013; 8 (10)
To gain insights into evolutionary forces that have shaped the history of Bornean and Sumatran populations of orang-utans, we compare patterns of variation across more than 11 million single nucleotide polymorphisms found by previous mitochondrial and autosomal genome sequencing of 10 wild-caught orang-utans. Our analysis of the mitochondrial data yields a far more ancient split time between the two populations (~3.4 million years ago) than estimates based on autosomal data (0.4 million years ago), suggesting a complex speciation process with moderate levels of primarily male migration. We find that the distribution of selection coefficients consistent with the observed frequency spectrum of autosomal non-synonymous polymorphisms in orang-utans is similar to the distribution in humans. Our analysis indicates that 35% of genes have evolved under detectable negative selection. Overall, our findings suggest that purifying natural selection, genetic drift, and a complex demographic history are the dominant drivers of genome evolution for the two orang-utan populations.
View details for DOI 10.1371/journal.pone.0077175
View details for PubMedID 24194868
SnIPRE: Selection Inference Using a Poisson Random Effects Model
PLOS COMPUTATIONAL BIOLOGY
2012; 8 (12)
We present an approach for identifying genes under natural selection using polymorphism and divergence data from synonymous and non-synonymous sites within genes. A generalized linear mixed model is used to model the genome-wide variability among categories of mutations and estimate its functional consequence. We demonstrate how the model's estimated fixed and random effects can be used to identify genes under selection. The parameter estimates from our generalized linear model can be transformed to yield population genetic parameter estimates for quantities including the average selection coefficient for new mutations at a locus, the synonymous and non-synynomous mutation rates, and species divergence times. Furthermore, our approach incorporates stochastic variation due to the evolutionary process and can be fit using standard statistical software. The model is fit in both the empirical Bayes and Bayesian settings using the lme4 package in R, and Markov chain Monte Carlo methods in WinBUGS. Using simulated data we compare our method to existing approaches for detecting genes under selection: the McDonald-Kreitman test, and two versions of the Poisson random field based method MKprf. Overall, we find our method universally outperforms existing methods for detecting genes subject to selection using polymorphism and divergence data.
View details for DOI 10.1371/journal.pcbi.1002806
View details for Web of Science ID 000312901500013
View details for PubMedID 23236270
The Possibility of De Novo Assembly of the Genome and Population Genomics of the Mangrove Rivulus, Kryptolebias marmoratus
INTEGRATIVE AND COMPARATIVE BIOLOGY
2012; 52 (6): 737-742
How organisms adapt to the range of environments they encounter is a fundamental question in biology. Elucidating the genetic basis of adaptation is a difficult task, especially when the targets of selection are not known. Emerging sequencing technologies and assembly algorithms facilitate the genomic dissection of adaptation and population differentiation in a vast array of organisms. Here we describe the attributes of Kryptolebias marmoratus, one of two known self-fertilizing hermaphroditic vertebrates that make this fish an attractive genetic system and a model for understanding the genomics of adaptation. Long periods of selfing have resulted in populations composed of many distinct naturally homozygous strains with a variety of identifiable, and apparently heritable, phenotypes. There also is strong population genetic structure across a diverse range of mangrove habitats, making this a tractable system in which to study differentiation both within and among populations. The ability to rear K. marmoratus in the laboratory contributes further to its value as a model for understanding the genetic drivers for adaptation. To date, microsatellite markers distinguish wild isogenic strains but the naturally high homozygosity improves the quality of de novo assembly of the genome and facilitates the identification of genetic variants associated with phenotypes. Gene annotation can be accomplished with RNA-sequencing data in combination with de novo genome assembly. By combining genomic information with extensive laboratory-based phenotyping, it becomes possible to map genetic variants underlying differences in behavioral, life-history, and other potentially adaptive traits. Emerging genomic technologies provide the required resources for establishing K. marmoratus as a new model organism for behavioral genetics and evolutionary genetics research.
View details for DOI 10.1093/icb/ics094
View details for Web of Science ID 000311645400003
View details for PubMedID 22723055
Limited Evidence for Classic Selective Sweeps in African Populations
2012; 192 (3): 1049-?
While hundreds of loci have been identified as reflecting strong-positive selection in human populations, connections between candidate loci and specific selective pressures often remain obscure. This study investigates broader patterns of selection in African populations, which are underrepresented despite their potential to offer key insights into human adaptation. We scan for hard selective sweeps using several haplotype and allele-frequency statistics with a data set of nearly 500,000 genome-wide single-nucleotide polymorphisms in 12 highly diverged African populations that span a range of environments and subsistence strategies. We find that positive selection does not appear to be a strong determinant of allele-frequency differentiation among these African populations. Haplotype statistics do identify putatively selected regions that are shared across African populations. However, as assessed by extensive simulations, patterns of haplotype sharing between African populations follow neutral expectations and suggest that tails of the empirical distributions contain false-positive signals. After highlighting several genomic regions where positive selection can be inferred with higher confidence, we use a novel method to identify biological functions enriched among populations' empirical tail genomic windows, such as immune response in agricultural groups. In general, however, it seems that current methods for selection scans are poorly suited to populations that, like the African populations in this study, are affected by ascertainment bias and have low levels of linkage disequilibrium, possibly old selective sweeps, and potentially reduced phasing accuracy. Additionally, population history can confound the interpretation of selection statistics, suggesting that greater care is needed in attributing broad genetic patterns to human adaptation.
View details for DOI 10.1534/genetics.112.144071
View details for Web of Science ID 000310793900019
View details for PubMedID 22960214
North African Populations Carry the Signature of Admixture with Neandertals
2012; 7 (10)
One of the main findings derived from the analysis of the Neandertal genome was the evidence for admixture between Neandertals and non-African modern humans. An alternative scenario is that the ancestral population of non-Africans was closer to Neandertals than to Africans because of ancient population substructure. Thus, the study of North African populations is crucial for testing both hypotheses. We analyzed a total of 780,000 SNPs in 125 individuals representing seven different North African locations and searched for their ancestral/derived state in comparison to different human populations and Neandertals. We found that North African populations have a significant excess of derived alleles shared with Neandertals, when compared to sub-Saharan Africans. This excess is similar to that found in non-African humans, a fact that can be interpreted as a sign of Neandertal admixture. Furthermore, the Neandertal's genetic signal is higher in populations with a local, pre-Neolithic North African ancestry. Therefore, the detected ancient admixture is not due to recent Near Eastern or European migrations. Sub-Saharan populations are the only ones not affected by the admixture event with Neandertals.
View details for DOI 10.1371/journal.pone.0047765
View details for Web of Science ID 000311146900109
View details for PubMedID 23082212
- The genetic prehistory of southern Africa NATURE COMMUNICATIONS 2012; 3
North African Jewish and non-Jewish populations form distinctive, orthogonal clusters
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2012; 109 (34): 13865-13870
North African Jews constitute the second largest Jewish Diaspora group. However, their relatedness to each other; to European, Middle Eastern, and other Jewish Diaspora groups; and to their former North African non-Jewish neighbors has not been well defined. Here, genome-wide analysis of five North African Jewish groups (Moroccan, Algerian, Tunisian, Djerban, and Libyan) and comparison with other Jewish and non-Jewish groups demonstrated distinctive North African Jewish population clusters with proximity to other Jewish populations and variable degrees of Middle Eastern, European, and North African admixture. Two major subgroups were identified by principal component, neighbor joining tree, and identity-by-descent analysis-Moroccan/Algerian and Djerban/Libyan-that varied in their degree of European admixture. These populations showed a high degree of endogamy and were part of a larger Ashkenazi and Sephardic Jewish group. By principal component analysis, these North African groups were orthogonal to contemporary populations from North and South Morocco, Western Sahara, Tunisia, Libya, and Egypt. Thus, this study is compatible with the history of North African Jews-founding during Classical Antiquity with proselytism of local populations, followed by genetic isolation with the rise of Christianity and then Islam, and admixture following the emigration of Sephardic Jews during the Inquisition.
View details for DOI 10.1073/pnas.1204840109
View details for Web of Science ID 000308085200081
View details for PubMedID 22869716
Variation of BMP3 Contributes to Dog Breed Skull Diversity
2012; 8 (8)
Since the beginnings of domestication, the craniofacial architecture of the domestic dog has morphed and radiated to human whims. By beginning to define the genetic underpinnings of breed skull shapes, we can elucidate mechanisms of morphological diversification while presenting a framework for understanding human cephalic disorders. Using intrabreed association mapping with museum specimen measurements, we show that skull shape is regulated by at least five quantitative trait loci (QTLs). Our detailed analysis using whole-genome sequencing uncovers a missense mutation in BMP3. Validation studies in zebrafish show that Bmp3 function in cranial development is ancient. Our study reveals the causal variant for a canine QTL contributing to a major morphologic trait.
View details for DOI 10.1371/journal.pgen.1002849
View details for Web of Science ID 000308529300012
View details for PubMedID 22876193
PCAdmix: Principal Components-Based Assignment of Ancestry along Each Chromosome in Individuals with Admixed Ancestry from Two or More Populations
2012; 84 (4): 343-364
Identifying ancestry along each chromosome in admixed individuals provides a wealth of information for understanding the population genetic history of admixture events and is valuable for admixture mapping and identifying recent targets of selection. We present PCAdmix (available at https://sites.google.com/site/pcadmix/home ), a Principal Components-based algorithm for determining ancestry along each chromosome from a high-density, genome-wide set of phased single-nucleotide polymorphism (SNP) genotypes of admixed individuals. We compare our method to HAPMIX on simulated data from two ancestral populations, and we find high concordance between the methods. Our method also has better accuracy than LAMP when applied to three-population admixture, a situation as yet unaddressed by HAPMIX. Finally, we apply our method to a data set of four Latino populations with European, African, and Native American ancestry. We find evidence of assortative mating in each of the four populations, and we identify regions of shared ancestry that may be recent targets of selection and could serve as candidate regions for admixture-based association mapping.
View details for Web of Science ID 000313648400001
View details for PubMedID 23249312
Evolution and Functional Impact of Rare Coding Variation from Deep Sequencing of Human Exomes
2012; 337 (6090): 64-69
As a first step toward understanding how rare variants contribute to risk for complex diseases, we sequenced 15,585 human protein-coding genes to an average median depth of 111× in 2440 individuals of European (n = 1351) and African (n = 1088) ancestry. We identified over 500,000 single-nucleotide variants (SNVs), the majority of which were rare (86% with a minor allele frequency less than 0.5%), previously unknown (82%), and population-specific (82%). On average, 2.3% of the 13,595 SNVs each person carried were predicted to affect protein function of ~313 genes per genome, and ~95.7% of SNVs predicted to be functionally important were rare. This excess of rare functional variants is due to the combined effects of explosive, recent accelerated population growth and weak purifying selection. Furthermore, we show that large sample sizes will be required to associate rare variants with complex traits.
View details for DOI 10.1126/science.1219240
View details for Web of Science ID 000306053100043
View details for PubMedID 22604720
- Randomized Trial of Personal Genomics for Preventive Cardiology Design and Challenges CIRCULATION-CARDIOVASCULAR GENETICS 2012; 5 (3): 368-376
Type 2 Diabetes Risk Alleles Demonstrate Extreme Directional Differentiation among Human Populations, Compared to Other Diseases
2012; 8 (4): 100-115
Many disease-susceptible SNPs exhibit significant disparity in ancestral and derived allele frequencies across worldwide populations. While previous studies have examined population differentiation of alleles at specific SNPs, global ethnic patterns of ensembles of disease risk alleles across human diseases are unexamined. To examine these patterns, we manually curated ethnic disease association data from 5,065 papers on human genetic studies representing 1,495 diseases, recording the precise risk alleles and their measured population frequencies and estimated effect sizes. We systematically compared the population frequencies of cross-ethnic risk alleles for each disease across 1,397 individuals from 11 HapMap populations, 1,064 individuals from 53 HGDP populations, and 49 individuals with whole-genome sequences from 10 populations. Type 2 diabetes (T2D) demonstrated extreme directional differentiation of risk allele frequencies across human populations, compared with null distributions of European-frequency matched control genomic alleles and risk alleles for other diseases. Most T2D risk alleles share a consistent pattern of decreasing frequencies along human migration into East Asia. Furthermore, we show that these patterns contribute to disparities in predicted genetic risk across 1,397 HapMap individuals, T2D genetic risk being consistently higher for individuals in the African populations and lower in the Asian populations, irrespective of the ethnicity considered in the initial discovery of risk alleles. We observed a similar pattern in the distribution of T2D Genetic Risk Scores, which are associated with an increased risk of developing diabetes in the Diabetes Prevention Program cohort, for the same individuals. This disparity may be attributable to the promotion of energy storage and usage appropriate to environments and inconsistent energy intake. Our results indicate that the differential frequencies of T2D risk alleles may contribute to the observed disparity in T2D incidence rates across ethnic populations.
View details for DOI 10.1371/journal.pgen.1002621
View details for Web of Science ID 000303441800007
View details for PubMedID 22511877
- Detecting and annotating genetic variations using the HugeSeq pipeline NATURE BIOTECHNOLOGY 2012; 30 (3): 226-229
New insights into the Tyrolean Iceman's origin and phenotype as inferred by whole-genome sequencing
The Tyrolean Iceman, a 5,300-year-old Copper age individual, was discovered in 1991 on the Tisenjoch Pass in the Italian part of the Ötztal Alps. Here we report the complete genome sequence of the Iceman and show 100% concordance between the previously reported mitochondrial genome sequence and the consensus sequence generated from our genomic data. We present indications for recent common ancestry between the Iceman and present-day inhabitants of the Tyrrhenian Sea, that the Iceman probably had brown eyes, belonged to blood group O and was lactose intolerant. His genetic predisposition shows an increased risk for coronary heart disease and may have contributed to the development of previously reported vascular calcifications. Sequences corresponding to ~60% of the genome of Borrelia burgdorferi are indicative of the earliest human case of infection with the pathogen for Lyme borreliosis.
View details for DOI 10.1038/ncomms1701
View details for Web of Science ID 000302060100039
View details for PubMedID 22426219
Genomic Ancestry of North Africans Supports Back-to-Africa Migrations
2012; 8 (1)
North African populations are distinct from sub-Saharan Africans based on cultural, linguistic, and phenotypic attributes; however, the time and the extent of genetic divergence between populations north and south of the Sahara remain poorly understood. Here, we interrogate the multilayered history of North Africa by characterizing the effect of hypothesized migrations from the Near East, Europe, and sub-Saharan Africa on current genetic diversity. We present dense, genome-wide SNP genotyping array data (730,000 sites) from seven North African populations, spanning from Egypt to Morocco, and one Spanish population. We identify a gradient of likely autochthonous Maghrebi ancestry that increases from east to west across northern Africa; this ancestry is likely derived from "back-to-Africa" gene flow more than 12,000 years ago (ya), prior to the Holocene. The indigenous North African ancestry is more frequent in populations with historical Berber ethnicity. In most North African populations we also see substantial shared ancestry with the Near East, and to a lesser extent sub-Saharan Africa and Europe. To estimate the time of migration from sub-Saharan populations into North Africa, we implement a maximum likelihood dating method based on the distribution of migrant tracts. In order to first identify migrant tracts, we assign local ancestry to haplotypes using a novel, principal component-based analysis of three ancestral populations. We estimate that a migration of western African origin into Morocco began about 40 generations ago (approximately 1,200 ya); a migration of individuals with Nilotic ancestry into Egypt occurred about 25 generations ago (approximately 750 ya). Our genomic data reveal an extraordinarily complex history of migrations, involving at least five ancestral populations, into North Africa.
View details for DOI 10.1371/journal.pgen.1002397
View details for Web of Science ID 000300223400001
View details for PubMedID 22253600
Mutation Hot Spots in Yeast Caused by Long-Range Clustering of Homopolymeric Sequences
2012; 1 (1): 36-42
Evolutionary theory assumes that mutations occur randomly in the genome; however, studies performed in a variety of organisms indicate the existence of context-dependent mutation biases. Sources of mutagenesis variation across large genomic contexts (e.g., hundreds of bases) have not been identified. Here, we use high-coverage whole-genome sequencing of a conditional mismatch repair mutant line of diploid yeast to identify mutations that accumulated after 160 generations of growth. The vast majority of the mutations accumulated as insertion/deletions (in/dels) in homopolymeric [poly(dA:dT)] and repetitive DNA tracts. Surprisingly, the likelihood of an in/del mutation in a given poly(dA:dT) tract is increased by the presence of nearby poly(dA:dT) tracts in up to a 1,000 bp region centered on the given tract. Our work suggests that specific mutation hot spots can contribute disproportionately to the genetic variation that is introduced into populations and provides long-range genomic sequence context that contributes to mutagenesis.
View details for DOI 10.1016/j.celrep.2011.10.003
View details for Web of Science ID 000309709500006
View details for PubMedID 22832106
The genetic prehistory of southern Africa.
2012; 3: 1143-?
Southern and eastern African populations that speak non-Bantu languages with click consonants are known to harbour some of the most ancient genetic lineages in humans, but their relationships are poorly understood. Here, we report data from 23 populations analysed at over half a million single-nucleotide polymorphisms, using a genome-wide array designed for studying human history. The southern African Khoisan fall into two genetic groups, loosely corresponding to the northwestern and southeastern Kalahari, which we show separated within the last 30,000 years. We find that all individuals derive at least a few percent of their genomes from admixture with non-Khoisan populations that began ?1,200 years ago. In addition, the East African Hadza and Sandawe derive a fraction of their ancestry from admixture with a population related to the Khoisan, supporting the hypothesis of an ancient link between southern and eastern Africa.
View details for DOI 10.1038/ncomms2140
View details for PubMedID 23072811
An Aboriginal Australian Genome Reveals Separate Human Dispersals into Asia
2011; 334 (6052): 94-98
We present an Aboriginal Australian genomic sequence obtained from a 100-year-old lock of hair donated by an Aboriginal man from southern Western Australia in the early 20th century. We detect no evidence of European admixture and estimate contamination levels to be below 0.5%. We show that Aboriginal Australians are descendants of an early human dispersal into eastern Asia, possibly 62,000 to 75,000 years ago. This dispersal is separate from the one that gave rise to modern Asians 25,000 to 38,000 years ago. We also find evidence of gene flow between populations of the two dispersal waves prior to the divergence of Native Americans from modern Asian ancestors. Our findings support the hypothesis that present-day Aboriginal Australians descend from the earliest humans to occupy Australia, likely representing one of the oldest continuous populations outside Africa.
View details for DOI 10.1126/science.1211177
View details for Web of Science ID 000295580300047
View details for PubMedID 21940856
- SnapShot: Human Biomedical Genomics CELL 2011; 147 (1): 248-U262
- Reply to Ge and Sang: A single origin of domesticated rice PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA 2011; 108 (39): E756-E756
Genome-wide association mapping reveals a rich genetic architecture of complex traits in Oryza sativa
Asian rice, Oryza sativa is a cultivated, inbreeding species that feeds over half of the world's population. Understanding the genetic basis of diverse physiological, developmental, and morphological traits provides the basis for improving yield, quality and sustainability of rice. Here we show the results of a genome-wide association study based on genotyping 44,100 SNP variants across 413 diverse accessions of O. sativa collected from 82 countries that were systematically phenotyped for 34 traits. Using cross-population-based mapping strategies, we identified dozens of common variants influencing numerous complex traits. Significant heterogeneity was observed in the genetic architecture associated with subpopulation structure and response to environment. This work establishes an open-source translational research platform for genome-wide association studies in rice that directly links molecular variation in genes and metabolic pathways with the germplasm resources needed to accelerate varietal development and crop improvement.
View details for DOI 10.1038/ncomms1467
View details for Web of Science ID 000294807200009
View details for PubMedID 21915109
Phased Whole-Genome Genetic Risk in a Family Quartet Using a Major Allele Reference Sequence
2011; 7 (9)
Whole-genome sequencing harbors unprecedented potential for characterization of individual and family genetic variation. Here, we develop a novel synthetic human reference sequence that is ethnically concordant and use it for the analysis of genomes from a nuclear family with history of familial thrombophilia. We demonstrate that the use of the major allele reference sequence results in improved genotype accuracy for disease-associated variant loci. We infer recombination sites to the lowest median resolution demonstrated to date (< 1,000 base pairs). We use family inheritance state analysis to control sequencing error and inform family-wide haplotype phasing, allowing quantification of genome-wide compound heterozygosity. We develop a sequence-based methodology for Human Leukocyte Antigen typing that contributes to disease risk prediction. Finally, we advance methods for analysis of disease and pharmacogenomic risk across the coding and non-coding genome that incorporate phased variant data. We show these methods are capable of identifying multigenic risk for inherited thrombophilia and informing the appropriate pharmacological therapy. These ethnicity-specific, family-based approaches to interpretation of genetic variation are emblematic of the next generation of genetic risk assessment using whole-genome sequencing.
View details for DOI 10.1371/journal.pgen.1002280
View details for Web of Science ID 000295419100031
View details for PubMedID 21935354
A genome-wide perspective on the evolutionary history of enigmatic wolf-like canids
2011; 21 (8): 1294-1305
High-throughput genotyping technologies developed for model species can potentially increase the resolution of demographic history and ancestry in wild relatives. We use a SNP genotyping microarray developed for the domestic dog to assay variation in over 48K loci in wolf-like species worldwide. Despite the high mobility of these large carnivores, we find distinct hierarchical population units within gray wolves and coyotes that correspond with geographic and ecologic differences among populations. Further, we test controversial theories about the ancestry of the Great Lakes wolf and red wolf using an analysis of haplotype blocks across all 38 canid autosomes. We find that these enigmatic canids are highly admixed varieties derived from gray wolves and coyotes, respectively. This divergent genomic history suggests that they do not have a shared recent ancestry as proposed by previous researchers. Interspecific hybridization, as well as the process of evolutionary divergence, may be responsible for the observed phenotypic distinction of both forms. Such admixture complicates decisions regarding endangered species restoration and protection.
View details for DOI 10.1101/gr.116301.110
View details for Web of Science ID 000293335700009
View details for PubMedID 21566151
Genetic Architecture of Aluminum Tolerance in Rice (Oryza sativa) Determined through Genome-Wide Association Analysis and QTL Mapping
2011; 7 (8)
Aluminum (Al) toxicity is a primary limitation to crop productivity on acid soils, and rice has been demonstrated to be significantly more Al tolerant than other cereal crops. However, the mechanisms of rice Al tolerance are largely unknown, and no genes underlying natural variation have been reported. We screened 383 diverse rice accessions, conducted a genome-wide association (GWA) study, and conducted QTL mapping in two bi-parental populations using three estimates of Al tolerance based on root growth. Subpopulation structure explained 57% of the phenotypic variation, and the mean Al tolerance in Japonica was twice that of Indica. Forty-eight regions associated with Al tolerance were identified by GWA analysis, most of which were subpopulation-specific. Four of these regions co-localized with a priori candidate genes, and two highly significant regions co-localized with previously identified QTLs. Three regions corresponding to induced Al-sensitive rice mutants (ART1, STAR2, Nrat1) were identified through bi-parental QTL mapping or GWA to be involved in natural variation for Al tolerance. Haplotype analysis around the Nrat1 gene identified susceptible and tolerant haplotypes explaining 40% of the Al tolerance variation within the aus subpopulation, and sequence analysis of Nrat1 identified a trio of non-synonymous mutations predictive of Al sensitivity in our diversity panel. GWA analysis discovered more phenotype-genotype associations and provided higher resolution, but QTL mapping identified critical rare and/or subpopulation-specific alleles not detected by GWA analysis. Mapping using Indica/Japonica populations identified QTLs associated with transgressive variation where alleles from a susceptible aus or indica parent enhanced Al tolerance in a tolerant Japonica background. This work supports the hypothesis that selectively introgressing alleles across subpopulations is an efficient approach for trait enhancement in plant breeding programs and demonstrates the fundamental importance of subpopulation in interpreting and manipulating the genetics of complex traits in rice.
View details for DOI 10.1371/journal.pgen.1002221
View details for Web of Science ID 000294297000018
View details for PubMedID 21829395
Fast, Exact Linkage Analysis for Categorical Traits on Arbitrary Pedigree Designs
2011; 35 (5): 371-380
Multi-symptom diseases without a consistent continuous measurement of severity may be best understood with a categorical interpretation. In this paper, we present LOCate v.2, a fast, exact algorithm for linkage analysis of all types of categorical traits, both ordinal and nominal. Our method is able to incorporate missing data and analyze complex genealogical structure, including inbreeding loops. LOCate v.2 computes exact likelihoods efficiently through an elimination algorithm, similar to that used by Superlink for binary traits. We compare LOCate v.2 to LOT and QTLlink, two existing methods of linkage analysis for ordinal traits. We find that LOCate v.2 outperforms both methods when used to analyze simulated nominal traits. In addition, LOCate v.2 performs as well as QTLlink on simulated ordinal traits, and better than LOT due to the necessity of cutting large pedigrees for analysis in LOT. To demonstrate the versatility of LOCate v.2, we conduct an ordinal and nominal linkage analysis of ventricular arrhythmias in a large, inbred pedigree of German Shepherd dogs. We find that a trichotomous ordinal or nominal interpretation strengthens the evidence in favor of linkage to a region on chromosome 6, and provides new evidence of linkage to a region on chromosome 11. LOCate v.2 is a unified, fast, and robust method for linkage analysis of ordinal and nominal traits which will be valuable to researchers interested in investigating any type of categorical trait.
View details for DOI 10.1002/gepi.20585
View details for Web of Science ID 000291591100009
View details for PubMedID 21520271
On Identifying the Optimal Number of Population Clusters via the Deviance Information Criterion
2011; 6 (6)
Inferring population structure using bayesian clustering programs often requires a priori specification of the number of subpopulations, K, from which the sample has been drawn. Here, we explore the utility of a common bayesian model selection criterion, the Deviance Information Criterion (DIC), for estimating K. We evaluate the accuracy of DIC, as well as other popular approaches, on datasets generated by coalescent simulations under various demographic scenarios. We find that DIC outperforms competing methods in many genetic contexts, validating its application in assessing population structure.
View details for DOI 10.1371/journal.pone.0021014
View details for Web of Science ID 000292142800008
View details for PubMedID 21738600
Levels and Patterns of Nucleotide Variation in Domestication QTL Regions on Rice Chromosome 3 Suggest Lineage-Specific Selection
2011; 6 (6)
Oryza sativa or Asian cultivated rice is one of the major cereal grass species domesticated for human food use during the Neolithic. Domestication of this species from the wild grass Oryza rufipogon was accompanied by changes in several traits, including seed shattering, percent seed set, tillering, grain weight, and flowering time. Quantitative trait locus (QTL) mapping has identified three genomic regions in chromosome 3 that appear to be associated with these traits. We would like to study whether these regions show signatures of selection and whether the same genetic basis underlies the domestication of different rice varieties. Fragments of 88 genes spanning these three genomic regions were sequenced from multiple accessions of two major varietal groups in O. sativa--indica and tropical japonica--as well as the ancestral wild rice species O. rufipogon. In tropical japonica, the levels of nucleotide variation in these three QTL regions are significantly lower compared to genome-wide levels, and coalescent simulations based on a complex demographic model of rice domestication indicate that these patterns are consistent with selection. In contrast, there is no significant reduction in nucleotide diversity in the homologous regions in indica rice. These results suggest that there are differences in the genetic and selective basis for domestication between these two Asian rice varietal groups.
View details for DOI 10.1371/journal.pone.0020670
View details for Web of Science ID 000291356400016
View details for PubMedID 21674010
Molecular evidence for a single evolutionary origin of domesticated rice
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2011; 108 (20): 8351-8356
Asian rice, Oryza sativa, is one of world's oldest and most important crop species. Rice is believed to have been domesticated ?9,000 y ago, although debate on its origin remains contentious. A single-origin model suggests that two main subspecies of Asian rice, indica and japonica, were domesticated from the wild rice O. rufipogon. In contrast, the multiple independent domestication model proposes that these two major rice types were domesticated separately and in different parts of the species range of wild rice. This latter view has gained much support from the observation of strong genetic differentiation between indica and japonica as well as several phylogenetic studies of rice domestication. We reexamine the evolutionary history of domesticated rice by resequencing 630 gene fragments on chromosomes 8, 10, and 12 from a diverse set of wild and domesticated rice accessions. Using patterns of SNPs, we identify 20 putative selective sweeps on these chromosomes in cultivated rice. Demographic modeling based on these SNP data and a diffusion-based approach provide the strongest support for a single domestication origin of rice. Bayesian phylogenetic analyses implementing the multispecies coalescent and using previously published phylogenetic sequence datasets also point to a single origin of Asian domesticated rice. Finally, we date the origin of domestication at ?8,200-13,500 y ago, depending on the molecular clock estimate that is used, which is consistent with known archaeological data that suggests rice was first cultivated at around this time in the Yangtze Valley of China.
View details for DOI 10.1073/pnas.1104686108
View details for Web of Science ID 000290719600056
View details for PubMedID 21536870
Hunter-gatherer genomic diversity suggests a southern African origin for modern humans
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2011; 108 (13): 5154-5162
Africa is inferred to be the continent of origin for all modern human populations, but the details of human prehistory and evolution in Africa remain largely obscure owing to the complex histories of hundreds of distinct populations. We present data for more than 580,000 SNPs for several hunter-gatherer populations: the Hadza and Sandawe of Tanzania, and the ?Khomani Bushmen of South Africa, including speakers of the nearly extinct N|u language. We find that African hunter-gatherer populations today remain highly differentiated, encompassing major components of variation that are not found in other African populations. Hunter-gatherer populations also tend to have the lowest levels of genome-wide linkage disequilibrium among 27 African populations. We analyzed geographic patterns of linkage disequilibrium and population differentiation, as measured by F(ST), in Africa. The observed patterns are consistent with an origin of modern humans in southern Africa rather than eastern Africa, as is generally assumed. Additionally, genetic variation in African hunter-gatherer populations has been significantly affected by interaction with farmers and herders over the past 5,000 y, through both severe population bottlenecks and sex-biased migration. However, African hunter-gatherer populations continue to maintain the highest levels of genetic diversity in the world.
View details for DOI 10.1073/pnas.1017511108
View details for Web of Science ID 000288894800009
View details for PubMedID 21383195
Detecting Directional Selection in the Presence of Recent Admixture in African-Americans
2011; 187 (3): 823-835
We investigate the performance of tests of neutrality in admixed populations using plausible demographic models for African-American history as well as resequencing data from African and African-American populations. The analysis of both simulated and human resequencing data suggests that recent admixture does not result in an excess of false-positive results for neutrality tests based on the frequency spectrum after accounting for the population growth in the parental African population. Furthermore, when simulating positive selection, Tajima's D, Fu and Li's D, and haplotype homozygosity have lower power to detect population-specific selection using individuals sampled from the admixed population than from the nonadmixed population. Fay and Wu's H test, however, has more power to detect selection using individuals from the admixed population than from the nonadmixed population, especially when the selective sweep ended long ago. Our results have implications for interpreting recent genome-wide scans for positive selection in human populations.
View details for DOI 10.1534/genetics.110.122739
View details for Web of Science ID 000288457800016
View details for PubMedID 21196524
Genetic structure and domestication history of the grape
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2011; 108 (9): 3530-3535
The grape is one of the earliest domesticated fruit crops and, since antiquity, it has been widely cultivated and prized for its fruit and wine. Here, we characterize genome-wide patterns of genetic variation in over 1,000 samples of the domesticated grape, Vitis vinifera subsp. vinifera, and its wild relative, V. vinifera subsp. sylvestris from the US Department of Agriculture grape germplasm collection. We find support for a Near East origin of vinifera and present evidence of introgression from local sylvestris as the grape moved into Europe. High levels of genetic diversity and rapid linkage disequilibrium (LD) decay have been maintained in vinifera, which is consistent with a weak domestication bottleneck followed by thousands of years of widespread vegetative propagation. The considerable genetic diversity within vinifera, however, is contained within a complex network of close pedigree relationships that has been generated by crosses among elite cultivars. We show that first-degree relationships are rare between wine and table grapes and among grapes from geographically distant regions. Our results suggest that although substantial genetic diversity has been maintained in the grape subsequent to domestication, there has been a limited exploration of this diversity. We propose that the adoption of vegetative propagation was a double-edged sword: Although it provided a benefit by ensuring true breeding cultivars, it also discouraged the generation of unique cultivars through crosses. The grape currently faces severe pathogen pressures, and the long-term sustainability of the grape and wine industries will rely on the exploitation of the grape's tremendous natural genetic diversity.
View details for DOI 10.1073/pnas.1009363108
View details for Web of Science ID 000287844400021
View details for PubMedID 21245334
A Population Genetic Approach to Mapping Neurological Disorder Genes Using Deep Resequencing
2011; 7 (2)
Deep resequencing of functional regions in human genomes is key to identifying potentially causal rare variants for complex disorders. Here, we present the results from a large-sample resequencing (n ?= ?285 patients) study of candidate genes coupled with population genetics and statistical methods to identify rare variants associated with Autism Spectrum Disorder and Schizophrenia. Three genes, MAP1A, GRIN2B, and CACNA1F, were consistently identified by different methods as having significant excess of rare missense mutations in either one or both disease cohorts. In a broader context, we also found that the overall site frequency spectrum of variation in these cases is best explained by population models of both selection and complex demography rather than neutral models or models accounting for complex demography alone. Mutations in the three disease-associated genes explained much of the difference in the overall site frequency spectrum among the cases versus controls. This study demonstrates that genes associated with complex disorders can be mapped using resequencing and analytical methods with sample sizes far smaller than those required by genome-wide association studies. Additionally, our findings support the hypothesis that rare mutations account for a proportion of the phenotypic variance of these complex disorders.
View details for DOI 10.1371/journal.pgen.1001318
View details for Web of Science ID 000287697300033
View details for PubMedID 21383861
Comparative and demographic analysis of orang-utan genomes
2011; 469 (7331): 529-533
'Orang-utan' is derived from a Malay term meaning 'man of the forest' and aptly describes the southeast Asian great apes native to Sumatra and Borneo. The orang-utan species, Pongo abelii (Sumatran) and Pongo pygmaeus (Bornean), are the most phylogenetically distant great apes from humans, thereby providing an informative perspective on hominid evolution. Here we present a Sumatran orang-utan draft genome assembly and short read sequence data from five Sumatran and five Bornean orang-utan genomes. Our analyses reveal that, compared to other primates, the orang-utan genome has many unique features. Structural evolution of the orang-utan genome has proceeded much more slowly than other great apes, evidenced by fewer rearrangements, less segmental duplication, a lower rate of gene family turnover and surprisingly quiescent Alu repeats, which have played a major role in restructuring other primate genomes. We also describe a primate polymorphic neocentromere, found in both Pongo species, emphasizing the gradual evolution of orang-utan genome structure. Orang-utans have extremely low energy usage for a eutherian mammal, far lower than their hominid relatives. Adding their genome to the repertoire of sequenced primates illuminates new signals of positive selection in several pathways including glycolipid metabolism. From the population perspective, both Pongo species are deeply diverse; however, Sumatran individuals possess greater diversity than their Bornean counterparts, and more species-specific variation. Our estimate of Bornean/Sumatran speciation time, 400,000?years ago, is more recent than most previous studies and underscores the complexity of the orang-utan speciation process. Despite a smaller modern census population size, the Sumatran effective population size (N(e)) expanded exponentially relative to the ancestral N(e) after the split, while Bornean N(e) declined over the same period. Overall, the resources and analyses presented here offer new opportunities in evolutionary genomics, insights into hominid biology, and an extensive database of variation for conservation efforts.
View details for DOI 10.1038/nature09687
View details for Web of Science ID 000286596500032
View details for PubMedID 21270892
The functional spectrum of low-frequency coding variation
2011; 12 (9)
Rare coding variants constitute an important class of human genetic variation, but are underrepresented in current databases that are based on small population samples. Recent studies show that variants altering amino acid sequence and protein function are enriched at low variant allele frequency, 2 to 5%, but because of insufficient sample size it is not clear if the same trend holds for rare variants below 1% allele frequency.The 1000 Genomes Exon Pilot Project has collected deep-coverage exon-capture data in roughly 1,000 human genes, for nearly 700 samples. Although medical whole-exome projects are currently afoot, this is still the deepest reported sampling of a large number of human genes with next-generation technologies. According to the goals of the 1000 Genomes Project, we created effective informatics pipelines to process and analyze the data, and discovered 12,758 exonic SNPs, 70% of them novel, and 74% below 1% allele frequency in the seven population samples we examined. Our analysis confirms that coding variants below 1% allele frequency show increased population-specificity and are enriched for functional variants.This study represents a large step toward detecting and interpreting low frequency coding variation, clearly lays out technical steps for effective analysis of DNA capture data, and articulates functional and population properties of this important class of genetic variation.
View details for DOI 10.1186/gb-2011-12-9-r84
View details for Web of Science ID 000298926900001
View details for PubMedID 21917140
GENOME-WIDE ASSOCIATION MAPPING AND RARE ALLELES: FROM POPULATION GENOMICS TO PERSONALIZED MEDICINE - Session Introduction.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
Genome-wide associations studies (GWAS) have been very successful in identifying common genetic variation associated to numerous complex diseases . However, most of the identified common genetic variants appear to confer modest risk and few causal alleles have been identified . Furthermore, these associations account for a small portion of the total heritability of inherited disease variation . This has led to the reexamination of the contribution of environment, gene-gene and gene-environment interactions, and rare genetic variants in complex diseases [1, 3, 4]. There is strong evidence that rare variants play an important role in complex disease etiology and may have larger genetic effects than common variants . Currently, much of what we know regarding the contribution of rare genetic variants to disease risk is based on a limited number of phenotypes and candidate genes. However, rapid advancement of second generation sequencing technologies will invariably lead to widespread association studies comparing whole exome and eventually whole genome sequencing of cases and controls. A tremendous challenge for enabling these "next generation" medical genomic studies is developing statistical approaches for correlating rare genetic variants with disease outcome. The analysis of rare variants is challenging since methods used for common variants are woefully underpowered. Therefore, methods that can deal with genetic heterogeneity at the trait-associated locus have been developed to analyze rare variants. These methods instead analyzing individual variants analyze variants within a region/gene as a group and usually rely on collapsing. They can be applied to both in cases vs. controls and quantitative trait studies are needed. The paper of Bansal et al. in this volume describes the application of a number of statistical methods for testing associations between rare variants in two genes to obesity. The authors considered the relative merits of the different methods as well as important implementation details, such as the leveraging of genomic annotations and determining p-values. Knowledge of haplotypes can increase the power of GWAS studies and also highlight associations that are impossible to detect without haplotype phase (e.g. loss of heterozygosity). Even more complicated phase-dependent interactions of variants in linkage equilibrium have also been suggested as possible causes of missing heritability. In their work, Hallsorsson et al. formulate algorithmic strategies for haplotype phasing by multi-assembly of shared haplotypes from next-generation sequencing data. These methods would allow testing haplotypes harboring rare variants for association and potentially increase their explanatory power. Since single SNP tests are often underpowered in rare variant association analysis, Zeggini and Asimit propose a locus-based method that has high power in the presence of rare variants and that incorporate base quality scores available for sequencing data. Their results suggest that this multi-marker approach may be best suited for smaller regions, or after some filtering to reduce the number of SNPs that are jointly tested to reduce loss of power due to multiple-testing adjustments. Finally, the paper of Zhou et al., presents a penalized regression framework for association testing on sequence data, in the presence of both common and rare variants. This method also introduces the use of weights to incorporate available biological information on the variants. Although these tactics improve both false positive and false negative rates, they represent an incremental development and there is still significant room for improvement. With the development of sequencing technologies and methods to detect complex trait rare variant associations many new and exciting discovery are imminent. The analysis of rare variants is still in its infancy and the next few years promises to produce many new methods to meet the special demands of analyzing this type of data. Note from Publisher: This article contains the abstract and references.
View details for PubMedID 21121034
- HUMAN ORIGINS Shadows of early migrations NATURE 2010; 468 (7327): 1044-1045
ALCHEMY: a reliable method for automated SNP genotype calling for small batch sizes and highly homozygous populations
2010; 26 (23): 2952-2960
The development of new high-throughput genotyping products requires a significant investment in testing and training samples to evaluate and optimize the product before it can be used reliably on new samples. One reason for this is current methods for automated calling of genotypes are based on clustering approaches which require a large number of samples to be analyzed simultaneously, or an extensive training dataset to seed clusters. In systems where inbred samples are of primary interest, current clustering approaches perform poorly due to the inability to clearly identify a heterozygote cluster.As part of the development of two custom single nucleotide polymorphism genotyping products for Oryza sativa (domestic rice), we have developed a new genotype calling algorithm called 'ALCHEMY' based on statistical modeling of the raw intensity data rather than modelless clustering. A novel feature of the model is the ability to estimate and incorporate inbreeding information on a per sample basis allowing accurate genotyping of both inbred and heterozygous samples even when analyzed simultaneously. Since clustering is not used explicitly, ALCHEMY performs well on small sample sizes with accuracy exceeding 99% with as few as 18 samples.ALCHEMY is available for both commercial and academic use free of charge and distributed under the GNU General Public License at http://firstname.lastname@example.orgSupplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btq533
View details for Web of Science ID 000284430900005
View details for PubMedID 20926420
Fine-scale population structure and the era of next-generation sequencing
HUMAN MOLECULAR GENETICS
2010; 19: R221-R226
Fine-scale population structure characterizes most continents and is especially pronounced in non-cosmopolitan populations. Roughly half of the world's population remains non-cosmopolitan and even populations within cities often assort along ethnic and linguistic categories. Barriers to random mating can be ecologically extreme, such as the Sahara Desert, or cultural, such as the Indian caste system. In either case, subpopulations accumulate genetic differences if the barrier is maintained over multiple generations. Genome-wide polymorphism data, initially with only a few hundred autosomal microsatellites, have clearly established differences in allele frequency not only among continental regions, but also within continents and within countries. We review recent evidence from the analysis of genome-wide polymorphism data for genetic boundaries delineating human population structure and the main demographic and genomic processes shaping variation, and discuss the implications of population structure for the distribution and discovery of disease-causing genetic variants, in the light of the imminent availability of sequencing data for a multitude of diverse human genomes.
View details for DOI 10.1093/hmg/ddq403
View details for Web of Science ID 000284456100014
View details for PubMedID 20876616
Detection of Heterozygous Mutations in the Genome of Mismatch Repair Defective Diploid Yeast Using a Bayesian Approach
2010; 186 (2): 493-503
DNA replication errors that escape polymerase proofreading and mismatch repair (MMR) can lead to base substitution and frameshift mutations. Such mutations can disrupt gene function, reduce fitness, and promote diseases such as cancer and are also the raw material of molecular evolution. To analyze with limited bias genomic features associated with DNA polymerase errors, we performed a genome-wide analysis of mutations that accumulate in MMR-deficient diploid lines of Saccharomyces cerevisiae. These lines were derived from a common ancestor and were grown for 160 generations, with bottlenecks reducing the population to one cell every 20 generations. We sequenced to between 8- and 20-fold coverage one wild-type and three mutator lines using Illumina Solexa 36-bp reads. Using an experimentally aware Bayesian genotype caller developed to pool experimental data across sequencing runs for all strains, we detected 28 heterozygous single-nucleotide polymorphisms (SNPs) and 48 single-nt insertion/deletions (indels) from the data set. This method was evaluated on simulated data sets and found to have a very low false-positive rate (?6 × 10(-5)) and a false-negative rate of 0.08 within the unique mapping regions of the genome that contained at least sevenfold coverage. The heterozygous mutations identified by the Bayesian genotype caller were confirmed by Sanger sequencing. All of the mutations were unique to a given line, except for a single-nt deletion mutation which occurred independently in two lines. All 48 indels, composed of 46 deletions and two insertions, occurred in homopolymer (HP) tracts [i.e., 47 poly(A) or (T) tracts, 1 poly(G) or (C) tract] between 5 and 13 bp long. Our findings are of interest because HP tracts are present at high levels in the yeast genome (>77,400 for 5- to 20-nt HP tracts), and frameshift mutations in these regions are likely to disrupt gene function. In addition, they demonstrate that the mutation pattern seen previously in mismatch repair defective strains using a limited number of reporters holds true for the entire genome.
View details for DOI 10.1534/genetics.110.120105
View details for Web of Science ID 000282807400005
View details for PubMedID 20660644
Balancing Selection Maintains a Form of ERAP2 that Undergoes Nonsense-Mediated Decay and Affects Antigen Presentation
2010; 6 (10)
A remarkable characteristic of the human major histocompatibility complex (MHC) is its extreme genetic diversity, which is maintained by balancing selection. In fact, the MHC complex remains one of the best-known examples of natural selection in humans, with well-established genetic signatures and biological mechanisms for the action of selection. Here, we present genetic and functional evidence that another gene with a fundamental role in MHC class I presentation, endoplasmic reticulum aminopeptidase 2 (ERAP2), has also evolved under balancing selection and contains a variant that affects antigen presentation. Specifically, genetic analyses of six human populations revealed strong and consistent signatures of balancing selection affecting ERAP2. This selection maintains two highly differentiated haplotypes (Haplotype A and Haplotype B), with frequencies 0.44 and 0.56, respectively. We found that ERAP2 expressed from Haplotype B undergoes differential splicing and encodes a truncated protein, leading to nonsense-mediated decay of the mRNA. To investigate the consequences of ERAP2 deficiency on MHC presentation, we correlated surface MHC class I expression with ERAP2 genotypes in primary lymphocytes. Haplotype B homozygotes had lower levels of MHC class I expressed on the surface of B cells, suggesting that naturally occurring ERAP2 deficiency affects MHC presentation and immune response. Interestingly, an ERAP2 paralog, endoplasmic reticulum aminopeptidase 1 (ERAP1), also shows genetic signatures of balancing selection. Together, our findings link the genetic signatures of selection with an effect on splicing and a cellular phenotype. Although the precise selective pressure that maintains polymorphism is unknown, the demonstrated differences between the ERAP2 splice forms provide important insights into the potential mechanism for the action of selection.
View details for DOI 10.1371/journal.pgen.1001157
View details for Web of Science ID 000283647800028
View details for PubMedID 20976248
The Baker's Yeast Diploid Genome Is Remarkably Stable in Vegetative Growth and Meiosis
2010; 6 (9)
Accurate estimates of mutation rates provide critical information to analyze genome evolution and organism fitness. We used whole-genome DNA sequencing, pulse-field gel electrophoresis, and comparative genome hybridization to determine mutation rates in diploid vegetative and meiotic mutation accumulation lines of Saccharomyces cerevisiae. The vegetative lines underwent only mitotic divisions while the meiotic lines underwent a meiotic cycle every ?20 vegetative divisions. Similar base substitution rates were estimated for both lines. Given our experimental design, these measures indicated that the meiotic mutation rate is within the range of being equal to zero to being 55-fold higher than the vegetative rate. Mutations detected in vegetative lines were all heterozygous while those in meiotic lines were homozygous. A quantitative analysis of intra-tetrad mating events in the meiotic lines showed that inter-spore mating is primarily responsible for rapidly fixing mutations to homozygosity as well as for removing mutations. We did not observe 1-2 nt insertion/deletion (in-del) mutations in any of the sequenced lines and only one structural variant in a non-telomeric location was found. However, a large number of structural variations in subtelomeric sequences were seen in both vegetative and meiotic lines that did not affect viability. Our results indicate that the diploid yeast nuclear genome is remarkably stable during the vegetative and meiotic cell cycles and support the hypothesis that peripheral regions of chromosomes are more dynamic than gene-rich central sections where structural rearrangements could be deleterious. This work also provides an improved estimate for the mutational load carried by diploid organisms.
View details for DOI 10.1371/journal.pgen.1001109
View details for Web of Science ID 000282369200047
View details for PubMedID 20838597
Bayesian Linkage Analysis of Categorical Traits for Arbitrary Pedigree Designs
2010; 5 (8)
Pedigree studies of complex heritable diseases often feature nominal or ordinal phenotypic measurements and missing genetic marker or phenotype data.We have developed a Bayesian method for Linkage analysis of Ordinal and Categorical traits (LOCate) that can analyze complex genealogical structure for family groups and incorporate missing data. LOCate uses a Gibbs sampling approach to assess linkage, incorporating a simulated tempering algorithm for fast mixing. While our treatment is Bayesian, we develop a LOD (log of odds) score estimator for assessing linkage from Gibbs sampling that is highly accurate for simulated data. LOCate is applicable to linkage analysis for ordinal or nominal traits, a versatility which we demonstrate by analyzing simulated data with a nominal trait, on which LOCate outperforms LOT, an existing method which is designed for ordinal traits. We additionally demonstrate our method's versatility by analyzing a candidate locus (D2S1788) for panic disorder in humans, in a dataset with a large amount of missing data, which LOT was unable to handle.LOCate's accuracy and applicability to both ordinal and nominal traits will prove useful to researchers interested in mapping loci for categorical traits.
View details for DOI 10.1371/journal.pone.0012307
View details for Web of Science ID 000281292100005
View details for PubMedID 20865038
An ADAM9 mutation in canine cone-rod dystrophy 3 establishes homology with human cone-rod dystrophy 9
2010; 16 (167-70): 1549-1569
To identify the causative mutation in a canine cone-rod dystrophy (crd3) that segregates as an adult onset disorder in the Glen of Imaal Terrier breed of dog.Glen of Imaal Terriers were ascertained for crd3 phenotype by clinical ophthalmoscopic examination, and in selected cases by electroretinography. Blood samples from affected cases and non-affected controls were collected and used, after DNA extraction, to undertake a genome-wide association study using Affymetrix Version 2 Canine single nucleotide polymorphism chips and 250K Sty Assay protocol. Positional candidate gene analysis was undertaken for genes identified within the peak-association signal region. Retinal morphology of selected crd3-affected dogs was evaluated by light and electron microscopy.A peak association signal exceeding genome-wide significance was identified on canine chromosome 16. Evaluation of genes in this region suggested A Disintegrin And Metalloprotease domain, family member 9 (ADAM9), identified concurrently elsewhere as the cause of human cone-rod dystrophy 9 (CORD9), as a strong positional candidate for canine crd3. Sequence analysis identified a large genomic deletion (over 20 kb) that removed exons 15 and 16 from the ADAM9 transcript, introduced a premature stop, and would remove critical domains from the encoded protein. Light and electron microscopy established that, as in ADAM9 knockout mice, the primary lesion in crd3 appears to be a failure of the apical microvilli of the retinal pigment epithelium to appropriately invest photoreceptor outer segments. By electroretinography, retinal function appears normal in very young crd3-affected dogs, but by 15 months of age, cone dysfunction is present. Subsequently, both rod and cone function degenerate.Identification of this ADAM9 deletion in crd3-affected dogs establishes this canine disease as orthologous to CORD9 in humans, and offers opportunities for further characterization of the disease process, and potential for genetic therapeutic intervention.
View details for Web of Science ID 000281314900001
View details for PubMedID 20806078
A Simple Genetic Architecture Underlies Morphological Variation in Dogs
2010; 8 (8)
Domestic dogs exhibit tremendous phenotypic diversity, including a greater variation in body size than any other terrestrial mammal. Here, we generate a high density map of canine genetic variation by genotyping 915 dogs from 80 domestic dog breeds, 83 wild canids, and 10 outbred African shelter dogs across 60,968 single-nucleotide polymorphisms (SNPs). Coupling this genomic resource with external measurements from breed standards and individuals as well as skeletal measurements from museum specimens, we identify 51 regions of the dog genome associated with phenotypic variation among breeds in 57 traits. The complex traits include average breed body size and external body dimensions and cranial, dental, and long bone shape and size with and without allometric scaling. In contrast to the results from association mapping of quantitative traits in humans and domesticated plants, we find that across dog breeds, a small number of quantitative trait loci (< or = 3) explain the majority of phenotypic variation for most of the traits we studied. In addition, many genomic regions show signatures of recent selection, with most of the highly differentiated regions being associated with breed-defining traits such as body size, coat characteristics, and ear floppiness. Our results demonstrate the efficacy of mapping multiple traits in the domestic dog using a database of genotyped individuals and highlight the important role human-directed selection has played in altering the genetic architecture of key traits in this important species.
View details for DOI 10.1371/journal.pbio.1000451
View details for Web of Science ID 000281464500009
View details for PubMedID 20711490
Successful Computational Prediction of Novel Imprinted Genes from Epigenomic Features
MOLECULAR AND CELLULAR BIOLOGY
2010; 30 (13): 3357-3370
Approximately 100 mouse genes undergo genomic imprinting, whereby one of the two parental alleles is epigenetically silenced. Imprinted genes influence processes including development, X chromosome inactivation, obesity, schizophrenia, and diabetes, motivating the identification of all imprinted loci. Local sequence features have been used to predict candidate imprinted genes, but rigorous testing using reciprocal crosses validated only three, one of which resided in previously identified imprinting clusters. Here we show that specific epigenetic features in mouse cells correlate with imprinting status in mice, and we identify hundreds of additional genes predicted to be imprinted in the mouse. We used a multitiered approach to validate imprinted expression, including use of a custom single nucleotide polymorphism array and traditional molecular methods. Of 65 candidates subjected to molecular assays for allele-specific expression, we found 10 novel imprinted genes that were maternally expressed in the placenta.
View details for DOI 10.1128/MCB.01355-09
View details for Web of Science ID 000278626100018
View details for PubMedID 20421412
The Effect of Recent Admixture on Inference of Ancient Human Population History
2010; 185 (2): 611-U327
Despite the widespread study of genetic variation in admixed human populations, such as African-Americans, there has not been an evaluation of the effects of recent admixture on patterns of polymorphism or inferences about population demography. These issues are particularly relevant because estimates of the timing and magnitude of population growth in Africa have differed among previous studies, some of which examined African-American individuals. Here we use simulations and single-nucleotide polymorphism (SNP) data collected through direct resequencing and genotyping to investigate these issues. We find that when estimating the current population size and magnitude of recent growth in an ancestral population using the site frequency spectrum (SFS), it is possible to obtain reasonably accurate estimates of the parameters when using samples drawn from the admixed population under certain conditions. We also show that methods for demographic inference that use haplotype patterns are more sensitive to recent admixture than are methods based on the SFS. The analysis of human genetic variation data from the Yoruba people of Ibadan, Nigeria and African-Americans supports the predictions from the simulations. Our results have important implications for the evaluation of previous population genetic studies that have considered African-American individuals as a proxy for individuals from West Africa as well as for future population genetic studies of additional admixed populations.
View details for DOI 10.1534/genetics.109.113761
View details for Web of Science ID 000281905200018
View details for PubMedID 20382834
Genomic Diversity and Introgression in O. sativa Reveal the Impact of Domestication and Breeding on the Rice Genome
2010; 5 (5)
The domestication of Asian rice (Oryza sativa) was a complex process punctuated by episodes of introgressive hybridization among and between subpopulations. Deep genetic divergence between the two main varietal groups (Indica and Japonica) suggests domestication from at least two distinct wild populations. However, genetic uniformity surrounding key domestication genes across divergent subpopulations suggests cultural exchange of genetic material among ancient farmers.In this study, we utilize a novel 1,536 SNP panel genotyped across 395 diverse accessions of O. sativa to study genome-wide patterns of polymorphism, to characterize population structure, and to infer the introgression history of domesticated Asian rice. Our population structure analyses support the existence of five major subpopulations (indica, aus, tropical japonica, temperate japonica and GroupV) consistent with previous analyses. Our introgression analysis shows that most accessions exhibit some degree of admixture, with many individuals within a population sharing the same introgressed segment due to artificial selection. Admixture mapping and association analysis of amylose content and grain length illustrate the potential for dissecting the genetic basis of complex traits in domesticated plant populations.Genes in these regions control a myriad of traits including plant stature, blast resistance, and amylose content. These analyses highlight the power of population genomics in agricultural systems to identify functionally important regions of the genome and to decipher the role of human-directed breeding in refashioning the genomes of a domesticated species.
View details for Web of Science ID 000278034600010
View details for PubMedID 20520727
Genome-wide patterns of population structure and admixture among Hispanic/Latino populations
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2010; 107: 8954-8961
Hispanic/Latino populations possess a complex genetic structure that reflects recent admixture among and potentially ancient substructure within Native American, European, and West African source populations. Here, we quantify genome-wide patterns of SNP and haplotype variation among 100 individuals with ancestry from Ecuador, Colombia, Puerto Rico, and the Dominican Republic genotyped on the Illumina 610-Quad arrays and 112 Mexicans genotyped on Affymetrix 500K platform. Intersecting these data with previously collected high-density SNP data from 4,305 individuals, we use principal component analysis and clustering methods FRAPPE and STRUCTURE to investigate genome-wide patterns of African, European, and Native American population structure within and among Hispanic/Latino populations. Comparing autosomal, X and Y chromosome, and mtDNA variation, we find evidence of a significant sex bias in admixture proportions consistent with disproportionate contribution of European male and Native American female ancestry to present-day populations. We also find that patterns of linkage-disequilibria in admixed Hispanic/Latino populations are largely affected by the admixture dynamics of the populations, with faster decay of LD in populations of higher African ancestry. Finally, using the locus-specific ancestry inference method LAMP, we reconstruct fine-scale chromosomal patterns of admixture. We document moderate power to differentiate among potential subcontinental source populations within the Native American, European, and African segments of the admixed Hispanic/Latino genomes. Our results suggest future genome-wide association scans in Hispanic/Latino populations may require correction for local genomic ancestry at a subcontinental scale when associating differences in the genome with disease risk, progression, and drug efficacy, as well as for admixture mapping.
View details for DOI 10.1073/pnas.0914618107
View details for Web of Science ID 000277553800009
Genome-wide SNP and haplotype analyses reveal a rich history underlying dog domestication
2010; 464 (7290): 898-U109
Advances in genome technology have facilitated a new understanding of the historical and genetic processes crucial to rapid phenotypic evolution under domestication. To understand the process of dog diversification better, we conducted an extensive genome-wide survey of more than 48,000 single nucleotide polymorphisms in dogs and their wild progenitor, the grey wolf. Here we show that dog breeds share a higher proportion of multi-locus haplotypes unique to grey wolves from the Middle East, indicating that they are a dominant source of genetic diversity for dogs rather than wolves from east Asia, as suggested by mitochondrial DNA sequence data. Furthermore, we find a surprising correspondence between genetic and phenotypic/functional breed groupings but there are exceptions that suggest phenotypic diversification depended in part on the repeated crossing of individuals with novel phenotypes. Our results show that Middle Eastern wolves were a critical source of genome diversity, although interbreeding with local wolf populations clearly occurred elsewhere in the early history of specific lineages. More recently, the evolution of modern dog breeds seems to have been an iterative process that drew on a limited genetic toolkit to create remarkable phenotypic diversity.
View details for DOI 10.1038/nature08837
View details for Web of Science ID 000276397300039
View details for PubMedID 20237475
A genome-wide linkage scan in German shepherd dogs localizes canine platelet procoagulant deficiency (Scott syndrome) to canine chromosome 27
2010; 450 (1-2): 70-75
Scott syndrome is a rare hereditary bleeding disorder associated with an inability of stimulated platelets to externalize the negatively charged phospholipid, phosphatidylserine (PS). Canine Scott syndrome (CSS) is the only naturally occurring animal model of this defect and therefore represents a unique tool to discover a disease gene capable of producing this platelet phenotype. We undertook platelet function studies and linkage analyses in a pedigree of CSS-affected German shepherd dogs. Based on residual serum prothrombin and flow cytometric assays, CSS segregates as an autosomal recessive trait. An initial genome scan, performed by genotyping 48 dogs for 280 microsatellite markers, suggested linkage with markers on chromosome 27. Genotypes ultimately obtained for a total of 56 dogs at 11 markers on chromosome 27 revealed significant LOD scores for 2 markers near the centromere, with multipoint linkage indicating a CSS trait locus spanning approximately 14 cm. These results provide the basis for fine mapping studies to narrow the disease interval and target the evaluation of putative disease genes.
View details for DOI 10.1016/j.gene.2009.09.016
View details for Web of Science ID 000273925100010
View details for PubMedID 19854246
SNP identification, verification, and utility for population genetics in a non-model genus.
2010; 11: 32-?
By targeting SNPs contained in both coding and non-coding areas of the genome, we are able to identify genetic differences and characterize genome-wide patterns of variation among individuals, populations and species. We investigated the utility of 454 sequencing and MassARRAY genotyping for population genetics in natural populations of the teleost, Fundulus heteroclitus as well as closely related Fundulus species (F. grandis, F. majalis and F. similis).We used 454 pyrosequencing and MassARRAY genotyping technology to identify and type 458 genome-wide SNPs and determine genetic differentiation within and between populations and species of Fundulus. Specifically, pyrosequencing identified 96 putative SNPs across coding and non-coding regions of the F. heteroclitus genome: 88.8% were verified as true SNPs with MassARRAY. Additionally, putative SNPs identified in F. heteroclitus EST sequences were verified in most (86.5%) F. heteroclitus individuals; fewer were genotyped in F. grandis (74.4%), F. majalis (72.9%), and F. similis (60.7%) individuals. SNPs were polymorphic and showed latitudinal clinal variation separating northern and southern populations and established isolation by distance in F. heteroclitus populations. In F. grandis, SNPs were less polymorphic but still established isolation by distance. Markers differentiated species and populations.In total, these approaches were used to quickly determine differences within the Fundulus genome and provide markers for population genetic studies.
View details for DOI 10.1186/1471-2156-11-32
View details for PubMedID 20433726
Targets of Balancing Selection in the Human Genome
MOLECULAR BIOLOGY AND EVOLUTION
2009; 26 (12): 2755-2764
Balancing selection is potentially an important biological force for maintaining advantageous genetic diversity in populations, including variation that is responsible for long-term adaptation to the environment. By serving as a means to maintain genetic variation, it may be particularly relevant to maintaining phenotypic variation in natural populations. Nevertheless, its prevalence and specific targets in the human genome remain largely unknown. We have analyzed the patterns of diversity and divergence of 13,400 genes in two human populations using an unbiased single-nucleotide polymorphism data set, a genome-wide approach, and a method that incorporates demography in neutrality tests. We identified an unbiased catalog of genes with signatures of long-term balancing selection, which includes immunity genes as well as genes encoding keratins and membrane channels; the catalog also shows enrichment in functional categories involved in cellular structure. Patterns are mostly concordant in the two populations, with a small fraction of genes showing population-specific signatures of selection. Power considerations indicate that our findings represent a subset of all targets in the genome, suggesting that although balancing selection may not have an obvious impact on a large proportion of human genes, it is a key force affecting the evolution of a number of genes in humans.
View details for DOI 10.1093/molbev/msp190
View details for Web of Science ID 000271818500010
View details for PubMedID 19713326
Coat Variation in the Domestic Dog Is Governed by Variants in Three Genes
2009; 326 (5949): 150-153
Coat color and type are essential characteristics of domestic dog breeds. Although the genetic basis of coat color has been well characterized, relatively little is known about the genes influencing coat growth pattern, length, and curl. We performed genome-wide association studies of more than 1000 dogs from 80 domestic breeds to identify genes associated with canine fur phenotypes. Taking advantage of both inter- and intrabreed variability, we identified distinct mutations in three genes, RSPO2, FGF5, and KRT71 (encoding R-spondin-2, fibroblast growth factor-5, and keratin-71, respectively), that together account for most coat phenotypes in purebred dogs in the United States. Thus, an array of varied and seemingly complex phenotypes can be reduced to the combinatorial effects of only a few genes.
View details for DOI 10.1126/science.1177808
View details for Web of Science ID 000270355600058
View details for PubMedID 19713490
Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data
2009; 5 (10)
Demographic models built from genetic data play important roles in illuminating prehistorical events and serving as null models in genome scans for selection. We introduce an inference method based on the joint frequency spectrum of genetic variants within and between populations. For candidate models we numerically compute the expected spectrum using a diffusion approximation to the one-locus, two-allele Wright-Fisher process, involving up to three simultaneous populations. Our approach is a composite likelihood scheme, since linkage between neutral loci alters the variance but not the expectation of the frequency spectrum. We thus use bootstraps incorporating linkage to estimate uncertainties for parameters and significance values for hypothesis tests. Our method can also incorporate selection on single sites, predicting the joint distribution of selected alleles among populations experiencing a bevy of evolutionary forces, including expansions, contractions, migrations, and admixture. We model human expansion out of Africa and the settlement of the New World, using 5 Mb of noncoding DNA resequenced in 68 individuals from 4 populations (YRI, CHB, CEU, and MXL) by the Environmental Genome Project. We infer divergence between West African and Eurasian populations 140 thousand years ago (95% confidence interval: 40-270 kya). This is earlier than other genetic studies, in part because we incorporate migration. We estimate the European (CEU) and East Asian (CHB) divergence time to be 23 kya (95% c.i.: 17-43 kya), long after archeological evidence places modern humans in Europe. Finally, we estimate divergence between East Asians (CHB) and Mexican-Americans (MXL) of 22 kya (95% c.i.: 16.3-26.9 kya), and our analysis yields no evidence for subsequent migration. Furthermore, combining our demographic model with a previously estimated distribution of selective effects among newly arising amino acid mutations accurately predicts the frequency spectrum of nonsynonymous variants across three continental populations (YRI, CHB, CEU).
View details for DOI 10.1371/journal.pgen.1000695
View details for Web of Science ID 000272032100033
View details for PubMedID 19851460
An Expressed Fgf4 Retrogene Is Associated with Breed-Defining Chondrodysplasia in Domestic Dogs
2009; 325 (5943): 995-998
Retrotransposition of processed mRNAs is a common source of novel sequence acquired during the evolution of genomes. Although the vast majority of retroposed gene copies, or retrogenes, rapidly accumulate debilitating mutations that disrupt the reading frame, a small percentage become new genes that encode functional proteins. By using a multibreed association analysis in the domestic dog, we demonstrate that expression of a recently acquired retrogene encoding fibroblast growth factor 4 (fgf4) is strongly associated with chondrodysplasia, a short-legged phenotype that defines at least 19 dog breeds including dachshund, corgi, and basset hound. These results illustrate the important role of a single evolutionary event in constraining and directing phenotypic diversity in the domestic dog.
View details for DOI 10.1126/science.1173275
View details for Web of Science ID 000269242800046
View details for PubMedID 19608863
Complex population structure in African village dogs and its implications for inferring dog domestication history
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2009; 106 (33): 13903-13908
High genetic diversity of East Asian village dogs has recently been used to argue for an East Asian origin of the domestic dog. However, global village dog genetic diversity and the extent to which semiferal village dogs represent distinct, indigenous populations instead of admixtures of various dog breeds has not been quantified. Understanding these issues is critical to properly reconstructing the timing, number, and locations of dog domestication. To address these questions, we sampled 318 village dogs from 7 regions in Egypt, Uganda, and Namibia, measuring genetic diversity >680 bp of the mitochondrial D-loop, 300 SNPs, and 89 microsatellite markers. We also analyzed breed dogs, including putatively African breeds (Afghan hounds, Basenjis, Pharaoh hounds, Rhodesian ridgebacks, and Salukis), Puerto Rican street dogs, and mixed breed dogs from the United States. Village dogs from most African regions appear genetically distinct from non-native breed and mixed-breed dogs, although some individuals cluster genetically with Puerto Rican dogs or United States breed mixes instead of with neighboring village dogs. Thus, African village dogs are a mosaic of indigenous dogs descended from early migrants to Africa, and non-native, breed-admixed individuals. Among putatively African breeds, Pharaoh hounds, and Rhodesian ridgebacks clustered with non-native rather than indigenous African dogs, suggesting they have predominantly non-African origins. Surprisingly, we find similar mtDNA haplotype diversity in African and East Asian village dogs, potentially calling into question the hypothesis of an East Asian origin for dog domestication.
View details for DOI 10.1073/pnas.0902129106
View details for Web of Science ID 000269078700052
View details for PubMedID 19666600
Evolutionary Processes Acting on Candidate cis-Regulatory Regions in Humans Inferred from Patterns of Polymorphism and Divergence
2009; 5 (8)
Analysis of polymorphism and divergence in the non-coding portion of the human genome yields crucial information about factors driving the evolution of gene regulation. Candidate cis-regulatory regions spanning more than 15,000 genes in 15 African Americans and 20 European Americans were re-sequenced and aligned to the chimpanzee genome in order to identify potentially functional polymorphism and to characterize and quantify departures from neutral evolution. Distortions of the site frequency spectra suggest a general pattern of selective constraint on conserved non-coding sites in the flanking regions of genes (CNCs). Moreover, there is an excess of fixed differences that cannot be explained by a Gamma model of deleterious fitness effects, suggesting the presence of positive selection on CNCs. Extensions of the McDonald-Kreitman test identified candidate cis-regulatory regions with high probabilities of positive and negative selection near many known human genes, the biological characteristics of which exhibit genome-wide trends that differ from patterns observed in protein-coding regions. Notably, there is a higher probability of positive selection in candidate cis-regulatory regions near genes expressed in the fetal brain, suggesting that a larger portion of adaptive regulatory changes has occurred in genes expressed during brain development. Overall we find that natural selection has played an important role in the evolution of candidate cis-regulatory regions throughout hominid evolution.
View details for DOI 10.1371/journal.pgen.1000592
View details for Web of Science ID 000271533500007
View details for PubMedID 19662163
Genomewide SNP variation reveals relationships among landraces and modern varieties of rice
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2009; 106 (30): 12273-12278
Rice, the primary source of dietary calories for half of humanity, is the first crop plant for which a high-quality reference genome sequence from a single variety was produced. We used resequencing microarrays to interrogate 100 Mb of the unique fraction of the reference genome for 20 diverse varieties and landraces that capture the impressive genotypic and phenotypic diversity of domesticated rice. Here, we report the distribution of 160,000 nonredundant SNPs. Introgression patterns of shared SNPs revealed the breeding history and relationships among the 20 varieties; some introgressed regions are associated with agronomic traits that mark major milestones in rice improvement. These comprehensive SNP data provide a foundation for deep exploration of rice diversity and gene-trait relationships and their use for future rice improvement.
View details for DOI 10.1073/pnas.0900992106
View details for Web of Science ID 000268440200015
View details for PubMedID 19597147
Darwinian and demographic forces affecting human protein coding genes
2009; 19 (5): 838-849
Past demographic changes can produce distortions in patterns of genetic variation that can mimic the appearance of natural selection unless the demographic effects are explicitly removed. Here we fit a detailed model of human demography that incorporates divergence, migration, admixture, and changes in population size to directly sequenced data from 13,400 protein coding genes from 20 European-American and 19 African-American individuals. Based on this demographic model, we use several new and established statistical methods for identifying genes with extreme patterns of polymorphism likely to be caused by Darwinian selection, providing the first genome-wide analysis of allele frequency distributions in humans based on directly sequenced data. The tests are based on observations of excesses of high frequency-derived alleles, excesses of low frequency-derived alleles, and excesses of differences in allele frequencies between populations. We detect numerous new genes with strong evidence of selection, including a number of genes related to psychiatric and other diseases. We also show that microRNA controlled genes evolve under extremely high constraints and are more likely to undergo negative selection than other genes. Furthermore, we show that genes involved in muscle development have been subject to positive selection during recent human history. In accordance with previous studies, we find evidence for negative selection against mutations in genes associated with Mendelian disease and positive selection acting on genes associated with several complex diseases.
View details for DOI 10.1101/gr.088336.108
View details for Web of Science ID 000265668800017
View details for PubMedID 19279335
Methods for Human Demographic Inference Using Haplotype Patterns From Genomewide Single-Nucleotide Polymorphism Data
2009; 182 (1): 217-231
We propose a novel approximate-likelihood method to fit demographic models to human genomewide single-nucleotide polymorphism (SNP) data. We divide the genome into windows of constant genetic map width and then tabulate the number of distinct haplotypes and the frequency of the most common haplotype for each window. We summarize the data by the genomewide joint distribution of these two statistics-termed the HCN statistic. Coalescent simulations are used to generate the expected HCN statistic for different demographic parameters. The HCN statistic provides additional information for disentangling complex demography beyond statistics based on single-SNP frequencies. Application of our method to simulated data shows it can reliably infer parameters from growth and bottleneck models, even in the presence of recombination hotspots when properly modeled. We also examined how practical problems with genomewide data sets, such as errors in the genetic map, haplotype phase uncertainty, and SNP ascertainment bias, affect our method. Several modifications of our method served to make it robust to these problems. We have applied our method to data collected by Perlegen Sciences and find evidence for a severe population size reduction in northwestern Europe starting 32,500-47,500 years ago.
View details for DOI 10.1534/genetics.108.099275
View details for Web of Science ID 000270213800019
View details for PubMedID 19255370
Global distribution of genomic diversity underscores rich complex history of continental human populations
2009; 19 (5): 795-803
Characterizing patterns of genetic variation within and among human populations is important for understanding human evolutionary history and for careful design of medical genetic studies. Here, we analyze patterns of variation across 443,434 single nucleotide polymorphisms (SNPs) genotyped in 3845 individuals from four continental regions. This unique resource allows us to illuminate patterns of diversity in previously under-studied populations at the genome-wide scale including Latin America, South Asia, and Southern Europe. Key insights afforded by our analysis include quantifying the degree of admixture in a large collection of individuals from Guadalajara, Mexico; identifying language and geography as key determinants of population structure within India; and elucidating a north-south gradient in haplotype diversity within Europe. We also present a novel method for identifying long-range tracts of homozygosity indicative of recent common ancestry. Application of our approach suggests great variation within and among populations in the extent of homozygosity, suggesting both demographic history (such as population bottlenecks) and recent ancestry events (such as consanguinity) play an important role in patterning variation in large modern human populations.
View details for DOI 10.1101/gr.088898.108
View details for Web of Science ID 000265668800013
View details for PubMedID 19218534
Genome-Wide Survey of SNP Variation Uncovers the Genetic Structure of Cattle Breeds
2009; 324 (5926): 528-532
The imprints of domestication and breed development on the genomes of livestock likely differ from those of companion animals. A deep draft sequence assembly of shotgun reads from a single Hereford female and comparative sequences sampled from six additional breeds were used to develop probes to interrogate 37,470 single-nucleotide polymorphisms (SNPs) in 497 cattle from 19 geographically and biologically diverse breeds. These data show that cattle have undergone a rapid recent decrease in effective population size from a very large ancestral population, possibly due to bottlenecks associated with domestication, selection, and breed formation. Domestication and artificial selection appear to have left detectable signatures of selection within the cattle genome, yet the current levels of diversity within breeds are at least as great as exists within humans.
View details for DOI 10.1126/science.1167936
View details for Web of Science ID 000265411200051
View details for PubMedID 19390050
Linkage Disequilibrium and Demographic History of Wild and Domestic Canids
2009; 181 (4): 1493-1505
Assessing the extent of linkage disequilibrium (LD) in natural populations of a nonmodel species has been difficult due to the lack of available genomic markers. However, with advances in genotyping and genome sequencing, genomic characterization of natural populations has become feasible. Using sequence data and SNP genotypes, we measured LD and modeled the demographic history of wild canid populations and domestic dog breeds. In 11 gray wolf populations and one coyote population, we find that the extent of LD as measured by the distance at which r2=0.2 extends <10 kb in outbred populations to >1.7 Mb in populations that have experienced significant founder events and bottlenecks. This large range in the extent of LD parallels that observed in 18 dog breeds where the r2 value varies from approximately 20 kb to >5 Mb. Furthermore, in modeling demographic history under a composite-likelihood framework, we find that two of five wild canid populations exhibit evidence of a historical population contraction. Five domestic dog breeds display evidence for a minor population contraction during domestication and a more severe contraction during breed formation. Only a 5% reduction in nucleotide diversity was observed as a result of domestication, whereas the loss of nucleotide diversity with breed formation averaged 35%.
View details for DOI 10.1534/genetics.108.098830
View details for Web of Science ID 000270213700027
View details for PubMedID 19189949
Molecular and Evolutionary History of Melanism in North American Gray Wolves
2009; 323 (5919): 1339-1343
Morphological diversity within closely related species is an essential aspect of evolution and adaptation. Mutations in the Melanocortin 1 receptor (Mc1r) gene contribute to pigmentary diversity in natural populations of fish, birds, and many mammals. However, melanism in the gray wolf, Canis lupus, is caused by a different melanocortin pathway component, the K locus, that encodes a beta-defensin protein that acts as an alternative ligand for Mc1r. We show that the melanistic K locus mutation in North American wolves derives from past hybridization with domestic dogs, has risen to high frequency in forested habitats, and exhibits a molecular signature of positive selection. The same mutation also causes melanism in the coyote, Canis latrans, and in Italian gray wolves, and hence our results demonstrate how traits selected in domesticated species can influence the morphological diversity of their wild relatives.
View details for DOI 10.1126/science.1165448
View details for Web of Science ID 000263876700041
View details for PubMedID 19197024
Improving the Fitness of High-Dimensional Biomechanical Models via Data-Driven Stochastic Exploration
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING
2009; 56 (3): 552-564
The field of complex biomechanical modeling has begun to rely on Monte Carlo techniques to investigate the effects of parameter variability and measurement uncertainty on model outputs, search for optimal parameter combinations, and define model limitations. However, advanced stochastic methods to perform data-driven explorations, such as Markov chain Monte Carlo (MCMC), become necessary as the number of model parameters increases. Here, we demonstrate the feasibility and, what to our knowledge is, the first use of an MCMC approach to improve the fitness of realistically large biomechanical models. We used a Metropolis-Hastings algorithm to search increasingly complex parameter landscapes (3, 8, 24, and 36 dimensions) to uncover underlying distributions of anatomical parameters of a "truth model" of the human thumb on the basis of simulated kinematic data (thumbnail location, orientation, and linear and angular velocities) polluted by zero-mean, uncorrelated multivariate Gaussian "measurement noise." Driven by these data, ten Markov chains searched each model parameter space for the subspace that best fit the data (posterior distribution). As expected, the convergence time increased, more local minima were found, and marginal distributions broadened as the parameter space complexity increased. In the 36-D scenario, some chains found local minima but the majority of chains converged to the true posterior distribution (confirmed using a cross-validation dataset), thus demonstrating the feasibility and utility of these methods for realistically large biomechanical problems.
View details for DOI 10.1109/TBME.2008.2006033
View details for Web of Science ID 000265486800002
View details for PubMedID 19272906
Copy Number Variation of CCL3-like Genes Affects Rate of Progression to Simian-AIDS in Rhesus Macaques (Macaca mulatta)
2009; 5 (1)
Variation in genes underlying host immunity can lead to marked differences in susceptibility to HIV infection among humans. Despite heavy reliance on non-human primates as models for HIV/AIDS, little is known about which host factors are shared and which are unique to a given primate lineage. Here, we investigate whether copy number variation (CNV) at CCL3-like genes (CCL3L), a key genetic host factor for HIV/AIDS susceptibility and cell-mediated immune response in humans, is also a determinant of time until onset of simian-AIDS in rhesus macaques. Using a retrospective study of 57 rhesus macaques experimentally infected with SIVmac, we find that CCL3L CNV explains approximately 18% of the variance in time to simian-AIDS (p<0.001) with lower CCL3L copy number associating with more rapid disease course. We also find that CCL3L copy number varies significantly (p<10(-6)) among rhesus subpopulations, with Indian-origin macaques having, on average, half as many CCL3L gene copies as Chinese-origin macaques. Lastly, we confirm that CCL3L shows variable copy number in humans and chimpanzees and report on CCL3L CNV within and among three additional primate species. On the basis of our findings we suggest that (1) the difference in population level copy number may explain previously reported observations of longer post-infection survivorship of Chinese-origin rhesus macaques, (2) stratification by CCL3L copy number in rhesus SIV vaccine trials will increase power and reduce noise due to non-vaccine-related differences in survival, and (3) CCL3L CNV is an ancestral component of the primate immune response and, therefore, copy number variation has not been driven by HIV or SIV per se.
View details for DOI 10.1371/journal.pgen.1000346
View details for Web of Science ID 000266221100028
View details for PubMedID 19165326
- Evaluating signatures of sex-specific processes in the human genome NATURE GENETICS 2009; 41 (1): 8-10
The population reference sample, POPRES: A resource for population, disease, and pharmacological genetics research
AMERICAN JOURNAL OF HUMAN GENETICS
2008; 83 (3): 347-358
Technological and scientific advances, stemming in large part from the Human Genome and HapMap projects, have made large-scale, genome-wide investigations feasible and cost effective. These advances have the potential to dramatically impact drug discovery and development by identifying genetic factors that contribute to variation in disease risk as well as drug pharmacokinetics, treatment efficacy, and adverse drug reactions. In spite of the technological advancements, successful application in biomedical research would be limited without access to suitable sample collections. To facilitate exploratory genetics research, we have assembled a DNA resource from a large number of subjects participating in multiple studies throughout the world. This growing resource was initially genotyped with a commercially available genome-wide 500,000 single-nucleotide polymorphism panel. This project includes nearly 6,000 subjects of African-American, East Asian, South Asian, Mexican, and European origin. Seven informative axes of variation identified via principal-component analysis (PCA) of these data confirm the overall integrity of the data and highlight important features of the genetic structure of diverse populations. The potential value of such extensively genotyped collections is illustrated by selection of genetically matched population controls in a genome-wide analysis of abacavir-associated hypersensitivity reaction. We find that matching based on country of origin, identity-by-state distance, and multidimensional PCA do similarly well to control the type I error rate. The genotype and demographic data from this reference sample are freely available through the NCBI database of Genotypes and Phenotypes (dbGaP).
View details for DOI 10.1016/j.ajhg.2008.08.005
View details for Web of Science ID 000259307200006
View details for PubMedID 18760391
- Exclusion of ABCA-1 as a candidate gene for canine Scott syndrome JOURNAL OF THROMBOSIS AND HAEMOSTASIS 2008; 6 (9): 1608-U5
Patterns of Positive Selection in Six Mammalian Genomes
2008; 4 (8)
Genome-wide scans for positively selected genes (PSGs) in mammals have provided insight into the dynamics of genome evolution, the genetic basis of differences between species, and the functions of individual genes. However, previous scans have been limited in power and accuracy owing to small numbers of available genomes. Here we present the most comprehensive examination of mammalian PSGs to date, using the six high-coverage genome assemblies now available for eutherian mammals. The increased phylogenetic depth of this dataset results in substantially improved statistical power, and permits several new lineage- and clade-specific tests to be applied. Of approximately 16,500 human genes with high-confidence orthologs in at least two other species, 400 genes showed significant evidence of positive selection (FDR<0.05), according to a standard likelihood ratio test. An additional 144 genes showed evidence of positive selection on particular lineages or clades. As in previous studies, the identified PSGs were enriched for roles in defense/immunity, chemosensory perception, and reproduction, but enrichments were also evident for more specific functions, such as complement-mediated immunity and taste perception. Several pathways were strongly enriched for PSGs, suggesting possible co-evolution of interacting genes. A novel Bayesian analysis of the possible "selection histories" of each gene indicated that most PSGs have switched multiple times between positive selection and nonselection, suggesting that positive selection is often episodic. A detailed analysis of Affymetrix exon array data indicated that PSGs are expressed at significantly lower levels, and in a more tissue-specific manner, than non-PSGs. Genes that are specifically expressed in the spleen, testes, liver, and breast are significantly enriched for PSGs, but no evidence was found for an enrichment for PSGs among brain-specific genes. This study provides additional evidence for widespread positive selection in mammalian evolution and new genome-wide insights into the functional implications of positive selection.
View details for DOI 10.1371/journal.pgen.1000144
View details for Web of Science ID 000260410800035
View details for PubMedID 18670650
Natural selection on genes that underlie human disease susceptibility
2008; 18 (12): 883-889
What evolutionary forces shape genes that contribute to the risk of human disease? Do similar selective pressures act on alleles that underlie simple versus complex disorders [1-3]? Answers to these questions will shed light onto the origin of human disorders (e.g., ) and help to predict the population frequencies of alleles that contribute to disease risk, with important implications for the efficient design of mapping studies [5-7]. As a first step toward addressing these questions, we created a hand-curated version of the Mendelian Inheritance in Man database (OMIM). We then examined selective pressures on Mendelian-disease genes, genes that contribute to complex-disease risk, and genes known to be essential in mouse by analyzing patterns of human polymorphism and of divergence between human and rhesus macaque. We found that Mendelian-disease genes appear to be under widespread purifying selection, especially when the disease mutations are dominant (rather than recessive). In contrast, the class of genes that influence complex-disease risk shows little signs of evolutionary conservation, possibly because this category includes targets of both purifying and positive selection.
View details for DOI 10.1016/j.cub.2008.04.074
View details for Web of Science ID 000257151900026
View details for PubMedID 18571414
Assessing the evolutionary impact of amino acid mutations in the human genome
2008; 4 (5)
Quantifying the distribution of fitness effects among newly arising mutations in the human genome is key to resolving important debates in medical and evolutionary genetics. Here, we present a method for inferring this distribution using Single Nucleotide Polymorphism (SNP) data from a population with non-stationary demographic history (such as that of modern humans). Application of our method to 47,576 coding SNPs found by direct resequencing of 11,404 protein coding-genes in 35 individuals (20 European Americans and 15 African Americans) allows us to assess the relative contribution of demographic and selective effects to patterning amino acid variation in the human genome. We find evidence of an ancient population expansion in the sample with African ancestry and a relatively recent bottleneck in the sample with European ancestry. After accounting for these demographic effects, we find strong evidence for great variability in the selective effects of new amino acid replacing mutations. In both populations, the patterns of variation are consistent with a leptokurtic distribution of selection coefficients (e.g., gamma or log-normal) peaked near neutrality. Specifically, we predict 27-29% of amino acid changing (nonsynonymous) mutations are neutral or nearly neutral (|s|<0.01%), 30-42% are moderately deleterious (0.01%<|s|<1%), and nearly all the remainder are highly deleterious or lethal (|s|>1%). Our results are consistent with 10-20% of amino acid differences between humans and chimpanzees having been fixed by positive selection with the remainder of differences being neutral or nearly neutral. Our analysis also predicts that many of the alleles identified via whole-genome association mapping may be selectively neutral or (formerly) positively selected, implying that deleterious genetic variation affecting disease phenotype may be missed by this widely used approach for mapping genes underlying complex traits.
View details for DOI 10.1371/journal.pgen.1000083
View details for Web of Science ID 000256869100004
View details for PubMedID 18516229
Exploring population genetic models with recombination using efficient forward-time simulations
2008; 178 (4): 2417-2427
We present an exact forward-in-time algorithm that can efficiently simulate the evolution of a finite population under the Wright-Fisher model. We used simulations based on this algorithm to verify the accuracy of the ancestral recombination graph approximation by comparing it to the exact Wright-Fisher scenario. We find that the recombination graph is generally a very good approximation for models with complete outcrossing, whereas, for models with self-fertilization, the approximation becomes slightly inexact for some combinations of selfing and recombination parameters.
View details for DOI 10.1534/genetics.107.085332
View details for Web of Science ID 000255239600047
View details for PubMedID 18430959
Proportionally more deleterious genetic variation in European than in African populations
2008; 451 (7181): 994-U5
Quantifying the number of deleterious mutations per diploid human genome is of crucial concern to both evolutionary and medical geneticists. Here we combine genome-wide polymorphism data from PCR-based exon resequencing, comparative genomic data across mammalian species, and protein structure predictions to estimate the number of functionally consequential single-nucleotide polymorphisms (SNPs) carried by each of 15 African American (AA) and 20 European American (EA) individuals. We find that AAs show significantly higher levels of nucleotide heterozygosity than do EAs for all categories of functional SNPs considered, including synonymous, non-synonymous, predicted 'benign', predicted 'possibly damaging' and predicted 'probably damaging' SNPs. This result is wholly consistent with previous work showing higher overall levels of nucleotide variation in African populations than in Europeans. EA individuals, in contrast, have significantly more genotypes homozygous for the derived allele at synonymous and non-synonymous SNPs and for the damaging allele at 'probably damaging' SNPs than AAs do. For SNPs segregating only in one population or the other, the proportion of non-synonymous SNPs is significantly higher in the EA sample (55.4%) than in the AA sample (47.0%; P < 2.3 x 10(-37)). We observe a similar proportional excess of SNPs that are inferred to be 'probably damaging' (15.9% in EA; 12.1% in AA; P < 3.3 x 10(-11)). Using extensive simulations, we show that this excess proportion of segregating damaging alleles in Europeans is probably a consequence of a bottleneck that Europeans experienced at about the time of the migration out of Africa.
View details for DOI 10.1038/nature06611
View details for Web of Science ID 000253313100049
View details for PubMedID 18288194
Population genetics of polymorphism and divergence under fluctuating selection
2008; 178 (1): 325-337
Current methods for detecting fluctuating selection require time series data on genotype frequencies. Here, we propose an alternative approach that makes use of DNA polymorphism data from a sample of individuals collected at a single point in time. Our method uses classical diffusion approximations to model temporal fluctuations in the selection coefficients to find the expected distribution of mutation frequencies in the population. Using the Poisson random-field setting we derive the site-frequency spectrum (SFS) for three different models of fluctuating selection. We find that the general effect of fluctuating selection is to produce a more "U"-shaped site-frequency spectrum with an excess of high-frequency derived mutations at the expense of middle-frequency variants. We present likelihood-ratio tests, comparing the fluctuating selection models to the neutral model using SFS data, and use Monte Carlo simulations to assess their power. We find that we have sufficient power to reject a neutral hypothesis using samples on the order of a few hundred SNPs and a sample size of approximately 20 and power to distinguish between selection that varies in time and constant selection for a sample of size 20. We also find that fluctuating selection increases the probability of fixation of selected sites even if, on average, there is no difference in selection among a pair of alleles segregating at the locus. Fluctuating selection will, therefore, lead to an increase in the ratio of divergence to polymorphism similar to that observed under positive directional selection.
View details for DOI 10.1531/genetics.107.073361
View details for Web of Science ID 000252742200025
View details for PubMedID 17947441
Recent and ongoing selection in the human genome
NATURE REVIEWS GENETICS
2007; 8 (11): 857-868
The recent availability of genome-scale genotyping data has led to the identification of regions of the human genome that seem to have been targeted by selection. These findings have increased our understanding of the evolutionary forces that affect the human genome, have augmented our knowledge of gene function and promise to increase our understanding of the genetic basis of disease. However, inferences of selection are challenged by several confounding factors, especially the complex demographic history of human populations, and concordance between studies is variable. Although such studies will always be associated with some uncertainty, steps can be taken to minimize the effects of confounding factors and improve our interpretation of their findings.
View details for DOI 10.1038/nrg2187
View details for Web of Science ID 000250281000014
View details for PubMedID 17943193
Context-dependent mutation rates may cause spurious signatures of a fixation bias favoring higher GC-Content in humans
MOLECULAR BIOLOGY AND EVOLUTION
2007; 24 (10): 2196-2202
Understanding the proximate and ultimate causes underlying the evolution of nucleotide composition in mammalian genomes is of fundamental interest to the study of molecular evolution. Comparative genomics studies have revealed that many more substitutions occur from G and C nucleotides to A and T nucleotides than the reverse, suggesting that mammalian genomes are not at equilibrium for base composition. Analysis of human polymorphism data suggests that mutations that increase GC-content tend to be at much higher frequencies than those that decrease or preserve GC-content when the ancestral allele is inferred via parsimony using the chimpanzee genome. These observations have been interpreted as evidence for a fixation bias in favor of G and C alleles due to either positive natural selection or biased gene conversion. Here, we test the robustness of this interpretation to violations of the parsimony assumption using a data set of 21,488 noncoding single nucleotide polymorphisms (SNPs) discovered by the National Institute of Environmental Health Sciences (NIEHS) SNPs project via direct resequencing of n = 95 individuals. Applying standard nonparametric and parametric population genetic approaches, we replicate the signatures of a fixation bias in favor of G and C alleles when the ancestral base is assumed to be the base found in the chimpanzee outgroup. However, upon taking into account the probability of misidentifying the ancestral state of each SNP using a context-dependent mutation model, the corrected distribution of SNP frequencies for GC-content increasing SNPs are nearly indistinguishable from the patterns observed for other types of mutations, suggesting that the signature of fixation bias is a spurious artifact of the parsimony assumption.
View details for DOI 10.1093/molbev/msm149
View details for Web of Science ID 000250437000005
View details for PubMedID 17656634
Genome-wide patterns of nucleotide polymorphism in domesticated rice
2007; 3 (9): 1745-1756
Domesticated Asian rice (Oryza sativa) is one of the oldest domesticated crop species in the world, having fed more people than any other plant in human history. We report the patterns of DNA sequence variation in rice and its wild ancestor, O. rufipogon, across 111 randomly chosen gene fragments, and use these to infer the evolutionary dynamics that led to the origins of rice. There is a genome-wide excess of high-frequency derived single nucleotide polymorphisms (SNPs) in O. sativa varieties, a pattern that has not been reported for other crop species. We developed several alternative models to explain contemporary patterns of polymorphisms in rice, including a (i) selectively neutral population bottleneck model, (ii) bottleneck plus migration model, (iii) multiple selective sweeps model, and (iv) bottleneck plus selective sweeps model. We find that a simple bottleneck model, which has been the dominant demographic model for domesticated species, cannot explain the derived nucleotide polymorphism site frequency spectrum in rice. Instead, a bottleneck model that incorporates selective sweeps, or a more complex demographic model that includes subdivision and gene flow, are more plausible explanations for patterns of variation in domesticated rice varieties. If selective sweeps are indeed the explanation for the observed nucleotide data of domesticated rice, it suggests that strong selection can leave its imprint on genome-wide polymorphism patterns, contrary to expectations that selection results only in a local signature of variation.
View details for DOI 10.1371/journal.pgen.0030163
View details for Web of Science ID 000249767800017
View details for PubMedID 17907810
Global dissemination of a single mutation conferring white pericarp in rice
2007; 3 (8): 1418-1424
Here we report that the change from the red seeds of wild rice to the white seeds of cultivated rice (Oryza sativa) resulted from the strong selective sweep of a single mutation, a frame-shift deletion within the Rc gene that is found in 97.9% of white rice varieties today. A second mutation, also within Rc, is present in less than 3% of white accessions surveyed. Haplotype analysis revealed that the predominant mutation originated in the japonica subspecies and crossed both geographic and sterility barriers to move into the indica subspecies. A little less than one Mb of japonica DNA hitchhiked with the rc allele into most indica varieties, suggesting that other linked domestication alleles may have been transferred from japonica to indica along with white pericarp color. Our finding provides evidence of active cultural exchange among ancient farmers over the course of rice domestication coupled with very strong, positive selection for a single white allele in both subspecies of O. sativa.
View details for DOI 10.1371/journal.pgen.0030133
View details for Web of Science ID 000249767600010
View details for PubMedID 17696613
Context dependence, ancestral misidentification, and spurious signatures of natural selection
MOLECULAR BIOLOGY AND EVOLUTION
2007; 24 (8): 1792-1800
Population genetic analyses often use polymorphism data from one species, and orthologous genomic sequences from closely related outgroup species. These outgroup sequences are frequently used to identify ancestral alleles at segregating sites and to compare the patterns of polymorphism and divergence. Inherent in such studies is the assumption of parsimony, which posits that the ancestral state of each single nucleotide polymorphism (SNP) is the allele that matches the orthologous site in the outgroup sequence, and that all nucleotide substitutions between species have been observed. This study tests the effect of violating the parsimony assumption when mutation rates vary across sites and over time. Using a context-dependent mutation model that accounts for elevated mutation rates at CpG dinucleotides, increased propensity for transitional versus transversional mutations, as well as other directional and contextual mutation biases estimated along the human lineage, we show (using both simulations and a theoretical model) that enough unobserved substitutions could have occurred since the divergence of human and chimpanzee to cause many statistical tests to spuriously reject neutrality. Moreover, using both the chimpanzee and rhesus macaque genomes to parsimoniously identify ancestral states causes a large fraction of the data to be removed while not completely alleviating problem. By constructing a novel model of the context-dependent mutation process, we can correct polymorphism data for the effect of ancestral misidentification using a single outgroup.
View details for DOI 10.1093/molbev/msm108
View details for Web of Science ID 000248848400026
View details for PubMedID 17545186
On the utility of linkage disequilibrium as a statistic for identifying targets of positive selection in nonequilibrium populations
2007; 176 (4): 2371-2379
A critically important challenge in empirical population genetics is distinguishing neutral nonequilibrium processes from selective forces that produce similar patterns of variation. We here examine the extent to which linkage disequilibrium (i.e., nonrandom associations between markers) improves this discrimination. We show that patterns of linkage disequilibrium recently proposed to be unique to hitchhiking models are replicated under nonequilibrium neutral models. We also demonstrate that jointly considering spatial patterns of association among variants alongside the site-frequency spectrum is nonetheless of value. Through a comparison of models of equilibrium neutrality, nonequilibrium neutrality, equilibrium hitchhiking, nonequilibrium hitchhiking, and recurrent hitchhiking, we evaluate a linkage disequilibrium (LD) statistic (omega(max)) that appears to have power to identify regions recently shaped by positive selection. Most notably, for demographic parameters relevant to non-African populations of Drosophila melanogaster, we demonstrate that selected loci are distinguishable from neutral loci using this statistic.
View details for DOI 10.1534/genetics.106.069450
View details for Web of Science ID 000249530000036
View details for PubMedID 17565955
A Markov chain Monte Carlo approach for joint inference of population structure and inbreeding rates from multilocus genotype data
2007; 176 (3): 1635-1651
Nonrandom mating induces correlations in allelic states within and among loci that can be exploited to understand the genetic structure of natural populations (Wright 1965). For many species, it is of considerable interest to quantify the contribution of two forms of nonrandom mating to patterns of standing genetic variation: inbreeding (mating among relatives) and population substructure (limited dispersal of gametes). Here, we extend the popular Bayesian clustering approach STRUCTURE (Pritchard et al. 2000) for simultaneous inference of inbreeding or selfing rates and population-of-origin classification using multilocus genetic markers. This is accomplished by eliminating the assumption of Hardy-Weinberg equilibrium within clusters and, instead, calculating expected genotype frequencies on the basis of inbreeding or selfing rates. We demonstrate the need for such an extension by showing that selfing leads to spurious signals of population substructure using the standard STRUCTURE algorithm with a bias toward spurious signals of admixture. We gauge the performance of our method using extensive coalescent simulations and demonstrate that our approach can correct for this bias. We also apply our approach to understanding the population structure of the wild relative of domesticated rice, Oryza rufipogon, an important partially selfing grass species. Using a sample of n = 16 individuals sequenced at 111 random loci, we find strong evidence for existence of two subpopulations, which correlates well with geographic location of sampling, and estimate selfing rates for both groups that are consistent with estimates from experimental data (s approximately 0.48-0.70).
View details for DOI 10.1534/genetics.107.072371
View details for Web of Science ID 000248416300022
View details for PubMedID 17483417
Localizing recent adaptive evolution in the human genome
2007; 3 (6): 901-915
Identifying genomic locations that have experienced selective sweeps is an important first step toward understanding the molecular basis of adaptive evolution. Using statistical methods that account for the confounding effects of population demography, recombination rate variation, and single-nucleotide polymorphism ascertainment, while also providing fine-scale estimates of the position of the selected site, we analyzed a genomic dataset of 1.2 million human single-nucleotide polymorphisms genotyped in African-American, European-American, and Chinese samples. We identify 101 regions of the human genome with very strong evidence (p < 10(-5)) of a recent selective sweep and where our estimate of the position of the selective sweep falls within 100 kb of a known gene. Within these regions, genes of biological interest include genes in pigmentation pathways, components of the dystrophin protein complex, clusters of olfactory receptors, genes involved in nervous system development and function, immune system genes, and heat shock genes. We also observe consistent evidence of selective sweeps in centromeric regions. In general, we find that recent adaptation is strikingly pervasive in the human genome, with as much as 10% of the genome affected by linkage to a selective sweep.
View details for DOI 10.1371/journal.pgen.0030090
View details for Web of Science ID 000248349300005
View details for PubMedID 17542651
A mutation in the myostatin gene increases muscle mass and enhances racing performance in heterozygote dogs
2007; 3 (5): 779-786
Double muscling is a trait previously described in several mammalian species including cattle and sheep and is caused by mutations in the myostatin (MSTN) gene (previously referred to as GDF8). Here we describe a new mutation in MSTN found in the whippet dog breed that results in a double-muscled phenotype known as the "bully" whippet. Individuals with this phenotype carry two copies of a two-base-pair deletion in the third exon of MSTN leading to a premature stop codon at amino acid 313. Individuals carrying only one copy of the mutation are, on average, more muscular than wild-type individuals (p = 7.43 x 10(-6); Kruskal-Wallis Test) and are significantly faster than individuals carrying the wild-type genotype in competitive racing events (Kendall's nonparametric measure, tau = 0.3619; p approximately 0.00028). These results highlight the utility of performance-enhancing polymorphisms, marking the first time a mutation in MSTN has been quantitatively linked to increased athletic performance.
View details for DOI 10.1371/journal.pgen.0030079
View details for Web of Science ID 000247573900012
View details for PubMedID 17530926
Demographic histories and patterns of linkage disequilibrium in Chinese and Indian rhesus macaques
2007; 316 (5822): 240-243
To understand the demographic history of rhesus macaques (Macaca mulatta) and document the extent of linkage disequilibrium (LD) in the genome, we partially resequenced five Encyclopedia of DNA Elements regions in 9 Chinese and 38 captive-born Indian rhesus macaques. Population genetic analyses of the 1467 single-nucleotide polymorphisms discovered suggest that the two populations separated about 162,000 years ago, with the Chinese population tripling in size since then and the Indian population eventually shrinking by a factor of four. Using coalescent simulations, we confirmed that these inferred demographic events explain a much faster decay of LD in Chinese (r(2) approximately 0.15 at 10 kilobases) versus Indian (r(2) approximately 0.52 at 10 kilobases) macaque populations.
View details for DOI 10.1126/science.1140462
View details for Web of Science ID 000245654500040
View details for PubMedID 17431170
Evolutionary and biomedical insights from the rhesus macaque genome
2007; 316 (5822): 222-234
The rhesus macaque (Macaca mulatta) is an abundant primate species that diverged from the ancestors of Homo sapiens about 25 million years ago. Because they are genetically and physiologically similar to humans, rhesus monkeys are the most widely used nonhuman primate in basic and applied biomedical research. We determined the genome sequence of an Indian-origin Macaca mulatta female and compared the data with chimpanzees and humans to reveal the structure of ancestral primate genomes and to identify evidence for positive selection and lineage-specific expansions and contractions of gene families. A comparison of sequences from individual animals was used to investigate their underlying genetic diversity. The complete description of the macaque genome blueprint enhances the utility of this animal model for biomedical research and improves our understanding of the basic biology of the species.
View details for DOI 10.1126/science.1139247
View details for Web of Science ID 000245654500037
View details for PubMedID 17431167
A single IGF1 allele is a major determinant of small size in dogs
2007; 316 (5821): 112-115
The domestic dog exhibits greater diversity in body size than any other terrestrial vertebrate. We used a strategy that exploits the breed structure of dogs to investigate the genetic basis of size. First, through a genome-wide scan, we identified a major quantitative trait locus (QTL) on chromosome 15 influencing size variation within a single breed. Second, we examined genetic variation in the 15-megabase interval surrounding the QTL in small and giant breeds and found marked evidence for a selective sweep spanning a single gene (IGF1), encoding insulin-like growth factor 1. A single IGF1 single-nucleotide polymorphism haplotype is common to all small breeds and nearly absent from giant breeds, suggesting that the same causal sequence variant is a major contributor to body size in all small dogs.
View details for DOI 10.1126/science.1137045
View details for Web of Science ID 000245470500044
View details for PubMedID 17412960
Human Genome Variation 2006: emerging views on structural variation and large-scale SNP analysis.
The eighth annual Human Genome Variation Meeting was held in September 2006 in the Hong Kong Special Administrative Region, China. The meeting highlighted recent advances in characterization of genetic variation, including genome-wide association studies and structural variation.
View details for PubMedID 17262030
Selective sweep mapping of genes with large phenotypic effects
2005; 15 (12): 1809-1819
Many domestic dog breeds have originated through fixation of discrete mutations by intense artificial selection. As a result of this process, markers in the proximity of genes influencing breed-defining traits will have reduced variation (a selective sweep) and will show divergence in allele frequency. Consequently, low-resolution genomic scans can potentially be used to identify regions containing genes that have a major influence on breed-defining traits. We model the process of breed formation and show that the probability of two or three adjacent marker loci showing a spurious signal of selection within at least one breed (i.e., Type I error or false-positive rate) is low if highly variable and moderately spaced markers are utilized. We also use simulations with selection to demonstrate that even a moderately spaced set of highly polymorphic markers (e.g., one every 0.8 cM) has high power to detect regions targeted by strong artificial selection in dogs. Further, we show that a gene responsible for black coat color in the Large Munsterlander has a 40-Mb region surrounding the gene that is very low in heterozygosity for microsatellite markers. Similarly, we survey 302 microsatellite markers in the Dachshund and find three linked monomorphic microsatellite markers all within a 10-Mb region on chromosome 3. This region contains the FGFR3 gene, which is responsible for achondroplasia in humans, but not in dogs. Consequently, our results suggest that the causative mutation is a gene or regulatory region closely linked to FGFR3.
View details for DOI 10.1101/gr.4374505
View details for Web of Science ID 000233921900024
View details for PubMedID 16339379
Ascertainment bias in studies of human genome-wide polymorphism
2005; 15 (11): 1496-1502
Large-scale SNP genotyping studies rely on an initial assessment of nucleotide variation to identify sites in the DNA sequence that harbor variation among individuals. This "SNP discovery" sample may be quite variable in size and composition, and it has been well established that properties of the SNPs that are found are influenced by the discovery sampling effort. The International HapMap project relied on nearly any piece of information available to identify SNPs-including BAC end sequences, shotgun reads, and differences between public and private sequences-and even made use of chimpanzee data to confirm human sequence differences. In addition, the ascertainment criteria shifted from using only SNPs that had been validated in population samples, to double-hit SNPs, to finally accepting SNPs that were singletons in small discovery samples. In contrast, Perlegen's primary discovery was a resequencing-by-hybridization effort using the 24 people of diverse origin in the Polymorphism Discovery Resource. Here we take these two data sets and contrast two basic summary statistics, heterozygosity and F(ST), as well as the site frequency spectra, for 500-kb windows spanning the genome. The magnitude of disparity between these samples in these measures of variability indicates that population genetic analysis on the raw genotype data is ill advised. Given the knowledge of the discovery samples, we perform an ascertainment correction and show how the post-correction data are more consistent across these studies. However, discrepancies persist, suggesting that the heterogeneity in the SNP discovery process of the HapMap project resulted in a data set resistant to complete ascertainment correction. Ascertainment bias will likely erode the power of tests of association between SNPs and complex disorders, but the effect will likely be small, and perhaps more importantly, it is unlikely that the bias will introduce false-positive inferences.
View details for DOI 10.1101/gr.4107905
View details for Web of Science ID 000232889400005
View details for PubMedID 16251459
Genomic scans for selective sweeps using SNP data
2005; 15 (11): 1566-1575
Detecting selective sweeps from genomic SNP data is complicated by the intricate ascertainment schemes used to discover SNPs, and by the confounding influence of the underlying complex demographics and varying mutation and recombination rates. Current methods for detecting selective sweeps have little or no robustness to the demographic assumptions and varying recombination rates, and provide no method for correcting for ascertainment biases. Here, we present several new tests aimed at detecting selective sweeps from genomic SNP data. Using extensive simulations, we show that a new parametric test, based on composite likelihood, has a high power to detect selective sweeps and is surprisingly robust to assumptions regarding recombination rates and demography (i.e., has low Type I error). Our new test also provides estimates of the location of the selective sweep(s) and the magnitude of the selection coefficient. To illustrate the method, we apply our approach to data from the Seattle SNP project and to Chromosome 2 data from the HapMap project. In Chromosome 2, the most extreme signal is found in the lactase gene, which previously has been shown to be undergoing positive selection. Evidence for selective sweeps is also found in many other regions, including genes known to be associated with disease risk such as DPP10 and COL4A3.
View details for DOI 10.1101/gr.4252305
View details for Web of Science ID 000232889400012
View details for PubMedID 16251466
Natural selection on protein-coding genes in the human genome
2005; 437 (7062): 1153-1157
Comparisons of DNA polymorphism within species to divergence between species enables the discovery of molecular adaptation in evolutionarily constrained genes as well as the differentiation of weak from strong purifying selection. The extent to which weak negative and positive darwinian selection have driven the molecular evolution of different species varies greatly, with some species, such as Drosophila melanogaster, showing strong evidence of pervasive positive selection, and others, such as the selfing weed Arabidopsis thaliana, showing an excess of deleterious variation within local populations. Here we contrast patterns of coding sequence polymorphism identified by direct sequencing of 39 humans for over 11,000 genes to divergence between humans and chimpanzees, and find strong evidence that natural selection has shaped the recent molecular evolution of our species. Our analysis discovered 304 (9.0%) out of 3,377 potentially informative loci showing evidence of rapid amino acid evolution. Furthermore, 813 (13.5%) out of 6,033 potentially informative loci show a paucity of amino acid differences between humans and chimpanzees, indicating weak negative selection and/or balancing selection operating on mutations at these loci. We find that the distribution of negatively and positively selected genes varies greatly among biological processes and molecular functions, and that some classes, such as transcription factors, show an excess of rapidly evolving genes, whereas others, such as cytoskeletal proteins, show an excess of genes with extensive amino acid polymorphism within humans and yet little amino acid divergence between humans and chimpanzees.
View details for DOI 10.1038/nature04240
View details for Web of Science ID 000232660500044
View details for PubMedID 16237444
Distinguishing between selective sweeps and demography using DNA polymorphism data
2005; 170 (3): 1401-1410
In 2002 Kim and Stephan proposed a promising composite-likelihood method for localizing and estimating the fitness advantage of a recently fixed beneficial mutation. Here, we demonstrate that their composite-likelihood-ratio (CLR) test comparing selective and neutral hypotheses is not robust to undetected population structure or a recent bottleneck, with some parameter combinations resulting in a false positive rate of nearly 90%. We also propose a goodness-of-fit test for discriminating rejections due to directional selection (true positive) from those due to population and demographic forces (false positives) and demonstrate that the new method has high sensitivity to differentiate the two classes of rejections.
View details for DOI 10.1534/genetics.104.038224
View details for Web of Science ID 000231097700035
View details for PubMedID 15911584
A composite-likelihood approach for detecting directional selection from DNA sequence data
2005; 170 (3): 1411-1421
We present a novel composite-likelihood-ratio test (CLRT) for detecting genes and genomic regions that are subject to recurrent natural selection (either positive or negative). The method uses the likelihood functions of Hartl et al. (1994) for inference in a Wright-Fisher genic selection model and corrects for nonindependence among sites by application of coalescent simulations with recombination. Here, we (1) characterize the distribution of the CLRT statistic (Lambda) as a function of the population recombination rate (R=4Ner); (2) explore the effects of bias in estimation of R on the size (type I error) of the CLRT; (3) explore the robustness of the model to population growth, bottlenecks, and migration; (4) explore the power of the CLRT under varying levels of mutation, selection, and recombination; (5) explore the discriminatory power of the test in distinguishing negative selection from population growth; and (6) evaluate the performance of maximum composite-likelihood estimation (MCLE) of the selection coefficient. We find that the test has excellent power to detect weak negative selection and moderate power to detect positive selection. Moreover, the test is quite robust to bias in the estimate of local recombination rate, but not to certain demographic scenarios such as population growth or a recent bottleneck. Last, we demonstrate that the MCLE of the selection parameter has little bias for weak negative selection and has downward bias for positively selected mutations.
View details for DOI 10.1534/genetics.104.035097
View details for Web of Science ID 000231097700036
View details for PubMedID 15879513
Detecting coevolving amino acid sites using Bayesian mutational mapping
2005; 21: I126-I135
The evolution of protein sequences is constrained by complex interactions between amino acid residues. Because harmful substitutions may be compensated for by other substitutions at neighboring sites, residues can coevolve. We describe a Bayesian phylogenetic approach to the detection of coevolving residues in protein families. This method, Bayesian mutational mapping (BMM), assigns mutations to the branches of the evolutionary tree stochastically, and then test statistics are calculated to determine whether a coevolutionary signal exists in the mapping. Posterior predictive P-values provide an estimate of significance, and specificity is maintained by integrating over uncertainty in the estimation of the tree topology, branch lengths and substitution rates. A coevolutionary Markov model for codon substitution is also described, and this model is used as the basis of several test statistics.Results on simulated coevolutionary data indicate that the BMM method can successfully detect nearly all coevolving sites when the model has been correctly specified, and that non-parametric statistics such as mutual information are generally less powerful than parametric statistics. On a dataset of eukaryotic proteins from the phosphoglycerate kinase (PGK) family, interdomain site contacts yield a significantly greater coevolutionary signal than interdomain non-contacts, an indication that the method provides information about interacting sites. Failure to account for the heterogeneity in rates across sites in PGK resulted in a less discriminating test, yielding a marked increase in the number of reported positives at both contact and non-contact sites.http://www.dimmic.net/supplement/
View details for DOI 10.1093/bioinformatics/bti1032
View details for Web of Science ID 000230273000014
View details for PubMedID 15961449
A scan for positively selected genes in the genomes of humans and chimpanzees
2005; 3 (6): 976-985
Since the divergence of humans and chimpanzees about 5 million years ago, these species have undergone a remarkable evolution with drastic divergence in anatomy and cognitive abilities. At the molecular level, despite the small overall magnitude of DNA sequence divergence, we might expect such evolutionary changes to leave a noticeable signature throughout the genome. We here compare 13,731 annotated genes from humans to their chimpanzee orthologs to identify genes that show evidence of positive selection. Many of the genes that present a signature of positive selection tend to be involved in sensory perception or immune defenses. However, the group of genes that show the strongest evidence for positive selection also includes a surprising number of genes involved in tumor suppression and apoptosis, and of genes involved in spermatogenesis. We hypothesize that positive selection in some of these genes may be driven by genomic conflict due to apoptosis during spermatogenesis. Genes with maximal expression in the brain show little or no evidence for positive selection, while genes with maximal expression in the testis tend to be enriched with positively selected genes. Genes on the X chromosome also tend to show an elevated tendency for positive selection. We also present polymorphism data from 20 Caucasian Americans and 19 African Americans for the 50 annotated genes showing the strongest evidence for positive selection. The polymorphism analysis further supports the presence of positive selection in these genes by showing an excess of high-frequency derived nonsynonymous mutations.
View details for DOI 10.1371/journal.pbio.0030170
View details for Web of Science ID 000229992900008
View details for PubMedID 15869325
Simultaneous inference of selection and population growth from patterns of variation in the human genome
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2005; 102 (22): 7882-7887
Natural selection and demographic forces can have similar effects on patterns of DNA polymorphism. Therefore, to infer selection from samples of DNA sequences, one must simultaneously account for demographic effects. Here we take a model-based approach to this problem by developing predictions for patterns of polymorphism in the presence of both population size change and natural selection. If data are available from different functional classes of variation, and a priori information suggests that mutations in one of those classes are selectively neutral, then the putatively neutral class can be used to infer demographic parameters, and inferences regarding selection on other classes can be performed given demographic parameter estimates. This procedure is more robust to assumptions regarding the true underlying demography than previous approaches to detecting and analyzing selection. We apply this method to a large polymorphism data set from 301 human genes and find (i) widespread negative selection acting on standing nonsynonymous variation, (ii) that the fitness effects of nonsynonymous mutations are well predicted by several measures of amino acid exchangeability, especially site-specific methods, and (iii) strong evidence for very recent population growth.
View details for DOI 10.1073/pnas.0502300102
View details for Web of Science ID 000229531000020
View details for PubMedID 15905331
A statistical characterization of consistent patterns of human immunodeficiency virus evolution within infected patients
MOLECULAR BIOLOGY AND EVOLUTION
2005; 22 (3): 456-468
Within-patient HIV populations evolve rapidly because of a high mutation rate, short generation time, and strong positive selection pressures. Previous studies have identified "consistent patterns" of viral sequence evolution. Just before HIV infection progresses to AIDS, evolution seems to slow markedly, and the genetic diversity of the viral population drops. This evolutionary slowdown could be caused either by a reduction in the average viral replication rate or because selection pressures weaken with the collapse of the immune system. The former hypothesis (which we denote "cellular exhaustion") predicts a simultaneous reduction in both synonymous and nonsynonymous evolution, whereas the latter hypothesis (denoted "immune relaxation") predicts that only nonsynonymous evolution will slow. In this paper, we present a set of statistical procedures for distinguishing between these alternative hypotheses using DNA sequences sampled over the course of infection. The first component is a new method for estimating evolutionary rates that takes advantage of the temporal information in longitudinal DNA sequence samples. Second, we develop a set of probability models for the analysis of evolutionary rates in HIV populations in vivo. Application of these models to both synonymous and nonsynonymous evolution affords a comparison of the cellular-exhaustion and immune-relaxation hypotheses. We apply the procedures to longitudinal data sets in which sequences of the env gene were sampled over the entire course of infection. Our analyses (1) statistically confirm that an evolutionary slowdown occurs late in infection, (2) strongly support the immune-relaxation hypothesis, and (3) indicate that the cessation of nonsynonymous evolution is associated with disease progression.
View details for Web of Science ID 000227163100012
View details for PubMedID 15509726
Inferring SNP function using evolutionary, structural, and computational methods.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
View details for PubMedID 15759643
Population genetics of polymorphism and divergence for diploid selection models with arbitrary dominance
2004; 168 (1): 463-475
We develop a Poisson random-field model of polymorphism and divergence that allows arbitrary dominance relations in a diploid context. This model provides a maximum-likelihood framework for estimating both selection and dominance parameters of new mutations using information on the frequency spectrum of sequence polymorphisms. This is the first DNA sequence-based estimator of the dominance parameter. Our model also leads to a likelihood-ratio test for distinguishing nongenic from genic selection; simulations indicate that this test is quite powerful when a large number of segregating sites are available. We also use simulations to explore the bias in selection parameter estimates caused by unacknowledged dominance relations. When inference is based on the frequency spectrum of polymorphisms, genic selection estimates of the selection parameter can be very strongly biased even for minor deviations from the genic selection model. Surprisingly, however, when inference is based on polymorphism and divergence (McDonald-Kreitman) data, genic selection estimates of the selection parameter are nearly unbiased, even for completely dominant or recessive mutations. Further, we find that weak overdominant selection can increase, rather than decrease, the substitution rate relative to levels of polymorphism. This nonintuitive result has major implications for the interpretation of several popular tests of neutrality.
View details for DOI 10.1534/genetics.103.024745
View details for Web of Science ID 000224393700036
View details for PubMedID 15454557
Natural selection on the olfactory receptor gene family in humans and chimpanzees
AMERICAN JOURNAL OF HUMAN GENETICS
2003; 73 (3): 489-501
The olfactory receptor (OR) genes constitute the largest gene family in mammalian genomes. Humans have >1,000 OR genes, of which only approximately 40% have an intact coding region and are therefore putatively functional. In contrast, the fraction of intact OR genes in the genomes of the great apes is significantly greater (68%-72%), suggesting that selective pressures on the OR repertoire vary among these species. We have examined the evolutionary forces that shaped the OR gene family in humans and chimpanzees by resequencing 20 OR genes in 16 humans, 16 chimpanzees, and one orangutan. We compared the variation at the OR genes with that at intergenic regions. In both humans and chimpanzees, OR pseudogenes seem to evolve neutrally. In chimpanzees, patterns of variability are consistent with purifying selection acting on intact OR genes, whereas, in humans, there is suggestive evidence for positive selection acting on intact OR genes. These observations are likely due to differences in lifestyle, between humans and great apes, that have led to distinct sensory needs.
View details for Web of Science ID 000185134000003
View details for PubMedID 12908129
Maximum likelihood and Bayesian methods for estimating the distribution of selective effects among classes of mutations using DNA polymorphism data
THEORETICAL POPULATION BIOLOGY
2003; 63 (2): 91-103
Maximum likelihood and Bayesian approaches are presented for analyzing hierarchical statistical models of natural selection operating on DNA polymorphism within a panmictic population. For analyzing Bayesian models, we present Markov chain Monte-Carlo (MCMC) methods for sampling from the joint posterior distribution of parameters. For frequentist analysis, an Expectation-Maximization (EM) algorithm is presented for finding the maximum likelihood estimate of the genome wide mean and variance in selection intensity among classes of mutations. The framework presented here provides an ideal setting for modeling mutations dispersed through the genome and, in particular, for the analysis of how natural selection operates on different classes of single nucleotide polymorphisms (SNPs).
View details for DOI 10.1016/S0040-5809(02)00050-3
View details for Web of Science ID 000181575000003
View details for PubMedID 12615493
Selection on rapidly evolving proteins in the Arabidopsis genome
2003; 163 (2): 723-733
Genes that have undergone positive or diversifying selection are likely to be associated with adaptive divergence between species. One indicator of adaptive selection at the molecular level is an excess of amino acid replacement fixed differences per replacement site relative to the number of synonymous fixed differences per synonymous site (omega = K(a)/K(s)). We used an evolutionary expressed sequence tag (EST) approach to estimate the distribution of omega among 304 orthologous loci between Arabidopsis thaliana and A. lyrata to identify genes potentially involved in the adaptive divergence between these two Brassicaceae species. We find that 14 of 304 genes (approximately 5%) have an estimated omega > 1 and are candidates for genes with increased selection intensities. Molecular population genetic analyses of 6 of these rapidly evolving protein loci indicate that, despite their high levels of between-species nonsynonymous divergence, these genes do not have elevated levels of intraspecific replacement polymorphisms compared to previously studied genes. A hierarchical Bayesian analysis of protein-coding region evolution within and between species also indicates that the selection intensities of these genes are elevated compared to previously studied A. thaliana nuclear loci.
View details for Web of Science ID 000181417200025
View details for PubMedID 12618409
Bayesian analysis suggests that most amino acid replacements in Drosophila are driven by positive selection
JOURNAL OF MOLECULAR EVOLUTION
2003; 57: S154-S164
One of the principal goals of population genetics is to understand the processes by which genetic variation within species (polymorphism) becomes converted into genetic differences between species (divergence). In this transformation, selective neutrality, near neutrality, and positive selection may each play a role, differing from one gene to the next. Synonymous nucleotide sites are often used as a uniform standard of comparison across genes on the grounds that synonymous sites are subject to relatively weak selective constraints and so may, to a first approximation, be regarded as neutral. Synonymous sites are also interdigitated with nonsynonymous sites and so are affected equally by genomic context and demographic factors. Hence a comparison of levels of polymorphism and divergence between synonymous sites and amino acid replacement sites in a gene is potentially informative about the magnitude of selective forces associated with amino acid replacements. We have analyzed 56 genes in which polymorphism data from D. simulans are compared with divergence from a reference strain of D. melanogaster. The framework of the analysis is Bayesian and assumes that the distribution of selective effects (Malthusian fitnesses) is Gaussian with a mean that differs for each gene. In such a model, the average scaled selection intensity (gamma = N(e)s) of amino acid replacements eligible to become polymorphic or fixed is -7.31, and the standard deviation of selective effects within each locus is 6.79 (assuming homoscedasticity across loci). For newly arising mutations of this type that occur in autosomal or X-linked genes, the average proportion of beneficial mutations is 19.7%. Among the amino acid polymorphisms in the sample, the expected average proportion of beneficial mutations is 47.7%, and among amino acid replacements that become fixed the average proportion of beneficial mutations is 94.3%. The average scaled selection intensity of fixed mutations is +5.1. The presence of positive selection is pervasive with the single exception of kl-5, a Y-linked fertility gene. We find no evidence that a significant fraction of fixed amino acid replacements is neutral or nearly neutral or that positive selection drives amino acid replacements at only a subset of the loci. These results are model dependent and we discuss possible modifications of the model that might allow more neutral and nearly neutral amino acid replacements to be fixed.
View details for DOI 10.1007/s00239-003-0022-3
View details for Web of Science ID 000186396000016
View details for PubMedID 15008412
The cost of inbreeding in Arabidopsis
2002; 416 (6880): 531-534
Population geneticists have long sought to estimate the distribution of selection intensities among genes of diverse function across the genome. Only recently have DNA sequencing and analytical techniques converged to make this possible. Important advances have come from comparing genetic variation within species (polymorphism) with fixed differences between species (divergence). These approaches have been used to examine individual genes for evidence of selection. Here we use the fact that the time since species divergence allows combination of data across genes. In a comparison of amino-acid replacements among species of the mustard weed Arabidopsis with those among species of the fruitfly Drosophila, we find evidence for predominantly beneficial gene substitutions in Drosophila but predominantly detrimental substitutions in Arabidopsis. We attribute this difference to the Arabidopsis mating system of partial self-fertilization, which corroborates a prediction of population genetics theory that species with a high frequency of inbreeding are less efficient in eliminating deleterious mutations owing to their reduced effective population size.
View details for Web of Science ID 000174756500042
View details for PubMedID 11932744
A maximum likelihood method for analyzing pseudogene evolution: Implications for silent site evolution in humans and rodents
MOLECULAR BIOLOGY AND EVOLUTION
2002; 19 (1): 110-117
We present a new likelihood method for detecting constrained evolution at synonymous sites and other forms of nonneutral evolution in putative pseudogenes. The model is applicable whenever the DNA sequence is available from a protein-coding functional gene, a pseudogene derived from the protein-coding gene, and an orthologous functional copy of the gene. Two nested likelihood ratio tests are developed to test the hypotheses that (1) the putative pseudogene has equal rates of silent and replacement substitutions; and (2) the rate of synonymous substitution in the functional gene equals the rate of substitution in the pseudogene. The method is applied to a data set containing 74 human processed-pseudogene loci, 25 mouse processed-pseudogene loci, and 22 rat processed-pseudogene loci. Using the informatics resources of the Human Genome Project, we localized 67 of the human-pseudogene pairs in the genome and estimated the GC content of a large surrounding genomic region for each. We find that, for pseudogenes deposited in GC regions similar to those of their paralogs, the assumption of equal rates of silent and replacement site evolution in the pseudogene is upheld; in these cases, the rate of silent site evolution in the functional genes is approximately 70% the rate of evolution in the pseudogene. On the other hand, for pseudogenes located in genomic regions of much lower GC than their functional gene, we see a sharp increase in the rate of silent site substitutions, leading to a large rate of rejection for the pseudogene equality likelihood ratio test.
View details for Web of Science ID 000173101100013
View details for PubMedID 11752196
Directional selection and the site-frequency spectrum
2001; 159 (4): 1779-1788
In this article we explore statistical properties of the maximum-likelihood estimates (MLEs) of the selection and mutation parameters in a Poisson random field population genetics model of directional selection at DNA sites. We derive the asymptotic variances and covariance of the MLEs and explore the power of the likelihood ratio tests (LRT) of neutrality for varying levels of mutation and selection as well as the robustness of the LRT to deviations from the assumption of free recombination among sites. We also discuss the coverage of confidence intervals on the basis of two standard-likelihood methods. We find that the LRT has high power to detect deviations from neutrality and that the maximum-likelihood estimation performs very well when the ancestral states of all mutations in the sample are known. When the ancestral states are not known, the test has high power to detect deviations from neutrality for negative selection but not for positive selection. We also find that the LRT is not robust to deviations from the assumption of independence among sites.
View details for Web of Science ID 000173106800033
View details for PubMedID 11779814
Chromosomal effects of rapid gene evolution in Drosophila melanogaster
2001; 291 (5501): 128-130
Rapid adaptive fixation of a new favorable mutation is expected to affect neighboring genes along the chromosome. Evolutionary theory predicts that the chromosomal region would show a reduced level of genetic variation and an excess of rare alleles. We have confirmed these predictions in a region of the X chromosome of Drosophila melanogaster that contains a newly evolved gene for a component of the sperm axoneme. In D. simulans, where the novel gene does not exist, the pattern of genetic variation is consistent with selection against recurrent deleterious mutations. These findings imply that the pattern of genetic variation along a chromosome may be useful for inferring its evolutionary history and for revealing regions in which recent adaptive fixations have taken place.
View details for Web of Science ID 000166259100050
View details for PubMedID 11141564
Solvent accessibility and purifying selection within proteins of Escherichia coli and Salmonella enterica
MOLECULAR BIOLOGY AND EVOLUTION
2000; 17 (2): 301-308
The neutral theory of molecular evolution predicts that variation within species is inversely related to the strength of purifying selection, but the strength of purifying selection itself must be related to physical constraints imposed by protein folding and function. In this paper, we analyzed five enzymes for which polymorphic sequence variation within Escherichia coli and/or Salmonella enterica was available, along with a protein structure. Single and multivariate logistic regression models are presented that evaluate amino acid size, physicochemical properties, solvent accessibility, and secondary structure as predictors of polymorphism. A model that contains a positive coefficient of association between polymorphism and solvent accessibility and separate intercepts for each secondary-structure element is sufficient to explain the observed variation in polymorphism between sites. The model predicts an increase in the probability of amino acid polymorphism with increasing solvent accessibility for each protein regardless of physicochemical properties, secondary-structure element, or size of the amino acid. This result, when compared with the distribution of synonymous polymorphism, which shows no association with solvent accessibility, suggests a strong decrease in purifying selection with increasing solvent accessibility.
View details for Web of Science ID 000085139200010
View details for PubMedID 10677853
PRETREATMENT WITH 1,25(OH)2D(3) PROTECTS FROM CYTOXAN-INDUCED ALOPECIA WITHOUT PROTECTING THE LEUKEMIC-CELLS FROM CYTOXAN
AMERICAN JOURNAL OF THE MEDICAL SCIENCES
1995; 310 (2): 43-47
The authors previously demonstrated that pretreatment with 1,25(OH)2D3 protects from Cytoxan-induced alopecia in the rat model. The current study was designed to investigate whether 1,25(OH)2D3 protects the transplantable rat chloroleukemia (C51) cells from the cytotoxic effects of Cytoxan. In vitro, 4-hydroperoxycyclophosphamide had a dose-dependent cytotoxic effect on C51 cells. In separate experiments, preincubation with 1,25(OH)2D3 did not protect the C51 cells from the cytotoxic effects of 4-hydroperoxycyclophosphamide. In vivo, 4 groups of 10 5-day-old rats were treated as follows: Groups 1 and 2 received 0.2 micrograms of 1,25(OH)2D3 topically in ethanol daily starting on day 5 through day 10. Groups 3 and 4 received ethanol topically similarly. On day 7, all rats received 1 x 10(5) C51 cells intraperitoneally. On day 11, groups 1 and 3 received 35 mg/kg Cytoxan intraperitoneally. All rats in groups 2 and 4 were dead of leukemia by day 34. In groups 1 and 3, only 1 of 10 and 2 of 10 rats died of leukemia, respectively. Alopecia developed in all rats in group 3. In contrast, all rats in group 1 were protected from Cytoxan-induced alopecia. These results indicate that, in vivo, pretreatment with 1,25(OH)2D3 does not protect the rat chloroleukemia cells from the cytotoxic effect of Cytoxan, while protecting from Cytoxan-induced alopecia.
View details for Web of Science ID A1995RN01800001
View details for PubMedID 7631641