Dr. Tang received her PhD in Statistics, with a minor in Genetics, from Stanford University in 2002. From 2002 to 2006, she was on faculty in the PHS division at the Fred Hutchinson Cancer Research Center. Dr. Tang joined the Stanford Genetics Department in 2007. The goals of her research are to better understand the evolutionary forces that have shaped the pattern of genetic variation in humans, as well as to elucidate the genetic architecture of complex traits and diseases in the context of human evolution.
AB, Harvard and Radcliffe College, Biology (1997)
PhD, Stanford University, Statistics (minor Genetics) (2002)
Current Research and Scholarly Interests
Research in our laboratory develops and applies statistical methods for analyzing patterns of human genetic variation, which underlie the phenotypic diversity of our species. We are collaborating on various genome-wide studies focusing on stratified or recently admixed populations. These studies offer unique opportunities to elucidate the evolutionary forces that have shaped the patterns of genetic variation in humans, to uncover the genetic basis of complex traits, and to shed light on the mechanisms that lead to diverse phenotypes and disparate disease risks among populations.
- Statistical and Machine Learning Methods for Genomics
BIO 268 (Win)
Independent Studies (8)
- Biomedical Informatics Teaching Methods
BIOMEDIN 290 (Aut, Win, Spr, Sum)
- Directed Reading and Research
BIOMEDIN 299 (Aut, Win, Spr, Sum)
- Directed Reading in Genetics
GENE 299 (Aut, Win, Spr, Sum)
- Graduate Research
GENE 399 (Aut, Win, Spr, Sum)
- Medical Scholars Research
BIOMEDIN 370 (Aut, Win, Spr, Sum)
- Medical Scholars Research
GENE 370 (Aut, Win, Spr, Sum)
- Supervised Study
GENE 260 (Aut, Win, Spr, Sum)
- Undergraduate Research
GENE 199 (Aut, Win, Spr, Sum)
- Biomedical Informatics Teaching Methods
Prior Year Courses
- Statistical Genetics of Complex Traits
BIOS 259 (Win)
- Statistical and Machine Learning Methods for Genomics
BIO 268, BIOMEDIN 245, CS 373, GENE 245, STATS 345 (Spr)
- Statistical Genetics of Complex Traits
Doctoral Dissertation Reader (AC)
Daniel Cotter, Roshni Patel, Alissa Severson, Ben Siranosian, Olivia de Goede
Postdoctoral Faculty Sponsor
Graduate and Fellowship Programs
Biomedical Informatics (Phd Program)
Genome-wide characterization of shared and distinct genetic components that influence blood lipid levels in ethnically diverse human populations.
American journal of human genetics
2013; 92 (6): 904-916
Blood lipid concentrations are heritable risk factors associated with atherosclerosis and cardiovascular diseases. Lipid traits exhibit considerable variation among populations of distinct ancestral origin as well as between individuals within a population. We performed association analyses to identify genetic loci influencing lipid concentrations in African American and Hispanic American women in the Women's Health Initiative SNP Health Association Resource. We validated one African-specific high-density lipoprotein cholesterol locus at CD36 as well as 14 known lipid loci that have been previously implicated in studies of European populations. Moreover, we demonstrate striking similarities in genetic architecture (loci influencing the trait, direction and magnitude of genetic effects, and proportions of phenotypic variation explained) of lipid traits across populations. In particular, we found that a disproportionate fraction of lipid variation in African Americans and Hispanic Americans can be attributed to genomic loci exhibiting statistical evidence of association in Europeans, even though the precise genes and variants remain unknown. At the same time, we found substantial allelic heterogeneity within shared loci, characterized both by population-specific rare variants and variants shared among multiple populations that occur at disparate frequencies. The allelic heterogeneity emphasizes the importance of including diverse populations in future genetic association studies of complex traits such as lipids; furthermore, the overlap in lipid loci across populations of diverse ancestral origin argues that additional knowledge can be gleaned from multiple populations.
View details for DOI 10.1016/j.ajhg.2013.04.025
View details for PubMedID 23726366
View details for PubMedCentralID PMC3675231
Genetic Architecture of Skin and Eye Color in an African-European Admixed Population
2013; 9 (3)
Variation in human skin and eye color is substantial and especially apparent in admixed populations, yet the underlying genetic architecture is poorly understood because most genome-wide studies are based on individuals of European ancestry. We study pigmentary variation in 699 individuals from Cape Verde, where extensive West African/European admixture has given rise to a broad range in trait values and genomic ancestry proportions. We develop and apply a new approach for measuring eye color, and identify two major loci (HERC2[OCA2] P = 2.3 × 10(-62), SLC24A5 P = 9.6 × 10(-9)) that account for both blue versus brown eye color and varying intensities of brown eye color. We identify four major loci (SLC24A5 P = 5.4 × 10(-27), TYR P = 1.1 × 10(-9), APBA2[OCA2] P = 1.5 × 10(-8), SLC45A2 P = 6 × 10(-9)) for skin color that together account for 35% of the total variance, but the genetic component with the largest effect (~44%) is average genomic ancestry. Our results suggest that adjacent cis-acting regulatory loci for OCA2 explain the relationship between skin and eye color, and point to an underlying genetic architecture in which several genes of moderate effect act together with many genes of small effect to explain ~70% of the estimated heritability.
View details for DOI 10.1371/journal.pgen.1003372
View details for Web of Science ID 000316866700048
View details for PubMedID 23555287
View details for PubMedCentralID PMC3605137
Genome-Wide Association Studies of Quantitatively Measured Skin, Hair, and Eye Pigmentation in Four European Populations
2012; 7 (10)
Pigmentation of the skin, hair, and eyes varies both within and between human populations. Identifying the genes and alleles underlying this variation has been the goal of many candidate gene and several genome-wide association studies (GWAS). Most GWAS for pigmentary traits to date have been based on subjective phenotypes using categorical scales. But skin, hair, and eye pigmentation vary continuously. Here, we seek to characterize quantitative variation in these traits objectively and accurately and to determine their genetic basis. Objective and quantitative measures of skin, hair, and eye color were made using reflectance or digital spectroscopy in Europeans from Ireland, Poland, Italy, and Portugal. A GWAS was conducted for the three quantitative pigmentation phenotypes in 176 women across 313,763 SNP loci, and replication of the most significant associations was attempted in a sample of 294 European men and women from the same countries. We find that the pigmentation phenotypes are highly stratified along axes of European genetic differentiation. The country of sampling explains approximately 35% of the variation in skin pigmentation, 31% of the variation in hair pigmentation, and 40% of the variation in eye pigmentation. All three quantitative phenotypes are correlated with each other. In our two-stage association study, we reproduce the association of rs1667394 at the OCA2/HERC2 locus with eye color but we do not identify new genetic determinants of skin and hair pigmentation supporting the lack of major genes affecting skin and hair color variation within Europe and suggesting that not only careful phenotyping but also larger cohorts are required to understand the genetic architecture of these complex quantitative traits. Interestingly, we also see that in each of these four populations, men are more lightly pigmented in the unexposed skin of the inner arm than women, a fact that is underappreciated and may vary across the world.
View details for DOI 10.1371/journal.pone.0048294
View details for Web of Science ID 000310600500094
View details for PubMedID 23118974
View details for PubMedCentralID PMC3485197
Ancestral Components of Admixed Genomes in a Mexican Cohort
2011; 7 (12)
For most of the world, human genome structure at a population level is shaped by interplay between ancient geographic isolation and more recent demographic shifts, factors that are captured by the concepts of biogeographic ancestry and admixture, respectively. The ancestry of non-admixed individuals can often be traced to a specific population in a precise region, but current approaches for studying admixed individuals generally yield coarse information in which genome ancestry proportions are identified according to continent of origin. Here we introduce a new analytic strategy for this problem that allows fine-grained characterization of admixed individuals with respect to both geographic and genomic coordinates. Ancestry segments from different continents, identified with a probabilistic model, are used to construct and study "virtual genomes" of admixed individuals. We apply this approach to a cohort of 492 parent-offspring trios from Mexico City. The relative contributions from the three continental-level ancestral populations-Africa, Europe, and America-vary substantially between individuals, and the distribution of haplotype block length suggests an admixing time of 10-15 generations. The European and Indigenous American virtual genomes of each Mexican individual can be traced to precise regions within each continent, and they reveal a gradient of Amerindian ancestry between indigenous people of southwestern Mexico and Mayans of the Yucatan Peninsula. This contrasts sharply with the African roots of African Americans, which have been characterized by a uniform mixing of multiple West African populations. We also use the virtual European and Indigenous American genomes to search for the signatures of selection in the ancestral populations, and we identify previously known targets of selection in other populations, as well as new candidate loci. The ability to infer precise ancestral components of admixed genomes will facilitate studies of disease-related phenotypes and will allow new insight into the adaptive and demographic history of indigenous people.
View details for DOI 10.1371/journal.pgen.1002410
View details for Web of Science ID 000299167900027
View details for PubMedID 22194699
View details for PubMedCentralID PMC3240599
Worldwide human relationships inferred from genome-wide patterns of variation
2008; 319 (5866): 1100-1104
Human genetic diversity is shaped by both demographic and biological factors and has fundamental implications for understanding the genetic basis of diseases. We studied 938 unrelated individuals from 51 populations of the Human Genome Diversity Panel at 650,000 common single-nucleotide polymorphism loci. Individual ancestry and population substructure were detectable with very high resolution. The relationship between haplotype heterozygosity and geography was consistent with the hypothesis of a serial founder effect with a single origin in sub-Saharan Africa. In addition, we observed a pattern of ancestral allele frequency distributions that reflects variation in population dynamics among geographic regions. This data set allows the most comprehensive characterization to date of human genetic variation.
View details for DOI 10.1126/science.1153717
View details for Web of Science ID 000253311700046
View details for PubMedID 18292342
Reconstructing genetic ancestry blocks in admixed individuals
AMERICAN JOURNAL OF HUMAN GENETICS
2006; 79 (1): 1-12
A chromosome in an individual of recently admixed ancestry resembles a mosaic of chromosomal segments, or ancestry blocks, each derived from a particular ancestral population. We consider the problem of inferring ancestry along the chromosomes in an admixed individual and thereby delineating the ancestry blocks. Using a simple population model, we infer gene-flow history in each individual. Compared with existing methods, which are based on a hidden Markov model, the Markov-hidden Markov model (MHMM) we propose has the advantage of accounting for the background linkage disequilibrium (LD) that exists in ancestral populations. When there are more than two ancestral groups, we allow each ancestral population to admix at a different time in history. We use simulations to illustrate the accuracy of the inferred ancestry as well as the importance of modeling the background LD; not accounting for background LD between markers may mislead us to false inferences about mixed ancestry in an indigenous population. The MHMM makes it possible to identify genomic blocks of a particular ancestry by use of any high-density single-nucleotide-polymorphism panel. One application of our method is to perform admixture mapping without genotyping special ancestry-informative-marker panels.
View details for Web of Science ID 000238341200001
View details for PubMedID 16773560
Variation and genetic control of protein abundance in humans
2013; 499 (7456): 79-82
Gene expression differs among individuals and populations and is thought to be a major determinant of phenotypic variation. Although variation and genetic loci responsible for RNA expression levels have been analysed extensively in human populations, our knowledge is limited regarding the differences in human protein abundance and the genetic basis for this difference. Variation in messenger RNA expression is not a perfect surrogate for protein expression because the latter is influenced by an array of post-transcriptional regulatory mechanisms, and, empirically, the correlation between protein and mRNA levels is generally modest. Here we used isobaric tag-based quantitative mass spectrometry to determine relative protein levels of 5,953 genes in lymphoblastoid cell lines from 95 diverse individuals genotyped in the HapMap Project. We found that protein levels are heritable molecular phenotypes that exhibit considerable variation between individuals, populations and sexes. Levels of specific sets of proteins involved in the same biological process covary among individuals, indicating that these processes are tightly regulated at the protein level. We identified cis-pQTLs (protein quantitative trait loci), including variants not detected by previous transcriptome studies. This study demonstrates the feasibility of high-throughput human proteome quantification that, when integrated with DNA variation and transcriptome information, adds a new dimension to the characterization of gene expression regulation.
View details for DOI 10.1038/nature12223
View details for Web of Science ID 000321285600037
View details for PubMedID 23676674
Association of DXA-derived Bone Mineral Density and Fat Mass With African Ancestry
JOURNAL OF CLINICAL ENDOCRINOLOGY & METABOLISM
2013; 98 (4): E713-E717
Both genes and environment have been implicated in determining the complex body composition phenotypes in individuals of European ancestry; however, few studies have been conducted in other race/ethnic groups.We conducted a genome-wide admixture mapping study in an attempt to localize novel genomic regions associated with genetic ancestry.We selected a sample of 842 African-American women from the Women's Health Initiative single nucleotide polymorphism (SNP) Health Association Resource for whom several dual-energy X-ray absorptiometry (DXA)-derived bone mineral density (BMD) and fat mass phenotypes were available.We derived both global and local ancestry estimates for each individual from Affymetrix 6.0 data and analyzed the correlation of DXA phenotypes with global African ancestry. For each phenotype, we examined the association of local genetic ancestry (number of African ancestral alleles at each marker) and each DXA phenotype at 570 282 markers across the genome in additive models with adjustment for important covariates. Results: We identified statistically significant correlations of whole-body fat mass, trunk fat mass, and all 6 measures of BMD with a proportion of African ancestry. Genome-wide (admixture) significance for femoral neck BMD was achieved across 2 regions ∼3.7 MB and 0.3 MB on chromosome 19q13; similarly, total hip and intertrochanter BMD were associated with local ancestry in these regions. Trunk fat was the most significant fat mass phenotype showing strong, but not genomewide significant associations on chromosome Xp22.Our results suggest that genomic regions in postmenopausal African-American women contribute to variance in BMD and fat mass existence and warrant further study.
View details for DOI 10.1210/jc.2012-3921
View details for Web of Science ID 000317195600014
View details for PubMedID 23436924
View details for PubMedCentralID PMC3615193
Variants in CXADR and F2RL1 are associated with blood pressure and obesity in African-Americans in regions identified through admixture mapping
JOURNAL OF HYPERTENSION
2012; 30 (10): 1970-1976
Genetic variants in 296 genes in regions identified through admixture mapping of hypertension, BMI, and lipids were assessed for association with hypertension, blood pressure (BP), BMI, and high-density lipoprotein cholesterol (HDL-C).This study identified coding SNPs identified from HapMap2 data that were located in genes on chromosomes 5, 6, 8, and 21, wherein ancestry association evidence for hypertension, BMI, or HDL-C was identified in previous admixture mapping studies. Genotyping was performed in 1733 unrelated African-Americans from the National Heart, Lung and Blood Institute's Family Blood Pressure Project, and gene-based association analyses were conducted for hypertension, SBP, DBP, BMI, and HDL-C. A gene score based on the number of minor alleles of each SNP in a gene was created and used for gene-based regression analyses, adjusting for age, age, sex, local marker ancestry, and BMI, as applicable. An individual's African ancestry estimated from 2507 ancestry-informative markers was also adjusted for to eliminate any confounding due to population stratification.CXADR (rs437470) on chromosome 21 was associated with SBP and DBP with or without adjusting for local ancestry (P < 0.0006). F2RL1 (rs631465) on chromosome 5 was associated with BMI (P = 0.0005). Local ancestry in these regions was associated with the respective traits as well.This study suggests that CXADR and F2RL1 likely play important roles in BP and obesity variation, respectively; and these findings are consistent with those of other studies, so replication and functional analyses are necessary.
View details for DOI 10.1097/HJH.0b013e3283578c80
View details for Web of Science ID 000308854500017
View details for PubMedID 22914544
View details for PubMedCentralID PMC3575678
Personal Omics Profiling Reveals Dynamic Molecular and Medical Phenotypes
2012; 148 (6): 1293-1307
Personalized medicine is expected to benefit from combining genomic information with regular monitoring of physiological states by multiple high-throughput methods. Here, we present an integrative personal omics profile (iPOP), an analysis that combines genomic, transcriptomic, proteomic, metabolomic, and autoantibody profiles from a single individual over a 14 month period. Our iPOP analysis revealed various medical risks, including type 2 diabetes. It also uncovered extensive, dynamic changes in diverse molecular components and biological pathways across healthy and diseased conditions. Extremely high-coverage genomic and transcriptomic data, which provide the basis of our iPOP, revealed extensive heteroallelic changes during healthy and diseased states and an unexpected RNA editing mechanism. This study demonstrates that longitudinal iPOP can be used to interpret healthy and diseased states by connecting genomic information with additional dynamic omics activity.
View details for DOI 10.1016/j.cell.2012.02.009
View details for PubMedID 22424236
Human genetic variation altering anthrax toxin sensitivity
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2012; 109 (8): 2972-2977
The outcome of exposure to infectious microbes or their toxins is influenced by both microbial and host genes. Some host genes encode defense mechanisms, whereas others assist pathogen functions. Genomic analyses have associated host gene mutations with altered infectious disease susceptibility, but evidence for causality is limited. Here we demonstrate that human genetic variation affecting capillary morphogenesis gene 2 (CMG2), which encodes a host membrane protein exploited by anthrax toxin as a principal receptor, dramatically alters toxin sensitivity. Lymphoblastoid cells derived from a HapMap Project cohort of 234 persons of African, European, or Asian ancestry differed in sensitivity mediated by the protective antigen (PA) moiety of anthrax toxin by more than four orders of magnitude, with 99% of the cohort showing a 250-fold range of sensitivity. We find that relative sensitivity is an inherited trait that correlates strongly with CMG2 mRNA abundance in cells of each ethnic/geographical group and in the combined population pool (P = 4 × 10(-11)). The extent of CMG2 expression in transfected murine macrophages and human lymphoblastoid cells affected anthrax toxin binding, internalization, and sensitivity. A CMG2 single-nucleotide polymorphism (SNP) occurring frequently in African and European populations independently altered toxin uptake, but was not statistically associated with altered sensitivity in HapMap cell populations. Our results reveal extensive human diversity in cell lethality dependent on PA-mediated toxin binding and uptake, and identify individual differences in CMG2 expression level as a determinant of this diversity. Testing of genomically characterized human cell populations may offer a broadly useful strategy for elucidating effects of genetic variation on infectious disease susceptibility.
View details for DOI 10.1073/pnas.1121006109
View details for Web of Science ID 000300495100062
View details for PubMedID 22315420
View details for PubMedCentralID PMC3286947
Joint Testing of Genotype and Ancestry Association in Admixed Families
2010; 34 (8): 783-791
Current genome-wide association studies (GWAS) often involve populations that have experienced recent genetic admixture. Genotype data generated from these studies can be used to test for association directly, as in a non-admixed population. As an alternative, these data can be used to infer chromosomal ancestry, and thus allow for admixture mapping. We quantify the contribution of allele-based and ancestry-based association testing under a family-design, and demonstrate that the two tests can provide non-redundant information. We propose a joint testing procedure, which efficiently integrates the two sources information. The efficiencies of the allele, ancestry and combined tests are compared in the context of a GWAS. We discuss the impact of population history and provide guidelines for future design and analysis of GWAS in admixed populations.
View details for DOI 10.1002/gepi.20520
View details for Web of Science ID 000284719100002
View details for PubMedID 21031451
View details for PubMedCentralID PMC3103820
Lack of Association Between the Trp719Arg Polymorphism in Kinesin-Like Protein-6 and Coronary Artery Disease in 19 Case-Control Studies
JOURNAL OF THE AMERICAN COLLEGE OF CARDIOLOGY
2010; 56 (19): 1552-1563
We sought to replicate the association between the kinesin-like protein 6 (KIF6) Trp719Arg polymorphism (rs20455), and clinical coronary artery disease (CAD).Recent prospective studies suggest that carriers of the 719Arg allele in KIF6 are at increased risk of clinical CAD compared with noncarriers.The KIF6 Trp719Arg polymorphism (rs20455) was genotyped in 19 case-control studies of nonfatal CAD either as part of a genome-wide association study or in a formal attempt to replicate the initial positive reports.A total of 17,000 cases and 39,369 controls of European descent as well as a modest number of South Asians, African Americans, Hispanics, East Asians, and admixed cases and controls were successfully genotyped. None of the 19 studies demonstrated an increased risk of CAD in carriers of the 719Arg allele compared with noncarriers. Regression analyses and fixed-effects meta-analyses ruled out with high degree of confidence an increase of ≥2% in the risk of CAD among European 719Arg carriers. We also observed no increase in the risk of CAD among 719Arg carriers in the subset of Europeans with early-onset disease (younger than 50 years of age for men and younger than 60 years of age for women) compared with similarly aged controls as well as all non-European subgroups.The KIF6 Trp719Arg polymorphism was not associated with the risk of clinical CAD in this large replication study.
View details for DOI 10.1016/j.jacc.2010.06.022
View details for PubMedID 20933357
Molecular and Evolutionary History of Melanism in North American Gray Wolves
2009; 323 (5919): 1339-1343
Morphological diversity within closely related species is an essential aspect of evolution and adaptation. Mutations in the Melanocortin 1 receptor (Mc1r) gene contribute to pigmentary diversity in natural populations of fish, birds, and many mammals. However, melanism in the gray wolf, Canis lupus, is caused by a different melanocortin pathway component, the K locus, that encodes a beta-defensin protein that acts as an alternative ligand for Mc1r. We show that the melanistic K locus mutation in North American wolves derives from past hybridization with domestic dogs, has risen to high frequency in forested habitats, and exhibits a molecular signature of positive selection. The same mutation also causes melanism in the coyote, Canis latrans, and in Italian gray wolves, and hence our results demonstrate how traits selected in domesticated species can influence the morphological diversity of their wild relatives.
View details for DOI 10.1126/science.1165448
View details for Web of Science ID 000263876700041
View details for PubMedID 19197024
View details for PubMedCentralID PMC2903542
Characterizing the admixed African ancestry of African Americans
2009; 10 (12)
Accurate, high-throughput genotyping allows the fine characterization of genetic ancestry. Here we applied recently developed statistical and computational techniques to the question of African ancestry in African Americans by using data on more than 450,000 single-nucleotide polymorphisms (SNPs) genotyped in 94 Africans of diverse geographic origins included in the HGDP, as well as 136 African Americans and 38 European Americans participating in the Atherosclerotic Disease Vascular Function and Genetic Epidemiology (ADVANCE) study. To focus on African ancestry, we reduced the data to include only those genotypes in each African American determined statistically to be African in origin.From cluster analysis, we found that all the African Americans are admixed in their African components of ancestry, with the majority contributions being from West and West-Central Africa, and only modest variation in these African-ancestry proportions among individuals. Furthermore, by principal components analysis, we found little evidence of genetic structure within the African component of ancestry in African Americans.These results are consistent with historic mating patterns among African Americans that are largely uncorrelated to African ancestral origins, and they cast doubt on the general utility of mtDNA or Y-chromosome markers alone to delineate the full African ancestry of African Americans. Our results also indicate that the genetic architecture of African Americans is distinct from that of Africans, and that the greatest source of potential genetic stratification bias in case-control studies of African Americans derives from the proportion of European ancestry.
View details for DOI 10.1186/gb-2009-10-12-r141
View details for Web of Science ID 000274289000011
View details for PubMedID 20025784
View details for PubMedCentralID PMC2812948
Susceptibility locus for clinical and subclinical coronary artery disease at chromosome 9p21 in the multi-ethnic ADVANCE study
HUMAN MOLECULAR GENETICS
2008; 17 (15): 2320-2328
A susceptibility locus for coronary artery disease (CAD) at chromosome 9p21 has recently been reported, which may influence the age of onset of CAD. We sought to replicate these findings among white subjects and to examine whether these results are consistent with other racial/ethnic groups by genotyping three single nucleotide polymorphisms (SNPs) in the risk interval in the Atherosclerotic Disease, Vascular Function, and Genetic Epidemiology (ADVANCE) study. One or more of these SNPs was associated with clinical CAD in whites, U.S. Hispanics and U.S. East Asians. None of the SNPs were associated with CAD in African Americans although the power to detect an odds ratio (OR) in this group equivalent to that seen in whites was only 24-30%. ORs were higher in Hispanics and East Asians and lower in African Americans, but in all groups the 95% confidence intervals overlapped with ORs observed in whites. High-risk alleles were also associated with increased coronary artery calcification in controls and the magnitude of these associations by racial/ethnic group closely mirrored the magnitude observed for clinical CAD. Unexpectedly, we noted significant genotype frequency differences between male and female cases (P = 0.003-0.05). Consequently, men tended towards a recessive and women tended towards a dominant mode of inheritance. Finally, an effect of genotype on the age of onset of CAD was detected but only in men carrying two versus one or no copy of the high-risk allele and presenting with CAD at age >50 years. Further investigations in other populations are needed to confirm or refute our findings.
View details for DOI 10.1093/hmg/ddn132
View details for Web of Science ID 000257788300007
View details for PubMedID 18443000
View details for PubMedCentralID PMC2733811
IMPROVING POPULATION-SPECIFIC ALLELE FREQUENCY ESTIMATES BY ADAPTING SUPPLEMENTAL DATA: AN EMPIRICAL BAYES APPROACH
ANNALS OF APPLIED STATISTICS
2007; 1 (2): 459-479
Estimation of the allele frequency at genetic markers is a key ingredient in biological and biomedical research, such as studies of human genetic variation or of the genetic etiology of heritable traits. As genetic data becomes increasingly available, investigators face a dilemma: when should data from other studies and population subgroups be pooled with the primary data? Pooling additional samples will generally reduce the variance of the frequency estimates; however, used inappropriately, pooled estimates can be severely biased due to population stratification. Because of this potential bias, most investigators avoid pooling, even for samples with the same ethnic background and residing on the same continent. Here, we propose an empirical Bayes approach for estimating allele frequencies of single nucleotide polymorphisms. This procedure adaptively incorporates genotypes from related samples, so that more similar samples have a greater influence on the estimates. In every example we have considered, our estimator achieves a mean squared error (MSE) that is smaller than either pooling or not, and sometimes substantially improves over both extremes. The bias introduced is small, as is shown by a simulation study that is carefully matched to a real data example. Our method is particularly useful when small groups of individuals are genotyped at a large number of markers, a situation we are likely to encounter in a genome-wide association study.
View details for DOI 10.1214/07-AOAS121
View details for Web of Science ID 000261057600010
View details for PubMedCentralID PMC3065192
Recent genetic selection in the ancestral admixture of Puerto Ricans
AMERICAN JOURNAL OF HUMAN GENETICS
2007; 81 (3): 626-633
Recent studies have used dense markers to examine the human genome in ancestrally homogeneous populations for hallmarks of selection. No genomewide studies have focused on recently admixed groups--populations that have experienced admixing among continentally divided ancestral populations within the past 200-500 years. New World admixed populations are unique in that they represent the sudden confluence of geographically diverged genomes with novel environmental challenges. Here, we present a novel approach for studying selection by examining the genomewide distribution of ancestry in the genetically admixed Puerto Ricans. We find strong statistical evidence of recent selection in three chromosomal regions, including the human leukocyte antigen region on chromosome 6p, chromosome 8q, and chromosome 11q. Two of these regions harbor genes for olfactory receptors. Interestingly, all three regions exhibit deficiencies in the European-ancestry proportion.
View details for DOI 10.1086/520769
View details for Web of Science ID 000249128200019
View details for PubMedID 17701908
View details for PubMedCentralID PMC1950843
A statistical method for chromatographic alignment of LC-MS data
2007; 8 (2): 357-367
Integrated liquid-chromatography mass-spectrometry (LC-MS) is becoming a widely used approach for quantifying the protein composition of complex samples. The output of the LC-MS system measures the intensity of a peptide with a specific mass-charge ratio and retention time. In the last few years, this technology has been used to compare complex biological samples across multiple conditions. One challenge for comparative proteomic profiling with LC-MS is to match corresponding peptide features from different experiments. In this paper, we propose a new method--Peptide Element Alignment (PETAL) that uses raw spectrum data and detected peak to simultaneously align features from multiple LC-MS experiments. PETAL creates spectrum elements, each of which represents the mass spectrum of a single peptide in a single scan. Peptides detected in different LC-MS data are aligned if they can be represented by the same elements. By considering each peptide separately, PETAL enjoys greater flexibility than time warping methods. While most existing methods process multiple data sets by sequentially aligning each data set to an arbitrarily chosen template data set, PETAL treats all experiments symmetrically and can analyze all experiments simultaneously. We illustrate the performance of PETAL on example data sets.
View details for DOI 10.1093/biostatistics/kxl015
View details for Web of Science ID 000245512000015
View details for PubMedID 16880200
Reduced selection leads to accelerated gene loss in Shigella
2007; 8 (8)
Obligate pathogenic bacteria lose more genes relative to facultative pathogens, which, in turn, lose more genes than free-living bacteria. It was suggested that the increased gene loss in obligate pathogens may be due to a reduction in the effectiveness of purifying selection. Less attention has been given to the causes of increased gene loss in facultative pathogens.We examined in detail the rate of gene loss in two groups of facultative pathogenic bacteria: pathogenic Escherichia coli, and Shigella. We show that Shigella strains are losing genes at an accelerated rate relative to pathogenic E. coli. We demonstrate that a genome-wide reduction in the effectiveness of selection contributes to the observed increase in the rate of gene loss in Shigella.When compared with their closely related pathogenic E. coli relatives, the more niche-limited Shigella strains appear to be losing genes at a significantly accelerated rate. A genome-wide reduction in the effectiveness of purifying selection plays a role in creating this observed difference. Our results demonstrate that differences in the effectiveness of selection contribute to differences in rate of gene loss in facultative pathogenic bacteria. We discuss how the lifestyle and pathogenicity of Shigella may alter the effectiveness of selection, thus influencing the rate of gene loss.
View details for DOI 10.1186/gb-2007-8-8-r164
View details for Web of Science ID 000253938500016
View details for PubMedID 17686180
View details for PubMedCentralID PMC2374995
Combining multiple family-based association studies.
2007; 1: S162-?
While high-throughput genotyping technologies are becoming readily available, the merit of using these technologies to perform genome-wide association studies has not been established. One major concern is that for studies of complex diseases and traits, the whole-genome approach requires such large sample sizes that both recruitment and genotyping pose considerable challenge. Here we propose a novel statistical method that boosts the effective sample size by combining data obtained from several studies. Specifically, we consider a situation in which various studies have genotyped non-overlapping subjects at largely non-overlapping sets of markers. Our approach, which exploits the local linkage disequilibrium structure without assuming an explicit population model, opens up the possibility of improving statistical power by incorporating existing data into future association studies.
View details for PubMedID 18466508
Genomewide evolutionary rates in laboratory and wild yeast
2006; 174 (1): 541-544
As wild organisms adapt to the laboratory environment, they become less relevant as biological models. It has been suggested that a commonly used S. cerevisiae strain has rapidly accumulated mutations in the lab. We report a low-to-intermediate rate of protein evolution in this strain relative to wild isolates.
View details for DOI 10.1534/genetics.106.060863
View details for Web of Science ID 000241134400048
View details for PubMedID 16816417
Locally weighted transmission/disequilibrium test for genetic association analysis
14th Genetic Analysis Workshop
BIOMED CENTRAL LTD. 2005
The transmission/disequilibrium test statistic has been used for assessing genetic association in affected-parent trios. In the presence of multiple tightly linked marker loci where local dependency may exist, haplotypes are reconstructed statistically to estimate the joint effects of these markers. In this manuscript, we propose an alternative to the haplotype approach by taking a weighted average of multiple loci, where the weight is proportional to the product of (1-2X recombination fraction) and the linkage disequilibrium between markers. As an illustration, we applied the method to the simulated Aipotu data.
View details for Web of Science ID 000236103400060
View details for PubMedID 16451673
View details for PubMedCentralID PMC1866722
A newly discovered founder population: the Roma/Gypsies
2005; 27 (10): 1084-1094
The Gypsies (a misnomer, derived from an early legend about Egyptian origins) defy the conventional definition of a population: they have no nation-state, speak different languages, belong to many religions and comprise a mosaic of socially and culturally divergent groups separated by strict rules of endogamy. Referred to as "the invisible minority", the Gypsies have for centuries been ignored by Western medicine, and their genetic heritage has only recently attracted attention. Common origins from a small group of ancestors characterise the 8-10 million European Gypsies as an unusual trans-national founder population, whose exodus from India played the role of a profound demographic bottleneck. Social and economic pressures within Europe led to gradual fragmentation, generating multiple genetically differentiated subisolates. The string of population bottlenecks and founder effects have shaped a unique genetic profile, whose potential for genetic research can be met only by study designs that acknowledge cultural tradition and self-identity.
View details for DOI 10.1002/bies.20287
View details for Web of Science ID 000232361100012
View details for PubMedID 16163730
Estimation of individual admixture: Analytical and study design considerations
2005; 28 (4): 289-301
The genome of an admixed individual represents a mixture of alleles from different ancestries. In the United States, the two largest minority groups, African-Americans and Hispanics, are both admixed. An understanding of the admixture proportion at an individual level (individual admixture, or IA) is valuable for both population geneticists and epidemiologists who conduct case-control association studies in these groups. Here we present an extension of a previously described frequentist (maximum likelihood or ML) approach to estimate individual admixture that allows for uncertainty in ancestral allele frequencies. We compare this approach both to prior partial likelihood based methods as well as more recently described Bayesian MCMC methods. Our full ML method demonstrates increased robustness when compared to an existing partial ML approach. Simulations also suggest that this frequentist estimator achieves similar efficiency, measured by the mean squared error criterion, as Bayesian methods but requires just a fraction of the computational time to produce point estimates, allowing for extensive analysis (e.g., simulations) not possible by Bayesian methods. Our simulation results demonstrate that inclusion of ancestral populations or their surrogates in the analysis is required by any method of IA estimation to obtain reasonable results.
View details for DOI 10.1002/gepi.20064
View details for Web of Science ID 000228573700001
View details for PubMedID 15712363
Genetic structure, self-identified race/ethnicity, and confounding in case-control association studies
AMERICAN JOURNAL OF HUMAN GENETICS
2005; 76 (2): 268-275
We have analyzed genetic data for 326 microsatellite markers that were typed uniformly in a large multiethnic population-based sample of individuals as part of a study of the genetics of hypertension (Family Blood Pressure Program). Subjects identified themselves as belonging to one of four major racial/ethnic groups (white, African American, East Asian, and Hispanic) and were recruited from 15 different geographic locales within the United States and Taiwan. Genetic cluster analysis of the microsatellite markers produced four major clusters, which showed near-perfect correspondence with the four self-reported race/ethnicity categories. Of 3,636 subjects of varying race/ethnicity, only 5 (0.14%) showed genetic cluster membership different from their self-identified race/ethnicity. On the other hand, we detected only modest genetic differentiation between different current geographic locales within each race/ethnicity group. Thus, ancient geographic ancestry, which is highly correlated with self-identified race/ethnicity--as opposed to current residence--is the major determinant of genetic structure in the U.S. population. Implications of this genetic structure for case-control association studies are discussed.
View details for Web of Science ID 000226215100012
View details for PubMedID 15625622
Ethnicity and human genetic linkage maps
AMERICAN JOURNAL OF HUMAN GENETICS
2005; 76 (2): 276-290
Human genetic linkage maps are based on rates of recombination across the genome. These rates in humans vary by the sex of the parent from whom alleles are inherited, by chromosomal position, and by genomic features, such as GC content and repeat density. We have examined--for the first time, to our knowledge--racial/ethnic differences in genetic maps of humans. We constructed genetic maps based on 353 microsatellite markers in four racial/ethnic groups: whites, African Americans, Mexican Americans, and East Asians (Chinese and Japanese). These maps were generated using 9,291 subjects from 2,900 nuclear families who participated in the National Heart, Lung, and Blood Institute-funded Family Blood Pressure Program, the largest sample used for map construction to date. Although the maps for the different groups are generally similar, we did find regional and genomewide differences across ethnic groups, including a longer genomewide map for African Americans than for other populations. Some of this variation was explained by genotyping artifacts--namely, null alleles (i.e., alleles with null phenotypes) at a number of loci--and by ethnic differences in null-allele frequencies. In particular, null alleles appear to be the likely explanation for the excess map length in African Americans. We also found that nonrandom missing data biases map results. However, we found regions on chromosome 8p and telomeric segments with significant ethnic differences and a suggestive interval on chromosome 12q that were not due to genotype artifacts. The difference on chromosome 8p is likely due to a polymorphic inversion in the region. The results of our investigation have implications for inferences of possible genetic influences on human recombination as well as for future linkage studies, especially those involving populations of nonwhite ethnicity.
View details for Web of Science ID 000226215100013
View details for PubMedID 15627237
Geographic distribution of disease mutations in the Ashkenazi Jewish population supports genetic drift over selection
AMERICAN JOURNAL OF HUMAN GENETICS
2003; 72 (4): 812-822
The presence of four lysosomal storage diseases (LSDs) at increased frequency in the Ashkenazi Jewish population has suggested to many the operation of natural selection (carrier advantage) as the driving force. We compare LSDs and nonlysosomal storage diseases (NLSDs) in terms of the number of mutations, allele-frequency distributions, and estimated coalescence dates of mutations. We also provide new data on the European geographic distribution, in the Ashkenazi population, of seven LSD and seven NLSD mutations. No differences in any of the distributions were observed between LSDs and NLSDs. Furthermore, no regular pattern of geographic distribution was observed for LSD versus NLSD mutations-with some being more common in central Europe and others being more common in eastern Europe, within each group. The most striking disparate pattern was the geographic distribution of the two primary Tay-Sachs disease mutations, with the first being more common in central Europe (and likely older) and the second being exclusive to eastern Europe (primarily Lithuania and Russia) (and likely much younger). The latter demonstrates a pattern similar to two other recently arisen Lithuanian mutations, those for torsion dystonia and familial hypercholesterolemia. These observations provide compelling support for random genetic drift (chance founder effects, one approximately 11 centuries ago that affected all Ashkenazim and another approximately 5 centuries ago that affected Lithuanians), rather than selection, as the primary determinant of disease mutations in the Ashkenazi population.
View details for Web of Science ID 000181972600004
View details for PubMedID 12612865
Categorization of humans in biomedical research: genes, race and disease.
2002; 3 (7): comment2007-?
A debate has arisen regarding the validity of racial/ethnic categories for biomedical and genetic research. An epidemiologic perspective on the issue of human categorization in biomedical and genetic research strongly supports the continued use of self-identified race and ethnicity.
View details for PubMedID 12184798
Frequentist estimation of coalescence times from nucleotide sequence data using a tree-based partition
2002; 161 (1): 447-459
This article proposes a method of estimating the time to the most recent common ancestor (TMRCA) of a sample of DNA sequences. The method is based on the molecular clock hypothesis, but avoids assumptions about population structure. Simulations show that in a wide range of situations, the point estimate has small bias and the confidence interval has at least the nominal coverage probability. We discuss conditions that can lead to biased estimates. Performance of this estimator is compared with existing methods based on the coalescence theory. The method is applied to sequences of Y chromosomes and mtDNAs to estimate the coalescent times of human male and female populations.
View details for Web of Science ID 000175814900040
View details for PubMedID 12019257
Locating regions of differential variability in DNA and protein sequences
1999; 153 (1): 485-495
In the comparison of DNA and protein sequences between species or between paralogues or among individuals within a species or population, there is often some indication that different regions of the sequence are divergent or polymorphic to different degrees, indicating differential constraint or diversifying selection operating in different regions of the sequence. The problem is to test statistically whether the observed regional differences in the density of variant sites represent real differences and then to estimate as accurately as possible the location of the differential regions. A method is given for testing and locating regions of differential variation. The method consists of calculating G(x(k)) = k/n - x(k)/N, where x(k) is the position of the kth variant site along the sequence, n is the total number of variant sites, and N is the total sequence length. The estimated region is the longest stretch of adjacent sequence for which G(x(k)) is monotonically increasing (a hot spot) or decreasing (a cold spot). Critical values of this length for tests of significance are given, a sequential method is developed for locating multiple differential regions, and the power of the method against various alternatives is explored. The method locates the endpoints of hot spots and cold spots of variation with high accuracy.
View details for Web of Science ID 000082421600035
View details for PubMedID 10471728