Professional Education

  • Bachelor of Science, Universita Degli Studi Di Torino, Biology (2004)
  • Master of Science, Universita Degli Studi Di Torino (2006)
  • Doctor of Philosophy, Washington University (2013)

Stanford Advisors

All Publications

  • Leveraging heterogeneity across multiple datasets increases cell-mixture deconvolution accuracy and reduces biological and technical biases. Nature communications Vallania, F., Tam, A., Lofgren, S., Schaffert, S., Azad, T. D., Bongen, E., Haynes, W., Alsup, M., Alonso, M., Davis, M., Engleman, E., Khatri, P. 2018; 9 (1): 4735


    In silico quantification of cell proportions from mixed-cell transcriptomics data (deconvolution) requires a reference expression matrix, called basis matrix. We hypothesize that matrices created using only healthy samples from a single microarray platform would introduce biological and technical biases in deconvolution. We show presence of such biases in two existing matrices, IRIS and LM22, irrespective of deconvolution method. Here, we present immunoStates, a basis matrix built using 6160 samples with different disease states across 42 microarray platforms. We find that immunoStates significantly reduces biological and technical biases. Importantly, we find that different methods have virtually no or minimal effect once the basis matrix is chosen. We further show that cellular proportion estimates using immunoStates are consistently more correlated with measured proportions than IRIS and LM22, across all methods. Our results demonstrate the need and importance of incorporating biological and technical heterogeneity in a basis matrix for achieving consistently high accuracy.

    View details for DOI 10.1038/s41467-018-07242-6

    View details for PubMedID 30413720

  • Author Correction: A multi-cohort study of the immune factors associated with M. tuberculosis infection outcomes. Nature Roy Chowdhury, R., Vallania, F., Yang, Q., Lopez Angel, C. J., Darboe, F., Penn-Nicholson, A., Rozot, V., Nemes, E., Malherbe, S. T., Ronacher, K., Walzl, G., Hanekom, W., Davis, M. M., Winter, J., Chen, X., Scriba, T. J., Khatri, P., Chien, Y. 2018


    The spelling of author Qianting Yang was corrected; the affiliation of author Stephanus T. Malherbe was corrected; and graphs in Fig. 4b and c were corrected owing to reanalysis of the data into the correct timed intervals.

    View details for DOI 10.1038/s41586-018-0635-8

    View details for PubMedID 30377311

  • A multi-cohort study of the immune factors associated with M. tuberculosis infection outcomes. Nature Roy Chowdhury, R., Vallania, F., Yang, Q., Lopez Angel, C. J., Darboe, F., Penn-Nicholson, A., Rozot, V., Nemes, E., Malherbe, S. T., Ronacher, K., Walzl, G., Hanekom, W., Davis, M. M., Winter, J., Chen, X., Scriba, T. J., Khatri, P., Chien, Y. 2018


    Most infections with Mycobacterium tuberculosis (Mtb) manifest as a clinically asymptomatic, contained state, known as latent tuberculosis infection, that affects approximately one-quarter of the global population1. Although fewer than one in ten individuals eventually progress to active disease2, tuberculosis is a leading cause of death from infectious disease worldwide3. Despite intense efforts, immune factors that influence the infection outcomes remain poorly defined. Here we used integrated analyses of multiple cohorts to identify stage-specific host responses to Mtb infection. First, using high-dimensional mass cytometry analyses and functional assays of a cohort of South African adolescents, we show that latent tuberculosis is associated with enhanced cytotoxic responses, which are mostly mediated by CD16 (also known as FcgammaRIIIa) and natural killer cells, and continuous inflammation coupled with immune deviations in both T and B cell compartments.Next, using cell-type deconvolution of transcriptomic data from several cohorts of different ages, genetic backgrounds, geographical locations and infection stages, we show that although deviations in peripheral B and T cell compartments generally start at latency, they are heterogeneous across cohorts. However, an increase in the abundance of circulating natural killer cells in tuberculosis latency, with a corresponding decrease during active disease and a return to baseline levels upon clinical cure are features that are common to all cohorts. Furthermore, by analysing three longitudinal cohorts, we find that changes inperipheral levels of natural killer cells can inform disease progression and treatment responses, and inversely correlate with the inflammatory state of the lungs of patients with active tuberculosis. Together, our findings offer crucial insights into the underlying pathophysiology of tuberculosis latency, and identify factors that may influence infection outcomes.

    View details for DOI 10.1038/s41586-018-0439-x

    View details for PubMedID 30135583

  • Single-cell epigenetics - Chromatin modification atlas unveiled by mass cytometry. Clinical immunology (Orlando, Fla.) Cheung, P., Vallania, F., Dvorak, M., Chang, S. E., Schaffert, S., Donato, M., Rao, A., Mao, R., Utz, P. J., Khatri, P., Kuo, A. J. 2018


    Modifications of histone proteins are fundamental to the regulation of epigenetic phenotypes. Dysregulations of histone modifications have been linked to the pathogenesis of diverse human diseases. However, identifying differential histone modifications in patients with immune-mediated diseases has been challenging, in part due to the lack of a powerful analytic platform to study histone modifications in the complex human immune system. We recently developed a highly multiplexed platform, Epigenetic landscape profiling using cytometry by Time-Of-Flight (EpiTOF), to analyze the global levels of a broad array of histone modifications in single cells using mass cytometry. In this review, we summarize the development of EpiTOF and discuss its potential applications in biomedical research. We anticipate that this platform will provide new insights into the roles of epigenetic regulation in hematopoiesis, immune cell functions and immune system aging, and reveal aberrant epigenetic patterns associated with immune-mediated diseases.

    View details for DOI 10.1016/j.clim.2018.06.009

    View details for PubMedID 29960011

  • KLRD1-expressing natural killer cells predict influenza susceptibility GENOME MEDICINE Bongen, E., Vallania, F., Utz, P. J., Khatri, P. 2018; 10: 45


    Influenza infects tens of millions of people every year in the USA. Other than notable risk groups, such as children and the elderly, it is difficult to predict what subpopulations are at higher risk of infection. Viral challenge studies, where healthy human volunteers are inoculated with live influenza virus, provide a unique opportunity to study infection susceptibility. Biomarkers predicting influenza susceptibility would be useful for identifying risk groups and designing vaccines.We applied cell mixture deconvolution to estimate immune cell proportions from whole blood transcriptome data in four independent influenza challenge studies. We compared immune cell proportions in the blood between symptomatic shedders and asymptomatic nonshedders across three discovery cohorts prior to influenza inoculation and tested results in a held-out validation challenge cohort.Natural killer (NK) cells were significantly lower in symptomatic shedders at baseline in both discovery and validation cohorts. Hematopoietic stem and progenitor cells (HSPCs) were higher in symptomatic shedders at baseline in discovery cohorts. Although the HSPCs were higher in symptomatic shedders in the validation cohort, the increase was statistically nonsignificant. We observed that a gene associated with NK cells, KLRD1, which encodes CD94, was expressed at lower levels in symptomatic shedders at baseline in discovery and validation cohorts. KLRD1 expression in the blood at baseline negatively correlated with influenza infection symptom severity. KLRD1 expression 8 h post-infection in the nasal epithelium from a rhinovirus challenge study also negatively correlated with symptom severity.We identified KLRD1-expressing NK cells as a potential biomarker for influenza susceptibility. Expression of KLRD1 was inversely correlated with symptom severity. Our results support a model where an early response by KLRD1-expressing NK cells may control influenza infection.

    View details for DOI 10.1186/s13073-018-0554-1

    View details for Web of Science ID 000435421500001

    View details for PubMedID 29898768

    View details for PubMedCentralID PMC6001128

  • Single-Cell Chromatin Modification Profiling Reveals Increased Epigenetic Variations with Aging. Cell Cheung, P., Vallania, F., Warsinske, H. C., Donato, M., Schaffert, S., Chang, S. E., Dvorak, M., Dekker, C. L., Davis, M. M., Utz, P. J., Khatri, P., Kuo, A. J. 2018


    Post-translational modifications of histone proteins and exchanges of histone variants of chromatin are central to the regulation of nearly all DNA-templated biological processes. However, the degree and variability of chromatin modifications in specific human immune cells remain largely unknown. Here, we employ a highly multiplexed mass cytometry analysis to profile the global levels of a broad array of chromatin modifications in primary human immune cells at the single-cell level. Our data reveal markedly different cell-type- and hematopoietic-lineage-specific chromatin modification patterns. Differential analysis between younger and older adults shows that aging is associated with increased heterogeneity between individuals and elevated cell-to-cell variability in chromatin modifications. Analysis of a twin cohort unveils heritability of chromatin modifications and demonstrates that aging-related chromatin alterations are predominantly driven by non-heritable influences. Together, we present a powerful platform for chromatin and immunology research. Our discoveries highlight the profound impacts of aging on chromatin modifications.

    View details for DOI 10.1016/j.cell.2018.03.079

    View details for PubMedID 29706550

  • Interpretation of biological experiments changes with evolution of the Gene Ontology and its annotations SCIENTIFIC REPORTS Tomczak, A., Mortensen, J. M., Winnenburg, R., Liu, C., Alessi, D. T., Swamy, V., Vallania, F., Lofgren, S., Haynes, W., Shah, N. H., Musen, M. A., Khatri, P. 2018; 8: 5115


    Gene Ontology (GO) enrichment analysis is ubiquitously used for interpreting high throughput molecular data and generating hypotheses about underlying biological phenomena of experiments. However, the two building blocks of this analysis - the ontology and the annotations - evolve rapidly. We used gene signatures derived from 104 disease analyses to systematically evaluate how enrichment analysis results were affected by evolution of the GO over a decade. We found low consistency between enrichment analyses results obtained with early and more recent GO versions. Furthermore, there continues to be a strong annotation bias in the GO annotations where 58% of the annotations are for 16% of the human genes. Our analysis suggests that GO evolution may have affected the interpretation and possibly reproducibility of experiments over time. Hence, researchers must exercise caution when interpreting GO enrichment analyses and should reexamine previous analyses with the most recent GO version.

    View details for DOI 10.1038/s41598-018-23395-2

    View details for Web of Science ID 000428162800001

    View details for PubMedID 29572502

    View details for PubMedCentralID PMC5865181

  • Multicohort analysis reveals baseline transcriptional predictors of influenza vaccination responses SCIENCE IMMUNOLOGY Avey, S., Cheung, F., Fermin, D., Frelinger, J., Gaujoux, R., Gottardo, R., Khatri, P., Kleinstein, S. H., Kotliarov, Y., Meng, H., Sauteraud, R., Shen-Orr, S. S., Tsang, J. S., Vallania, F., Anguiano, E., Baisch, J., Baldwin, N., Belshe, R. B., Blevins, T. P., Chaussabel, D., Davis, M. M., Fikrig, E., Grill, D. E., Hafler, D. A., Henrich, E., Joshi, S. R., Kaech, S. M., Kennedy, R. B., Mohanty, S., Montgomery, R. R., Oberg, A. L., Obermoser, G., Ovsyannikova, I. G., Palucka, A., Pascual, V., Poland, G. A., Pulendran, B., Reinherz, E. L., Shaw, A. C., Siconolfi, B., Stuart, K. D., Tsang, S., Ueda, I., Wilson, J., Zapata, H. J., HIPC-CHI Signatures Project Team, HIPC-I Consortium 2017; 2 (14)
  • Methods to increase reproducibility in differential gene expression via meta-analysis. Nucleic acids research Sweeney, T. E., Haynes, W. A., Vallania, F., Ioannidis, J. P., Khatri, P. 2017; 45 (1)


    Findings from clinical and biological studies are often not reproducible when tested in independent cohorts. Due to the testing of a large number of hypotheses and relatively small sample sizes, results from whole-genome expression studies in particular are often not reproducible. Compared to single-study analysis, gene expression meta-analysis can improve reproducibility by integrating data from multiple studies. However, there are multiple choices in designing and carrying out a meta-analysis. Yet, clear guidelines on best practices are scarce. Here, we hypothesized that studying subsets of very large meta-analyses would allow for systematic identification of best practices to improve reproducibility. We therefore constructed three very large gene expression meta-analyses from clinical samples, and then examined meta-analyses of subsets of the datasets (all combinations of datasets with up to N/2 samples and K/2 datasets) compared to a 'silver standard' of differentially expressed genes found in the entire cohort. We tested three random-effects meta-analysis models using this procedure. We showed relatively greater reproducibility with more-stringent effect size thresholds with relaxed significance thresholds; relatively lower reproducibility when imposing extraneous constraints on residual heterogeneity; and an underestimation of actual false positive rate by Benjamini-Hochberg correction. In addition, multivariate regression showed that the accuracy of a meta-analysis increased significantly with more included datasets even when controlling for sample size.

    View details for DOI 10.1093/nar/gkw797

    View details for PubMedID 27634930

    View details for PubMedCentralID PMC5224496

  • Multicohort analysis reveals baseline transcriptional predictors of influenza vaccination responses. Science immunology 2017; 2 (14)


    Annual influenza vaccinations are currently recommended for all individuals 6 months and older. Antibodies induced by vaccination are an important mechanism of protection against infection. Despite the overall public health success of influenza vaccination, many individuals fail to induce a substantial antibody response. Systems-level immune profiling studies have discerned associations between transcriptional and cell subset signatures with the success of antibody responses. However, existing signatures have relied on small cohorts and have not been validated in large independent studies. We leveraged multiple influenza vaccination cohorts spanning distinct geographical locations and seasons from the Human Immunology Project Consortium (HIPC) and the Center for Human Immunology (CHI) to identify baseline (i.e., before vaccination) predictive transcriptional signatures of influenza vaccination responses. Our multicohort analysis of HIPC data identified nine genes (RAB24, GRB2, DPP3, ACTB, MVP, DPP7, ARPC4, PLEKHB2, and ARRB1) and three gene modules that were significantly associated with the magnitude of the antibody response, and these associations were validated in the independent CHI cohort. These signatures were specific to young individuals, suggesting that distinct mechanisms underlie the lower vaccine response in older individuals. We found an inverse correlation between the effect size of signatures in young and older individuals. Although the presence of an inflammatory gene signature, for example, was associated with better antibody responses in young individuals, it was associated with worse responses in older individuals. These results point to the prospect of predicting antibody responses before vaccination and provide insights into the biological mechanisms underlying successful vaccination responses.

    View details for DOI 10.1126/sciimmunol.aal4656

    View details for PubMedID 28842433

  • EMPOWERING MULTI-COHORT GENE EXPRESSION ANALYSIS TO INCREASE REPRODUCIBILITY. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Haynes, W. A., Vallania, F., Liu, C., Bongen, E., Tomczak, A., Andres-Terrè, M., Lofgren, S., Tam, A., Deisseroth, C. A., Li, M. D., Sweeney, T. E., Khatri, P. 2016; 22: 144-153


    A major contributor to the scientific reproducibility crisis has been that the results from homogeneous, single-center studies do not generalize to heterogeneous, real world populations. Multi-cohort gene expression analysis has helped to increase reproducibility by aggregating data from diverse populations into a single analysis. To make the multi-cohort analysis process more feasible, we have assembled an analysis pipeline which implements rigorously studied meta-analysis best practices. We have compiled and made publicly available the results of our own multi-cohort gene expression analysis of 103 diseases, spanning 615 studies and 36,915 samples, through a novel and interactive web application. As a result, we have made both the process of and the results from multi-cohort gene expression analysis more approachable for non-technical users.

    View details for PubMedID 27896970

    View details for PubMedCentralID PMC5167529

  • META-ANALYSIS OF CONTINUOUS PHENOTYPES IDENTIFIES A GENE SIGNATURE THAT CORRELATES WITH COPD DISEASE STATUS. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Scott, M., Vallania, F., Khatri, P. 2016; 22: 266-275


    The utility of multi-cohort two-class meta-analysis to identify robust differentially expressed gene signatures has been well established. However, many biomedical applications, such as gene signatures of disease progression, require one-class analysis. Here we describe an R package, MetaCorrelator, that can identify a reproducible transcriptional signature that is correlated with a continuous disease phenotype across multiple datasets. We successfully applied this framework to extract a pattern of gene expression that can predict lung function in patients with chronic obstructive pulmonary disease (COPD) in both peripheral blood mononuclear cells (PBMCs) and tissue. Our results point to a disregulation in the oxidation state of the lungs of patients with COPD, as well as underscore the classically recognized inammatory state that underlies this disease.

    View details for PubMedID 27896981

  • Validating single-cell genomics for the study of renal development KIDNEY INTERNATIONAL Jain, S., Noordam, M. J., Hoshi, M., Vallania, F. L., Conrad, D. F. 2014; 86 (5): 1049-1055


    Single-cell genomics will enable studies of the earliest events in kidney development, although it is unclear if existing technologies are mature enough to generate accurate and reproducible data on kidney progenitors. Here we designed a pilot study to validate a high-throughput assay to measure the expression levels of key regulators of kidney development in single cells isolated from embryonic mice. Our experiment produced 4608 expression measurements of 22 genes, made in small cell pools, and 28 single cells purified from the RET-positive ureteric bud. There were remarkable levels of concordance with expression data generated by traditional microarray analysis on bulk ureteric bud tissue with the correlation between our average single-cell measurements and GUDMAP measurements for each gene of 0.82-0.85. Nonetheless, a major motivation for single-cell technology is to uncover dynamic biology hidden in population means. There was evidence for extensive and surprising variation in expression of Wnt11 and Etv5, both downstream targets of activated RET. The variation for all genes in the study was strongly consistent with burst-like promoter kinetics. Thus, our results can inform the design of future single-cell experiments, which are poised to provide important insights into kidney development and disease.

    View details for DOI 10.1038/ki.2014.104

    View details for Web of Science ID 000344446000025

    View details for PubMedID 24759149

  • Origin and Consequences of the Relationship between Protein Mean and Variance PLOS ONE Vallania, F. L., Sherman, M., Goodwin, Z., Mogno, I., Cohen, B. A., Mitra, R. D. 2014; 9 (7)


    Cell-to-cell variance in protein levels (noise) is a ubiquitous phenomenon that can increase fitness by generating phenotypic differences within clonal populations of cells. An important challenge is to identify the specific molecular events that control noise. This task is complicated by the strong dependence of a protein's cell-to-cell variance on its mean expression level through a power-law like relationship (σ2∝μ1.69). Here, we dissect the nature of this relationship using a stochastic model parameterized with experimentally measured values. This framework naturally recapitulates the power-law like relationship (σ2∝μ1.6) and accurately predicts protein variance across the yeast proteome (r2 = 0.935). Using this model we identified two distinct mechanisms by which protein variance can be increased. Variables that affect promoter activation, such as nucleosome positioning, increase protein variance by changing the exponent of the power-law relationship. In contrast, variables that affect processes downstream of promoter activation, such as mRNA and protein synthesis, increase protein variance in a mean-dependent manner following the power-law. We verified our findings experimentally using an inducible gene expression system in yeast. We conclude that the power-law-like relationship between noise and protein mean is due to the kinetics of promoter activation. Our results provide a framework for understanding how molecular processes shape stochastic variation across the genome.

    View details for DOI 10.1371/journal.pone.0102202

    View details for Web of Science ID 000339992600010

    View details for PubMedID 25062021

  • Performance of Common Analysis Methods for Detecting Low-Frequency Single Nucleotide Variants in Targeted Next-Generation Sequence Data JOURNAL OF MOLECULAR DIAGNOSTICS Spencer, D. H., Tyagi, M., Vallania, F., Bredemeyer, A. J., Pfeifer, J. D., Mitra, R. D., Duncavage, E. J. 2014; 16 (1): 75-88


    Next-generation sequencing (NGS) is becoming a common approach for clinical testing of oncology specimens for mutations in cancer genes. Unlike inherited variants, cancer mutations may occur at low frequencies because of contamination from normal cells or tumor heterogeneity and can therefore be challenging to detect using common NGS analysis tools, which are often designed for constitutional genomic studies. We generated high-coverage (>1000×) NGS data from synthetic DNA mixtures with variant allele fractions (VAFs) of 25% to 2.5% to assess the performance of four variant callers, SAMtools, Genome Analysis Toolkit, VarScan2, and SPLINTER, in detecting low-frequency variants. SAMtools had the lowest sensitivity and detected only 49% of variants with VAFs of approximately 25%; whereas the Genome Analysis Toolkit, VarScan2, and SPLINTER detected at least 94% of variants with VAFs of approximately 10%. VarScan2 and SPLINTER achieved sensitivities of 97% and 89%, respectively, for variants with observed VAFs of 1% to 8%, with >98% sensitivity and >99% positive predictive value in coding regions. Coverage analysis demonstrated that >500× coverage was required for optimal performance. The specificity of SPLINTER improved with higher coverage, whereas VarScan2 yielded more false positive results at high coverage levels, although this effect was abrogated by removing low-quality reads before variant identification. Finally, we demonstrate the utility of high-sensitivity variant callers with data from 15 clinical lung cancers.

    View details for DOI 10.1016/j.jmoldx.2013.09.003

    View details for Web of Science ID 000328926500010

    View details for PubMedID 24211364

  • Population-based rare variant detection via pooled exome or custom hybridization capture with or without individual indexing BMC GENOMICS Ramos, E., Levinson, B. T., Chasnoff, S., Hughes, A., Young, A. L., Thornton, K., Li, A., Vallania, F. L., Province, M., Druley, T. E. 2012; 13


    Rare genetic variation in the human population is a major source of pathophysiological variability and has been implicated in a host of complex phenotypes and diseases. Finding disease-related genes harboring disparate functional rare variants requires sequencing of many individuals across many genomic regions and comparing against unaffected cohorts. However, despite persistent declines in sequencing costs, population-based rare variant detection across large genomic target regions remains cost prohibitive for most investigators. In addition, DNA samples are often precious and hybridization methods typically require large amounts of input DNA. Pooled sample DNA sequencing is a cost and time-efficient strategy for surveying populations of individuals for rare variants. We set out to 1) create a scalable, multiplexing method for custom capture with or without individual DNA indexing that was amenable to low amounts of input DNA and 2) expand the functionality of the SPLINTER algorithm for calling substitutions, insertions and deletions across either candidate genes or the entire exome by integrating the variant calling algorithm with the dynamic programming aligner, Novoalign.We report methodology for pooled hybridization capture with pre-enrichment, indexed multiplexing of up to 48 individuals or non-indexed pooled sequencing of up to 92 individuals with as little as 70 ng of DNA per person. Modified solid phase reversible immobilization bead purification strategies enable no sample transfers from sonication in 96-well plates through adapter ligation, resulting in 50% less library preparation reagent consumption. Custom Y-shaped adapters containing novel 7 base pair index sequences with a Hamming distance of ≥2 were directly ligated onto fragmented source DNA eliminating the need for PCR to incorporate indexes, and was followed by a custom blocking strategy using a single oligonucleotide regardless of index sequence. These results were obtained aligning raw reads against the entire genome using Novoalign followed by variant calling of non-indexed pools using SPLINTER or SAMtools for indexed samples. With these pipelines, we find sensitivity and specificity of 99.4% and 99.7% for pooled exome sequencing. Sensitivity, and to a lesser degree specificity, proved to be a function of coverage. For rare variants (≤2% minor allele frequency), we achieved sensitivity and specificity of ≥94.9% and ≥99.99% for custom capture of 2.5 Mb in multiplexed libraries of 22-48 individuals with only ≥5-fold coverage/chromosome, but these parameters improved to ≥98.7 and 100% with 20-fold coverage/chromosome.This highly scalable methodology enables accurate rare variant detection, with or without individual DNA sample indexing, while reducing the amount of required source DNA and total costs through less hybridization reagent consumption, multi-sample sonication in a standard PCR plate, multiplexed pre-enrichment pooling with a single hybridization and lesser sequencing coverage required to obtain high sensitivity.

    View details for DOI 10.1186/1471-2164-13-683

    View details for Web of Science ID 000312962400001

    View details for PubMedID 23216810

  • Detection of rare genomic variants from pooled sequencing using SPLINTER. Journal of visualized experiments : JoVE Vallania, F., Ramos, E., Cresci, S., Mitra, R. D., Druley, T. E. 2012


    As DNA sequencing technology has markedly advanced in recent years(2), it has become increasingly evident that the amount of genetic variation between any two individuals is greater than previously thought(3). In contrast, array-based genotyping has failed to identify a significant contribution of common sequence variants to the phenotypic variability of common disease(4,5). Taken together, these observations have led to the evolution of the Common Disease / Rare Variant hypothesis suggesting that the majority of the "missing heritability" in common and complex phenotypes is instead due to an individual's personal profile of rare or private DNA variants(6-8). However, characterizing how rare variation impacts complex phenotypes requires the analysis of many affected individuals at many genomic loci, and is ideally compared to a similar survey in an unaffected cohort. Despite the sequencing power offered by today's platforms, a population-based survey of many genomic loci and the subsequent computational analysis required remains prohibitive for many investigators. To address this need, we have developed a pooled sequencing approach(1,9) and a novel software package(1) for highly accurate rare variant detection from the resulting data. The ability to pool genomes from entire populations of affected individuals and survey the degree of genetic variation at multiple targeted regions in a single sequencing library provides excellent cost and time savings to traditional single-sample sequencing methodology. With a mean sequencing coverage per allele of 25-fold, our custom algorithm, SPLINTER, uses an internal variant calling control strategy to call insertions, deletions and substitutions up to four base pairs in length with high sensitivity and specificity from pools of up to 1 mutant allele in 500 individuals. Here we describe the method for preparing the pooled sequencing library followed by step-by-step instructions on how to use the SPLINTER package for pooled sequencing analysis ( We show a comparison between pooled sequencing of 947 individuals, all of whom also underwent genome-wide array, at over 20kb of sequencing per person. Concordance between genotyping of tagged and novel variants called in the pooled sample were excellent. This method can be easily scaled up to any number of genomic loci and any number of individuals. By incorporating the internal positive and negative amplicon controls at ratios that mimic the population under study, the algorithm can be calibrated for optimal performance. This strategy can also be modified for use with hybridization capture or individual-specific barcodes and can be applied to the sequencing of naturally heterogeneous samples, such as tumor DNA.

    View details for DOI 10.3791/3943

    View details for PubMedID 22760212

  • Rare Variants in APP, PSEN1 and PSEN2 Increase Risk for AD in Late-Onset Alzheimer's Disease Families PLOS ONE Cruchaga, C., Chakraverty, S., Mayo, K., Vallania, F. L., Mitra, R. D., Faber, K., Williamson, J., Bird, T., Diaz-Arrastia, R., Foroud, T. M., Boeve, B. F., Graff-Radford, N. R., Jean, P. S., Lawson, M., Ehm, M. G., Mayeux, R., Goate, A. M. 2012; 7 (2)


    Pathogenic mutations in APP, PSEN1, PSEN2, MAPT and GRN have previously been linked to familial early onset forms of dementia. Mutation screening in these genes has been performed in either very small series or in single families with late onset AD (LOAD). Similarly, studies in single families have reported mutations in MAPT and GRN associated with clinical AD but no systematic screen of a large dataset has been performed to determine how frequently this occurs. We report sequence data for 439 probands from late-onset AD families with a history of four or more affected individuals. Sixty sequenced individuals (13.7%) carried a novel or pathogenic mutation. Eight pathogenic variants, (one each in APP and MAPT, two in PSEN1 and four in GRN) three of which are novel, were found in 14 samples. Thirteen additional variants, present in 23 families, did not segregate with disease, but the frequency of these variants is higher in AD cases than controls, indicating that these variants may also modify risk for disease. The frequency of rare variants in these genes in this series is significantly higher than in the 1,000 genome project (p = 5.09 × 10⁻⁵; OR = 2.21; 95%CI = 1.49-3.28) or an unselected population of 12,481 samples (p = 6.82 × 10⁻⁵; OR = 2.19; 95%CI = 1.347-3.26). Rare coding variants in APP, PSEN1 and PSEN2, increase risk for or cause late onset AD. The presence of variants in these genes in LOAD and early-onset AD demonstrates that factors other than the mutation can impact the age at onset and penetrance of at least some variants associated with AD. MAPT and GRN mutations can be found in clinical series of AD most likely due to misdiagnosis. This study clearly demonstrates that rare variants in these genes could explain an important proportion of genetic heritability of AD, which is not detected by GWAS.

    View details for DOI 10.1371/journal.pone.0031039

    View details for Web of Science ID 000301977500027

    View details for PubMedID 22312439

  • Rare missense variants in CHRNB4 are associated with reduced risk of nicotine dependence HUMAN MOLECULAR GENETICS Haller, G., Druley, T., Vallania, F. L., Mitra, R. D., Li, P., Akk, G., Steinbach, J. H., Breslau, N., Johnson, E., Hatsukami, D., Stitzel, J., Bierut, L. J., Goate, A. M. 2012; 21 (3): 647-655


    Genome-wide association studies have identified common variation in the CHRNA5-CHRNA3-CHRNB4 and CHRNA6-CHRNB3 gene clusters that contribute to nicotine dependence. However, the role of rare variation in risk for nicotine dependence in these nicotinic receptor genes has not been studied. We undertook pooled sequencing of the coding regions and flanking sequence of the CHRNA5, CHRNA3, CHRNB4, CHRNA6 and CHRNB3 genes in African American and European American nicotine-dependent smokers and smokers without symptoms of dependence. Carrier status of individuals harboring rare missense variants at conserved sites in each of these genes was then compared in cases and controls to test for an association with nicotine dependence. Missense variants at conserved residues in CHRNB4 are associated with lower risk for nicotine dependence in African Americans and European Americans (AA P = 0.0025, odds-ratio (OR) = 0.31, 95% confidence-interval (CI) = 0.31-0.72; EA P = 0.023, OR = 0.69, 95% CI = 0.50-0.95). Furthermore, these individuals were found to smoke fewer cigarettes per day than non-carriers (AA P = 6.6 × 10(-5), EA P = 0.021). Given the possibility of stochastic differences in rare allele frequencies between groups replication of this association is necessary to confirm these findings. The functional effects of the two CHRNB4 variants contributing most to this association (T375I and T91I) and a missense variant in CHRNA3 (R37H) in strong linkage disequilibrium with T91I were examined in vitro. The minor allele of each polymorphism increased cellular response to nicotine (T375I P = 0.01, T91I P = 0.02, R37H P = 0.003), but the largest effect on in vitro receptor activity was seen in the presence of both CHRNB4 T91I and CHRNA3 R37H (P = 2 × 10(-6)).

    View details for DOI 10.1093/hmg/ddr498

    View details for Web of Science ID 000299351000015

    View details for PubMedID 22042774

  • High-throughput discovery of rare insertions and deletions in large cohorts GENOME RESEARCH Vallania, F. L., Druley, T. E., Ramos, E., Wang, J., Borecki, I., Province, M., Mitra, R. D. 2010; 20 (12): 1711-1718


    Pooled-DNA sequencing strategies enable fast, accurate, and cost-effect detection of rare variants, but current approaches are not able to accurately identify short insertions and deletions (indels), despite their pivotal role in genetic disease. Furthermore, the sensitivity and specificity of these methods depend on arbitrary, user-selected significance thresholds, whose optimal values change from experiment to experiment. Here, we present a combined experimental and computational strategy that combines a synthetically engineered DNA library inserted in each run and a new computational approach named SPLINTER that detects and quantifies short indels and substitutions in large pools. SPLINTER integrates information from the synthetic library to select the optimal significance thresholds for every experiment. We show that SPLINTER detects indels (up to 4 bp) and substitutions in large pools with high sensitivity and specificity, accurately quantifies variant frequency (r = 0.999), and compares favorably with existing algorithms for the analysis of pooled sequencing data. We applied our approach to analyze a cohort of 1152 individuals, identifying 48 variants and validating 14 of 14 (100%) predictions by individual genotyping. Thus, our strategy provides a novel and sensitive method that will speed the discovery of novel disease-causing rare variants.

    View details for DOI 10.1101/gr.109157.110

    View details for Web of Science ID 000284835000010

    View details for PubMedID 21041413

  • TATA is a modular component of synthetic promoters GENOME RESEARCH Mogno, I., Vallania, F., Mitra, R. D., Cohen, B. A. 2010; 20 (10): 1391-1397


    The expression of most genes is regulated by multiple transcription factors. The interactions between transcription factors produce complex patterns of gene expression that are not always obvious from the arrangement of cis-regulatory elements in a promoter. One critical element of promoters is the TATA box, the docking site for the RNA polymerase holoenzyme. Using a synthetic promoter system coupled to a thermodynamic model of combinatorial regulation, we analyze the effects of different strength TATA boxes on various aspects of combinatorial cis-regulation. The thermodynamic model explains 75% of the variance in gene expression in synthetic promoter libraries with different strength TATA boxes, suggesting that many of the salient aspects of cis-regulation are captured by the model. Our results demonstrate that the effect of changing the TATA box on gene expression is the same for all synthetic promoters regardless of the arrangement of cis-regulatory sites we studied. Our analysis also showed that in our synthetic system the strength of the RNA polymerase-TATA interaction does not alter the combinatorial interactions between transcription factors, or between transcription factors and RNA polymerase. Finally, we show that although stronger TATA boxes increase expression in a predictable fashion, stronger TATA boxes have very little effect on noise in our synthetic promoters, regardless of the arrangement of cis-regulatory sites. Our results support a modular model of promoter function, where cis-regulatory elements can be mixed and matched (programmed) with outcomes on expression that are predictable based on the rules of simple protein-protein and protein-DNA interactions.

    View details for DOI 10.1101/gr.106732.110

    View details for Web of Science ID 000282375000009

    View details for PubMedID 20627890

  • Cardiac signaling genes exhibit unexpected sequence diversity in sporadic cardiomyopathy, revealing HSPB7 polymorphisms associated with disease JOURNAL OF CLINICAL INVESTIGATION Matkovich, S. J., Van Booven, D. J., Hindes, A., Kang, M. Y., Druley, T. E., Vallania, F. L., Mitra, R. D., Reilly, M. P., Cappola, T. P., Dorn, G. W. 2010; 120 (1): 280-289


    Sporadic heart failure is thought to have a genetic component, but the contributing genetic events are poorly defined. Here, we used ultra-high-throughput resequencing of pooled DNAs to identify SNPs in 4 biologically relevant cardiac signaling genes, and then examined the association between allelic variants and incidence of sporadic heart failure in 2 large Caucasian populations. Resequencing of DNA pools, each containing DNA from approximately 100 individuals, was rapid, accurate, and highly sensitive for identifying common and rare SNPs; it also had striking advantages in time and cost efficiencies over individual resequencing using conventional Sanger methods. In 2,606 individuals examined, we identified a total of 129 separate SNPs in the 4 cardiac signaling genes, including 23 nonsynonymous SNPs that we believe to be novel. Comparison of allele frequencies between 625 Caucasian nonaffected controls and 1,117 Caucasian individuals with systolic heart failure revealed 12 SNPs in the cardiovascular heat shock protein gene HSPB7 with greater proportional representation in the systolic heart failure group; all 12 SNPs were confirmed in an independent replication study. These SNPs were found to be in tight linkage disequilibrium, likely reflecting a single genetic event, but none altered amino acid sequence. These results establish the power and applicability of pooled resequencing for comparative SNP association analysis of target subgenomes in large populations and identify an association between multiple HSPB7 polymorphisms and heart failure.

    View details for DOI 10.1172/JCI39085

    View details for Web of Science ID 000273495700031

    View details for PubMedID 20038796

  • The RhoU/Wrch1 Rho GTPase gene is a common transcriptional target of both the gp130/STAT3 and Wnt-1 pathways BIOCHEMICAL JOURNAL Schiavone, D., Dewilde, S., Vallania, F., Turkson, J., Di Cunto, F., Poli, V. 2009; 421: 283-292


    STAT3 (signal transducer and activator of transcription 3) is a transcription factor activated by cytokines, growth factors and oncogenes, whose activity is required for cell survival/proliferation of a wide variety of primary tumours and tumour cell lines. Prominent among its multiple effects on tumour cells is the stimulation of cell migration and metastasis, whose functional mechanisms are however not completely characterized. RhoU/Wrch1 (Wnt-responsive Cdc42 homologue) is an atypical Rho GTPase thought to be constitutively bound to GTP. RhoU was first identified as a Wnt-1-inducible mRNA and subsequently shown to act on the actin cytoskeleton by stimulating filopodia formation and stress fibre dissolution. It was in addition recently shown to localize to focal adhesions and to Src-induced podosomes and enhance cell migration. RhoU overexpression in mammary epithelial cells stimulates quiescent cells to re-enter the cell cycle and morphologically phenocopies Wnt-1-dependent transformation. In the present study we show that Wnt-1-mediated RhoU induction occurs at the transcriptional level. Moreover, we demonstrate that RhoU can also be induced by gp130 cytokines via STAT3, and we identify two functional STAT3-binding sites on the mouse RhoU promoter. RhoU induction by Wnt-1 is independent of beta-catenin, but does not involve STAT3. Rather, it is mediated by the Wnt/planar cell polarity pathway through the activation of JNK (c-Jun N-terminal kinase). Both the so-called non-canonical Wnt pathway and STAT3 are therefore able to induce RhoU, which in turn may be involved in mediating their effects on cell migration.

    View details for DOI 10.1042/BJ20090061

    View details for Web of Science ID 000268088100015

    View details for PubMedID 19397496

  • Quantification of rare allelic variants from pooled genomic DNA NATURE METHODS Druley, T. E., Vallania, F. L., Wegner, D. J., Varley, K. E., Knowles, O. L., Bonds, J. A., Robison, S. W., Doniger, S. W., Hamvas, A., Cole, F. S., Fay, J. C., Mitra, R. D. 2009; 6 (4): 263-265


    We report a targeted, cost-effective method to quantify rare single-nucleotide polymorphisms from pooled human genomic DNA using second-generation sequencing. We pooled DNA from 1,111 individuals and targeted four genes to identify rare germline variants. Our base-calling algorithm, SNPSeeker, derived from large deviation theory, detected single-nucleotide polymorphisms present at frequencies below the raw error rate of the sequencing platform.

    View details for DOI 10.1038/NMETH.1307

    View details for Web of Science ID 000264738800013

    View details for PubMedID 19252504

  • Genome-wide discovery of functional transcription factor binding sites by comparative genomics: The case of Stat3 PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Vallania, F., Schiavone, D., Dewilde, S., Pupo, E., Garbay, S., Calogero, R., Pontoglio, M., Provero, P., Poli, V. 2009; 106 (13): 5117-5122


    The identification of direct targets of transcription factors is a key problem in the study of gene regulatory networks. However, the use of high throughput experimental methods, such as ChIP-chip and ChIP-sequencing, is limited by their high cost and strong dependence on cellular type and context. We developed a computational method for the genome-wide identification of functional transcription factor binding sites based on positional weight matrices, comparative genomics, and gene expression profiling. The method was applied to Stat3, a transcription factor playing crucial roles in inflammation, immunity and oncogenesis, and able to induce distinct subsets of target genes in different cell types or conditions. A newly generated positional weight matrix enabled us to assign affinity scores of high specificity, as measured by EMSA competition assays. Phylogenetic conservation with 7 vertebrate species was used to select the binding sites most likely to be functional. Validation was carried out on predicted sites within genes identified as differentially expressed in the presence or absence of Stat3 by microarray analysis. Twelve of the fourteen sites tested were bound by Stat3 in vivo, as assessed by Chromatin Immunoprecipitation, allowing us to identify 9 Stat3 transcriptional targets. Given its high validation rate, and the availability of large transcription factor-dependent gene expression datasets obtained under diverse experimental conditions, our approach appears to be a valid alternative to high-throughput experimental assays for the discovery of novel direct targets of transcription factors.

    View details for DOI 10.1073/pnas.0900473106

    View details for Web of Science ID 000264790600031

    View details for PubMedID 19282476