Julia Salzman
Associate Professor of Biomedical Data Science, of Biochemistry and, by courtesy, of Statistics and of Biology
Department of Biomedical Data Science
Bio
The Salzman lab views genomic data as a statistical signal processing problem, leveraging tools from statistics, traditional machine learning and deep learning for biological discovery. We have recently introduced a new unifying paradigm, SPLASH (Statistically Primary aLignment Agnostic Sequence Homing), a data-compressive signal extraction that bypasses the historical computational machinery of genomics. Ongoing work extends the scope of this approach, making many problems in genomic sequence analysis easily accessible to computational engineers and statisticians. We are using SPLASH and its sister algorithms to train biological deep learning models at unprecedented scale, and leveraging these models to predict phenotypes and from single cell analyses of human and non-model organisms to drug responses in microbes, and other phenotypes across the tree of life, including in the study of plants, viruses, oceanic systems and their symbioses. The reference-free nature of SPLASH and its sister approaches enable dually the prediction of biological behavior – eg phenotype and attribution of the features responsible for these predictions.
Academic Appointments
-
Associate Professor, Department of Biomedical Data Science
-
Associate Professor, Biochemistry
-
Associate Professor (By courtesy), Statistics
-
Associate Professor (By courtesy), Biology
-
Member, Bio-X
-
Faculty Fellow, Sarafan ChEM-H
-
Member, Stanford Cancer Institute
-
Member, Wu Tsai Neurosciences Institute
Honors & Awards
-
Arc Ignite Investigator, Arc Institute (2023)
-
NSF CAREER AWARD, National Science Foundation (2016-2021)
-
McCormick-Gabilan Fellowship, Stanford University (2015)
-
Alfred P. Sloan Research Fellow, Alfred P. Sloan Foundation (2014)
-
Baxter Faculty Scholar Award, Baxter Foundation (2014)
-
Pathway to Independence (K99/R00) Award, National Institutes of Health (2012-present)
-
Research Grant, National Science Foundation (2009-2012)
-
Magna Cum Laude, Princeton University (2002)
Professional Education
-
Ph. D., Stanford University, Statistics (2007)
-
A. B., Princeton University, Mathematics (2002)
Current Research and Scholarly Interests
statistical computational biology focusing on splicing, cancer and microbes
2024-25 Courses
-
Independent Studies (16)
- Biomedical Informatics Teaching Methods
BIOMEDIN 290 (Aut, Win, Spr, Sum) - Directed Investigation
BIOE 392 (Aut, Win, Spr, Sum) - Directed Reading and Research
BIODS 299 (Aut, Win, Spr, Sum) - Directed Reading and Research
BIOMEDIN 299 (Aut, Win, Spr, Sum) - Directed Reading in Biochemistry
BIOC 299 (Aut, Win, Spr, Sum) - Directed Reading in Biophysics
BIOPHYS 399 (Aut, Win, Spr, Sum) - Directed Study
BIOE 391 (Aut, Win, Spr, Sum) - Graduate Research
BIOPHYS 300 (Aut, Win, Spr, Sum) - Graduate Research and Special Advanced Work
BIOC 399 (Aut, Win, Spr, Sum) - Master's Research
CME 291 (Aut, Win, Spr, Sum) - Medical Scholars Research
BIOC 370 (Aut, Win, Spr, Sum) - Medical Scholars Research
BIOMEDIN 370 (Aut, Win, Spr, Sum) - Ph.D. Research
CME 400 (Aut, Sum) - Ph.D. Research Rotation
CME 391 (Aut, Win, Spr, Sum) - The Teaching of Biochemistry
BIOC 221 (Aut, Win, Spr, Sum) - Undergraduate Research
BIOC 199 (Aut, Win, Spr, Sum)
- Biomedical Informatics Teaching Methods
-
Prior Year Courses
2023-24 Courses
- Statistical Genomics for Planetary Health: Oceans, Plants, Microbes and Humans
BIODS 228 (Aut) - Workshop in Biostatistics
BIODS 260C, STATS 260C (Spr)
2022-23 Courses
- Workshop in Biostatistics
BIODS 260A, STATS 260A (Aut)
2021-22 Courses
- Introduction to Analysis of RNA Sequence Data
BIOC 239, BIODS 239 (Aut) - Workshop in Biostatistics
BIODS 260A, STATS 260A (Aut)
- Statistical Genomics for Planetary Health: Oceans, Plants, Microbes and Humans
Stanford Advisees
-
Doctoral Dissertation Reader (AC)
Isabel Delwel, Yannick Lee-Yow -
Doctoral (Program)
Holly McCann
Graduate and Fellowship Programs
-
Biomedical Data Science (Phd Program)
All Publications
-
Cell types of origin of the cell-free transcriptome.
Nature biotechnology
2022
Abstract
Cell-free RNA from liquid biopsies can be analyzed to determine disease tissue of origin. We extend this concept to identify cell types of origin using the Tabula Sapiens transcriptomic cell atlas as well as individual tissue transcriptomic cell atlases in combination with the Human Protein Atlas RNA consensus dataset. We define cell type signature scores, which allow the inference of cell types that contribute to cell-free RNA for a variety of diseases.
View details for DOI 10.1038/s41587-021-01188-9
View details for PubMedID 35132263
-
RNA splicing programs define tissue compartments and cell types at single cell resolution.
eLife
2021; 10
Abstract
The extent splicing is regulated at single-cell resolution has remained controversial due to both available data and methods to interpret it. We apply the SpliZ, a new statistical approach, to detect cell-type-specific splicing in >110K cells from 12 human tissues. Using 10x data for discovery, 9.1% of genes with computable SpliZ scores are cell-type-specifically spliced, including ubiquitously expressed genes MYL6 and RPS24. These results are validated with RNA FISH, single-cell PCR, and Smart-seq2. SpliZ analysis reveals 170 genes with regulated splicing during human spermatogenesis, including examples conserved in mouse and mouse lemur. The SpliZ allows model-based identification of subpopulations indistinguishable based on gene expression, illustrated by subpopulation-specific splicing of classical monocytes involving an ultraconserved exon in SAT1. Together, this analysis of differential splicing across multiple organs establishes that splicing is regulated cell-type-specifically.
View details for DOI 10.7554/eLife.70692
View details for PubMedID 34515025
-
Specific splice junction detection in single cells with SICILIAN.
Genome biology
2021; 22 (1): 219
Abstract
Precise splice junction calls are currently unavailable in scRNA-seq pipelines such as the 10x Chromium platform but are critical for understanding single-cell biology. Here, we introduce SICILIAN, a new method that assigns statistical confidence to splice junctions from a spliced aligner to improve precision. SICILIAN is a general method that can be applied to bulk or single-cell data, but has particular utility for single-cell analysis due to that data's unique challenges and opportunities for discovery. SICILIAN's precise splice detection achieves high accuracy on simulated data, improves concordance between matched single-cell and bulk datasets, and increases agreement between biological replicates. SICILIAN detects unannotated splicing in single cells, enabling the discovery of novel splicing regulation through single-cell analysis workflows.
View details for DOI 10.1186/s13059-021-02434-8
View details for PubMedID 34353340
-
Circular RNA Expression: Its Potential Regulation and Function in Abdominal Aortic Aneurysms
OXIDATIVE MEDICINE AND CELLULAR LONGEVITY
2021; 2021: 9934951
Abstract
Abdominal aortic aneurysms (AAAs) have posed a great threat to human life, and the necessity of its monitoring and treatment is decided by symptomatology and/or the aneurysm size. Accumulating evidence suggests that circular RNAs (circRNAs) contribute a part to the pathogenesis of AAAs. circRNAs are novel single-stranded RNAs with a closed loop structure and high stability, having become the candidate biomarkers for numerous kinds of human disorders. Besides, circRNAs act as molecular "sponge" in organisms, capable of regulating the transcription level. Here, we characterize that the molecular mechanisms underlying the role of circRNAs in AAA development were further elucidated. In the present work, studies on the biosynthesis, bibliometrics, and mechanisms of action of circRNAs were aims comprehensively reviewed, the role of circRNAs in the AAA pathogenic mechanism was illustrated, and their potential in diagnosing AAAs was examined. Moreover, the current evidence about the effects of circRNAs on AAA development through modulating endothelial cells (ECs), macrophages, and vascular smooth muscle cells (VSMCs) was summarized. Through thorough investigation, the molecular mechanisms underlying the role of circRNAs in AAA development were further elucidated. The results demonstrated that circRNAs had the application potential in the diagnosis and prevention of AAAs in clinical practice. The study of circRNA regulatory pathways would be of great assistance to the etiologic research of AAAs.
View details for DOI 10.1155/2021/9934951
View details for Web of Science ID 000674620900002
View details for PubMedID 34306317
View details for PubMedCentralID PMC8263248
-
High-throughput SARS-CoV-2 and host genome sequencing from single nasopharyngeal swabs.
medRxiv : the preprint server for health sciences
2020
Abstract
During COVID19 and other viral pandemics, rapid generation of host and pathogen genomic data is critical to tracking infection and informing therapies. There is an urgent need for efficient approaches to this data generation at scale. We have developed a scalable, high throughput approach to generate high fidelity low pass whole genome and HLA sequencing, viral genomes, and representation of human transcriptome from single nasopharyngeal swabs of COVID19 patients.
View details for DOI 10.1101/2020.07.27.20163147
View details for PubMedID 32766602
View details for PubMedCentralID PMC7402057
-
Molecular sampling at logarithmic rates for next-generation sequencing.
PLoS computational biology
2019; 15 (12): e1007537
Abstract
Next-generation sequencing is a cutting edge technology, but to quantify a dynamic range of abundances for different RNA or DNA species requires increasing sampling depth to levels that can be prohibitively expensive due to physical limits on molecular throughput of sequencers. To overcome this problem, we introduce a new general sampling theory which uses biophysical principles to functionally encode the abundance of a species before sampling, SeQUential depletIon and enriCHment (SQUICH). In theory and simulation, SQUICH enables sampling at a logarithmic rate to achieve the same precision as attained with conventional sequencing. A simple proof of principle experimental implementation of SQUICH in a controlled complex system of ~262,000 oligonucleotides already reduces sequencing depth by a factor of 10. SQUICH lays the groundwork for a general solution to a fundamental problem in molecular sampling and enables a new generation of efficient, precise molecular measurement at logarithmic or better sampling depth.
View details for DOI 10.1371/journal.pcbi.1007537
View details for PubMedID 31830035
View details for PubMedCentralID PMC6932819
-
Improved detection of gene fusions by applying statistical methods reveals oncogenic RNA cancer drivers.
Proceedings of the National Academy of Sciences of the United States of America
2019
Abstract
The extent to which gene fusions function as drivers of cancer remains a critical open question. Current algorithms do not sufficiently identify false-positive fusions arising during library preparation, sequencing, and alignment. Here, we introduce Data-Enriched Efficient PrEcise STatistical fusion detection (DEEPEST), an algorithm that uses statistical modeling to minimize false-positives while increasing the sensitivity of fusion detection. In 9,946 tumor RNA-sequencing datasets from The Cancer Genome Atlas (TCGA) across 33 tumor types, DEEPEST identifies 31,007 fusions, 30% more than identified by other methods, while calling 10-fold fewer false-positive fusions in nontransformed human tissues. We leverage the increased precision of DEEPEST to discover fundamental cancer biology. Namely, 888 candidate oncogenes are identified based on overrepresentation in DEEPEST calls, and 1,078 previously unreported fusions involving long intergenic noncoding RNAs, demonstrating a previously unappreciated prevalence and potential for function. DEEPEST also reveals a high enrichment for fusions involving oncogenes in cancers, including ovarian cancer, which has had minimal treatment advances in recent decades, finding that more than 50% of tumors harbor gene fusions predicted to be oncogenic. Specific protein domains are enriched in DEEPEST calls, indicating a global selection for fusion functionality: kinase domains are nearly 2-fold more enriched in DEEPEST calls than expected by chance, as are domains involved in (anaerobic) metabolism and DNA binding. The statistical algorithms, population-level analytic framework, and the biological conclusions of DEEPEST call for increased attention to gene fusions as drivers of cancer and for future research into using fusions for targeted therapy.
View details for DOI 10.1073/pnas.1900391116
View details for PubMedID 31308241
-
Hyperammonemia after capecitabine associated with occult impairment of the urea cycle.
Cancer medicine
2019
Abstract
BACKGROUND: Cancer patients receiving chemotherapy often complain of "chemobrain" or cognitive impairment, but mechanisms remain elusive.METHODS: A patient with gastric cancer developed delirium and hyperammonemia after chemotherapy with the 5-fluorouracil pro-drug capecitabine. Exome sequencing facilitated a search for mutations among 43 genes associated with hyperammonemia and affecting the urea cycle directly or indirectly.RESULTS: The patient's urea cycle was impaired by capecitabine-induced liver steatosis, and portosystemic shunting of gut ammonia into the systemic circulation. The patient was also heterozygous for amino acid substitution mutations previously reported to create dysfunctional proteins in 2 genes, ORNT2 (ornithine transporter-2 for the urea cycle), and ETFA (electron transport flavoprotein alpha for fatty acid oxidation). The mutations explained the patient's abnormal plasma amino acid profile and exaggerated response to allopurinol challenge. Global population variations among the 43 hyperammonemia genes were assessed for inactivating mutations, and for amino acid substitutions predicted to be deleterious by complementary algorithms, SIFT and PolyPhen-2. One or 2 deleterious mutations occur among the 43 genes in 13.9% and 1% of individuals, respectively.CONCLUSIONS: Capecitabine and 5-fluorouracil inhibit pyrimidine biosynthesis, decreasing ammonia utilization. These drugs can induce hyperammonemia in susceptible individuals. The risk factors of hyperammonemia, gene mutations and liver dysfunction, are not rare. Diagnosis will trigger appropriate treatment and ameliorate brain toxicity.
View details for PubMedID 30977266
-
Discovery of gene regulatory elements through a new bioinformatics analysis of haploid genetic screens.
PloS one
2019; 14 (1): e0198463
Abstract
The systematic identification of regulatory elements that control gene expression remains a challenge. Genetic screens that use untargeted mutagenesis have the potential to identify protein-coding genes, non-coding RNAs and regulatory elements, but their analysis has mainly focused on identifying the former two. To identify regulatory elements, we conducted a new bioinformatics analysis of insertional mutagenesis screens interrogating WNT signaling in haploid human cells. We searched for specific patterns of retroviral gene trap integrations (used as mutagens in haploid screens) in short genomic intervals overlapping with introns and regions upstream of genes. We uncovered atypical patterns of gene trap insertions that were not predicted to disrupt coding sequences, but caused changes in the expression of two key regulators of WNT signaling, suggesting the presence of cis-regulatory elements. Our methodology extends the scope of haploid genetic screens by enabling the identification of regulatory elements that control gene expression.
View details for PubMedID 30695034
-
Ambiguous splice sites distinguish circRNA and linear splicing in the human genome.
Bioinformatics (Oxford, England)
2018
Abstract
Motivation: Identification of splice sites is critical to gene annotation and to determine which sequences control circRNA biogenesis. Full-length RNA transcripts could in principle complete annotations of introns and exons in genomes without external ontologies, i.e., ab initio. However, whether it is possible to reconstruct genomic positions where splicing occurs from full-length transcripts, even if sampled in the absence of noise, depends on the genome sequence composition. If it is not, there exist provable limits on the use of RNA-Seq to define splice locations (linear or circular) in the genome.Results: We provide a formal definition of splice site ambiguity due to the genomic sequence by introducing a definition of equivalent junction, which is the set of local genomic positions resulting in the same RNA sequence when joined through RNA splicing. We show that equivalent junctions are prevalent in diverse eukaryotic genomes and occur in 88.64% and 78.64% of annotated human splice sites in linear and circRNA junctions, respectively. The observed fractions of equivalent junctions and the frequency of many individual motifs are statistically significant when compared against the null distribution computed via simulation or closed-form. The frequency of equivalent junctions establishes a fundamental limit on the possibility of ab initio reconstruction of RNA transcripts without appealing to the ontology of "GT-AG" boundaries defining introns. Said differently, completely ab initio is impossible in the vast majority of splice sites in annotated circRNAs and linear transcripts.Availability: Two python scripts generating an equivalent junction sequence per junction are available at: https://github.com/salzmanlab/Equivalent-Junctions.Supplementary information: Supplementary data are available at Bioinformatics online.
View details for PubMedID 30192918
-
ciRS-7 exonic sequence is embedded in a long non-coding RNA locus
PLOS GENETICS
2017; 13 (12): e1007114
Abstract
ciRS-7 is an intensely studied, highly expressed and conserved circRNA. Essentially nothing is known about its biogenesis, including the location of its promoter. A prevailing assumption has been that ciRS-7 is an exceptional circRNA because it is transcribed from a locus lacking any mature linear RNA transcripts of the same sense. To study the biogenesis of ciRS-7, we developed an algorithm to define its promoter and predicted that the human ciRS-7 promoter coincides with that of the long non-coding RNA, LINC00632. We validated this prediction using multiple orthogonal experimental assays. We also used computational approaches and experimental validation to establish that ciRS-7 exonic sequence is embedded in linear transcripts that are flanked by cryptic exons in both human and mouse. Together, this experimental and computational evidence generates a new model for regulation of this locus: (a) ciRS-7 is like other circRNAs, as it is spliced into linear transcripts; (b) expression of ciRS-7 is primarily determined by the chromatin state of LINC00632 promoters; (c) transcription and splicing factors sufficient for ciRS-7 biogenesis are expressed in cells that lack detectable ciRS-7 expression. These findings have significant implications for the study of the regulation and function of ciRS-7, and the analytic framework we developed to jointly analyze RNA-seq and ChIP-seq data reveal the potential for genome-wide discovery of important biological regulation missed in current reference annotations.
View details for PubMedID 29236709
-
Statistical algorithms improve accuracy of gene fusion detection.
Nucleic acids research
2017
Abstract
Gene fusions are known to play critical roles in tumor pathogenesis. Yet, sensitive and specific algorithms to detect gene fusions in cancer do not currently exist. In this paper, we present a new statistical algorithm, MACHETE (Mismatched Alignment CHimEra Tracking Engine), which achieves highly sensitive and specific detection of gene fusions from RNA-Seq data, including the highest Positive Predictive Value (PPV) compared to the current state-of-the-art, as assessed in simulated data. We show that the best performing published algorithms either find large numbers of fusions in negative control data or suffer from low sensitivity detecting known driving fusions in gold standard settings, such as EWSR1-FLI1. As proof of principle that MACHETE discovers novel gene fusions with high accuracy in vivo, we mined public data to discover and subsequently PCR validate novel gene fusions missed by other algorithms in the ovarian cancer cell line OVCAR3. These results highlight the gains in accuracy achieved by introducing statistical models into fusion detection, and pave the way for unbiased discovery of potentially driving and druggable gene fusions in primary tumors.
View details for DOI 10.1093/nar/gkx453
View details for PubMedID 28541529
-
Circular RNA biogenesis can proceed through an exon-containing lariat precursor.
eLife
2015; 4
Abstract
Pervasive expression of circular RNA is a recently discovered feature of eukaryotic gene expression programs, yet its function remains largely unknown. The presumed biogenesis of these RNAs involves a non-canonical 'backsplicing' event. Recent studies in mammalian cell culture posit that backsplicing is facilitated by inverted repeats flanking the circularized exon(s). Although such sequence elements are common in mammals, they are rare in lower eukaryotes, making current models insufficient to describe circularization. Through systematic splice site mutagenesis and the identification of splicing intermediates, we show that circular RNA in Schizosaccharomyces pombe is generated through an exon-containing lariat precursor. Furthermore, we have performed high-throughput and comprehensive mutagenesis of a circle-forming exon, which enabled us to discover a systematic effect of exon length on RNA circularization. Our results uncover a mechanism for circular RNA biogenesis that may account for circularization in genes that lack noticeable flanking intronic secondary structure.
View details for DOI 10.7554/eLife.07540
View details for PubMedID 26057830
View details for PubMedCentralID PMC4479058
-
Statistically based splicing detection reveals neural enrichment and tissue-specific induction of circular RNA during human fetal development
Genome biology
2015
View details for DOI 10.1186/s13059-015-0690-5
-
Extensive site-directed mutagenesis reveals interconnected functional units in the alkaline phosphatase active site.
eLife
2015; 4
Abstract
Enzymes enable life by accelerating reaction rates to biological timescales. Conventional studies have focused on identifying the residues that have a direct involvement in an enzymatic reaction, but these so-called 'catalytic residues' are embedded in extensive interaction networks. Although fundamental to our understanding of enzyme function, evolution, and engineering, the properties of these networks have yet to be quantitatively and systematically explored. We dissected an interaction network of five residues in the active site of Escherichia coli alkaline phosphatase. Analysis of the complex catalytic interdependence of specific residues identified three energetically independent but structurally interconnected functional units with distinct modes of cooperativity. From an evolutionary perspective, this network is orders of magnitude more probable to arise than a fully cooperative network. From a functional perspective, new catalytic insights emerge. Further, such comprehensive energetic characterization will be necessary to benchmark the algorithms required to rationally engineer highly efficient enzymes.
View details for DOI 10.7554/eLife.06181
View details for PubMedID 25902402
View details for PubMedCentralID PMC4438272
-
Circular RNA Is Expressed across the Eukaryotic Tree of Life.
PloS one
2014; 9 (3)
Abstract
An unexpectedly large fraction of genes in metazoans (human, mouse, zebrafish, worm, fruit fly) express high levels of circularized RNAs containing canonical exons. Here we report that circular RNA isoforms are found in diverse species whose most recent common ancestor existed more than one billion years ago: fungi (Schizosaccharomyces pombe and Saccharomyces cerevisiae), a plant (Arabidopsis thaliana), and protists (Plasmodium falciparum and Dictyostelium discoideum). For all species studied to date, including those in this report, only a small fraction of the theoretically possible circular RNA isoforms from a given gene are actually observed. Unlike metazoans, Arabidopsis, D. discoideum, P. falciparum, S. cerevisiae, and S. pombe have very short introns (∼ 100 nucleotides or shorter), yet they still produce circular RNAs. A minority of genes in S. pombe and P. falciparum have documented examples of canonical alternative splicing, making it unlikely that all circular RNAs are by-products of alternative splicing or 'piggyback' on signals used in alternative RNA processing. In S. pombe, the relative abundance of circular to linear transcript isoforms changed in a gene-specific pattern during nitrogen starvation. Circular RNA may be an ancient, conserved feature of eukaryotic gene expression programs.
View details for DOI 10.1371/journal.pone.0090859
View details for PubMedID 24609083
View details for PubMedCentralID PMC3946582
-
Best permutation analysis
JOURNAL OF MULTIVARIATE ANALYSIS
2013; 121: 193-223
View details for DOI 10.1016/j.jmva.2013.03.001
View details for Web of Science ID 000323584100013
-
Cell-type specific features of circular RNA expression.
PLoS genetics
2013; 9 (9)
Abstract
Thousands of loci in the human and mouse genomes give rise to circular RNA transcripts; at many of these loci, the predominant RNA isoform is a circle. Using an improved computational approach for circular RNA identification, we found widespread circular RNA expression in Drosophila melanogaster and estimate that in humans, circular RNA may account for 1% as many molecules as poly(A) RNA. Analysis of data from the ENCODE consortium revealed that the repertoire of genes expressing circular RNA, the ratio of circular to linear transcripts for each gene, and even the pattern of splice isoforms of circular RNAs from each gene were cell-type specific. These results suggest that biogenesis of circular RNA is an integral, conserved, and regulated feature of the gene expression program.
View details for DOI 10.1371/journal.pgen.1003777
View details for PubMedID 24039610
View details for PubMedCentralID PMC3764148
-
Association between living environment and human oral viral ecology
ISME JOURNAL
2013; 7 (9): 1710-1724
Abstract
The human oral cavity has an indigenous microbiota known to include a robust community of viruses. Very little is known about how oral viruses are spread throughout the environment or to which viruses individuals are exposed. We sought to determine whether shared living environment is associated with the composition of human oral viral communities by examining the saliva of 21 human subjects; 11 subjects from different households and 10 unrelated subjects comprising 4 separate households. Although there were many viral homologues shared among all subjects studied, there were significant patterns of shared homologues in three of the four households that suggest shared living environment affects viral community composition. We also examined CRISPR (clustered regularly interspaced short palindromic repeat) loci, which are involved in acquired bacterial and archaeal resistance against invading viruses by acquiring short viral sequences. We analyzed 2 065 246 CRISPR spacers from 5 separate repeat motifs found in oral bacterial species of Gemella, Veillonella, Leptotrichia and Streptococcus to determine whether individuals from shared living environments may have been exposed to similar viruses. A significant proportion of CRISPR spacers were shared within subjects from the same households, suggesting either shared ancestry of their oral microbiota or similar viral exposures. Many CRISPR spacers matched virome sequences from different subjects, but no pattern specific to any household was found. Our data on viromes and CRISPR content indicate that shared living environment may have a significant role in determining the ecology of human oral viruses.
View details for DOI 10.1038/ismej.2013.63
View details for Web of Science ID 000323385600004
View details for PubMedID 23598790
-
Improved discovery of molecular interactions in genome-scale data with adaptive model-based normalization.
PloS one
2013; 8 (1)
Abstract
High throughput molecular-interaction studies using immunoprecipitations (IP) or affinity purifications are powerful and widely used in biology research. One of many important applications of this method is to identify the set of RNAs that interact with a particular RNA-binding protein (RBP). Here, the unique statistical challenge presented is to delineate a specific set of RNAs that are enriched in one sample relative to another, typically a specific IP compared to a non-specific control to model background. The choice of normalization procedure critically impacts the number of RNAs that will be identified as interacting with an RBP at a given significance threshold - yet existing normalization methods make assumptions that are often fundamentally inaccurate when applied to IP enrichment data.In this paper, we present a new normalization methodology that is specifically designed for identifying enriched RNA or DNA sequences in an IP. The normalization (called adaptive or AD normalization) uses a basic model of the IP experiment and is not a variant of mean, quantile, or other methodology previously proposed. The approach is evaluated statistically and tested with simulated and empirical data.The adaptive (AD) normalization method results in a greatly increased range in the number of enriched RNAs identified, fewer false positives, and overall better concordance with independent biological evidence, for the RBPs we analyzed, compared to median normalization. The approach is also applicable to the study of pairwise RNA, DNA and protein interactions such as the analysis of transcription factors via chromatin immunoprecipitation (ChIP) or any other experiments where samples from two conditions, one of which contains an enriched subset of the other, are studied.
View details for DOI 10.1371/journal.pone.0053930
View details for PubMedID 23349766
View details for PubMedCentralID PMC3551948
- A penalized likelihood approach for robust estimation of isoform expression eprint arXiv:1310.0379 2013
-
Statistical properties of an early stopping rule for resampling-based multiple testing
BIOMETRIKA
2012; 99 (4): 973-980
View details for DOI 10.1093/biomet/ass051
View details for Web of Science ID 000311303800015
-
Extensive Gene-Specific Translational Reprogramming in a Model of B Cell Differentiation and Abl-Dependent Transformation
PLOS ONE
2012; 7 (5)
Abstract
To what extent might the regulation of translation contribute to differentiation programs, or to the molecular pathogenesis of cancer? Pre-B cells transformed with the viral oncogene v-Abl are suspended in an immortalized, cycling state that mimics leukemias with a BCR-ABL1 translocation, such as Chronic Myelogenous Leukemia (CML) and Acute Lymphoblastic Leukemia (ALL). Inhibition of the oncogenic Abl kinase with imatinib reverses transformation, allowing progression to the next stage of B cell development. We employed a genome-wide polysome profiling assay called Gradient Encoding to investigate the extent and potential contribution of translational regulation to transformation and differentiation in v-Abl-transformed pre-B cells. Over half of the significantly translationally regulated genes did not change significantly at the level of mRNA abundance, revealing biology that might have been missed by measuring changes in transcript abundance alone. We found extensive, gene-specific changes in translation affecting genes with known roles in B cell signaling and differentiation, cancerous transformation, and cytoskeletal reorganization potentially affecting adhesion. These results highlight a major role for gene-specific translational regulation in remodeling the gene expression program in differentiation and malignant transformation.
View details for DOI 10.1371/journal.pone.0037108
View details for Web of Science ID 000305338500030
View details for PubMedID 22693568
-
Evidence of a robust resident bacteriophage population revealed through analysis of the human salivary virome
ISME JOURNAL
2012; 6 (5): 915-926
Abstract
Viruses are the most abundant known infectious agents on the planet and are significant drivers of diversity in a variety of ecosystems. Although there have been numerous studies of viral communities, few have focused on viruses within the indigenous human microbiota. We analyzed 2 267 695 virome reads from viral particles and compared them with 263 516 bacterial 16S rRNA gene sequences from the saliva of five healthy human subjects over a 2- to 3-month period, in order to improve our understanding of the role viruses have in the complex oral ecosystem. Our data reveal viral communities in human saliva dominated by bacteriophages whose constituents are temporally distinct. The preponderance of shared homologs between the salivary viral communities in two unrelated subjects in the same household suggests that environmental factors are determinants of community membership. When comparing salivary viromes to those from human stool and the respiratory tract, each group was distinct, further indicating that habitat is of substantial importance in shaping human viromes. Compared with coexisting bacteria, there was concordance among certain predicted host-virus pairings such as Veillonella and Streptococcus, whereas there was discordance among others such as Actinomyces. We identified 122 728 virulence factor homologs, suggesting that salivary viruses may serve as reservoirs for pathogenic gene function in the oral environment. That the vast majority of human oral viruses are bacteriophages whose putative gene function signifies some have a prominent role in lysogeny, suggests these viruses may have an important role in helping shape the microbial diversity in the human oral cavity.
View details for DOI 10.1038/ismej.2011.169
View details for Web of Science ID 000302950700002
View details for PubMedID 22158393
View details for PubMedCentralID PMC3329113
-
Widespread mRNA Association with Cytoskeletal Motor Proteins and Identification and Dynamics of Myosin-Associated mRNAs in S. cerevisiae
PLOS ONE
2012; 7 (2)
Abstract
Programmed mRNA localization to specific subcellular compartments for localized translation is a fundamental mechanism of post-transcriptional regulation that affects many, and possibly all, mRNAs in eukaryotes. We describe here a systematic approach to identify the RNA cargoes associated with the cytoskeletal motor proteins of Saccharomyces cerevisiae in combination with live-cell 3D super-localization microscopy of endogenously tagged mRNAs. Our analysis identified widespread association of mRNAs with cytoskeletal motor proteins, including association of Myo3 with mRNAs encoding key regulators of actin branching and endocytosis such as WASP and WIP. Using conventional fluorescence microscopy and expression of MS2-tagged mRNAs from endogenous loci, we observed a strong bias for actin patch nucleator mRNAs to localize to the cell cortex and the actin patch in a Myo3- and F-actin dependent manner. Use of a double-helix point spread function (DH-PSF) microscope allowed super-localization measurements of single mRNPs at a spatial precision of 25 nm in x and y and 50 nm in z in live cells with 50 ms exposure times, allowing quantitative profiling of mRNP dynamics. The actin patch mRNA exhibited distinct and characteristic diffusion coefficients when compared to a control mRNA. In addition, disruption of F-actin significantly expanded the 3D confinement radius of an actin patch nucleator mRNA, providing a quantitative assessment of the contribution of the actin cytoskeleton to mRNP dynamic localization. Our results provide evidence for specific association of mRNAs with cytoskeletal motor proteins in yeast, suggest that different mRNPs have distinct and characteristic dynamics, and lend insight into the mechanism of actin patch nucleator mRNA localization to actin patches.
View details for DOI 10.1371/journal.pone.0031912
View details for Web of Science ID 000302796200110
View details for PubMedID 22359641
View details for PubMedCentralID PMC3281097
-
Circular RNAs Are the Predominant Transcript Isoform from Hundreds of Human Genes in Diverse Cell Types
PLOS ONE
2012; 7 (2)
Abstract
Most human pre-mRNAs are spliced into linear molecules that retain the exon order defined by the genomic sequence. By deep sequencing of RNA from a variety of normal and malignant human cells, we found RNA transcripts from many human genes in which the exons were arranged in a non-canonical order. Statistical estimates and biochemical assays provided strong evidence that a substantial fraction of the spliced transcripts from hundreds of genes are circular RNAs. Our results suggest that a non-canonical mode of RNA splicing, resulting in a circular RNA isoform, is a general feature of the gene expression program in human cells.
View details for DOI 10.1371/journal.pone.0030733
View details for Web of Science ID 000301977500016
View details for PubMedID 22319583
View details for PubMedCentralID PMC3270023
-
Comparisons of CRISPRs and viromes in human saliva reveal bacterial 3 adaptations to salivary viruses
Environmental Microbiology
2012
View details for DOI 10.1111/j.1462-2920.2012.02775.x
-
ESRRA-C11orf20 Is a Recurrent Gene Fusion in Serous Ovarian Carcinoma
PLOS BIOLOGY
2011; 9 (9)
Abstract
Every year, ovarian cancer kills approximately 14,000 women in the United States and more than 140,000 women worldwide. Most of these deaths are caused by tumors of the serous histological type, which is rarely diagnosed before it has disseminated. By deep paired-end sequencing of mRNA from serous ovarian cancers, followed by deep sequencing of the corresponding genomic region, we identified a recurrent fusion transcript. The fusion transcript joins the 5' exons of ESRRA, encoding a ligand-independent member of the nuclear-hormone receptor superfamily, to the 3' exons of C11orf20, a conserved but uncharacterized gene located immediately upstream of ESRRA in the reference genome. To estimate the prevalence of the fusion, we tested 67 cases of serous ovarian cancer by RT-PCR and sequencing and confirmed its presence in 10 of these. Targeted resequencing of the corresponding genomic region from two fusion-positive tumor samples identified a nearly clonal chromosomal rearrangement positioning ESRRA upstream of C11orf20 in one tumor, and evidence of local copy number variation in the ESRRA locus in the second tumor. We hypothesize that the recurrent novel fusion transcript may play a role in pathogenesis of a substantial fraction of serous ovarian cancers and could provide a molecular marker for detection of the cancer. Gene fusions involving adjacent or nearby genes can readily escape detection but may play important roles in the development and progression of cancer.
View details for DOI 10.1371/journal.pbio.1001156
View details for Web of Science ID 000295372800012
View details for PubMedID 21949640
View details for PubMedCentralID PMC3176749
-
Statistical Modeling of RNA-Seq Data
STATISTICAL SCIENCE
2011; 26 (1): 62-83
View details for DOI 10.1214/10-STS343
View details for Web of Science ID 000292424900013
-
Analysis of streptococcal CRISPRs from human saliva reveals substantial sequence diversity within and between subjects over time
GENOME RESEARCH
2011; 21 (1): 126-136
Abstract
Viruses may play an important role in the evolution of human microbial communities. Clustered regularly interspaced short palindromic repeats (CRISPRs) provide bacteria and archaea with adaptive immunity to previously encountered viruses. Little is known about CRISPR composition in members of human microbial communities, the relative rate of CRISPR locus change, or how CRISPR loci differ between the microbiota of different individuals. We collected saliva from four periodontally healthy human subjects over an 11- to 17-mo time period and analyzed CRISPR sequences with corresponding streptococcal repeats in order to improve our understanding of the predominant features of oral streptococcal adaptive immune repertoires. We analyzed a total of 6859 CRISPR bearing reads and 427,917 bacterial 16S rRNA gene sequences. We found a core (ranging from 7% to 22%) of shared CRISPR spacers that remained stable over time within each subject, but nearly a third of CRISPR spacers varied between time points. We document high spacer diversity within each subject, suggesting constant addition of new CRISPR spacers. No greater than 2% of CRISPR spacers were shared between subjects, suggesting that each individual was exposed to different virus populations. We detect changes in CRISPR spacer sequence diversity over time that may be attributable to locus diversification or to changes in streptococcal population structure, yet the composition of the populations within subjects remained relatively stable. The individual-specific and traceable character of CRISPR spacer complements could potentially open the way for expansion of the domain of personalized medicine to the oral microbiome, where lineages may be tracked as a function of health and other factors.
View details for DOI 10.1101/gr.111732.110
View details for Web of Science ID 000285868300013
View details for PubMedID 21149389
View details for PubMedCentralID PMC3012920
-
Proteome-Wide Search Reveals Unexpected RNA-Binding Proteins in Saccharomyces cerevisiae
PLOS ONE
2010; 5 (9)
Abstract
The vast landscape of RNA-protein interactions at the heart of post-transcriptional regulation remains largely unexplored. Indeed it is likely that, even in yeast, a substantial fraction of the regulatory RNA-binding proteins (RBPs) remain to be discovered. Systematic experimental methods can play a key role in discovering these RBPs--most of the known yeast RBPs lack RNA-binding domains that might enable this activity to be predicted. We describe here a proteome-wide approach to identify RNA-protein interactions based on in vitro binding of RNA samples to yeast protein microarrays that represent over 80% of the yeast proteome. We used this procedure to screen for novel RBPs and RNA-protein interactions. A complementary mass spectrometry technique also identified proteins that associate with yeast mRNAs. Both the protein microarray and mass spectrometry methods successfully identify previously annotated RBPs, suggesting that other proteins identified in these assays might be novel RBPs. Of 35 putative novel RBPs identified by either or both of these methods, 12, including 75% of the eight most highly-ranked candidates, reproducibly associated with specific cellular RNAs. Surprisingly, most of the 12 newly discovered RBPs were enzymes. Functional characteristics of the RNA targets of some of the novel RBPs suggest coordinated post-transcriptional regulation of subunits of protein complexes and a possible link between mRNA trafficking and vesicle transport. Our results suggest that many more RBPs still remain to be identified and provide a set of candidates for further investigation.
View details for DOI 10.1371/journal.pone.0012671
View details for Web of Science ID 000281687300015
View details for PubMedID 20844764
View details for PubMedCentralID PMC2937035
-
Reliable concurrent calling of multiple genetic alleles and 24-chromosome ploidy without embryo freezing using parental support™ technology (PS)
Fertility and Sterility
2008
View details for DOI 10.1016/j.fertnstert.2008.07.440
-
Limits on the ability of quantum states to convey classical messages
JOURNAL OF THE ACM
2006; 53 (1): 184-206
View details for Web of Science ID 000236521600005
-
An improved upper bound for the pebbling threshold of the n-path
Discrete Mathematics
2004
View details for DOI 10.1016/j.disc.2002.10.001