Arend Sidow
Professor of Pathology and of Genetics
Bio
Please refer to my NIH biosketch:
http://mendel.stanford.edu/sidowlab/SidowCurrentBiosketch.pdf
Academic Appointments
-
Professor, Pathology
-
Professor, Genetics
-
Member, Bio-X
-
Member, Stanford Cancer Institute
Current Research and Scholarly Interests
We have a highly collaborative research program in the evolutionary genomics of cancer. We apply well-established principles of phylogenetics to cancer evolution on the basis of whole genome sequencing and functional genomics data of multiple tumor samples from the same patient. Introductions to our work and the concepts we apply are best found in the Newburger et al paper in Genome Research (2013) and the Sidow and Spies review in TIGS (2015).
More information can be found here: http://www.sidowlab.org
2024-25 Courses
-
Independent Studies (15)
- Biomedical Informatics Teaching Methods
BIOMEDIN 290 (Aut, Win, Spr, Sum) - Directed Reading and Research
BIOMEDIN 299 (Aut, Win, Spr, Sum) - Directed Reading in Cancer Biology
CBIO 299 (Aut, Win, Spr, Sum) - Directed Reading in Genetics
GENE 299 (Aut, Win, Spr, Sum) - Directed Reading in Pathology
PATH 299 (Aut, Win, Spr, Sum) - Early Clinical Experience in Pathology
PATH 280 (Aut, Win, Spr, Sum) - Graduate Research
CBIO 399 (Aut, Win, Spr, Sum) - Graduate Research
GENE 399 (Aut, Win, Spr, Sum) - Graduate Research
PATH 399 (Aut, Win, Spr, Sum) - Medical Scholars Research
BIOMEDIN 370 (Aut, Win, Spr, Sum) - Medical Scholars Research
GENE 370 (Aut, Win, Spr, Sum) - Medical Scholars Research
PATH 370 (Aut, Win, Spr, Sum) - Supervised Study
GENE 260 (Aut, Win, Spr, Sum) - Undergraduate Research
GENE 199 (Aut, Win, Spr, Sum) - Undergraduate Research
PATH 199 (Aut, Win, Spr, Sum)
- Biomedical Informatics Teaching Methods
-
Prior Year Courses
2023-24 Courses
- Genetics and Developmental Biology Training Camp
DBIO 200, GENE 200 (Aut)
- Genetics and Developmental Biology Training Camp
Graduate and Fellowship Programs
-
Biomedical Data Science (Phd Program)
All Publications
-
SparseSignatures: An R package using LASSO-regularized non-negative matrix factorization to identify mutational signatures from human tumor samples.
STAR protocols
2022; 3 (3): 101513
Abstract
We outline the features of the R package SparseSignatures and its application to determine the signatures contributing to mutation profiles of tumor samples. We describe installation details and illustrate a step-by-step approach to (1) prepare the data for signature analysis, (2) determine the optimal parameters, and (3) employ them to determine the signatures and related exposure levels in the point mutation dataset. For complete details on the use and execution of this protocol, please refer to Lal etal. (2021).
View details for DOI 10.1016/j.xpro.2022.101513
View details for PubMedID 35779264
-
Benchmarking challenging small variants with linked and long reads.
Cell genomics
2022; 2 (5)
Abstract
Genome in a Bottle benchmarks are widely used to help validate clinical sequencing pipelines and develop variant calling and sequencing methods. Here we use accurate linked and long reads to expand benchmarks in 7 samples to include difficult-to-map regions and segmental duplications that are challenging for short reads. These benchmarks add more than 300,000 SNVs and 50,000 insertions or deletions (indels) and include 16% more exonic variants, many in challenging, clinically relevant genes not covered previously, such as PMS2. For HG002, we include 92% of the autosomal GRCh38 assembly while excluding regions problematic for benchmarking small variants, such as copy number variants, that should not have been in the previous version, which included 85% of GRCh38. It identifies eight times more false negatives in a short read variant call set relative to our previous benchmark. We demonstrate that this benchmark reliably identifies false positives and false negatives across technologies, enabling ongoing methods development.
View details for DOI 10.1016/j.xgen.2022.100128
View details for PubMedID 36452119
View details for PubMedCentralID PMC9706577
-
Aquila_stLFR: diploid genome assembly based structural variant calling package for stLFR linked-reads.
Bioinformatics advances
2021; 1 (1): vbab007
Abstract
Identifying structural variants (SVs) is critical in health and disease, however, detecting them remains a challenge. Several linked-read sequencing technologies, including 10X Genomics, TELL-Seq and single tube long fragment read (stLFR), have been recently developed as cost-effective approaches to reconstruct multi-megabase haplotypes (phase blocks) from sequence data of a single sample. These technologies provide an optimal sequencing platform to characterize SVs, though few computational algorithms can utilize them. Thus, we developed Aquila_stLFR, an approach that resolves SVs through haplotype-based assembly of stLFR linked-reads.Aquila_stLFR first partitions long fragment reads into two haplotype-specific blocks with the assistance of the high-quality reference genome, by taking advantage of the potential phasing ability of the linked-read itself. Each haplotype is then assembled independently, to achieve a complete diploid assembly to finally reconstruct the genome-wide SVs. We benchmarked Aquila_stLFR on a well-studied sample, NA24385, and showed Aquila_stLFR can detect medium to large size deletions (50 bp-10 kb) with high sensitivity and medium-size insertions (50 bp-1 kb) with high specificity.Source code and documentation are available on https://github.com/maiziex/Aquila_stLFR.Supplementary data are available at Bioinformatics Advances online.
View details for DOI 10.1093/bioadv/vbab007
View details for PubMedID 36700103
View details for PubMedCentralID PMC9710574
-
Aquila enables reference-assisted diploid personal genome assembly and comprehensive variant detection based on linked reads.
Nature communications
2021; 12 (1): 1077
Abstract
We introduce Aquila, a new approach to variant discovery in personal genomes, which is critical for uncovering the genetic contributions to health and disease. Aquila uses a reference sequence and linked-read data to generate a high quality diploid genome assembly, from which it then comprehensively detects and phases personal genetic variation. The contigs of the assemblies from our libraries cover >95% of the human reference genome, with over 98% of that in a diploid state. Thus, the assemblies support detection and accurate genotyping of the most prevalent types of human genetic variation, including single nucleotide polymorphisms (SNPs), small insertions and deletions (small indels), and structural variants (SVs), in all but the most difficult regions. All heterozygous variants are phased in blocks that can approach arm-level length. The final output of Aquila is a diploid and phased personal genome sequence, and a phased Variant Call Format (VCF) file that also contains homozygous and a few unphased heterozygous variants. Aquila represents a cost-effective approach that can be applied to cohorts for variation discovery or association studies, or to single individuals with rare phenotypes that could be caused by SVs or compound heterozygosity.
View details for DOI 10.1038/s41467-021-21395-x
View details for PubMedID 33597536
-
De novo mutational signature discovery in tumor genomes using SparseSignatures.
PLoS computational biology
2021; 17 (6): e1009119
Abstract
Cancer is the result of mutagenic processes that can be inferred from tumor genomes by analyzing rate spectra of point mutations, or "mutational signatures". Here we present SparseSignatures, a novel framework to extract signatures from somatic point mutation data. Our approach incorporates a user-specified background signature, employs regularization to reduce noise in non-background signatures, uses cross-validation to identify the number of signatures, and is scalable to large datasets. We show that SparseSignatures outperforms current state-of-the-art methods on simulated data using a variety of standard metrics. We then apply SparseSignatures to whole genome sequences of pancreatic and breast tumors, discovering well-differentiated signatures that are linked to known mutagenic mechanisms and are strongly associated with patient clinical features.
View details for DOI 10.1371/journal.pcbi.1009119
View details for PubMedID 34181655
-
De novo diploid genome assembly for genome-wide structural variant detection.
NAR genomics and bioinformatics
2020; 2 (1): lqz018
Abstract
Detection of structural variants (SVs) on the basis of read alignment to a reference genome remains a difficult problem. De novo assembly, traditionally used to generate reference genomes, offers an alternative for SV detection. However, it has not been applied broadly to human genomes because of fundamental limitations of short-fragment approaches and high cost of long-read technologies. We here show that 10× linked-read sequencing supports accurate SV detection. We examined variants in six de novo 10× assemblies with diverse experimental parameters from two commonly used human cell lines: NA12878 and NA24385. The assemblies are effective for detecting mid-size SVs, which were discovered by simple pairwise alignment of the assemblies' contigs to the reference (hg38). Our study also shows that the base-pair level SV breakpoint accuracy is high, with a majority of SVs having precisely correct sizes and breakpoints. Setting the ancestral state of SV loci by comparing to ape orthologs allows inference of the actual molecular mechanism (insertion or deletion) causing the mutation. In about half of cases, the mechanism is the opposite of the reference-based call. We uncover 214 SVs that may have been maintained as polymorphisms in the human lineage since before our divergence from chimp. Overall, we show that de novo assembly of 10× linked-read data can achieve cost-effective SV detection for personal genomes.
View details for DOI 10.1093/nargab/lqz018
View details for PubMedID 33575568
View details for PubMedCentralID PMC7671403
-
Assessment of human diploid genome assembly with 10x Linked-Reads data.
GigaScience
2019; 8 (11)
Abstract
BACKGROUND: Producing cost-effective haplotype-resolved personal genomes remains challenging. 10x Linked-Read sequencing, with its high base quality and long-range information, has been demonstrated to facilitate de novo assembly of human genomes and variant detection. In this study, we investigate in depth how the parameter space of 10x library preparation and sequencing affects assembly quality, on the basis of both simulated and real libraries.RESULTS: We prepared and sequenced eight 10x libraries with a diverse set of parameters from standard cell lines NA12878 and NA24385 and performed whole-genome assembly on the data. We also developed the simulator LRTK-SIM to follow the workflow of 10x data generation and produce realistic simulated Linked-Read data sets. We found that assembly quality could be improved by increasing the total sequencing coverage (C) and keeping physical coverage of DNA fragments (CF) or read coverage per fragment (CR) within broad ranges. The optimal physical coverage was between 332* and 823* and assembly quality worsened if it increased to >1,000* for a given C. Long DNA fragments could significantly extend phase blocks but decreased contig contiguity. The optimal length-weighted fragment length (W${\mu _{FL}}$) was 50-150 kb. When broadly optimal parameters were used for library preparation and sequencing, 80% of the genome was assembled in a diploid state.CONCLUSIONS: The Linked-Read libraries we generated and the parameter space we identified provide theoretical considerations and practical guidelines for personal genome assemblies based on 10x Linked-Read sequencing.
View details for DOI 10.1093/gigascience/giz141
View details for PubMedID 31769805
-
Comprehensive genomic characterization of breast tumors with BRCA1 and BRCA2 mutations.
BMC medical genomics
2019; 12 (1): 84
Abstract
Germline mutations in the BRCA1 and BRCA2 genes predispose carriers to breast and ovarian cancer, and there remains a need to identify the specific genomic mechanisms by which cancer evolves in these patients. Here we present a systematic genomic analysis of breast tumors with BRCA1 and BRCA2 mutations.We analyzed genomic data from breast tumors, with a focus on comparing tumors with BRCA1/BRCA2 gene mutations with common classes of sporadic breast tumors.We identify differences between BRCA-mutated and sporadic breast tumors in patterns of point mutation, DNA methylation and structural variation. We show that structural variation disproportionately affects tumor suppressor genes and identify specific driver gene candidates that are enriched for structural variation.Compared to sporadic tumors, BRCA-mutated breast tumors show signals of reduced DNA methylation, more ancestral cell divisions, and elevated rates of structural variation that tend to disrupt highly expressed protein-coding genes and known tumor suppressors. Our analysis suggests that BRCA-mutated tumors are more aggressive than sporadic breast cancers because loss of the BRCA pathway causes multiple processes of mutagenesis and gene dysregulation.
View details for DOI 10.1186/s12920-019-0545-0
View details for PubMedID 31182087
-
High-quality genome sequences of uncultured microbes by assembly of read clouds.
Nature biotechnology
2018
Abstract
Although shotgun metagenomic sequencing of microbiome samples enables partial reconstruction of strain-level community structure, obtaining high-quality microbial genome drafts without isolation and culture remains difficult. Here, we present an application of read clouds, short-read sequences tagged with long-range information, to microbiome samples. We present Athena, a de novo assembler that uses read clouds to improve metagenomic assemblies. We applied this approach to sequence stool samples from two healthy individuals and compared it with existing short-read and synthetic long-read metagenomic sequencing techniques. Read-cloud metagenomic sequencing and Athena assembly produced the most comprehensive individual genome drafts with high contiguity (>200-kb N50, fewer than ten contigs), even for bacteria with relatively low (20*) raw short-read-sequence coverage. We also sequenced a complex marine-sediment sample and generated 24 intermediate-quality genome drafts (>70% complete, <10% contaminated), nine of which were complete (>90% complete, <5% contaminated). Our approach allows for culture-free generation of high-quality microbial genome drafts by using a single shotgun experiment.
View details for PubMedID 30320765
-
HAPDeNovo: a haplotype-based approach for filtering and phasing de novo mutations in linked read sequencing data.
BMC genomics
2018; 19 (1): 467
Abstract
BACKGROUND: De novo mutations (DNMs) are associated with neurodevelopmental and congenital diseases, and their detection can contribute to understanding disease pathogenicity. However, accurate detection is challenging because of their small number relative to the genome-wide false positives in next generation sequencing (NGS) data. Software such as DeNovoGear and TrioDeNovo have been developed to detect DNMs, but at good sensitivity they still produce many false positive calls.RESULTS: To address this challenge, we develop HAPDeNovo, a program that leverages phasing information from linked read sequencing, to remove false positive DNMs from candidate lists generated by DNM-detection tools. Short reads from each phasing block are allocated to each of the two haplotypes followed by generating a haploid genotype for each putative DNM. HAPDeNovo removes variants that are called as heterozygous in one of the haplotypes because they are almost certainly false positives. Our experiments on 10X Chromium linked read sequencing trio data reveal that HAPDeNovo eliminates 80 to 99% of false positives regardless of how large the candidate DNM set is.CONCLUSIONS: HAPDeNovo leverages the haplotype information from linked read sequencing to remove spurious false positive DNMs effectively, and it increases accuracy of DNM detection dramatically without sacrificing sensitivity.
View details for PubMedID 29914369
-
Multi-omic tumor data reveal diversity of molecular mechanisms that correlate with survival.
Nature communications
2018; 9 (1): 4453
Abstract
Outcomes for cancer patients vary greatly even within the same tumor type, and characterization of molecular subtypes of cancer holds important promise for improving prognosis and personalized treatment. This promise has motivated recent efforts to produce large amounts of multidimensional genomic (multi-omic) data, but current algorithms still face challenges in the integrated analysis of such data. Here we present Cancer Integration via Multikernel Learning (CIMLR), a new cancer subtyping method that integrates multi-omic data to reveal molecular subtypes of cancer. We apply CIMLR to multi-omic data from 36 cancer types and show significant improvements in both computational efficiency and ability to extract biologically meaningful cancer subtypes. The discovered subtypes exhibit significant differences in patient survival for 27 of 36 cancer types. Our analysis reveals integrated patterns of gene expression, methylation, point mutations, and copy number changes in multiple cancers and highlights patterns specifically associated with poor patient outcomes.
View details for PubMedID 30367051
-
Genome-wide reconstruction of complex structural variants using read clouds
NATURE METHODS
2017; 14 (9): 915-+
Abstract
In read cloud approaches, microfluidic partitioning of long genomic DNA fragments and barcoding of shorter fragments derived from these fragments retains long-range information in short sequencing reads. This combination of short reads with long-range information represents a powerful alternative to single-molecule long-read sequencing. We develop Genome-wide Reconstruction of Complex Structural Variants (GROC-SVs) for SV detection and assembly from read cloud data and apply this method to Illumina-sequenced 10x Genomics sarcoma and breast cancer data sets. Compared with short-fragment sequencing, GROC-SVs substantially improves the specificity of breakpoint detection at comparable sensitivity. This approach also performs sequence assembly across multiple breakpoints simultaneously, enabling the reconstruction of events exhibiting remarkable complexity. We show that chromothriptic rearrangements occurred before copy number amplifications, and that rates of single-nucleotide variants and SVs are not correlated. Our results support the use of read cloud approaches to advance the characterization of large and complex structural variation.
View details for PubMedID 28714986
-
A research roadmap for next-generation sequencing informatics
SCIENCE TRANSLATIONAL MEDICINE
2016; 8 (335)
Abstract
Next-generation sequencing technologies are fueling a wave of new diagnostic tests. Progress on a key set of nine research challenge areas will help generate the knowledge required to advance effectively these diagnostics to the clinic.
View details for DOI 10.1126/scitranslmed.aaf7314
View details for PubMedID 27099173
-
Lineage-specific enhancers activate self-renewal genes in macrophages and embryonic stem cells
SCIENCE
2016; 351 (6274): 680-U123
View details for DOI 10.1126/science.aad5510
View details for Web of Science ID 000369810000032
-
Lineage-specific enhancers activate self-renewal genes in macrophages and embryonic stem cells.
Science (New York, N.Y.)
2016; 351 (6274): aad5510
Abstract
Differentiated macrophages can self-renew in tissues and expand long term in culture, but the gene regulatory mechanisms that accomplish self-renewal in the differentiated state have remained unknown. Here we show that in mice, the transcription factors MafB and c-Maf repress a macrophage-specific enhancer repertoire associated with a gene network that controls self-renewal. Single-cell analysis revealed that, in vivo, proliferating resident macrophages can access this network by transient down-regulation of Maf transcription factors. The network also controls embryonic stem cell self-renewal but is associated with distinct embryonic stem cell-specific enhancers. This indicates that distinct lineage-specific enhancer platforms regulate a shared network of genes that control self-renewal potential in both stem and mature cells.
View details for DOI 10.1126/science.aad5510
View details for PubMedID 26797145
View details for PubMedCentralID PMC4811353
-
Extensive sequencing of seven human genomes to characterize benchmark reference materials.
Scientific data
2016; 3: 160025-?
Abstract
The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCode WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.
View details for DOI 10.1038/sdata.2016.25
View details for PubMedID 27271295
View details for PubMedCentralID PMC4896128
-
svviz: a read viewer for validating structural variants
BIOINFORMATICS
2015; 31 (24): 3994-3996
View details for DOI 10.1093/bioinformatics/btv478
View details for PubMedID 26286809
-
Read clouds uncover variation in complex regions of the human genome.
Genome research
2015; 25 (10): 1570-1580
Abstract
Although an increasing amount of human genetic variation is being identified and recorded, determining variants within repeated sequences of the human genome remains a challenge. Most population and genome-wide association studies have therefore been unable to consider variation in these regions. Core to the problem is the lack of a sequencing technology that produces reads with sufficient length and accuracy to enable unique mapping. Here, we present a novel methodology of using read clouds, obtained by accurate short-read sequencing of DNA derived from long fragment libraries, to confidently align short reads within repeat regions and enable accurate variant discovery. Our novel algorithm, Random Field Aligner (RFA), captures the relationships among the short reads governed by the long read process via a Markov Random Field. We utilized a modified version of the Illumina TruSeq synthetic long-read protocol, which yielded shallow-sequenced read clouds. We test RFA through extensive simulations and apply it to discover variants on the NA12878 human sample, for which shallow TruSeq read cloud sequencing data are available, and on an invasive breast carcinoma genome that we sequenced using the same method. We demonstrate that RFA facilitates accurate recovery of variation in 155 Mb of the human genome, including 94% of 67 Mb of segmental duplication sequence and 96% of 11 Mb of transcribed sequence, that are currently hidden from short-read technologies.
View details for DOI 10.1101/gr.191189.115
View details for PubMedID 26286554
View details for PubMedCentralID PMC4579342
-
Constraint and divergence of global gene expression in the mammalian embryo
ELIFE
2015; 4
Abstract
The effects of genetic variation on gene regulation in the developing mammalian embryo remain largely unexplored. To globally quantify these effects, we crossed two divergent mouse strains and asked how genotype of the mother or of the embryo drives gene expression phenotype genomewide. Embryonic expression of 331 genes depends on the genotype of the mother. Embryonic genotype controls allele-specific expression of 1594 genes and a highly overlapping set of cis-expression quantitative trait loci (eQTL). A marked paucity of trans-eQTL suggests that the widespread expression differences do not propagate through the embryonic gene regulatory network. The cis-eQTL genes exhibit lower-than-average evolutionary conservation and are depleted for developmental regulators, consistent with purifying selection acting on expression phenotype of pattern formation genes. The widespread effect of maternal and embryonic genotype in conjunction with the purifying selection we uncovered suggests that embryogenesis is an important and understudied reservoir of phenotypic variation.
View details for DOI 10.7554/eLife.05538
View details for Web of Science ID 000373792400001
View details for PubMedID 25871848
View details for PubMedCentralID PMC4417935
-
Concepts in solid tumor evolution
TRENDS IN GENETICS
2015; 31 (4): 208-214
Abstract
Evolutionary mechanisms in cancer progression give tumors their individuality. Cancer evolution is different from organismal evolution, however, and we discuss where concepts from evolutionary genetics are useful or limited in facilitating an understanding of cancer. Based on these concepts we construct and apply the simplest plausible model of tumor growth and progression. Simulations using this simple model illustrate the importance of stochastic events early in tumorigenesis, highlight the dominance of exponential growth over linear growth and differentiation, and explain the clonal substructure of tumors.
View details for DOI 10.1016/j.tig.2015.02.001
View details for Web of Science ID 000353089500006
View details for PubMedID 25733351
View details for PubMedCentralID PMC4380537
-
Cell-lineage heterogeneity and driver mutation recurrence in pre-invasive breast neoplasia.
Genome medicine
2015; 7 (1): 28-?
Abstract
All cells in an individual are related to one another by a bifurcating lineage tree, in which each node is an ancestral cell that divided into two, each branch connects two nodes, and the root is the zygote. When a somatic mutation occurs in an ancestral cell, all its descendants carry the mutation, which can then serve as a lineage marker for the phylogenetic reconstruction of tumor progression. Using this concept, we investigate cell lineage relationships and genetic heterogeneity of pre-invasive neoplasias compared to invasive carcinomas.We deeply sequenced over a thousand phylogenetically informative somatic variants in 66 morphologically independent samples from six patients that represent a spectrum of normal, early neoplasia, carcinoma in situ, and invasive carcinoma. For each patient, we obtained a highly resolved lineage tree that establishes the phylogenetic relationships among the pre-invasive lesions and with the invasive carcinoma.The trees reveal lineage heterogeneity of pre-invasive lesions, both within the same lesion, and between histologically similar ones. On the basis of the lineage trees, we identified a large number of independent recurrences of PIK3CA H1047 mutations in separate lesions in four of the six patients, often separate from the diagnostic carcinoma.Our analyses demonstrate that multi-sample phylogenetic inference provides insights on the origin of driver mutations, lineage heterogeneity of neoplastic proliferations, and the relationship of genomically aberrant neoplasias with the primary tumors. PIK3CA driver mutations may be comparatively benign inducers of cellular proliferation.
View details for DOI 10.1186/s13073-015-0146-2
View details for PubMedID 25918554
View details for PubMedCentralID PMC4410742
-
Maternal bias and escape from X chromosome imprinting in the midgestation mouse placenta.
Developmental biology
2014; 390 (1): 80-92
Abstract
To investigate the epigenetic landscape at the interface between mother and fetus, we provide a comprehensive analysis of parent-of-origin bias in the mouse placenta. Using F1 interspecies hybrids between mus musculus (C57BL/6J) and mus musculus castaneus, we sequenced RNA from 23 individual midgestation placentas, five late stage placentas, and two yolk sac samples and then used SNPs to determine whether transcripts were preferentially generated from the maternal or paternal allele. In the placenta, we find 103 genes that show significant and reproducible parent-of-origin bias, of which 78 are novel candidates. Most (96%) show a strong maternal bias which we demonstrate, via multiple mathematical models, pyrosequencing, and FISH, is not due to maternal decidual contamination. Analysis of the X chromosome also reveals paternal expression of Xist and several genes that escape inactivation, most significantly Alas2, Fhl1, and Slc38a5. Finally, sequencing individual placentas allowed us to reveal notable expression similarity between littermates. In all, we observe a striking preference for maternal transcription in the midgestation mouse placenta and a dynamic imprinting landscape in extraembryonic tissues, reflecting the complex nature of epigenetic pathways in the placenta.
View details for DOI 10.1016/j.ydbio.2014.02.020
View details for PubMedID 24594094
-
Discovery of recurrent structural variants in nasopharyngeal carcinoma.
Genome research
2014; 24 (2): 300-309
Abstract
We present the discovery of genes recurrently involved in structural variation in nasopharyngeal carcinoma (NPC) and the identification of a novel type of somatic structural variant. We identified the variants with high complexity mate-pair libraries and a novel computational algorithm specifically designed for tumor-normal comparisons, SMASH. SMASH combines signals from split reads and mate-pair discordance to detect somatic structural variants. We demonstrate a >90% validation rate and a breakpoint reconstruction accuracy of 3 bp by Sanger sequencing. Our approach identified three in-frame gene fusions (YAP1-MAML2, PTPLB-RSRC1, and SP3-PTK2) that had strong levels of expression in corresponding NPC tissues. We found two cases of a novel type of structural variant, which we call "coupled inversion," one of which produced the YAP1-MAML2 fusion. To investigate whether the identified fusion genes are recurrent, we performed fluorescent in situ hybridization (FISH) to screen 196 independent NPC cases. We observed recurrent rearrangements of MAML2 (three cases), PTK2 (six cases), and SP3 (two cases), corresponding to a combined rate of structural variation recurrence of 6% among tested NPC tissues.
View details for DOI 10.1101/gr.156224.113
View details for PubMedID 24214394
-
Discovery of recurrent structural variants in nasopharyngeal carcinoma
GENOME RESEARCH
2014; 24 (2): 300-309
Abstract
We present the discovery of genes recurrently involved in structural variation in nasopharyngeal carcinoma (NPC) and the identification of a novel type of somatic structural variant. We identified the variants with high complexity mate-pair libraries and a novel computational algorithm specifically designed for tumor-normal comparisons, SMASH. SMASH combines signals from split reads and mate-pair discordance to detect somatic structural variants. We demonstrate a >90% validation rate and a breakpoint reconstruction accuracy of 3 bp by Sanger sequencing. Our approach identified three in-frame gene fusions (YAP1-MAML2, PTPLB-RSRC1, and SP3-PTK2) that had strong levels of expression in corresponding NPC tissues. We found two cases of a novel type of structural variant, which we call "coupled inversion," one of which produced the YAP1-MAML2 fusion. To investigate whether the identified fusion genes are recurrent, we performed fluorescent in situ hybridization (FISH) to screen 196 independent NPC cases. We observed recurrent rearrangements of MAML2 (three cases), PTK2 (six cases), and SP3 (two cases), corresponding to a combined rate of structural variation recurrence of 6% among tested NPC tissues.
View details for DOI 10.1101/gr.156224.113
View details for Web of Science ID 000330696800012
View details for PubMedID 24214394
View details for PubMedCentralID PMC3912420
-
Inference of tumor phylogenies with improved somatic mutation discovery.
Journal of computational biology
2013; 20 (11): 933-944
Abstract
Next-generation sequencing technologies provide a powerful tool for studying genome evolution during progression of advanced diseases such as cancer. Although many recent studies have employed new sequencing technologies to detect mutations across multiple, genetically related tumors, current methods do not exploit available phylogenetic information to improve the accuracy of their variant calls. Here, we present a novel algorithm that uses somatic single-nucleotide variations (SNVs) in multiple, related tissue samples as lineage markers for phylogenetic tree reconstruction. Our method then leverages the inferred phylogeny to improve the accuracy of SNV discovery. Experimental analyses demonstrate that our method achieves up to 32% improvement for somatic SNV calling of multiple, related samples over the accuracy of GATK's Unified Genotyper, the state-of-the-art multisample SNV caller.
View details for DOI 10.1089/cmb.2013.0106
View details for PubMedID 24195709
-
Transcription-factor occupancy at HOT regions quantitatively predicts RNA polymerase recruitment in five human cell lines
BMC GENOMICS
2013; 14
Abstract
High-occupancy target (HOT) regions are compact genome loci occupied by many different transcription factors (TFs). HOT regions were initially defined in invertebrate model organisms, and we here show that they are a ubiquitous feature of the human gene-regulation landscape.We identified HOT regions by a comprehensive analysis of ChIP-seq data from 96 DNA-associated proteins in 5 human cell lines. Most HOT regions co-localize with RNA polymerase II binding sites, but many are not near the promoters of annotated genes. At HOT promoters, TF occupancy is strongly predictive of transcription preinitiation complex recruitment and moderately predictive of initiating Pol II recruitment, but only weakly predictive of elongating Pol II and RNA transcript abundance. TF occupancy varies quantitatively within human HOT regions; we used this variation to discover novel associations between TFs. The sequence motif associated with any given TF's direct DNA binding is somewhat predictive of its empirical occupancy, but a great deal of occupancy occurs at sites without the TF's motif, implying indirect recruitment by another TF whose motif is present.Mammalian HOT regions are regulatory hubs that integrate the signals from diverse regulatory pathways to quantitatively tune the promoter for RNA polymerase II recruitment.
View details for DOI 10.1186/1471-2164-14-720
View details for Web of Science ID 000328633100002
View details for PubMedID 24138567
View details for PubMedCentralID PMC3826616
-
Genome evolution during progression to breast cancer
GENOME RESEARCH
2013; 23 (7): 1097-1108
Abstract
Cancer evolution involves cycles of genomic damage, epigenetic deregulation, and increased cellular proliferation that eventually culminate in the carcinoma phenotype. Early neoplasias, which are often found concurrently with carcinomas and are histologically distinguishable from normal breast tissue, are less advanced in phenotype than carcinomas and are thought to represent precursor stages. To elucidate their role in cancer evolution we performed comparative whole-genome sequencing of early neoplasias, matched normal tissue, and carcinomas from six patients, for a total of 31 samples. By using somatic mutations as lineage markers we built trees that relate the tissue samples within each patient. On the basis of these lineage trees we inferred the order, timing, and rates of genomic events. In four out of six cases, an early neoplasia and the carcinoma share a mutated common ancestor with recurring aneuploidies, and in all six cases evolution accelerated in the carcinoma lineage. Transition spectra of somatic mutations are stable and consistent across cases, suggesting that accumulation of somatic mutations is a result of increased ancestral cell division rather than specific mutational mechanisms. In contrast to highly advanced tumors that are the focus of much of the current cancer genome sequencing, neither the early neoplasia genomes nor the carcinomas are enriched with potentially functional somatic point mutations. Aneuploidies that occur in common ancestors of neoplastic and tumor cells are the earliest events that affect a large number of genes and may predispose breast tissue to eventual development of invasive carcinoma.
View details for DOI 10.1101/gr.151670.112
View details for Web of Science ID 000321119900007
View details for PubMedID 23568837
-
The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes.
Genome research
2013; 23 (5): 749-761
Abstract
Short insertions and deletions (indels) are the second most abundant form of human genetic variation, but our understanding of their origins and functional effects lags behind that of other types of variants. Using population-scale sequencing, we have identified a high-quality set of 1.6 million indels from 179 individuals representing three diverse human populations. We show that rates of indel mutagenesis are highly heterogeneous, with 43%-48% of indels occurring in 4.03% of the genome, whereas in the remaining 96% their prevalence is 16 times lower than SNPs. Polymerase slippage can explain upwards of three-fourths of all indels, with the remainder being mostly simple deletions in complex sequence. However, insertions do occur and are significantly associated with pseudo-palindromic sequence features compatible with the fork stalling and template switching (FoSTeS) mechanism more commonly associated with large structural variations. We introduce a quantitative model of polymerase slippage, which enables us to identify indel-hypermutagenic protein-coding genes, some of which are associated with recurrent mutations leading to disease. Accounting for mutational rate heterogeneity due to sequence context, we find that indels across functional sequence are generally subject to stronger purifying selection than SNPs. We find that indel length modulates selection strength, and that indels affecting multiple functionally constrained nucleotides undergo stronger purifying selection. We further find that indels are enriched in associations with gene expression and find evidence for a contribution of nonsense-mediated decay. Finally, we show that indels can be integrated in existing genome-wide association studies (GWAS); although we do not find direct evidence that potentially causal protein-coding indels are enriched with associations to known disease-associated SNPs, our findings suggest that the causal variant underlying some of these associations may be indels.
View details for DOI 10.1101/gr.148718.112
View details for PubMedID 23478400
View details for PubMedCentralID PMC3638132
-
Global genomic profiling reveals an extensive p53-regulated autophagy program contributing to key p53 responses.
Genes & development
2013; 27 (9): 1016-1031
Abstract
The mechanisms by which the p53 tumor suppressor acts remain incompletely understood. To gain new insights into p53 biology, we used high-throughput sequencing to analyze global p53 transcriptional networks in primary mouse embryo fibroblasts in response to DNA damage. Chromatin immunoprecipitation sequencing reveals 4785 p53-bound sites in the genome located near 3193 genes involved in diverse biological processes. RNA sequencing analysis shows that only a subset of p53-bound genes is transcriptionally regulated, yielding a list of 432 p53-bound and regulated genes. Interestingly, we identify a host of autophagy genes as direct p53 target genes. While the autophagy program is regulated predominantly by p53, the p53 family members p63 and p73 contribute to activation of this autophagy gene network. Induction of autophagy genes in response to p53 activation is associated with enhanced autophagy in diverse settings and depends on p53 transcriptional activity. While p53-induced autophagy does not affect cell cycle arrest in response to DNA damage, it is important for both robust p53-dependent apoptosis triggered by DNA damage and transformation suppression by p53. Together, our data highlight an intimate connection between p53 and autophagy through a vast transcriptional network and indicate that autophagy contributes to p53-dependent apoptosis and cancer suppression.
View details for DOI 10.1101/gad.212282.112
View details for PubMedID 23651856
View details for PubMedCentralID PMC3656320
-
An integrated encyclopedia of DNA elements in the human genome
NATURE
2012; 489 (7414): 57-74
Abstract
The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.
View details for DOI 10.1038/nature11247
View details for Web of Science ID 000308347000039
View details for PubMedID 22955616
View details for PubMedCentralID PMC3439153
-
Architecture of the human regulatory network derived from ENCODE data
NATURE
2012; 489 (7414): 91-100
Abstract
Transcription factors bind in a combinatorial fashion to specify the on-and-off states of genes; the ensemble of these binding events forms a regulatory network, constituting the wiring diagram for a cell. To examine the principles of the human transcriptional regulatory network, we determined the genomic binding information of 119 transcription-related factors in over 450 distinct experiments. We found the combinatorial, co-association of transcription factors to be highly context specific: distinct combinations of factors bind at specific genomic locations. In particular, there are significant differences in the binding proximal and distal to genes. We organized all the transcription factor binding into a hierarchy and integrated it with other genomic information (for example, microRNA regulation), forming a dense meta-network. Factors at different levels have different properties; for instance, top-level transcription factors more strongly influence expression and middle-level ones co-regulate targets to mitigate information-flow bottlenecks. Moreover, these co-regulations give rise to many enriched network motifs (for example, noise-buffering feed-forward loops). Finally, more connected network components are under stronger selection and exhibit a greater degree of allele-specific activity (that is, differential binding to the two parental alleles). The regulatory information obtained in this study will be crucial for interpreting personal genome sequences and understanding basic principles of human biology and disease.
View details for DOI 10.1038/nature11245
View details for PubMedID 22955619
-
ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia
GENOME RESEARCH
2012; 22 (9): 1813-1831
Abstract
Chromatin immunoprecipitation (ChIP) followed by high-throughput DNA sequencing (ChIP-seq) has become a valuable and widely used approach for mapping the genomic location of transcription-factor binding and histone modifications in living cells. Despite its widespread use, there are considerable differences in how these experiments are conducted, how the results are scored and evaluated for quality, and how the data and metadata are archived for public use. These practices affect the quality and utility of any global ChIP experiment. Through our experience in performing ChIP-seq experiments, the ENCODE and modENCODE consortia have developed a set of working standards and guidelines for ChIP experiments that are updated routinely. The current guidelines address antibody validation, experimental replication, sequencing depth, data and metadata reporting, and data quality assessment. We discuss how ChIP quality, assessed in these ways, affects different uses of ChIP-seq data. All data sets used in the analysis have been deposited for public viewing and downloading at the ENCODE (http://encodeproject.org/ENCODE/) and modENCODE (http://www.modencode.org/) portals.
View details for DOI 10.1101/gr.136184.111
View details for PubMedID 22955991
-
Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements
GENOME RESEARCH
2012; 22 (9): 1735-1747
Abstract
Gene regulation at functional elements (e.g., enhancers, promoters, insulators) is governed by an interplay of nucleosome remodeling, histone modifications, and transcription factor binding. To enhance our understanding of gene regulation, the ENCODE Consortium has generated a wealth of ChIP-seq data on DNA-binding proteins and histone modifications. We additionally generated nucleosome positioning data on two cell lines, K562 and GM12878, by MNase digestion and high-depth sequencing. Here we relate 14 chromatin signals (12 histone marks, DNase, and nucleosome positioning) to the binding sites of 119 DNA-binding proteins across a large number of cell lines. We developed a new method for unsupervised pattern discovery, the Clustered AGgregation Tool (CAGT), which accounts for the inherent heterogeneity in signal magnitude, shape, and implicit strand orientation of chromatin marks. We applied CAGT on a total of 5084 data set pairs to obtain an exhaustive catalog of high-resolution patterns of histone modifications and nucleosome positioning signals around bound transcription factors. Our analyses reveal extensive heterogeneity in how histone modifications are deposited, and how nucleosomes are positioned around binding sites. With the exception of the CTCF/cohesin complex, asymmetry of nucleosome positioning is predominant. Asymmetry of histone modifications is also widespread, for all types of chromatin marks examined, including promoter, enhancer, elongation, and repressive marks. The fine-resolution signal shapes discovered by CAGT unveiled novel correlation patterns between chromatin marks, nucleosome positioning, and sequence content. Meta-analyses of the signal profiles revealed a common vocabulary of chromatin signals shared across multiple cell lines and binding proteins.
View details for DOI 10.1101/gr.136366.111
View details for PubMedID 22955985
-
A Cell Cycle Phosphoproteome of the Yeast Centrosome
SCIENCE
2011; 332 (6037): 1557-1561
Abstract
Centrosomes organize the bipolar mitotic spindle, and centrosomal defects cause chromosome instability. Protein phosphorylation modulates centrosome function, and we provide a comprehensive map of phosphorylation on intact yeast centrosomes (18 proteins). Mass spectrometry was used to identify 297 phosphorylation sites on centrosomes from different cell cycle stages. We observed different modes of phosphoregulation via specific protein kinases, phosphorylation site clustering, and conserved phosphorylated residues. Mutating all eight cyclin-dependent kinase (Cdk)-directed sites within the core component, Spc42, resulted in lethality and reduced centrosomal assembly. Alternatively, mutation of one conserved Cdk site within γ-tubulin (Tub4-S360D) caused mitotic delay and aberrant anaphase spindle elongation. Our work establishes the extent and complexity of this prominent posttranslational modification in centrosome biology and provides specific examples of phosphorylation control in centrosome function.
View details for DOI 10.1126/science.1205193
View details for Web of Science ID 000291990000045
View details for PubMedID 21700874
-
Determinants of nucleosome organization in primary human cells
NATURE
2011; 474 (7352): 516-U148
Abstract
Nucleosomes are the basic packaging units of chromatin, modulating accessibility of regulatory proteins to DNA and thus influencing eukaryotic gene regulation. Elaborate chromatin remodelling mechanisms have evolved that govern nucleosome organization at promoters, regulatory elements, and other functional regions in the genome. Analyses of chromatin landscape have uncovered a variety of mechanisms, including DNA sequence preferences, that can influence nucleosome positions. To identify major determinants of nucleosome organization in the human genome, we used deep sequencing to map nucleosome positions in three primary human cell types and in vitro. A majority of the genome showed substantial flexibility of nucleosome positions, whereas a small fraction showed reproducibly positioned nucleosomes. Certain sites that position in vitro can anchor the formation of nucleosomal arrays that have cell type-specific spacing in vivo. Our results unveil an interplay of sequence-based nucleosome preferences and non-nucleosomal factors in determining nucleosome organization within mammalian cells.
View details for DOI 10.1038/nature10002
View details for Web of Science ID 000291939700050
View details for PubMedID 21602827
View details for PubMedCentralID PMC3212987
-
A User's Guide to the Encyclopedia of DNA Elements (ENCODE)
PLOS BIOLOGY
2011; 9 (4)
Abstract
The mission of the Encyclopedia of DNA Elements (ENCODE) Project is to enable the scientific and medical communities to interpret the human genome sequence and apply it to understand human biology and improve health. The ENCODE Consortium is integrating multiple technologies and approaches in a collective effort to discover and define the functional elements encoded in the human genome, including genes, transcripts, and transcriptional regulatory regions, together with their attendant chromatin states and DNA methylation patterns. In the process, standards to ensure high-quality data have been implemented, and novel algorithms have been developed to facilitate analysis. Data and derived results are made available through a freely accessible database. Here we provide an overview of the project and the resources it is generating and illustrate the application of ENCODE data to interpret the human genome.
View details for DOI 10.1371/journal.pbio.1001046
View details for Web of Science ID 000289938900014
-
Identifying a High Fraction of the Human Genome to be under Selective Constraint Using GERP plus
PLOS COMPUTATIONAL BIOLOGY
2010; 6 (12)
Abstract
Computational efforts to identify functional elements within genomes leverage comparative sequence information by looking for regions that exhibit evidence of selective constraint. One way of detecting constrained elements is to follow a bottom-up approach by computing constraint scores for individual positions of a multiple alignment and then defining constrained elements as segments of contiguous, highly scoring nucleotide positions. Here we present GERP++, a new tool that uses maximum likelihood evolutionary rate estimation for position-specific scoring and, in contrast to previous bottom-up methods, a novel dynamic programming approach to subsequently define constrained elements. GERP++ evaluates a richer set of candidate element breakpoints and ranks them based on statistical significance, eliminating the need for biased heuristic extension techniques. Using GERP++ we identify over 1.3 million constrained elements spanning over 7% of the human genome. We predict a higher fraction than earlier estimates largely due to the annotation of longer constrained elements, which improves one to one correspondence between predicted elements with known functional sequences. GERP++ is an efficient and effective tool to provide both nucleotide- and element-level constraint scores within deep multiple sequence alignments.
View details for DOI 10.1371/journal.pcbi.1001025
View details for Web of Science ID 000285574600013
View details for PubMedID 21152010
View details for PubMedCentralID PMC2996323
-
Functional analyses of variants reveal a significant role for dominant negative and common alleles in oligogenic Bardet-Biedl syndrome
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2010; 107 (23): 10602-10607
Abstract
Technological advances hold the promise of rapidly catalyzing the discovery of pathogenic variants for genetic disease. However, this possibility is tempered by limitations in interpreting the functional consequences of genetic variation at candidate loci. Here, we present a systematic approach, grounded on physiologically relevant assays, to evaluate the mutational content (125 alleles) of the 14 genes associated with Bardet-Biedl syndrome (BBS). A combination of in vivo assays with subsequent in vitro validation suggests that a significant fraction of BBS-associated mutations have a dominant-negative mode of action. Moreover, we find that a subset of common alleles, previously considered to be benign, are, in fact, detrimental to protein function and can interact with strong rare alleles to modulate disease presentation. These data represent a comprehensive evaluation of genetic load in a multilocus disease. Importantly, superimposition of these results to human genetics data suggests a previously underappreciated complexity in disease architecture that might be shared among diverse clinical phenotypes.
View details for DOI 10.1073/pnas.1000219107
View details for Web of Science ID 000278549300050
View details for PubMedID 20498079
View details for PubMedCentralID PMC2890780
-
Evolutionary constraint facilitates interpretation of genetic variation in resequenced human genomes
GENOME RESEARCH
2010; 20 (3): 301-310
Abstract
Here, we demonstrate how comparative sequence analysis facilitates genome-wide base-pair-level interpretation of individual genetic variation and address two questions of importance for human personal genomics: first, whether an individual's functional variation comes mostly from noncoding or coding polymorphisms; and, second, whether population-specific or globally-present polymorphisms contribute more to functional variation in any given individual. Neither has been definitively answered by analyses of existing variation data because of a focus on coding polymorphisms, ascertainment biases in favor of common variation, and a lack of base-pair-level resolution for identifying functional variants. We resequenced 575 amplicons within 432 individuals at genomic sites enriched for evolutionary constraint and also analyzed variation within three published human genomes. We find that single-site measures of evolutionary constraint derived from mammalian multiple sequence alignments are strongly predictive of reductions in modern-day genetic diversity across a range of annotation categories and across the allele frequency spectrum from rare (<1%) to high frequency (>10% minor allele frequency). Furthermore, we show that putatively functional variation in an individual genome is dominated by polymorphisms that do not change protein sequence and that originate from our shared ancestral population and commonly segregate in human populations. These observations show that common, noncoding alleles contribute substantially to human phenotypes and that constraint-based analyses will be of value to identify phenotypically relevant variants in individual genomes.
View details for DOI 10.1101/gr.102210.109
View details for Web of Science ID 000275124600002
View details for PubMedID 20067941
View details for PubMedCentralID PMC2840986
-
ProPhylER: A curated online resource for protein function and structure based on evolutionary constraint analyses
GENOME RESEARCH
2010; 20 (1): 142-154
Abstract
ProPhylER (Protein Phylogeny and Evolutionary Rates) is a next-generation curated proteome resource that uses comparative sequence analysis to predict constraint and mutation impact for eukaryotic proteins. Its purpose is to inform any research program for which protein function and structure are relevant, by the predictive power of evolutionary constraint analyses. ProPhylER currently has nearly 9000 clusters of related proteins, including more than 200,000 sequences. It serves data via two interfaces. The "ProPhylER Interface" displays predictive analyses in sequence space; the "CrystalPainter" maps evolutionary constraints onto solved protein structures. Here we summarize ProPhylER's data content and analysis pipeline, demonstrate the use of ProPhylER's interfaces, and evaluate ProPhylER's unique regional analysis of evolutionary constraint. The high accuracy of ProPhylER's regional analysis complements the high resolution of its single-site analysis to effectively guide and inform structure-function investigations and predict the impact of polymorphisms.
View details for DOI 10.1101/gr.097121.109
View details for Web of Science ID 000273249500015
View details for PubMedID 19846609
View details for PubMedCentralID PMC2798826
-
Jarid2/Jumonji Coordinates Control of PRC2 Enzymatic Activity and Target Gene Occupancy in Pluripotent Cells
CELL
2009; 139 (7): 1290-1302
Abstract
Polycomb Repressive Complex 2 (PRC2) regulates key developmental genes in embryonic stem (ES) cells and during development. Here we show that Jarid2/Jumonji, a protein enriched in pluripotent cells and a founding member of the Jumonji C (JmjC) domain protein family, is a PRC2 subunit in ES cells. Genome-wide ChIP-seq analyses of Jarid2, Ezh2, and Suz12 binding reveal that Jarid2 and PRC2 occupy the same genomic regions. We further show that Jarid2 promotes PRC2 recruitment to the target genes while inhibiting PRC2 histone methyltransferase activity, suggesting that it acts as a "molecular rheostat" that finely calibrates PRC2 functions at developmental genes. Using Xenopus laevis as a model we demonstrate that Jarid2 knockdown impairs the induction of gastrulation genes in blastula embryos and results in failure of differentiation. Our findings illuminate a mechanism of histone methylation regulation in pluripotent cells and during early cell-fate transitions.
View details for DOI 10.1016/j.cell.2009.12.002
View details for Web of Science ID 000273048700017
View details for PubMedID 20064375
View details for PubMedCentralID PMC2911953
-
SHRiMP: Accurate Mapping of Short Color-space Reads
PLOS COMPUTATIONAL BIOLOGY
2009; 5 (5)
Abstract
The development of Next Generation Sequencing technologies, capable of sequencing hundreds of millions of short reads (25-70 bp each) in a single run, is opening the door to population genomic studies of non-model species. In this paper we present SHRiMP - the SHort Read Mapping Package: a set of algorithms and methods to map short reads to a genome, even in the presence of a large amount of polymorphism. Our method is based upon a fast read mapping technique, separate thorough alignment methods for regular letter-space as well as AB SOLiD (color-space) reads, and a statistical model for false positive hits. We use SHRiMP to map reads from a newly sequenced Ciona savignyi individual to the reference genome. We demonstrate that SHRiMP can accurately map reads to this highly polymorphic genome, while confirming high heterozygosity of C. savignyi in this second individual. SHRiMP is freely available at http://compbio.cs.toronto.edu/shrimp.
View details for DOI 10.1371/journal.pcbi.1000386
View details for Web of Science ID 000267081300009
View details for PubMedID 19461883
-
Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data
NATURE METHODS
2008; 5 (9): 829-834
Abstract
Molecular interactions between protein complexes and DNA mediate essential gene-regulatory functions. Uncovering such interactions by chromatin immunoprecipitation coupled with massively parallel sequencing (ChIP-Seq) has recently become the focus of intense interest. We here introduce quantitative enrichment of sequence tags (QuEST), a powerful statistical framework based on the kernel density estimation approach, which uses ChIP-Seq data to determine positions where protein complexes contact DNA. Using QuEST, we discovered several thousand binding sites for the human transcription factors SRF, GABP and NRSF at an average resolution of about 20 base pairs. MEME motif-discovery tool-based analyses of the QuEST-identified sequences revealed DNA binding by cofactors of SRF, providing evidence that cofactor binding specificity can be obtained from ChIP-Seq data. By combining QuEST analyses with Gene Ontology (GO) annotations and expression data, we illustrate how general functions of transcription factors can be inferred.
View details for DOI 10.1038/NMETH.1246
View details for Web of Science ID 000258912700017
View details for PubMedID 19160518
View details for PubMedCentralID PMC2917543
-
The C-savignyi genetic map and its integration with the reference sequence facilitates insights into chordate genome evolution
GENOME RESEARCH
2008; 18 (8): 1369-1379
Abstract
The urochordate Ciona savignyi is an emerging model organism for the study of chordate evolution, development, and gene regulation. The extreme level of polymorphism in its population has inspired novel approaches in genome assembly, which we here continue to develop. Specifically, we present the reconstruction of all of C. savignyi's chromosomes via the development of a comprehensive genetic map, without a physical map intermediate. The resulting genetic map is complete, having one linkage group for each one of the 14 chromosomes. Eighty-three percent of the reference genome sequence is covered. The chromosomal reconstruction allowed us to investigate the evolution of genome structure in highly polymorphic species, by comparing the genome of C. savignyi to its divergent sister species, Ciona intestinalis. Both genomes have been extensively reshaped by intrachromosomal rearrangements. Interchromosomal changes have been extremely rare. This is in striking contrast to what has been observed in vertebrates, where interchromosomal events are commonplace. These results, when considered in light of the neutral theory, suggest fundamentally different modes of evolution of animal species with large versus small population sizes.
View details for DOI 10.1101/gr.078576.108
View details for Web of Science ID 000258116100018
View details for PubMedID 18519652
View details for PubMedCentralID PMC2493423
-
A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning
GENOME RESEARCH
2008; 18 (7): 1051-1063
Abstract
Using the massively parallel technique of sequencing by oligonucleotide ligation and detection (SOLiD; Applied Biosystems), we have assessed the in vivo positions of more than 44 million putative nucleosome cores in the multicellular genetic model organism Caenorhabditis elegans. These analyses provide a global view of the chromatin architecture of a multicellular animal at extremely high density and resolution. While we observe some degree of reproducible positioning throughout the genome in our mixed stage population of animals, we note that the major chromatin feature in the worm is a diversity of allowed nucleosome positions at the vast majority of individual loci. While absolute positioning of nucleosomes can vary substantially, relative positioning of nucleosomes (in a repeated array structure likely to be maintained at least in part by steric constraints) appears to be a significant property of chromatin structure. The high density of nucleosomal reads enabled a substantial extension of previous analysis describing the usage of individual oligonucleotide sequences along the span of the nucleosome core and linker. We release this data set, via the UCSC Genome Browser, as a resource for the high-resolution analysis of chromatin conformation and DNA accessibility at individual loci within the C. elegans genome.
View details for DOI 10.1101/gr.076463.108
View details for Web of Science ID 000257249100005
View details for PubMedID 18477713
View details for PubMedCentralID PMC2493394
-
Identification of the Otopetrin Domain, a conserved domain in vertebrate otopetrins and invertebrate otopetrin-like family members
BMC EVOLUTIONARY BIOLOGY
2008; 8
Abstract
Otopetrin 1 (Otop1) encodes a multi-transmembrane domain protein with no homology to known transporters, channels, exchangers, or receptors. Otop1 is necessary for the formation of otoconia and otoliths, calcium carbonate biominerals within the inner ear of mammals and teleost fish that are required for the detection of linear acceleration and gravity. Vertebrate Otop1 and its paralogues Otop2 and Otop3 define a new gene family with homology to the invertebrate Domain of Unknown Function 270 genes (DUF270; pfam03189).Multi-species comparison of the predicted primary sequences and predicted secondary structures of 62 vertebrate otopetrin, and arthropod and nematode DUF270 proteins, has established that the genes encoding these proteins constitute a single family that we renamed the Otopetrin Domain Protein (ODP) gene family. Signature features of ODP proteins are three "Otopetrin Domains" that are highly conserved between vertebrates, arthropods and nematodes, and a highly constrained predicted loop structure.Our studies suggest a refined topologic model for ODP insertion into the lipid bilayer of 12 transmembrane domains, and highlight conserved amino-acid residues that will aid in the biochemical examination of ODP family function. The high degree of sequence and structural similarity of the ODP proteins may suggest a conserved role in the intracellular trafficking of calcium and the formation of biominerals.
View details for DOI 10.1186/1471-2148-8-41
View details for Web of Science ID 000254053700001
View details for PubMedID 18254951
View details for PubMedCentralID PMC2268672
-
Fruit fly family fun
CELL
2007; 131 (7): 1222-1223
Abstract
A recent comparative analysis of the sequenced genomes of 12 Drosophila species (Drosophila 12 Genomes Consortium, 2007; Stark et al., 2007) reveals a comprehensive picture of the evolution of small animal genomes and greatly improves computational predictions of functional elements in the D. melanogaster reference sequence.
View details for DOI 10.1016/j.cell.2007.12.003
View details for Web of Science ID 000252217200009
View details for PubMedID 18160030
-
Functional architecture and evolution of transcriptional elements that drive gene coexpression
SCIENCE
2007; 317 (5844): 1557-1560
Abstract
Transcriptional coexpression of interacting gene products is required for complex molecular processes; however, the function and evolution of cis-regulatory elements that orchestrate coexpression remain largely unexplored. We mutagenized 19 regulatory elements that drive coexpression of Ciona muscle genes and obtained quantitative estimates of the cis-regulatory activity of the 77 motifs that comprise these elements. We found that individual motif activity ranges broadly within and among elements, and among different instantiations of the same motif type. The activity of orthologous motifs is strongly constrained, although motif arrangement, type, and activity vary greatly among the elements of different co-regulated genes. Thus, the syntactical rules governing this regulatory function are flexible but become highly constrained evolutionarily once they are established in a particular element.
View details for DOI 10.1126/science.1145893
View details for Web of Science ID 000249467900044
View details for PubMedID 17872446
-
Mammalian Comparative Sequence Analysis of the Agrp Locus
PLOS ONE
2007; 2 (8)
Abstract
Agouti-related protein encodes a neuropeptide that stimulates food intake. Agrp expression in the brain is restricted to neurons in the arcuate nucleus of the hypothalamus and is elevated by states of negative energy balance. The molecular mechanisms underlying Agrp regulation, however, remain poorly defined. Using a combination of transgenic and comparative sequence analysis, we have previously identified a 760 bp conserved region upstream of Agrp which contains STAT binding elements that participate in Agrp transcriptional regulation. In this study, we attempt to improve the specificity for detecting conserved elements in this region by comparing genomic sequences from 10 mammalian species. Our analysis reveals a symmetrical organization of conserved sequences upstream of Agrp, which cluster into two inverted repeat elements. Conserved sequences within these elements suggest a role for homeodomain proteins in the regulation of Agrp and provide additional targets for functional evaluation.
View details for DOI 10.1371/journal.pone.0000702
View details for Web of Science ID 000207452400006
View details for PubMedID 17684549
View details for PubMedCentralID PMC1931611
-
Constructing a meaningful evolutionary average at the phylogenetic center of mass
BMC BIOINFORMATICS
2007; 8
Abstract
As a consequence of the evolutionary process, data collected from related species tend to be similar. This similarity by descent can obscure subtler signals in the data such as the evidence of constraint on variation due to shared selective pressures. In comparative sequence analysis, for example, sequence similarity is often used to illuminate important regions of the genome, but if the comparison is between closely related species, then similarity is the rule rather than the interesting exception. Furthermore, and perhaps worse yet, the contribution of a divergent third species may be masked by the strong similarity between the other two. Here we propose a remedy that weighs the contribution of each species according to its phylogenetic placement.We first solve the problem of summarizing data related by phylogeny, and we explain why an average should operate on the entire evolutionary trajectory that relates the data. This perspective leads to a new approach in which we define the average in terms of the phylogeny, using the data and a stochastic model to obtain a probability on evolutionary trajectories. With the assumption that the data evolve according to a Brownian motion process on the tree, we show that our evolutionary average can be computed as convex combination of the species data. Thus, our approach, called the BranchManager, defines both an average and a novel taxon weighting scheme. We compare the BranchManager to two other methods, demonstrating why it exhibits desirable properties. In doing so, we devise a framework for comparison and introduce the concept of a representative point at which the average is situated.The BranchManager uses as its representative point the phylogenetic center of mass, a choice which has both intuitive and practical appeal. Because our average is intrinsic to both the dataset and to the phylogeny, we expect it and its corresponding weighting scheme to be useful in all sorts of studies where interspecies data need to be combined. Obvious applications include evolutionary studies of morphology, physiology or behaviour, but quantitative measures such as sequence hydrophobicity and gene expression level are amenable to our approach as well. Other areas of potential impact include motif discovery and vaccine design. A Java implementation of the BranchManager is available for download, as is a script written in the statistical language R.
View details for DOI 10.1186/1471-2105-8-222
View details for Web of Science ID 000248131500001
View details for PubMedID 17594490
-
Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project
NATURE
2007; 447 (7146): 799-816
Abstract
We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.
View details for DOI 10.1038/nature05874
View details for Web of Science ID 000247207500034
View details for PubMedID 17571346
View details for PubMedCentralID PMC2212820
-
Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome
GENOME RESEARCH
2007; 17 (6): 760-774
Abstract
A key component of the ongoing ENCODE project involves rigorous comparative sequence analyses for the initially targeted 1% of the human genome. Here, we present orthologous sequence generation, alignment, and evolutionary constraint analyses of 23 mammalian species for all ENCODE targets. Alignments were generated using four different methods; comparisons of these methods reveal large-scale consistency but substantial differences in terms of small genomic rearrangements, sensitivity (sequence coverage), and specificity (alignment accuracy). We describe the quantitative and qualitative trade-offs concomitant with alignment method choice and the levels of technical error that need to be accounted for in applications that require multisequence alignments. Using the generated alignments, we identified constrained regions using three different methods. While the different constraint-detecting methods are in general agreement, there are important discrepancies relating to both the underlying alignments and the specific algorithms. However, by integrating the results across the alignments and constraint-detecting methods, we produced constraint annotations that were found to be robust based on multiple independent measures. Analyses of these annotations illustrate that most classes of experimentally annotated functional elements are enriched for constrained sequences; however, large portions of each class (with the exception of protein-coding sequences) do not overlap constrained regions. The latter elements might not be under primary sequence constraint, might not be constrained across all mammals, or might have expendable molecular functions. Conversely, 40% of the constrained sequences do not overlap any of the functional elements that have been experimentally identified. Together, these findings demonstrate and quantify how many genomic functional elements await basic molecular characterization.
View details for DOI 10.1101/gr.6034307
View details for Web of Science ID 000247226900009
View details for PubMedID 17567995
View details for PubMedCentralID PMC1891336
-
Extreme genomic variation in a natural population
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2007; 104 (13): 5698-5703
Abstract
Whole-genome sequence data from samples of natural populations provide fertile grounds for analyses of intraspecific variation and tests of population genetic theory. We show that the urochordate Ciona savignyi, one of the species of ocean-dwelling broadcast spawners commonly known as sea squirts, exhibits the highest rates of single-nucleotide and structural polymorphism ever comprehensively quantified in a multicellular organism. We demonstrate that the cause for the extreme heterozygosity is a large effective population size, and, consistent with prediction by the neutral theory, we find evidence of strong purifying selection. These results constitute in-depth insight into the dynamics of highly polymorphic genomes and provide important empirical support of population genetic theory as it pertains to population size, heterozygosity, and natural selection.
View details for DOI 10.1073/pnas.0700890104
View details for Web of Science ID 000245331700079
View details for PubMedID 17372217
View details for PubMedCentralID PMC1838466
-
A haplome alignment and reference sequence of the highly polymorphic Ciona savignyi genome
GENOME BIOLOGY
2007; 8 (3)
Abstract
The sequence of Ciona savignyi was determined using a whole-genome shotgun strategy, but a high degree of polymorphism resulted in a fractured assembly wherein allelic sequences from the same genomic region assembled separately. We designed a multistep strategy to generate a nonredundant reference sequence from the original assembly by reconstructing and aligning the two 'haplomes' (haploid genomes). In the resultant 174 megabase reference sequence, each locus is represented once, misassemblies are corrected, and contiguity and continuity are dramatically improved.
View details for DOI 10.1186/gb-2007-8-3-r41
View details for Web of Science ID 000246081600014
View details for PubMedID 17374142
View details for PubMedCentralID PMC1868934
-
Structural and molecular evolutionary analysis of Agouti and Agouti-related proteins
CHEMISTRY & BIOLOGY
2006; 13 (12): 1297-1305
Abstract
Agouti (ASIP) and Agouti-related protein (AgRP) are endogenous antagonists of melanocortin receptors that play critical roles in the regulation of pigmentation and energy balance, respectively, and which arose from a common ancestral gene early in vertebrate evolution. The N-terminal domain of ASIP facilitates antagonism by binding to an accessory receptor, but here we show that the N-terminal domain of AgRP has the opposite effect and acts as a prodomain that negatively regulates antagonist function. Computational analysis reveals similar patterns of evolutionary constraint in the ASIP and AgRP C-terminal domains, but fundamental differences between the N-terminal domains. These studies shed light on the relationships between regulation of pigmentation and body weight, and they illustrate how evolutionary structure function analysis can reveal both unique and common mechanisms of action for paralogous gene products.
View details for DOI 10.1016/j.chembiol.2006.10.006
View details for Web of Science ID 000243323600008
View details for PubMedID 17185225
-
De novo discovery of a tissue-specific gene regulatory module in a chordate
GENOME RESEARCH
2005; 15 (10): 1315-1324
Abstract
We engage the experimental and computational challenges of de novo regulatory module discovery in a complex and largely unstudied metazoan genome. Our analysis is based on the comprehensive characterization of regulatory elements of 20 muscle genes in the chordate, Ciona savignyi. Three independent types of data we generate contribute to the characterization of a muscle-specific regulatory module: (1) Positive elements (PEs), short sequences sufficient for strong muscle expression that are identified in a high-resolution in vivo analysis; (2) CisModules (CMs), candidate regulatory modules defined by clusters of overrepresented motifs predicted de novo; and (3) Conserved elements (CEs), short noncoding sequences of strong conservation between C. savignyi and C. intestinalis. We estimate the accuracy of the computational predictions by an analysis of the intersection of these data. As final biological validation of the discovered muscle regulatory module, we implement a novel algorithm to search the genome for instances of the module and identify seven novel enhancers.
View details for DOI 10.1101/gr.4062605
View details for PubMedID 16169925
-
Distribution and intensity of constraint in mammalian genomic sequence
GENOME RESEARCH
2005; 15 (7): 901-913
Abstract
Comparisons of orthologous genomic DNA sequences can be used to characterize regions that have been subject to purifying selection and are enriched for functional elements. We here present the results of such an analysis on an alignment of sequences from 29 mammalian species. The alignment captures approximately 3.9 neutral substitutions per site and spans approximately 1.9 Mbp of the human genome. We identify constrained elements from 3 bp to over 1 kbp in length, covering approximately 5.5% of the human locus. Our estimate for the total amount of nonexonic constraint experienced by this locus is roughly twice that for exonic constraint. Constrained elements tend to cluster, and we identify large constrained regions that correspond well with known functional elements. While constraint density inversely correlates with mobile element density, we also show the presence of unambiguously constrained elements overlapping mammalian ancestral repeats. In addition, we describe a number of elements in this region that have undergone intense purifying selection throughout mammalian evolution, and we show that these important elements are more numerous than previously thought. These results were obtained with Genomic Evolutionary Rate Profiling (GERP), a statistically rigorous and biologically transparent framework for constrained element identification. GERP identifies regions at high resolution that exhibit nucleotide substitution deficits, and measures these deficits as "rejected substitutions". Rejected substitutions reflect the intensity of past purifying selection and are used to rank and characterize constrained elements. We anticipate that GERP and the types of analyses it facilitates will provide further insights and improved annotation for the human genome as mammalian genome sequence data become richer.
View details for Web of Science ID 000230424000001
View details for PubMedID 15965027
-
Physicochemical constraint violation by missense substitutions mediates impairment of protein function and disease severity
GENOME RESEARCH
2005; 15 (7): 978-986
Abstract
We find that the degree of impairment of protein function by missense variants is predictable by comparative sequence analysis alone. The applicable range of impairment is not confined to binary predictions that distinguish normal from deleterious variants, but extends continuously from mild to severe effects. The accuracy of predictions is strongly dependent on sequence variation and is highest when diverse orthologs are available. High predictive accuracy is achieved by quantification of the physicochemical characteristics in each position of the protein, based on observed evolutionary variation. The strong relationship between physicochemical characteristics of a missense variant and impairment of protein function extends to human disease. By using four diverse proteins for which sufficient comparative sequence data are available, we show that grades of disease, or likelihood of developing cancer, correlate strongly with physicochemical constraint violation by causative amino acid variants.
View details for Web of Science ID 000230424000009
View details for PubMedID 15965030
-
Phenotype-genotype correlation in Hirschsprung disease is illuminated by comparative analysis of the RET protein sequence
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2005; 102 (25): 8949-8954
Abstract
The ability to discriminate between deleterious and neutral amino acid substitutions in the genes of patients remains a significant challenge in human genetics. The increasing availability of genomic sequence data from multiple vertebrate species allows inclusion of sequence conservation and physicochemical properties of residues to be used for functional prediction. In this study, the RET receptor tyrosine kinase serves as a model disease gene in which a broad spectrum (> or = 116) of disease-associated mutations has been identified among patients with Hirschsprung disease and multiple endocrine neoplasia type 2. We report the alignment of the human RET protein sequence with the orthologous sequences of 12 non-human vertebrates (eight mammalian, one avian, and three teleost species), their comparative analysis, the evolutionary topology of the RET protein, and predicted tolerance for all published missense mutations. We show that, although evolutionary conservation alone provides significant information to predict the effect of a RET mutation, a model that combines comparative sequence data with analysis of physiochemical properties in a quantitative framework provides far greater accuracy. Although the ability to discern the impact of a mutation is imperfect, our analyses permit substantial discrimination between predicted functional classes of RET mutations and disease severity even for a multigenic disease such as Hirschsprung disease.
View details for Web of Science ID 000230049500031
View details for PubMedID 15956201
View details for PubMedCentralID PMC1157046
-
Trade-offs in detecting evolutionarily constrained sequence by comparative genomics
ANNUAL REVIEW OF GENOMICS AND HUMAN GENETICS
2005; 6: 143-164
Abstract
As whole-genome sequencing efforts extend beyond more traditional model organisms to include a deep diversity of species, comparative genomic analyses will be further empowered to reveal insights into the human genome and its evolution. The discovery and annotation of functional genomic elements is a necessary step toward a detailed understanding of our biology, and sequence comparisons have proven to be an integral tool for that task. This review is structured to broadly reflect the statistical challenges in discriminating these functional elements from the bulk of the genome that has evolved neutrally. Specifically, we review the comparative genomics literature in terms of specificity, sensitivity, and phylogenetic scope, as well as the trade-offs that relate these factors in standard analyses. We consider the impact of an expanding diversity of orthologous sequences on our ability to resolve functional elements. This impact is assessed through both recent comparative analyses of deep alignments and mathematical modeling.
View details for DOI 10.1146/annurev.genom.6.080604.162146
View details for Web of Science ID 000232441500008
View details for PubMedID 16124857
-
ABC: software for interactive browsing of genomic multiple sequence alignment data
BMC BIOINFORMATICS
2004; 5
Abstract
Alignment and comparison of related genome sequences is a powerful method to identify regions likely to contain functional elements. Such analyses are data intensive, requiring the inclusion of genomic multiple sequence alignments, sequence annotations, and scores describing regional attributes of columns in the alignment. Visualization and browsing of results can be difficult, and there are currently limited software options for performing this task.The Application for Browsing Constraints (ABC) is interactive Java software for intuitive and efficient exploration of multiple sequence alignments and data typically associated with alignments. It is used to move quickly from a summary view of the entire alignment via arbitrary levels of resolution to individual alignment columns. It allows for the simultaneous display of quantitative data, (e.g., sequence similarity or evolutionary rates) and annotation data (e.g. the locations of genes, repeats, and constrained elements). It can be used to facilitate basic comparative sequence tasks, such as export of data in plain-text formats, visualization of phylogenetic trees, and generation of alignment summary graphics.The ABC is a lightweight, stand-alone, and flexible graphical user interface for browsing genomic multiple sequence alignments of specific loci, up to hundreds of kilobases or a few megabases in length. It is coded in Java for cross-platform use and the program and source code are freely available under the General Public License. Documentation and a sample data set are also available http://mendel.stanford.edu/sidowlab/downloads.html.
View details for DOI 10.1186/1471-2105-5-192
View details for Web of Science ID 000226622100001
View details for PubMedID 15588288
View details for PubMedCentralID PMC539296
-
Noncoding regulatory sequences of Gona exhibit strong correspondence between evolutionary constraint and functional importance
GENOME RESEARCH
2004; 14 (12): 2448-2456
Abstract
We show that sequence comparisons at different levels of resolution can efficiently guide functional analyses of regulatory regions in the ascidians Ciona savignyi and Ciona intestinalis. Sequence alignments of several tissue-specific genes guided discovery of minimal regulatory regions that are active in whole-embryo reporter assays. Using the Troponin I (TnI) locus as a case study, we show that more refined local sequence analyses can then be used to reveal functional substructure within a regulatory region. A high-resolution saturation mutagenesis in conjunction with comparative sequence analyses defined essential sequence elements within the TnI regulatory region. Finally, we found a significant, quantitative relationship between function and sequence divergence of noncoding functional elements. This work demonstrates the power of comparative sequence analysis between the two Ciona species for guiding gene regulatory experiments.
View details for DOI 10.1101/gr.2964504
View details for PubMedID 15545496
-
Genome sequence of the Brown Norway rat yields insights into mammalian evolution
NATURE
2004; 428 (6982): 493-521
Abstract
The laboratory rat (Rattus norvegicus) is an indispensable tool in experimental medicine and drug development, having made inestimable contributions to human health. We report here the genome sequence of the Brown Norway (BN) rat strain. The sequence represents a high-quality 'draft' covering over 90% of the genome. The BN rat sequence is the third complete mammalian genome to be deciphered, and three-way comparisons with the human and mouse genomes resolve details of mammalian evolution. This first comprehensive analysis includes genes and proteins and their relation to human disease, repeated sequences, comparative genome-wide studies of mammalian orthologous chromosomal regions and rearrangement breakpoints, reconstruction of ancestral karyotypes and the events leading to existing species, rates of variation, and lineage-specific and lineage-independent evolutionary events such as expansion of gene families, orthology relations and protein evolution.
View details for DOI 10.1038/nature02426
View details for Web of Science ID 000220540100032
View details for PubMedID 15057822
-
Automated whole-genome multiple alignment of rat, mouse, and human
GENOME RESEARCH
2004; 14 (4): 685-692
Abstract
We have built a whole-genome multiple alignment of the three currently available mammalian genomes using a fully automated pipeline that combines the local/global approach of the Berkeley Genome Pipeline and the LAGAN program. The strategy is based on progressive alignment and consists of two main steps: (1) alignment of the mouse and rat genomes, and (2) alignment of human to either the mouse-rat alignments from step 1, or the remaining unaligned mouse and rat sequences. The resulting alignments demonstrate high sensitivity, with 87% of all human gene-coding areas aligned in both mouse and rat. The specificity is also high: <7% of the rat contigs are aligned to multiple places in human, and 97% of all alignments with human sequence >100 kb agree with a three-way synteny map built independently, using predicted exons in the three genomes. At the nucleotide level <1% of the rat nucleotides are mapped to multiple places in the human sequence in the alignment, and 96.5% of human nucleotides within all alignments agree with the synteny map. The alignments are publicly available online, with visualization through the novel Multi-VISTA browser that we also present.
View details for DOI 10.1101/gr.2067704
View details for Web of Science ID 000220629900022
View details for PubMedID 15060011
View details for PubMedCentralID PMC383314
-
Characterization of evolutionary rates and constraints in three mammalian genomes
GENOME RESEARCH
2004; 14 (4): 539-548
Abstract
We present an analysis of rates and patterns of microevolutionary phenomena that have shaped the human, mouse, and rat genomes since their last common ancestor. We find evidence for a shift in the mutational spectrum between the mouse and rat lineages, with the net effect being a relative increase in GC content in the rat genome. Our estimate for the neutral point substitution rate separating the two rodents is 0.196 substitutions per site, and 0.65 substitutions per site for the tree relating all three mammals. Small insertions and deletions of 1-10 bp in length ("microindels") occur at approximately 5% of the point substitution rate. Inferred regional correlations in evolutionary rates between lineages and between types of sites support the idea that rates of evolution are influenced by local genomic or cell biological context. No substantial correlations between rates of point substitutions and rates of microindels are found, however, implying that the influences that affect these processes are distinct. Finally, we have identified those regions in the human genome that are evolving slowly, which are likely to include functional elements important to human biology. At least 5% of the human genome is under substantial constraint, most of which is noncoding.
View details for DOI 10.1101/gr.2034704
View details for Web of Science ID 000220629900005
View details for PubMedID 15059994
View details for PubMedCentralID PMC383297
- Genome sequence of the brown Norway rat yields insights into mammalian evolution Nature 2004; 428
-
Chaining algorithms for alignment of draft sequence
4th International Workshop on Algorithms in Bioinformatics (WABI 2004)
SPRINGER-VERLAG BERLIN. 2004: 326–337
View details for Web of Science ID 000224116500028
-
Genomic regulatory regions: insights from comparative sequence analysis
CURRENT OPINION IN GENETICS & DEVELOPMENT
2003; 13 (6): 604-610
Abstract
Comparative sequence analysis is contributing to the identification and characterization of genomic regulatory regions with functional roles. It is effective because functionally important regions tend to evolve at a slower rate than do less important regions. The choice of species for comparative analysis is crucial: shared ancestry of a clade of species facilitates the discovery of genomic features important to that clade, whereas increased sequence divergence improves the resolution at which features can be discovered. Recent studies suggest that comparative analyses are useful for all branches of life and that, in the near future, large-scale mammalian comparative sequence analysis will provide the best approach for the comprehensive discovery of human regulatory elements.
View details for DOI 10.1016/j.gde.2003.10.001
View details for Web of Science ID 000187248400009
View details for PubMedID 14638322
-
Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes
GENOME RESEARCH
2003; 13 (5): 813-820
Abstract
Comparative sequence analyses on a collection of carefully chosen mammalian genomes could facilitate identification of functional elements within the human genome and allow quantification of evolutionary constraint at the single nucleotide level. High-resolution quantification would be informative for determining the distribution of important positions within functional elements and for evaluating the relative importance of nucleotide sites that carry single nucleotide polymorphisms (SNPs). Because the level of resolution in comparative sequence analyses is a direct function of sequence diversity, we propose that the information content of a candidate mammalian genome be defined as the sequence divergence it would add relative to already-sequenced genomes. We show that reliable estimates of genomic sequence divergence can be obtained from small genomic regions. On the basis of a multiple sequence alignment of approximately 1.4 megabases each from eight mammals, we generate such estimates for five unsequenced mammals. Estimates of the neutral divergence in these data suggest that a small number of diverse mammalian genomes in addition to human, mouse, and rat would allow single nucleotide resolution in comparative sequence analyses.
View details for DOI 10.1101/gr.1064503
View details for Web of Science ID 000182645500007
View details for PubMedID 12727901
View details for PubMedCentralID PMC430923
-
LAGAN and Multi-LAGAN: Efficient tools for large-scale multiple alignment of genomic DNA
GENOME RESEARCH
2003; 13 (4): 721-731
Abstract
To compare entire genomes from different species, biologists increasingly need alignment methods that are efficient enough to handle long sequences, and accurate enough to correctly align the conserved biological features between distant species. We present LAGAN, a system for rapid global alignment of two homologous genomic sequences, and Multi-LAGAN, a system for multiple global alignment of genomic sequences. We tested our systems on a data set consisting of greater than 12 Mb of high-quality sequence from 12 vertebrate species. All the sequence was derived from the genomic region orthologous to an approximately 1.5-Mb region on human chromosome 7q31.3. We found that both LAGAN and Multi-LAGAN compare favorably with other leading alignment methods in correctly aligning protein-coding exons, especially between distant homologs such as human and chicken, or human and fugu. Multi-LAGAN produced the most accurate alignments, while requiring just 75 minutes on a personal computer to obtain the multiple alignment of all 12 sequences. Multi-LAGAN is a practical method for generating multiple alignments of long genomic sequences at any evolutionary distance. Our systems are publicly available at http://lagan.stanford.edu.
View details for DOI 10.1101/gr.926603
View details for PubMedID 12654723
-
The integrity of a cholesterol-binding pocket in Niemann-Pick C2 protein is necessary to control lysosome cholesterol levels
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2003; 100 (5): 2518-2525
Abstract
The neurodegenerative disease Niemann-Pick Type C2 (NPC2) results from mutations in the NPC2 (HE1) gene that cause abnormally high cholesterol accumulation in cells. We find that purified NPC2, a secreted soluble protein, binds cholesterol specifically with a much higher affinity (K(d) = 30-50 nM) than previously reported. Genetic and biochemical studies identified single amino acid changes that prevent both cholesterol binding and the restoration of normal cholesterol levels in mutant cells. The amino acids that affect cholesterol binding surround a hydrophobic pocket in the NPC2 protein structure, identifying a candidate sterol-binding location. On the basis of evolutionary analysis and mutagenesis, three other regions of the NPC2 protein emerged as important, including one required for efficient secretion.
View details for DOI 10.1073/pnas.0530027100
View details for Web of Science ID 000181365000065
View details for PubMedID 12591949
View details for PubMedCentralID PMC151373
-
Functional evolution in the ancestral lineage of vertebrates or when genomic complexity was wagging its morphological tail.
Journal of structural and functional genomics
2003; 3 (1-4): 45-52
Abstract
Early vertebrate evolution is characterized by a significant increase of organismal complexity over a relatively short time span. We present quantitative evidence for a high rate of increase in morphological complexity during early vertebrate evolution. Possible molecular evolutionary mechanisms that underlie this increase in complexity fall into a small number of categories, one of which is gene duplication and subsequent structural or regulatory neofunctionalization. We discuss analyses of two gene families whose regulatory and structural evolution shed light on the connection between gene duplication and increases in organismal complexity.
View details for PubMedID 12836684
-
Sequence first. Ask questions later.
CELL
2002; 111 (1): 13-16
Abstract
Comparative sequence analyses of eukaryotic genes and genomic regions are beginning to provide a wealth of information that is directly relevant to human biology. Functional changes that set us apart from apes are identifiable, as are functional constraints in proteins and genomic elements that arose in our relatively distant phylogenetic past.
View details for Web of Science ID 000178461900004
View details for PubMedID 12372296
-
Inference of functional regions in proteins by quantification of evolutionary constraints
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2002; 99 (5): 2912-2917
Abstract
Likelihood estimates of local rates of evolution within proteins reveal that selective constraints on structure and function are quantitatively stable over billions of years of divergence. The stability of constraints produces an intramolecular clock that gives each protein a characteristic pattern of evolutionary rates along its sequence. This pattern allows the identification of constrained regions and, because the rate of evolution is a quantitative measure of the strength of the constraint, of their functional importance. We show that results from such analyses, which require only sequence alignments, are consistent with experimental and mutational data. The methodology has significant predictive power and may be used to guide structure--function studies for any protein represented by a modest number of homologs in sequence databases.
View details for DOI 10.1073/pnas.042692299
View details for Web of Science ID 000174284600059
View details for PubMedID 11880638
View details for PubMedCentralID PMC122447
-
Partitioning of tissue expression accompanies multiple duplications of the Na plus /K+ ATPase alpha subunit gene
GENOME RESEARCH
2001; 11 (10): 1625-1631
Abstract
Vertebrate genomes contain multiple copies of related genes that arose through gene duplication. In the past it has been proposed that these duplicated genes were retained because of acquisition of novel beneficial functions. A more recent model, the duplication-degeneration-complementation hypothesis (DDC), posits that the functions of a single gene may become separately allocated among the duplicated genes, rendering both duplicates essential. Thus far, empirical evidence for this model has been limited to the engrailed and sox family of developmental regulators, and it has been unclear whether it may also apply to ubiquitously expressed genes with essential functions for cell survival. Here we describe the cloning of three zebrafish alpha subunits of the Na(+),K(+)-ATPase and a comprehensive evolutionary analysis of this gene family. The predicted amino acid sequences are extremely well conserved among vertebrates. The evolutionary relationships and the map positions of these genes and of other alpha-like sequences indicate that both tandem and ploidy duplications contributed to the expansion of this gene family in the teleost lineage. The duplications are accompanied by acquisition of clear functional specialization, consistent with the DDC model of genome evolution.
View details for Web of Science ID 000171456000004
View details for PubMedID 11591639
-
A double-deletion mutation in the Pitx3 gene causes arrested lens development in aphakia mice
GENOMICS
2001; 72 (1): 61-72
Abstract
The recessive aphakia (ak) mouse mutant is characterized by bilateral microphthalmia due to a failure of lens morphogenesis. We fine-mapped the ak locus to the interval between D19Umi1 and D19Mit9, developed new polymorphic markers, and mapped candidate genes by construction of a BAC contig. The Pitx3 gene, known to be expressed in lens primordia, shows zero recombination with the ak mutation on our intersubspecific intercross panel representing 1170 meioses. A recent report described a deletion in the intergenic region between Gbf1 and Pitx3 as the possible ak mutation. Our results differ in that we find not only the distant intergenic deletion, but also a much larger deletion directly in the Pitx3 gene, eliminating exon 1 and extending into intron 1 and the promoter region. Pitx3 transcript levels are severely reduced in ak/ak mice from E11.5 to newborn (5 +/- 1% of the wildtype levels at E13.5), while an involvement of the flanking Gbf1 and Cig30 genes in the aberrant lens development is highly unlikely based on expression analysis. We conclude that the ak mutation consists of two deletions, the larger of which removes part of Pitx3, indicating a crucial role of this gene in early lens development.
View details for Web of Science ID 000167553700007
View details for PubMedID 11247667
-
A novel member of the F-box/WD40 gene family, encoding dactylin, is disrupted in the mouse dactylaplasia mutant
NATURE GENETICS
1999; 23 (1): 104-107
Abstract
Early outgrowth of the vertebrate embryonic limb requires signalling by the apical ectodermal ridge (AER) to the progress zone (PZ), which in response proliferates and lays down the pattern of the presumptive limb in a proximal to distal progression. Signals from the PZ maintain the AER until the anlagen for the distal phalanges have been formed. The semidominant mouse mutant dactylaplasia (Dac) disrupts the maintenance of the AER, leading to truncation of distal structures of the developing footplate, or autopod. Adult Dac homozygotes thus lack hands and feet except for malformed single digits, whereas heterozygotes lack phalanges of the three middle digits. Dac resembles the human autosomal dominant split hand/foot malformation (SHFM) diseases. One of these, SHFM3, maps to chromosome 10q24 (Refs 6,7), which is syntenic to the Dac region on chromosome 19, and may disrupt the orthologue of Dac. We report here the positional cloning of Dac and show that it belongs to the F-box/WD40 gene family, which encodes adapters that target specific proteins for destruction by presenting them to the ubiquitination machinery. In conjuction with recent biochemical studies, this report demonstrates the importance of this gene family in vertebrate embryonic development.
View details for Web of Science ID 000082337300026
View details for PubMedID 10471509