Gill Bejerano
Professor of Developmental Biology, of Computer Science, of Pediatrics (Genetics) and of Biomedical Data Science
Web page: http://bejerano.stanford.edu/
Bio
Gill Bejerano holds a B.Sc. In Mathematics, Physics and Computer Science, and a Ph.D in Computer Science (Machine Learning applications in Biology) from Hebrew University of Jerusalem. Gill got into genomics in 2003, started a wet lab in 2007, began analyzing patient genomes and medical records in 2014, got into cryptogenomics in 2017 and has become very interested in healthcare economics and patient risk management in 2021. He is recognized by multiple academic awards including two best paper and tomorrow's PI awards, Mallinckrodt, Sloan, Human Frontiers, Searle, Okawa, David and Lucile Packard, Microsoft and Sony Scholar awards. Gill has trained, collaborated and advised computational scientists, experimentalists, clinicians, and MBAs and has helped both start-ups and Fortune 500 companies.
Academic Appointments
-
Professor, Developmental Biology
-
Professor, Computer Science
-
Professor, Pediatrics - Medical Genetics
-
Professor, Department of Biomedical Data Science
-
Member, Bio-X
-
Member, Cardiovascular Institute
-
Faculty Affiliate, Institute for Human-Centered Artificial Intelligence (HAI)
-
Member, Stanford Cancer Institute
-
Member, Wu Tsai Neurosciences Institute
Administrative Appointments
-
Member, Editorial Board, Gene (2007 - 2008)
-
Technical Advisory Board, Numenta (2008 - Present)
Honors & Awards
-
Rector Prize & Deans list for undergraduate achievements., Hebrew University (1993-1996)
-
Intel award for achievements., Hebrew University (1996)
-
Rector Prize & Deans list for graduate studies achievements., Hebrew University (1997-1999)
-
Rachel & Salim Banin scholarship., Hebrew University (1999)
-
Best paper by a young scientist award., RECOMB conference (1999)
-
Levi Eshkol graduate studies fellowship., Hebrew University (1999-2002)
-
Best paper by a young scientist award., RECOMB conference (2003)
-
Junior Faculty Grant, Edward Mallinckrodt, Jr. Foundation (2007-2010)
-
Tomorrow's Principal Investigator, Genome Technology Magazine (2008)
-
Alfred P. Sloan Fellow, Alfred P. Sloan Foundation (2008-2010)
-
Young Investigator Award, Human Frontier Science Program (2008-2011)
-
Searle Scholar, Searle Scholars Program (2008-2011)
-
Research Grant Award, Okawa Foundation (2008)
-
Fellow, David & Lucile Packard Foundation (2008-2013)
-
New Faculty Fellow, Microsoft Research (2009)
Professional Education
-
Ph.D., Hebrew University, Computer Science (2004)
-
B.Sc., Hebrew University, Physics, Mathematics, Computer Science (summa cum laude) (1997)
Current Research and Scholarly Interests
The Bejerano lab interests evolve continuously. As of 2021 they span data science, genomic variants of large effect, cryptogenomics, machine learning with electronic health records and healthcare economics.
We have done seminal work and continue to play an active role in:
1. Automating monogenic patient diagnosis and reanalysis.
2. The genomic signatures of independent divergent and convergent trait evolution in mammals.
3. The logic of human gene regulation.
4. The reasons for sequence ultraconservation.
5. Cryptogenomics to bridge medical silos.
6. Cryptogenetics to debate social injustice.
We are also getting quite interested in:
7. Managing patient risk using machine learning.
8. Understanding the incentive structure of the US healthcare system.
2024-25 Courses
- Foundations of Computational Human Genomics
BIOMEDIN 173A, CS 173A, DBIO 173A (Aut) -
Independent Studies (23)
- Advanced Reading and Research
CS 499 (Aut, Win, Spr, Sum) - Advanced Reading and Research
CS 499P (Aut, Win, Spr, Sum) - Biomedical Informatics Teaching Methods
BIOMEDIN 290 (Aut, Win, Spr, Sum) - Curricular Practical Training
CS 390A (Aut, Win, Spr, Sum) - Curricular Practical Training
CS 390B (Aut, Win, Spr, Sum) - Curricular Practical Training
CS 390C (Aut, Win, Spr, Sum) - Directed Reading and Research
BIOMEDIN 299 (Aut, Win, Spr, Sum) - Directed Reading in Developmental Biology
DBIO 299 (Aut, Win, Spr, Sum) - Directed Reading in Neurosciences
NEPR 299 (Aut, Win, Spr, Sum) - Graduate Research
DBIO 399 (Aut, Win, Spr, Sum) - Independent Project
CS 399 (Aut, Win, Spr, Sum) - Independent Project
CS 399P (Aut, Win, Spr, Sum) - Independent Work
CS 199 (Aut, Win, Spr, Sum) - Independent Work
CS 199P (Aut, Win, Spr, Sum) - Medical Scholars Research
BIOMEDIN 370 (Aut, Win, Spr, Sum) - Medical Scholars Research
DBIO 370 (Aut, Win, Spr, Sum) - Part-time Curricular Practical Training
CS 390D (Aut, Win, Spr, Sum) - Programming Service Project
CS 192 (Aut, Win, Spr, Sum) - Research
PHYSICS 490 (Aut, Win, Spr, Sum) - Senior Project
CS 191 (Aut, Win, Spr, Sum) - Supervised Undergraduate Research
CS 195 (Aut, Win, Spr, Sum) - Undergraduate Research
DBIO 199 (Aut, Win, Spr, Sum) - Writing Intensive Senior Research Project
CS 191W (Aut, Win, Spr)
- Advanced Reading and Research
-
Prior Year Courses
2023-24 Courses
- Foundations of Computational Human Genomics
BIOMEDIN 173A, CS 173A, DBIO 173A (Win)
2022-23 Courses
- Foundations of Computational Human Genomics
BIOMEDIN 173A, CS 173A, DBIO 173A (Win)
- Foundations of Computational Human Genomics
Stanford Advisees
-
Master's Program Advisor
Arya Bakhtiar, Vinita Cheepurupalli, Hee Jung Choi, Hannah Cussen, Esi Donkor, Raghav Garg, Hamed Hekmat, Ilan Ladabaum, Will Reilly, Lillian Weng -
Undergraduate Major Advisor
Ronit Jain
Graduate and Fellowship Programs
-
Biomedical Data Science (Masters Program)
-
Biomedical Data Science (Phd Program)
-
Developmental-Behavioral Pediatrics (Fellowship Program)
-
Human Genetics and Genetic Counseling (Masters Program)
-
Medical Genetics (Fellowship Program)
-
Molecular and Genetic Medicine (Fellowship Program)
All Publications
-
Large-scale mutational analysis identifies UNC93B1 variants that drive TLR-mediated autoimmunity in mice and humans.
The Journal of experimental medicine
2024; 221 (8)
Abstract
Nucleic acid-sensing Toll-like receptors (TLR) 3, 7/8, and 9 are key innate immune sensors whose activities must be tightly regulated to prevent systemic autoimmune or autoinflammatory disease or virus-associated immunopathology. Here, we report a systematic scanning-alanine mutagenesis screen of all cytosolic and luminal residues of the TLR chaperone protein UNC93B1, which identified both negative and positive regulatory regions affecting TLR3, TLR7, and TLR9 responses. We subsequently identified two families harboring heterozygous coding mutations in UNC93B1, UNC93B1+/T93I and UNC93B1+/R336C, both in key negative regulatory regions identified in our screen. These patients presented with cutaneous tumid lupus and juvenile idiopathic arthritis plus neuroinflammatory disease, respectively. Disruption of UNC93B1-mediated regulation by these mutations led to enhanced TLR7/8 responses, and both variants resulted in systemic autoimmune or inflammatory disease when introduced into mice via genome editing. Altogether, our results implicate the UNC93B1-TLR7/8 axis in human monogenic autoimmune diseases and provide a functional resource to assess the impact of yet-to-be-reported UNC93B1 mutations.
View details for DOI 10.1084/jem.20232005
View details for PubMedID 38780621
-
The Undiagnosed Diseases Network: Characteristics of solvable applicants and diagnostic suggestions for non-accepted ones.
Genetics in medicine : official journal of the American College of Medical Genetics
2024: 101203
Abstract
Can certain characteristics identify as solvable some undiagnosed patients who seek extensive evaluation and thorough record review, like by the Undiagnosed Diseases Network (UDN)?The UDN is a national research resource to solve medical mysteries through team science. Applicants provide informed consent to access to their medical records. After review, expert panels assess if applicants meet inclusion and exclusion criteria to select participants. When not accepting applicants, UDN experts may offer suggestions for diagnostic efforts. Using minimal information from initial applications, we compare features in applicants not accepted with those accepted and either solved or still not solved by the UDN. The diagnostic suggestions offered to non-accepted applicants and their clinicians were tallied.Non-accepted applicants were more often female, older at first symptoms and application, and longer in review than accepted applicants. The accepted and successfully diagnosed applicants were younger in ages, shorter in review time, more often non-white, of Hispanic ethnicity, and presenting with nervous system features. Half of non-accepted applicants were given suggestions for further local diagnostic evaluation. A few seemed to have two major diagnoses or a provocative environmental exposure history.Comprehensive UDN record review generates possibly helpful advice.
View details for DOI 10.1016/j.gim.2024.101203
View details for PubMedID 38967101
-
Loss of function of FAM177A1, a Golgi complex localized protein, causes a novel neurodevelopmental disorder.
Genetics in medicine : official journal of the American College of Medical Genetics
2024: 101166
Abstract
The function of FAM177A1 and its relationship to human disease is largely unknown. Recent studies have demonstrated FAM177A1 to be a critical immune-associated gene. One previous case study has linked FAM177A1 to a neurodevelopmental disorder in four siblings.We identified five individuals from three unrelated families with biallelic variants in FAM177A1. The physiological function of FAM177A1 was studied in a zebrafish model organism and human cell lines with loss-of-function variants similar to the affected cohort.These individuals share a characteristic phenotype defined by macrocephaly, global developmental delay, intellectual disability, seizures, behavioral abnormalities, hypotonia, and gait disturbance. We show that FAM177A1 localizes to the Golgi complex in mammalian and zebrafish cells. Intersection of the RNA-seq and metabolomic datasets from FAM177A1-deficient human fibroblasts and whole zebrafish larvae demonstrated dysregulation of pathways associated with apoptosis, inflammation, and negative regulation of cell proliferation.Our data sheds light on the emerging function of FAM177A1 and defines FAM177A1-related neurodevelopmental disorder as a new clinical entity.
View details for DOI 10.1016/j.gim.2024.101166
View details for PubMedID 38767059
-
Immunological and hematological findings as major features in a patient with a new germline pathogenic CBL variant.
American journal of medical genetics. Part A
2024: e63627
Abstract
Casitas B-lineage lymphoma (CBL) encodes an adaptor protein with E3-ligase activity negatively controlling intracellular signaling downstream of receptor tyrosine kinases. Somatic CBL mutations play a driver role in a variety of cancers, particularly myeloid malignancies, whereas germline defects in the same gene underlie a RASopathy having clinical overlap with Noonan syndrome (NS) and predisposing to juvenile myelomonocytic leukemia and vasculitis. Other features of the disorder include cardiac defects, postnatal growth delay, cryptorchidism, facial dysmorphisms, and predisposition to develop autoimmune disorders. Here we report a novel CBL variant (c.1202G>T; p.Cys401Phe) occurring de novo in a subject with café-au-lait macules, feeding difficulties, mild dysmorphic features, psychomotor delay, autism spectrum disorder, thrombocytopenia, hepatosplenomegaly, and recurrent hypertransaminasemia. The identified variant affects an evolutionarily conserved residue located in the RING finger domain, a known mutational hot spot of both germline and somatic mutations. Functional studies documented enhanced EGF-induced ERK phosphorylation in transiently transfected COS1 cells. The present findings further support the association of pathogenic CBL variants with immunological and hematological manifestations in the context of a presentation with only minor findings reminiscent of NS or a clinically related RASopathy.
View details for DOI 10.1002/ajmg.a.63627
View details for PubMedID 38613168
-
Exome and genome sequencing in a heterogeneous population of patients with rare disease: Identifying predictors of a diagnosis.
Genetics in medicine : official journal of the American College of Medical Genetics
2024: 101115
Abstract
Exome (ES) and genome sequencing (GS) are increasingly being utilized for individuals with rare and undiagnosed diseases; however, guidelines on their use remain limited. This study aimed to identify factors associated with diagnosis by ES and/or GS in a heterogeneous population of patients with rare and undiagnosed diseases.In this case control study, we reviewed data from 400 diagnosed and 400 undiagnosed randomly selected participants in the Undiagnosed Diseases Network (UDN), all of whom had undergone ES and/or GS. We analyzed factors associated with receiving a diagnosis by ES and/or GS.Factors associated with a decreased odds of being diagnosed included adult symptom onset, singleton sequencing, and having undergone ES and/or GS prior to acceptance to the UDN (48%, 51%, and 32% lower odds, respectively). Factors that increased the odds of being diagnosed by ES and/or GS included having primarily neurological symptoms and having undergone prior chromosomal microarray testing (44% and 59% higher odds, respectively).We identified several factors that were associated with receiving a diagnosis by ES and/or GS. This will ideally inform the utilization of ES and/or GS and help manage expectations of individuals and families undergoing these tests.
View details for DOI 10.1016/j.gim.2024.101115
View details for PubMedID 38436216
-
Recurring homozygous ACTN2 variant (p.Arg506Gly) causes a recessive myopathy.
Annals of clinical and translational neurology
2024
Abstract
ACTN2, encoding alpha-actinin-2, is essential for cardiac and skeletal muscle sarcomeric function. ACTN2 variants are a known cause of cardiomyopathy without skeletal muscle involvement. Recently, specific dominant monoallelic variants were reported as a rare cause of core myopathy of variable clinical onset, although the pathomechanism remains to be elucidated. The possibility of a recessively inherited ACTN2-myopathy has also been proposed in a single series.We provide clinical, imaging, and histological characterization of a series of patients with a novel biallelic ACTN2 variant.We report seven patients from five families with a recurring biallelic variant in ACTN2: c.1516A>G (p.Arg506Gly), all manifesting with a consistent phenotype of asymmetric, progressive, proximal, and distal lower extremity predominant muscle weakness. None of the patients have cardiomyopathy or respiratory insufficiency. Notably, all patients report Palestinian ethnicity, suggesting a possible founder ACTN2 variant, which was confirmed through haplotype analysis in two families. Muscle biopsies reveal an underlying myopathic process with disruption of the intermyofibrillar architecture, Type I fiber predominance and atrophy. MRI of the lower extremities demonstrate a distinct pattern of asymmetric muscle involvement with selective involvement of the hamstrings and adductors in the thigh, and anterior tibial group and soleus in the lower leg. Using an in vitro splicing assay, we show that c.1516A>G ACTN2 does not impair normal splicing.This series further establishes ACTN2 as a muscle disease gene, now also including variants with a recessive inheritance mode, and expands the clinical spectrum of actinopathies to adult-onset progressive muscle disease.
View details for DOI 10.1002/acn3.51983
View details for PubMedID 38311799
-
Genomics Research with Undiagnosed Children: Ethical Challenges at the Boundaries of Research and Clinical Care
JOURNAL OF PEDIATRICS
2023; 261
View details for DOI 10.1018/j.jpeds.2023.113537
View details for Web of Science ID 001029333600001
-
Whole-genome Comparisons Identify Repeated Regulatory Changes Underlying Convergent Appendage Evolution in Diverse Fish Lineages.
Molecular biology and evolution
2023; 40 (9)
Abstract
Fins are major functional appendages of fish that have been repeatedly modified in different lineages. To search for genomic changes underlying natural fin diversity, we compared the genomes of 36 percomorph fish species that span over 100 million years of evolution and either have complete or reduced pelvic and caudal fins. We identify 1,614 genomic regions that are well-conserved in fin-complete species but missing from multiple fin-reduced lineages. Recurrent deletions of conserved sequences in wild fin-reduced species are enriched for functions related to appendage development, suggesting that convergent fin reduction at the organismal level is associated with repeated genomic deletions near fin-appendage development genes. We used sequencing and functional enhancer assays to confirm that PelA, a Pitx1 enhancer previously linked to recurrent pelvic loss in sticklebacks, has also been independently deleted and may have contributed to the fin morphology in distantly related pelvic-reduced species. We also identify a novel enhancer that is conserved in the majority of percomorphs, drives caudal fin expression in transgenic stickleback, is missing in tetraodontiform, syngnathid, and synbranchid species with caudal fin reduction, and alters caudal fin development when targeted by genome editing. Our study illustrates a broadly applicable strategy for mapping phenotypes to genotypes across a tree of vertebrate species and highlights notable new examples of regulatory genomic hotspots that have been used to evolve recurrent phenotypes across 100 million years of fish evolution.
View details for DOI 10.1093/molbev/msad188
View details for PubMedID 37739926
-
Genomics Research with Undiagnosed Children: Ethical Challenges at the Boundaries of Research and Clinical Care.
The Journal of pediatrics
2023: 113537
Abstract
To explore the perspectives of parents of undiagnosed children enrolled in genomic diagnosis research regarding their motivations for enrolling their children, their understanding of the potential burdens and benefits, and the extent to which their experiences ultimately aligned with or diverged from their original expectations.In-depth interviews were conducted with parents, audio-recorded and transcribed. A structured codebook was applied to each transcript, after which iterative memoing was used to identify themes.Fifty-four parents participated, including 17 (31.5%) whose child received a diagnosis through research. Themes describing parents' expectations and experiences of genomic diagnosis research included: 1) the extent to which parents' motivations for participation focused on their hope that it would directly benefit their child; 2) the ways in which parents' frustrations regarding the research process confused the dual clinical and research goals of their participation; and 3) the limited clinical benefits parents ultimately experienced for their children.Our results suggest that parents of undiagnosed children seeking enrollment in genomic diagnosis research are at risk of a form of therapeutic misconception - in this case, diagnostic misconception. These findings indicate the need to examine the processes and procedures associated with this research in order to appropriately communicate and balance the potential burdens and benefits of study participation.
View details for DOI 10.1016/j.jpeds.2023.113537
View details for PubMedID 37271495
-
Participation in a national diagnostic research study: assessing the patient experience.
Orphanet journal of rare diseases
2023; 18 (1): 73
Abstract
INTRODUCTION: The Undiagnosed Diseases Network (UDN), a clinical research study funded by the National Institutes of Health, aims to provide answers for patients with undiagnosed conditions and generate knowledge about underlying disease mechanisms. UDN evaluations involve collaboration between clinicians and researchers and go beyond what is possible in clinical settings. While medical and research outcomes of UDN evaluations have been explored, this is the first formal assessment of the patient and caregiver experience.METHODS: We invited UDN participants and caregivers to participate in focus groups via email, newsletter, and a private participant Facebook group. We developed focus group questions based on research team expertise, literature focused on patients with rare and undiagnosed conditions, and UDN participant and family member feedback. In March 2021, we conducted, recorded, and transcribed four 60-min focus groups via Zoom. Transcripts were evaluated using a thematic analysis approach.RESULTS: The adult undiagnosed focus group described the UDN evaluation as validating and an avenue for access to medical providers. They also noted that the experience impacted professional choices and helped them rely on others for support. The adult diagnosed focus group described the healthcare system as not set up for rare disease. In the pediatric undiagnosed focus group, caregivers discussed a continued desire for information and gratitude for the UDN evaluation. They also described an ability to rule out information and coming to terms with not having answers. The pediatric diagnosed focus group discussed how the experience helped them focus on management and improved communication. Across focus groups, adults (undiagnosed/diagnosed) noted the comprehensiveness of the evaluation. Undiagnosed focus groups (adult/pediatric) discussed a desire for ongoing communication and care with the UDN. Diagnosed focus groups (adult/pediatric) highlighted the importance of the diagnosis they received in the UDN. The majority of the focus groups noted a positive future orientation after participation.CONCLUSION: Our findings are consistent with prior literature focused on the patient experience of rare and undiagnosed conditions and highlight benefits from comprehensive evaluations, regardless of whether a diagnosis is obtained. Focus group themes also suggest areas for improvement and future research related to the diagnostic odyssey.
View details for DOI 10.1186/s13023-023-02695-5
View details for PubMedID 37032333
-
Analysis of structural variation among inbred mouse strains.
BMC genomics
2023; 24 (1): 97
Abstract
BACKGROUND: 'Long read' sequencing methods have been used to identify previously uncharacterized structural variants that cause human genetic diseases. Therefore, we investigated whether long read sequencing could facilitate genetic analysis of murine models for human diseases.RESULTS: The genomes of six inbred strains (BTBR T+Itpr3tf/J, 129Sv1/J, C57BL/6/J, Balb/c/J, A/J, SJL/J) were analyzed using long read sequencing. Our results revealed that (i) Structural variants are very abundant within the genome of inbred strains (4.8 per gene) and (ii) that we cannot accurately infer whether structural variants are present using conventional short read genomic sequence data, even when nearby SNP alleles are known. The advantage of having a more complete map was demonstrated by analyzing the genomic sequence of BTBR mice. Based upon this analysis, knockin mice were generated and used to characterize a BTBR-unique 8-bp deletion within Draxin that contributes to the BTBR neuroanatomic abnormalities, which resemble human autism spectrum disorder.CONCLUSION: A more complete map of the pattern of genetic variation among inbred strains, which is produced by long read genomic sequencing of the genomes of additional inbred strains, could facilitate genetic discovery when murine models of human diseases are analyzed.
View details for DOI 10.1186/s12864-023-09197-5
View details for PubMedID 36864393
-
Whole-genome comparisons identify repeated regulatory changes underlying convergent appendage evolution in diverse fish lineages.
bioRxiv : the preprint server for biology
2023
Abstract
Fins are major functional appendages of fish that have been repeatedly modified in different lineages. To search for genomic changes underlying natural fin diversity, we compared the genomes of 36 wild fish species that either have complete or reduced pelvic and caudal fins. We identify 1,614 genomic regions that are well-conserved in fin-complete species but missing from multiple fin-reduced lineages. Recurrent deletions of conserved sequences (CONDELs) in wild fin-reduced species are enriched for functions related to appendage development, suggesting that convergent fin reduction at the organismal level is associated with repeated genomic deletions near fin-appendage development genes. We used sequencing and functional enhancer assays to confirm that PelA , a Pitx1 enhancer previously linked to recurrent pelvic loss in sticklebacks, has also been independently deleted and may have contributed to the fin morphology in distantly related pelvic-reduced species. We also identify a novel enhancer that is conserved in the majority of percomorphs, drives caudal fin expression in transgenic stickleback, is missing in tetraodontiform, s yngnathid, and synbranchid species with caudal fin reduction, and which alters caudal fin development when targeted by genome editing. Our study illustrates a general strategy for mapping phenotypes to genotypes across a tree of vertebrate species, and highlights notable new examples of regulatory genomic hotspots that have been used to evolve recurrent phenotypes during 100 million years of fish evolution.
View details for DOI 10.1101/2023.01.30.526059
View details for PubMedID 36778215
View details for PubMedCentralID PMC9915506
-
A concurrent dual analysis of genomic data augments diagnoses: experiences of two clinical sites in the Undiagnosed Diseases Network.
Genetics in medicine : official journal of the American College of Medical Genetics
2022
Abstract
Next generation sequencing (NGS) has revolutionized the diagnostic process for rare/ultra-rare conditions. However, diagnosis rates differ between analytical pipelines. In the NIH-Undiagnosed Diseases Network (UDN) study, each individual's NGS data are concurrently analyzed by the UDN sequencing core laboratory and the clinical sites. We examined the outcomes of this practice.A retrospective review was performed at two UDN clinical sites, to compare variants, and diagnoses/candidate genes identified with the dual analyses of the NGS data.Ninety-five individuals had 100 diagnoses/candidate genes. There was 59% concordance between the UDN sequencing core laboratories and the clinical sites in identifying diagnoses/candidate genes. The core laboratory provided more diagnoses, while the clinical sites prioritized more research variants/candidate genes (p <0.001). The clinical sites solely identified 15% of the diagnoses/candidate genes. The differences between the two pipelines were more often due to variant prioritization disparities, than variant detection.The unique dual analysis of NGS data in the UDN synergistically enhances outcomes. The core laboratory provides a clinical analysis with more diagnoses and the clinical sites prioritized more research variants/candidate genes. Implementing such concurrent dual analyses in other genomic research studies and clinical settings can improve both variant detection and prioritization.
View details for DOI 10.1016/j.gim.2022.12.001
View details for PubMedID 36481303
-
Discovering monogenic patients with a confirmed molecular diagnosis in millions of clinical notes with MonoMiner.
Genetics in medicine : official journal of the American College of Medical Genetics
2022
Abstract
PURPOSE: Cohort building is a powerful foundation for improving clinical care, performing biomedical research, recruiting for clinical trials, and many other applications. We set out to build a cohort of all monogenic patients with a definitive causal gene diagnosis in a 3-million patient hospital system.METHODS: We define a subset (4461) of OMIM diseases that have at least 1 known monogenic causal gene. We then introduce MonoMiner, a natural language processing framework to identify molecularly confirmed monogenic patients from free-text clinical notes.RESULTS: We show that ICD-10-CM codes cover only a fraction of monogenic diseases and that even where available, ICD-10-CM code‒based patient retrieval offers 0.14 precision. Searching by causal gene symbol offers great recall but has an even worse 0.07 precision. MonoMiner achieves 6 to 11 times higher precision (0.80), with 0.87 precision on disease diagnosis alone, tagging 4259 patients with 560 monogenic diseases and 534 causal genes, at 0.48 recall.CONCLUSION: MonoMiner enables the discovery of a large, high-precision cohort of patients with monogenic diseases with an established molecular diagnosis, empowering numerous downstream uses. Because it relies solely on clinical notes, MonoMiner is highly portable, and its approach is adaptable to other domains and languages.
View details for DOI 10.1016/j.gim.2022.07.008
View details for PubMedID 35976265
-
WhichTF is functionally important in your open chromatin data?
PLoS computational biology
2022; 18 (8): e1010378
Abstract
We present WhichTF, a computational method to identify functionally important transcription factors (TFs) from chromatin accessibility measurements. To rank TFs, WhichTF applies an ontology-guided functional approach to compute novel enrichment by integrating accessibility measurements, high-confidence pre-computed conservation-aware TF binding sites, and putative gene-regulatory models. Comparison with prior sheer abundance-based methods reveals the unique ability of WhichTF to identify context-specific TFs with functional relevance, including NF-kappaB family members in lymphocytes and GATA factors in cardiac cells. To distinguish the transcriptional regulatory landscape in closely related samples, we apply differential analysis and demonstrate its utility in lymphocyte, mesoderm developmental, and disease cells. We find suggestive, under-characterized TFs, such as RUNX3 in mesoderm development and GLI1 in systemic lupus erythematosus. We also find TFs known for stress response, suggesting routine experimental caveats that warrant careful consideration. WhichTF yields biological insight into known and novel molecular mechanisms of TF-mediated transcriptional regulation in diverse contexts, including human and mouse cell types, cell fate trajectories, and disease-associated cells.
View details for DOI 10.1371/journal.pcbi.1010378
View details for PubMedID 36040971
-
X-CAP improves pathogenicity prediction of stopgain variants.
Genome medicine
2022; 14 (1): 81
Abstract
Stopgain substitutions are the third-largest class of monogenic human disease mutations and often examined first in patient exomes. Existing computational stopgain pathogenicity predictors, however, exhibit poor performance at the high sensitivity required for clinical use. Here, we introduce a new classifier, termed X-CAP, which uses a novel training methodology and unique feature set to improve the AUROC by 18% and decrease the false-positive rate 4-fold on large variant databases. In patient exomes, X-CAP prioritizes causal stopgains better than existing methods do, further illustrating its clinical utility. X-CAP is available at https://github.com/bejerano-lab/X-CAP .
View details for DOI 10.1186/s13073-022-01078-y
View details for PubMedID 35906703
-
Champagne: Automated whole-genome phylogenomic character matrix method using large genomic indels for homoplasy-free inference.
Genome biology and evolution
2022
Abstract
We present Champagne, a whole-genome method for generating character matrices for phylogenomic analysis using large genomic indel events. By rigorously picking orthologous genes and locating large insertion and deletion events, Champagne delivers a character matrix that considerably reduces homoplasy compared to morphological and nucleotide-based matrices, on both established phylogenies and difficult-to-resolve nodes in the mammalian tree. Champagne provides ample evidence in the form of genomic structural variation to support incomplete lineage sorting and possible introgression in Paenungulata and human-chimp-gorilla which were previously inferred primarily through matrices composed of aligned single-nucleotide characters. Champagne also offers further evidence for Myomorpha as sister to Sciuridae and Hystricomorpha in the rodent tree. Champagne harbors distinct theoretical advantages as an automated method that produces nearly homoplasy-free character matrices on the whole-genome scale.
View details for DOI 10.1093/gbe/evac013
View details for PubMedID 35171243
-
Genetic counselor roles in the undiagnosed diseases network research study: Clinical care, collaboration, and curation.
Journal of genetic counseling
2021
Abstract
Genetic counselors (GCs) are increasingly filling important positions on research study teams, but there is limited literature describing the roles of GCs in these settings. GCs on the Undiagnosed Diseases Network (UDN) study team serve in a variety of roles across the research network and provide an opportunity to better understand genetic counselor roles in research. To quantitatively characterize the tasks regularly performed and professional fulfillment derived from these tasks, two surveys were administered to UDN GCs in a stepwise fashion. Responses from the first, free-response survey elicited the scope of tasks which informed development of a second structured, multiple-select survey. In survey 2, respondents were asked to select which roles they performed. Across 19 respondents, roles in survey 2 received a total of 947 selections averaging approximately 10 selections per role. When asked to indicate what roles they performed, respondent selected a mean of 50 roles (range 22-70). Survey 2 data were analyzed via thematic coding of responses and hierarchical cluster analysis to identify patterns in responses. From the thematic analysis, 20 non-overlapping codes emerged in seven categories: clinical interaction and care, communication, curation, leadership, participant management, research, and team management. Three themes emerged from the categories that represented the roles of GCs in the UDN: clinical care, collaboration, and curation. Cluster analyses showed that responses were more similar among individuals at the same institution than between institutions. This study highlights the ways GCs apply their unique skill set in the context of a clinical translational research network. Additionally, findings from this study reinforce the wide applicability of core skills that are part of genetic counseling training. Clinical literacy, genomics expertise and analysis, interpersonal, psychosocial and counseling skills, education, professional practice skills, and an understanding of research processes make genetic counselors well suited for such roles and poised to positively impact research experiences and outcomes for participants.
View details for DOI 10.1002/jgc4.1493
View details for PubMedID 34374469
-
InpherNet accelerates monogenic disease diagnosis using patients' candidate genes' neighbors.
Genetics in medicine : official journal of the American College of Medical Genetics
2021
Abstract
PURPOSE: Roughly 70% of suspected Mendelian disease patients remain undiagnosed after genome sequencing, partly because knowledge about pathogenic genes is incomplete and constantly growing. Generating a novel pathogenic gene hypothesis from patient data can be time-consuming especially where cohort-based analysis is not available.METHODS: Each patient genome contains dozens to hundreds of candidate variants. Many sources of indirect evidence about each candidate may be considered. We introduce InpherNet, a network-based machine learning approach leveraging Monarch Initiative data to accelerate this process.RESULTS: InpherNet ranks candidate genes based on orthologs, paralogs, functional pathway members, and colocalized interaction partner gene neighbors. It can propose novel pathogenic genes and reveal known pathogenic genes whose diagnosed patient-based annotation is missing or partial. InpherNet is applied to patient cases where the causative gene is incorrectly ranked low by clinical gene-ranking methods that use only patient-derived evidence. InpherNet correctly ranks the causative gene top 1 or top 1-5 in roughly twice as many cases as seven comparable tools, including in cases where no clinical evidence for the diagnostic gene is in our knowledgebase.CONCLUSION: InpherNet improves the state of the art in considering candidate gene neighbors to accelerate monogenic diagnosis.
View details for DOI 10.1038/s41436-021-01238-2
View details for PubMedID 34230641
-
Variants in PRKAR1B cause a neurodevelopmental disorder with autism spectrum disorder, apraxia, and insensitivity to pain
GENETICS IN MEDICINE
2021
Abstract
We characterize the clinical and molecular phenotypes of six unrelated individuals with intellectual disability and autism spectrum disorder who carry heterozygous missense variants of the PRKAR1B gene, which encodes the R1β subunit of the cyclic AMP-dependent protein kinase A (PKA).Variants of PRKAR1B were identified by single- or trio-exome analysis. We contacted the families and physicians of the six individuals to collect phenotypic information, performed in vitro analyses of the identified PRKAR1B-variants, and investigated PRKAR1B expression during embryonic development.Recent studies of large patient cohorts with neurodevelopmental disorders found significant enrichment of de novo missense variants in PRKAR1B. In our cohort, de novo origin of the PRKAR1B variants could be confirmed in five of six individuals, and four carried the same heterozygous de novo variant c.1003C>T (p.Arg335Trp; NM_001164760). Global developmental delay, autism spectrum disorder, and apraxia/dyspraxia have been reported in all six, and reduced pain sensitivity was found in three individuals carrying the c.1003C>T variant. PRKAR1B expression in the brain was demonstrated during human embryonal development. Additionally, in vitro analyses revealed altered basal PKA activity in cells transfected with variant-harboring PRKAR1B expression constructs.Our study provides strong evidence for a PRKAR1B-related neurodevelopmental disorder.
View details for DOI 10.1038/s41436-021-01152-7
View details for Web of Science ID 000638059400001
View details for PubMedID 33833410
-
Avoiding genetic racial profiling in criminal DNA profile databases
NATURE COMPUTATIONAL SCIENCE
2021; 1 (4): 272-+
View details for DOI 10.1038/s43588-021-00058-3
View details for Web of Science ID 000888554600010
-
Avoiding genetic racial profiling in criminal DNA profile databases.
Nature computational science
2021; 1 (4): 272-279
Abstract
DNA profiling has become an essential tool for crime solving and prevention, and CODIS (Combined DNA Index System) criminal investigation databases have flourished at the national, state and even local level. However, reports suggest that the DNA profiles of all suspects searched in these databases are often retained, which could result in racial profiling. Here, we devise an approach to both enable broad DNA profile searches and preserve exonerated citizens' privacy through a real-time privacy-preserving procedure to query CODIS databases. Using our approach, an agent can privately and efficiently query a suspect's DNA profile device in the field, learning only whether the profile matches against any database profile. More importantly, the central database learns nothing about the queried profile, and thus cannot retain it. Our approach paves the way to implement privacy-preserving DNA profile searching in CODIS databases and any CODIS-like system.
View details for DOI 10.1038/s43588-021-00058-3
View details for PubMedID 38217177
-
Commonalities across computational workflows for uncovering explanatory variants in undiagnosed cases.
Genetics in medicine : official journal of the American College of Medical Genetics
2021
Abstract
PURPOSE: Genomic sequencing has become an increasingly powerful and relevant tool to be leveraged for the discovery of genetic aberrations underlying rare, Mendelian conditions. Although the computational tools incorporated into diagnostic workflows for this task are continually evolving and improving, we nevertheless sought to investigate commonalities across sequencing processing workflows to reveal consensus and standard practice tools and highlight exploratory analyses where technical and theoretical method improvements would be most impactful.METHODS: We collected details regarding the computational approaches used by a genetic testing laboratory and 11 clinical research sites in the United States participating in the Undiagnosed Diseases Network via meetings with bioinformaticians, online survey forms, and analyses of internal protocols.RESULTS: We found that tools for processing genomic sequencing data can be grouped into four distinct categories. Whereas well-established practices exist for initial variant calling and quality control steps, there is substantial divergence across sites in later stages for variant prioritization and multimodal data integration, demonstrating a diversity of approaches for solving the most mysterious undiagnosed cases.CONCLUSION: The largest differences across diagnostic workflows suggest that advances in structural variant detection, noncoding variant interpretation, and integration of additional biomedical data may be especially promising for solving chronically undiagnosed cases.
View details for DOI 10.1038/s41436-020-01084-8
View details for PubMedID 33580225
-
The Effect of Population Structure on Murine Genome-Wide Association Studies.
Frontiers in genetics
2021; 12: 745361
Abstract
The ability to use genome-wide association studies (GWAS) for genetic discovery depends upon our ability to distinguish true causative from false positive association signals. Population structure (PS) has been shown to cause false positive signals in GWAS. PS correction is routinely used for analysis of human GWAS results, and it has been assumed that it also should be utilized for murine GWAS using inbred strains. Nevertheless, there are fundamental differences between murine and human GWAS, and the impact of PS on murine GWAS results has not been carefully investigated. To assess the impact of PS on murine GWAS, we examined 8223 datasets that characterized biomedical responses in panels of inbred mouse strains. Rather than treat PS as a confounding variable, we examined it as a response variable. Surprisingly, we found that PS had a minimal impact on datasets measuring responses in ≤20 strains; and had surprisingly little impact on most datasets characterizing 21 - 40 inbred strains. Moreover, we show that true positive association signals arising from haplotype blocks, SNPs or indels, which were experimentally demonstrated to be causative for trait differences, would be rejected if PS correction were applied to them. Our results indicate because of the special conditions created by GWAS (the use of inbred strains, small sample sizes) PS assessment results should be carefully evaluated in conjunction with other criteria, when murine GWAS results are evaluated.
View details for DOI 10.3389/fgene.2021.745361
View details for PubMedID 34589118
-
A comparative genomics multitool for scientific discovery and conservation
NATURE
2020; 587 (7833): 240-+
Abstract
The Zoonomia Project is investigating the genomics of shared and specialized traits in eutherian mammals. Here we provide genome assemblies for 131 species, of which all but 9 are previously uncharacterized, and describe a whole-genome alignment of 240 species of considerable phylogenetic diversity, comprising representatives from more than 80% of mammalian families. We find that regions of reduced genetic diversity are more abundant in species at a high risk of extinction, discern signals of evolutionary selection at high resolution and provide insights from individual reference genomes. By prioritizing phylogenetic diversity and making data available quickly and without restriction, the Zoonomia Project aims to support biological discovery, medical research and the conservation of biodiversity.
View details for DOI 10.1038/s41586-020-2876-6
View details for Web of Science ID 000588830300010
View details for PubMedID 33177664
-
A fully-automated method discovers loss of mouse-lethal and human-monogenic disease genes in 58 mammals.
Nucleic acids research
2020
Abstract
Gene losses provide an insightful route for studying the morphological and physiological adaptations of species, but their discovery is challenging. Existing genome annotation tools focus on annotating intact genes and do not attempt to distinguish nonfunctional genes from genes missing annotation due to sequencing and assembly artifacts. Previous attempts to annotate gene losses have required significant manual curation, which hampers their scalability for the ever-increasing deluge of newly sequenced genomes. Using extreme sequence erosion (amino acid deletions and substitutions) and sister species support as an unambiguous signature of loss, we developed an automated approach for detecting high-confidence gene loss events across a species tree. Our approach relies solely on gene annotation in a single reference genome, raw assemblies for the remaining species to analyze, and the associated phylogenetic tree for all organisms involved. Using human as reference, we discovered over 400 unique human ortholog erosion events across 58 mammals. This includes dozens of clade-specific losses of genes that result in early mouse lethality or are associated with severe human congenital diseases. Our discoveries yield intriguing potential for translational medical genetics and evolutionary biology, and our approach is readily applicable to large-scale genome sequencing efforts across the tree of life.
View details for DOI 10.1093/nar/gkaa550
View details for PubMedID 32614390
-
Clinical sites of the Undiagnosed Diseases Network: unique contributions to genomic medicine and science.
Genetics in medicine : official journal of the American College of Medical Genetics
2020
Abstract
The NIH Undiagnosed Diseases Network (UDN) evaluates participants with disorders that have defied diagnosis, applying personalized clinical and genomic evaluations and innovative research. The clinical sites of the UDN are essential to advancing the UDN mission; this study assesses their contributions relative to standard clinical practices.We analyzed retrospective data from four UDN clinical sites, from July 2015 to September 2019, for diagnoses, new disease gene discoveries and the underlying investigative methods.Of 791 evaluated individuals, 231 received 240 diagnoses and 17 new disease-gene associations were recognized. Straightforward diagnoses on UDN exome and genome sequencing occurred in 35% (84/240). We considered these tractable in standard clinical practice, although genome sequencing is not yet widely available clinically. The majority (156/240, 65%) required additional UDN-driven investigations, including 90 diagnoses that occurred after prior nondiagnostic exome sequencing and 45 diagnoses (19%) that were nongenetic. The UDN-driven investigations included complementary/supplementary phenotyping, innovative analyses of genomic variants, and collaborative science for functional assays and animal modeling.Investigations driven by the clinical sites identified diagnostic and research paradigms that surpass standard diagnostic processes. The new diagnoses, disease gene discoveries, and delineation of novel disorders represent a model for genomic medicine and science.
View details for DOI 10.1038/s41436-020-00984-z
View details for PubMedID 33093671
-
Morphogenesis is transcriptionally coupled to neurogenesis during peripheral olfactory organ development.
Development (Cambridge, England)
2020
Abstract
Sense organs acquire their distinctive shapes concomitantly with the differentiation of sensory cells and neurons necessary for their function. While our understanding of the mechanisms controlling morphogenesis and neurogenesis in these structures has grown, how these processes are coordinated remains largely unexplored. Neurogenesis in the zebrafish olfactory epithelium requires the bHLH proneural transcription factor Neurogenin1 (Neurog1). To address whether Neurog1 also controls morphogenesis, we analysed the migratory behaviour of early olfactory neural progenitors in neurog1 mutant embryos. Our results indicate that the oriented movements of these progenitors are disrupted in this context. Morphogenesis is similarly affected by mutations in the chemokine receptor gene, cxcr4b, suggesting it is a potential Neurog1 target gene. We find that Neurog1 directly regulates cxcr4b through an E-boxes cluster located just upstream of the cxcr4b transcription start site. Our results suggest that proneural transcription factors, such as Neurog1, directly couple distinct aspects of nervous system development.
View details for DOI 10.1242/dev.192971
View details for PubMedID 34004975
-
AMELIE speeds Mendelian diagnosis by matching patient phenotype and genotype to primary literature.
Science translational medicine
2020; 12 (544)
Abstract
The diagnosis of Mendelian disorders requires labor-intensive literature research. Trained clinicians can spend hours looking for the right publication(s) supporting a single gene that best explains a patient's disease. AMELIE (Automatic Mendelian Literature Evaluation) greatly accelerates this process. AMELIE parses all 29 million PubMed abstracts and downloads and further parses hundreds of thousands of full-text articles in search of information supporting the causality and associated phenotypes of most published genetic variants. AMELIE then prioritizes patient candidate variants for their likelihood of explaining any patient's given set of phenotypes. Diagnosis of singleton patients (without relatives' exomes) is the most time-consuming scenario, and AMELIE ranked the causative gene at the very top for 66% of 215 diagnosed singleton Mendelian patients from the Deciphering Developmental Disorders project. Evaluating only the top 11 AMELIE-scored genes of 127 (median) candidate genes per patient resulted in a rapid diagnosis in more than 90% of cases. AMELIE-based evaluation of all cases was 3 to 19 times more efficient than hand-curated database-based approaches. We replicated these results on a retrospective cohort of clinical cases from Stanford Children's Health and the Manton Center for Orphan Disease Research. An analysis web portal with our most recent update, programmatic interface, and code is available at AMELIE.stanford.edu.
View details for DOI 10.1126/scitranslmed.aau9113
View details for PubMedID 32434849
-
Transcription factor expression defines subclasses of developing projection neurons highly similar to single-cell RNA-seq subtypes.
Proceedings of the National Academy of Sciences of the United States of America
2020
Abstract
We are only just beginning to catalog the vast diversity of cell types in the cerebral cortex. Such categorization is a first step toward understanding how diversification relates to function. All cortical projection neurons arise from a uniform pool of progenitor cells that lines the ventricles of the forebrain. It is still unclear how these progenitor cells generate the more than 50 unique types of mature cortical projection neurons defined by their distinct gene-expression profiles. Moreover, exactly how and when neurons diversify their function during development is unknown. Here we relate gene expression and chromatin accessibility of two subclasses of projection neurons with divergent morphological and functional features as they develop in the mouse brain between embryonic day 13 and postnatal day 5 in order to identify transcriptional networks that diversify neuron cell fate. We compare these gene-expression profiles with published profiles of single cells isolated from similar populations and establish that layer-defined cell classes encompass cell subtypes and developmental trajectories identified using single-cell sequencing. Given the depth of our sequencing, we identify groups of transcription factors with particularly dense subclass-specific regulation and subclass-enriched transcription factor binding motifs. We also describe transcription factor-adjacent long noncoding RNAs that define each subclass and validate the function of Myt1l in balancing the ratio of the two subclasses in vitro. Our multidimensional approach supports an evolving model of progressive restriction of cell fate competence through inherited transcriptional identities.
View details for DOI 10.1073/pnas.2008013117
View details for PubMedID 32948690
-
Morphogenesis is transcriptionally coupled to neurogenesis during peripheral olfactory organ development.
Development (Cambridge, England)
2020
Abstract
Sense organs acquire their distinctive shapes concomitantly with the differentiation of sensory cells and neurons necessary for their function. While our understanding of the mechanisms controlling morphogenesis and neurogenesis in these structures has grown, how these processes are coordinated remains largely unexplored. Neurogenesis in the zebrafish olfactory epithelium requires the bHLH proneural transcription factor Neurogenin1 (Neurog1). To address whether Neurog1 also controls morphogenesis, we analysed the migratory behaviour of early olfactory neural progenitors in neurog1 mutant embryos. Our results indicate that the oriented movements of these progenitors are disrupted in this context. Morphogenesis is similarly affected by mutations in the chemokine receptor gene, cxcr4b, suggesting it is a potential Neurog1 target gene. We find that Neurog1 directly regulates cxcr4b through an E-boxes cluster located just upstream of the cxcr4b transcription start site. Our results suggest that proneural transcription factors, such as Neurog1, directly couple distinct aspects of nervous system development.
View details for DOI 10.1242/dev.192971
View details for PubMedID 33144399
-
A functional enrichment test for molecular convergent evolution finds a clear protein-coding signal in echolocating bats and whales.
Proceedings of the National Academy of Sciences of the United States of America
2019
Abstract
Distantly related species entering similar biological niches often adapt by evolving similar morphological and physiological characters. How much genomic molecular convergence (particularly of highly constrained coding sequence) contributes to convergent phenotypic evolution, such as echolocation in bats and whales, is a long-standing fundamental question. Like others, we find that convergent amino acid substitutions are not more abundant in echolocating mammals compared to their outgroups. However, we also ask a more informative question about the genomic distribution of convergent substitutions by devising a test to determine which, if any, of more than 4,000 tissue-affecting gene sets is most statistically enriched with convergent substitutions. We find that the gene set most overrepresented (q-value = 2.2e-3) with convergent substitutions in echolocators, affecting 18 genes, regulates development of the cochlear ganglion, a structure with empirically supported relevance to echolocation. Conversely, when comparing to nonecholocating outgroups, no significant gene set enrichment exists. For aquatic and high-altitude mammals, our analysis highlights 15 and 16 genes from the gene sets most affected by molecular convergence which regulate skin and lung physiology, respectively. Importantly, our test requires that the most convergence-enriched set cannot also be enriched for divergent substitutions, such as in the pattern produced by inactivated vision genes in subterranean mammals. Showing a clear role for adaptive protein-coding molecular convergence, we discover nearly 2,600 convergent positions, highlight 77 of them in 3 organs, and provide code to investigate other clades across the tree of life.
View details for DOI 10.1073/pnas.1818532116
View details for PubMedID 31570615
-
ClinPhen extracts and prioritizes patient phenotypes directly from medical records to expedite genetic disease diagnosis
GENETICS IN MEDICINE
2019; 21 (7): 1585–93
View details for DOI 10.1038/s41436-018-0381-1
View details for Web of Science ID 000473518700017
-
CRISPR/Cas9 Genome Engineering in Engraftable Human Brain-Derived Neural Stem Cells.
iScience
2019; 15: 524–35
Abstract
Human neural stem cells (NSCs) offer therapeutic potential for neurodegenerative diseases, such as inherited monogenic nervous system disorders, and neural injuries. Gene editing in NSCs (GE-NSCs) could enhance their therapeutic potential. We show that NSCs are amenable to gene targeting at multiple loci using Cas9 mRNA with synthetic chemically modified guide RNAs along with DNA donor templates. Transplantation of GE-NSC into oligodendrocyte mutant shiverer-immunodeficient mice showed that GE-NSCs migrate and differentiate into astrocytes, neurons, and myelin-producing oligodendrocytes, highlighting the fact that GE-NSCs retain their NSC characteristics of self-renewal and site-specific global migration and differentiation. To show the therapeutic potential of GE-NSCs, we generated GALC lysosomal enzyme overexpressing GE-NSCs that are able to cross-correct GALC enzyme activity through the mannose-6-phosphate receptor pathway. These GE-NSCs have the potential to be an investigational cell and gene therapy for a range of neurodegenerative disorders and injuries of the central nervous system, including lysosomal storage disorders.
View details for DOI 10.1016/j.isci.2019.04.036
View details for PubMedID 31132746
-
Darwin: A Genomics Coprocessor
IEEE MICRO
2019; 39 (3): 29–37
View details for DOI 10.1109/MM.2019.2910009
View details for Web of Science ID 000467551700005
-
S-CAP extends pathogenicity prediction to genetic variants that affect RNA splicing
NATURE GENETICS
2019; 51 (4): 755-+
View details for DOI 10.1038/s41588-019-0348-4
View details for Web of Science ID 000462767500022
-
LUNG DISEASE IN SYSTEMIC JIA: AN EMERGING PROBLEM LINKED WITH YOUNG AGE AND ANTI-IL-1/IL-6
BMJ PUBLISHING GROUP. 2019: A57
View details for DOI 10.1136/annrheumdis-2018-EWRR2019.115
View details for Web of Science ID 000466415300116
-
S-CAP extends pathogenicity prediction to genetic variants that affect RNA splicing.
Nature genetics
2019
Abstract
Exome analysis of patients with a likely monogenic disease does not identify a causal variant in over half of cases. Splice-disrupting mutations make up the second largest class of known disease-causing mutations. Each individual (singleton) exome harbors over 500 rare variants of unknown significance (VUS) in the splicing region. The existing relevant pathogenicity prediction tools tackle all non-coding variants as one amorphic class and/or are not calibrated for the high sensitivity required for clinical use. Here we calibrate seven such tools and devise a novel tool called Splicing Clinically Applicable Pathogenicity prediction (S-CAP) that is over twice as powerful as all previous tools, removing 41% of patient VUS at 95% sensitivity. We show that S-CAP does this by using its own features and not via meta-prediction over previous tools, and that splicing pathogenicity prediction is distinct from predicting molecular splicing changes. S-CAP is an important step on the path to deriving non-coding causal diagnoses.
View details for PubMedID 30804562
-
Components of genetic associations across 2,138 phenotypes in the UK Biobank highlight adipocyte biology.
Nature communications
2019; 10 (1): 4064
Abstract
Population-based biobanks with genomic and dense phenotype data provide opportunities for generating effective therapeutic hypotheses and understanding the genomic role in disease predisposition. To characterize latent components of genetic associations, we apply truncated singular value decomposition (DeGAs) to matrices of summary statistics derived from genome-wide association analyses across 2,138 phenotypes measured in 337,199 White British individuals in the UK Biobank study. We systematically identify key components of genetic associations and the contributions of variants, genes, and phenotypes to each component. As an illustration of the utility of the approach to inform downstream experiments, we report putative loss of function variants, rs114285050 (GPR151) and rs150090666 (PDE3B), that substantially contribute to obesity-related traits and experimentally demonstrate the role of these genes in adipocyte biology. Our approach to dissect components of genetic associations across the human phenome will accelerate biomedical hypothesis generation by providing insights on previously unexplored latent structures.
View details for DOI 10.1038/s41467-019-11953-9
View details for PubMedID 31492854
-
AVADA: toward automated pathogenic variant evidence retrieval directly from the full-text literature.
Genetics in medicine : official journal of the American College of Medical Genetics
2019
Abstract
Both monogenic pathogenic variant cataloging and clinical patient diagnosis start with variant-level evidence retrieval followed by expert evidence integration in search of diagnostic variants and genes. Here, we try to accelerate pathogenic variant evidence retrieval by an automatic approach.Automatic VAriant evidence DAtabase (AVADA) is a novel machine learning tool that uses natural language processing to automatically identify pathogenic genetic variant evidence in full-text primary literature about monogenic disease and convert it to genomic coordinates.AVADA automatically retrieved almost 60% of likely disease-causing variants deposited in the Human Gene Mutation Database (HGMD), a 4.4-fold improvement over the current best open source automated variant extractor. AVADA contains over 60,000 likely disease-causing variants that are in HGMD but not in ClinVar. AVADA also highlights the challenges of automated variant mapping and pathogenicity curation. However, when combined with manual validation, on 245 diagnosed patients, AVADA provides valuable evidence for an additional 18 diagnostic variants, on top of ClinVar's 21, versus only 2 using the best current automated approach.AVADA advances automated retrieval of pathogenic monogenic variant evidence from full-text literature. Far from perfect, but much faster than PubMed/Google Scholar search, careful curation of AVADA-retrieved evidence can aid both database curation and patient diagnosis.
View details for DOI 10.1038/s41436-019-0643-6
View details for PubMedID 31467448
-
CLINPHEN EXTRACTS AND PRIORITIZES PHENOTYPES FROM MEDICAL RECORDS TO ACCELERATE GENOMIC DIAGNOSIS
BMJ PUBLISHING GROUP. 2019: 179
View details for DOI 10.1136/jim-2018-000939.262
View details for Web of Science ID 000457712500272
-
Emergent high fatality lung disease in systemic juvenile arthritis.
Annals of the rheumatic diseases
2019
Abstract
To investigate the characteristics and risk factors of a novel parenchymal lung disease (LD), increasingly detected in systemic juvenile idiopathic arthritis (sJIA).In a multicentre retrospective study, 61 cases were investigated using physician-reported clinical information and centralised analyses of radiological, pathological and genetic data.LD was associated with distinctive features, including acute erythematous clubbing and a high frequency of anaphylactic reactions to the interleukin (IL)-6 inhibitor, tocilizumab. Serum ferritin elevation and/or significant lymphopaenia preceded LD detection. The most prevalent chest CT pattern was septal thickening, involving the periphery of multiple lobes ± ground-glass opacities. The predominant pathology (23 of 36) was pulmonary alveolar proteinosis and/or endogenous lipoid pneumonia (PAP/ELP), with atypical features including regional involvement and concomitant vascular changes. Apparent severe delayed drug hypersensitivity occurred in some cases. The 5-year survival was 42%. Whole exome sequencing (20 of 61) did not identify a novel monogenic defect or likely causal PAP-related or macrophage activation syndrome (MAS)-related mutations. Trisomy 21 and young sJIA onset increased LD risk. Exposure to IL-1 and IL-6 inhibitors (46 of 61) was associated with multiple LD features. By several indicators, severity of sJIA was comparable in drug-exposed subjects and published sJIA cohorts. MAS at sJIA onset was increased in the drug-exposed, but was not associated with LD features.A rare, life-threatening lung disease in sJIA is defined by a constellation of unusual clinical characteristics. The pathology, a PAP/ELP variant, suggests macrophage dysfunction. Inhibitor exposure may promote LD, independent of sJIA severity, in a small subset of treated patients. Treatment/prevention strategies are needed.
View details for DOI 10.1136/annrheumdis-2019-216040
View details for PubMedID 31562126
-
Darwin-WGA: A Co-processor Provides Increased Sensitivity in Whole Genome Alignments with High Speedup
IEEE. 2019: 359–72
View details for DOI 10.1109/HPCA.2019.00050
View details for Web of Science ID 000469766300028
-
Identification of rare-disease genes using blood transcriptome sequencing and large control cohorts.
Nature medicine
2019
Abstract
It is estimated that 350 million individuals worldwide suffer from rare diseases, which are predominantly caused by mutation in a single gene1. The current molecular diagnostic rate is estimated at 50%, with whole-exome sequencing (WES) among the most successful approaches2-5. For patients in whom WES is uninformative, RNA sequencing (RNA-seq) has shown diagnostic utility in specific tissues and diseases6-8. This includes muscle biopsies from patients with undiagnosed rare muscle disorders6,9, and cultured fibroblasts from patients with mitochondrial disorders7. However, for many individuals, biopsies are not performed for clinical care, and tissues are difficult to access. We sought to assess the utility of RNA-seq from blood as a diagnostic tool for rare diseases of different pathophysiologies. We generated whole-blood RNA-seq from 94 individuals with undiagnosed rare diseases spanning 16 diverse disease categories. We developed a robust approach to compare data from these individuals with large sets of RNA-seq data for controls (n = 1,594 unrelated controls and n = 49 family members) and demonstrated the impacts of expression, splicing, gene and variant filtering strategies on disease gene identification. Across our cohort, we observed that RNA-seq yields a 7.5% diagnostic rate, and an additional 16.7% with improved candidate gene resolution.
View details for DOI 10.1038/s41591-019-0457-8
View details for PubMedID 31160820
-
A sequence-based, deep learning model accurately predicts RNA splicing branchpoints.
RNA (New York, N.Y.)
2018
Abstract
Experimental detection of RNA splicing branchpoints is difficult. To date, high-confidence experimental annotations exist for 18% of 3' splice sites in the human genome. We develop a deep-learning based branchpoint predictor, LaBranchoR, which predicts a correct branchpoint for at least 75% of 3' splice sites genome-wide. Detailed analysis of cases in which our predicted branchpoint deviates from experimental data suggests a correct branchpoint is predicted in over 90% of cases. We use our predicted branchpoints to identify a novel sequence element upstream of branchpoints consistent with extended U2 snRNA base pairing, show an association between weak branchpoints and alternative splicing, and explore the effects of genetic variants on branchpoints. We provide genome-wide branchpoint annotations and in silico mutagenesis scores at http://bejerano.stanford.edu/labranchor.
View details for PubMedID 30224349
-
Independent erosion of conserved transcription factor binding sites points to shared hindlimb, vision and external testes loss in different mammals.
Nucleic acids research
2018
Abstract
Genetic variation in cis-regulatory elements is thought to be a major driving force in morphological and physiological changes. However, identifying transcription factor binding events that code for complex traits remains a challenge, motivating novel means of detecting putatively important binding events. Using a curated set of 1154 high-quality transcription factor motifs, we demonstrate that independently eroded binding sites are enriched for independently lost traits in three distinct pairs of placental mammals. We show that these independently eroded events pinpoint the loss of hindlimbs in dolphin and manatee, degradation of vision in naked mole-rat and star-nosed mole, and the loss of external testes in white rhinoceros and Weddell seal. We additionally show that our method may also be utilized with more than two species. Our study exhibits a novel methodology to detect cis-regulatory mutations which help explain a portion of the molecular mechanism underlying complex trait formation and loss.
View details for PubMedID 30137416
-
An MTF1 binding site disrupted by a homozygous variant in the promoter of ATP7B likely causes Wilson Disease.
European journal of human genetics : EJHG
2018
Abstract
Approximately 2% of the human genome accounts for protein-coding genes, yet most known Mendelian disease-causing variants lie in exons or splice sites. Individuals who symptomatically present with monogenic disorders but do not possess function-altering variants in the protein-coding regions of causative genes may harbor variants in the surrounding gene regulatory domains. We present such a case: a male of Afghani descent was clinically diagnosed with Wilson Disease-a disorder of systemic copper buildup-but was found to have no function-altering coding variants in ATP7B (ENST00000242839.4), the typically causative gene. Our analysis revealed the homozygous variant chr13:g.52,586,149T>C (NC_000013.10, hg19) 676bp into the ATP7B promoter, which disrupts a metal regulatory transcription factor 1 (MTF1) binding site and diminishes expression of ATP7B in response to copper intake, likely resulting in Wilson Disease. Our approach to identify the causative variant can be generalized to systematically discover function-altering non-coding variants underlying disease and motivates evaluation of gene regulatory variants.
View details for PubMedID 30087448
-
Phrank measures phenotype sets similarity to greatly improve Mendelian diagnostic disease prioritization.
Genetics in medicine : official journal of the American College of Medical Genetics
2018
Abstract
PURPOSE: Exome sequencing and diagnosis is beginning to spread across the medical establishment. The most time-consuming part of genome-based diagnosis is the manual step of matching the potentially long list of patient candidate genes to patient phenotypes to identify the causative disease.METHODS: We introduce Phrank (for phenotype ranking), an information theory-inspired method that utilizes a Bayesian network to prioritize candidate diseases or genes, as a stand-alone module that can be run with any underlying knowledgebase and any variant filtering scheme.RESULTS: Phrank outperforms existing methods at ranking the causative disease or gene when applied to 169 real patient exomes with Mendelian diagnoses. Phrank's greatest improvement is in disease space, where across all 169 patients it ranks only 3 diseases on average ahead of the true diagnosis, whereas Phenomizer ranks 32 diseases ahead of the causal one.CONCLUSIONS: Using Phrank to rank all patient candidate genes or diseases, as they start working through a new case, will save the busy clinician much time in deriving a genetic diagnosis.
View details for PubMedID 29997393
-
BIALLELIC LOSS OF FUNCTION WNT5A MUTATIONS IN AN INFANT WITH SEVERE AND ATYPICAL MANIFESTATIONS OF ROBINOW SYNDROME AND UNAFFECTED PARENTS - A NEW LOCUS FOR AUTOSOMAL RECESSIVE DISEASE
WILEY. 2018: 1504
View details for Web of Science ID 000434040600110
-
ClinPhen extracts and prioritizes patient phenotypes directly from medical records to expedite genetic disease diagnosis.
Genetics in medicine : official journal of the American College of Medical Genetics
2018
Abstract
Diagnosing monogenic diseases facilitates optimal care, but can involve the manual evaluation of hundreds of genetic variants per case. Computational tools like Phrank expedite this process by ranking all candidate genes by their ability to explain the patient's phenotypes. To use these tools, busy clinicians must manually encode patient phenotypes from lengthy clinical notes. With 100 million human genomes estimated to be sequenced by 2025, a fast alternative to manual phenotype extraction from clinical notes will become necessary.We introduce ClinPhen, a fast, high-accuracy tool that automatically converts clinical notes into a prioritized list of patient phenotypes using Human Phenotype Ontology (HPO) terms.ClinPhen shows superior accuracy and 20× speedup over existing phenotype extractors, and its novel phenotype prioritization scheme improves the performance of gene-ranking tools.While a dedicated clinician can process 200 patient records in a 40-hour workweek, ClinPhen does the same in 10 minutes. Compared with manual phenotype extraction, ClinPhen saves an additional 3-5 hours per Mendelian disease diagnosis. Providers can now add ClinPhen's output to each summary note attached to a filled testing laboratory request form. ClinPhen makes a substantial contribution to improvements in efficiency critically needed to meet the surging demand for clinical diagnostic sequencing.
View details for PubMedID 30514889
-
Biallelic loss-of-function WNT5A mutations in an infant with severe and atypical manifestations of Robinow syndrome.
American journal of medical genetics. Part A
2018; 176 (4): 1030–36
Abstract
Robinow syndrome (RS) is a well-recognized Mendelian disorder known to demonstrate both autosomal dominant and autosomal recessive inheritance. Typical manifestations include short stature, characteristic facies, and skeletal anomalies. Recessive inheritance has been associated with mutations in ROR2 while dominant inheritance has been observed for mutations in WNT5A, DVL1, and DVL3. Through trio whole genome sequencing, we identified a homozygous frameshifting single nucleotide deletion in WNT5A in a previously reported, deceased infant with a unique constellation of features comprising a 46,XY disorder of sex development with multiple congenital malformations including congenital diaphragmatic hernia, ambiguous genitalia, dysmorphic facies, shortened long bones, adactyly, and ventricular septal defect. The parents, who are both heterozygous for the deletion, appear clinically unaffected. In conjunction with published observations of Wnt5a double knockout mice, we provide evidence for the possibility of autosomal recessive inheritance in association with WNT5A loss-of-function mutations in RS.
View details for PubMedID 29575631
-
A screen for deeply conserved non-coding GWAS SNPs uncovers a MIR-9-2 functional mutation associated to retinal vasculature defects in human
Nucleic Acids Research
2018; 1
Abstract
Thousands of human disease-associated single nucleotide polymorphisms (SNPs) lie in the non-coding genome, but only a handful have been demonstrated to affect gene expression and human biology. We computationally identified risk-associated SNPs in deeply conserved non-exonic elements (CNEs) potentially contributing to 45 human diseases. We further demonstrated that human CNE1/rs17421627 associated with retinal vasculature defects showed transcriptional activity in the zebrafish retina, while introducing the risk-associated allele completely abolished CNE1 enhancer activity. Furthermore, deletion of CNE1 led to retinal vasculature defects and to a specific downregulation of microRNA-9, rather than MEF2C as predicted by the original genome-wide association studies. Consistent with these results, miR-9 depletion affects retinal vasculature formation, demonstrating MIR-9-2 as a critical gene underpinning the associated trait. Importantly, we validated that other CNEs act as transcriptional enhancers that can be disrupted by conserved non-coding SNPs. This study uncovers disease-associated non-coding mutations that are deeply conserved, providing a path for in vivo testing to reveal their cis-regulated genes and biological roles.
View details for DOI 10.1093/nar/gky166
View details for PubMedCentralID PMC5909433
-
Deriving genomic diagnoses without revealing patient genomes
SCIENCE
2017; 357 (6352): 692-+
Abstract
Patient genomes are interpretable only in the context of other genomes; however, genome sharing enables discrimination. Thousands of monogenic diseases have yielded definitive genomic diagnoses and potential gene therapy targets. Here we show how to provide such diagnoses while preserving participant privacy through the use of secure multiparty computation. In multiple real scenarios (small patient cohorts, trio analysis, two-hospital collaboration), we used our methods to identify the causal variant and discover previously unrecognized disease genes and variants while keeping up to 99.7% of all participants' most sensitive genomic information private.
View details for PubMedID 28818945
-
Chitayat syndrome: hyperphalangism, characteristic facies, hallux valgus and bronchomalacia results from a recurrent c.266A > G p.(Tyr89Cys) variant in the ERF gene
JOURNAL OF MEDICAL GENETICS
2017; 54 (3): 157-165
Abstract
In 1993, Chitayat et al., reported a newborn with hyperphalangism, facial anomalies, and bronchomalacia. We identified three additional families with similar findings. Features include bilateral accessory phalanx resulting in shortened index fingers; hallux valgus; distinctive face; respiratory compromise.To identify the genetic aetiology of Chitayat syndrome and identify a unifying cause for this specific form of hyperphalangism.Through ongoing collaboration, we had collected patients with strikingly-similar phenotype. Trio-based exome sequencing was first performed in Patient 2 through Deciphering Developmental Disorders study. Proband-only exome sequencing had previously been independently performed in Patient 4. Following identification of a candidate gene variant in Patient 2, the same variant was subsequently confirmed from exome data in Patient 4. Sanger sequencing was used to validate this variant in Patients 1, 3; confirm paternal inheritance in Patient 5.A recurrent, novel variant NM_006494.2:c.266A>G p.(Tyr89Cys) in ERF was identified in five affected individuals: de novo (patient 1, 2 and 3) and inherited from an affected father (patient 4 and 5). p.Tyr89Cys is an aromatic polar neutral to polar neutral amino acid substitution, at a highly conserved position and lies within the functionally important ETS-domain of the protein. The recurrent ERF c.266A>C p.(Tyr89Cys) variant causes Chitayat syndrome.ERF variants have previously been associated with complex craniosynostosis. In contrast, none of the patients with the c.266A>G p.(Tyr89Cys) variant have craniosynostosis.We report the molecular aetiology of Chitayat syndrome and discuss potential mechanisms for this distinctive phenotype associated with the p.Tyr89Cys substitution in ERF.
View details for Web of Science ID 000397862400003
-
Systematic reanalysis of clinical exome data yields additional diagnoses: implications for providers
GENETICS IN MEDICINE
2017; 19 (2): 209-214
Abstract
Clinical exome sequencing is nondiagnostic for about 75% of patients evaluated for a possible Mendelian disorder. We examined the ability of systematic reevaluation of exome data to establish additional diagnoses.The exome and phenotypic data of 40 individuals with previously nondiagnostic clinical exomes were reanalyzed with current software and literature.A definitive diagnosis was identified for 4 of 40 participants (10%). In these cases the causative variant is de novo and in a relevant autosomal-dominant disease gene. The literature to tie the causative genes to the participants' phenotypes was weak, nonexistent, or not readily located at the time of the initial clinical exome reports. At the time of diagnosis by reanalysis, the supporting literature was 1 to 3 years old.Approximately 250 gene-disease and 9,200 variant-disease associations are reported annually. This increase in information necessitates regular reevaluation of nondiagnostic exomes. To be practical, systematic reanalysis requires further automation and more up-to-date variant databases. To maximize the diagnostic yield of exome sequencing, providers should periodically request reanalysis of nondiagnostic exomes. Accordingly, policies regarding reanalysis should be weighed in combination with factors such as cost and turnaround time when selecting a clinical exome laboratory.Genet Med advance online publication 21 July 2016Genetics in Medicine (2016); doi:10.1038/gim.2016.88.
View details for DOI 10.1038/gim.2016.88
View details for Web of Science ID 000393534200010
-
Mutations of AKT3 are associated with a wide spectrum of developmental disorders including extreme megalencephaly.
Brain : a journal of neurology
2017; 140 (10): 2610–22
Abstract
Mutations of genes within the phosphatidylinositol-3-kinase (PI3K)-AKT-MTOR pathway are well known causes of brain overgrowth (megalencephaly) as well as segmental cortical dysplasia (such as hemimegalencephaly, focal cortical dysplasia and polymicrogyria). Mutations of the AKT3 gene have been reported in a few individuals with brain malformations, to date. Therefore, our understanding regarding the clinical and molecular spectrum associated with mutations of this critical gene is limited, with no clear genotype-phenotype correlations. We sought to further delineate this spectrum, study levels of mosaicism and identify genotype-phenotype correlations of AKT3-related disorders. We performed targeted sequencing of AKT3 on individuals with these phenotypes by molecular inversion probes and/or Sanger sequencing to determine the type and level of mosaicism of mutations. We analysed all clinical and brain imaging data of mutation-positive individuals including neuropathological analysis in one instance. We performed ex vivo kinase assays on AKT3 engineered with the patient mutations and examined the phospholipid binding profile of pleckstrin homology domain localizing mutations. We identified 14 new individuals with AKT3 mutations with several phenotypes dependent on the type of mutation and level of mosaicism. Our comprehensive clinical characterization, and review of all previously published patients, broadly segregates individuals with AKT3 mutations into two groups: patients with highly asymmetric cortical dysplasia caused by the common p.E17K mutation, and patients with constitutional AKT3 mutations exhibiting more variable phenotypes including bilateral cortical malformations, polymicrogyria, periventricular nodular heterotopia and diffuse megalencephaly without cortical dysplasia. All mutations increased kinase activity, and pleckstrin homology domain mutants exhibited enhanced phospholipid binding. Overall, our study shows that activating mutations of the critical AKT3 gene are associated with a wide spectrum of brain involvement ranging from focal or segmental brain malformations (such as hemimegalencephaly and polymicrogyria) predominantly due to mosaic AKT3 mutations, to diffuse bilateral cortical malformations, megalencephaly and heterotopia due to constitutional AKT3 mutations. We also provide the first detailed neuropathological examination of a child with extreme megalencephaly due to a constitutional AKT3 mutation. This child has one of the largest documented paediatric brain sizes, to our knowledge. Finally, our data show that constitutional AKT3 mutations are associated with megalencephaly, with or without autism, similar to PTEN-related disorders. Recognition of this broad clinical and molecular spectrum of AKT3 mutations is important for providing early diagnosis and appropriate management of affected individuals, and will facilitate targeted design of future human clinical trials using PI3K-AKT pathway inhibitors.
View details for PubMedID 28969385
-
MicroRNA-9 Couples Brain Neurogenesis and Angiogenesis.
Cell reports
2017; 20 (7): 1533–42
Abstract
In the developing brain, neurons expressing VEGF-A and blood vessels grow in close apposition, but many of the molecular pathways regulating neuronal VEGF-A and neurovascular system development remain to be deciphered. Here, we show that miR-9 links neurogenesis and angiogenesis through the formation of neurons expressing VEGF-A. We found that miR-9 directly targets the transcription factors TLX and ONECUTs to regulate VEGF-A expression. miR-9 inhibition leads to increased TLX and ONECUT expression, resulting in VEGF-A overexpression. This untimely increase of neuronal VEGF-A signal leads to the thickening of blood vessels at the expense of the normal formation of the neurovascular network in the brain and retina. Thus, this conserved transcriptional cascade is critical for proper brain development in vertebrates. Because of this dual role on neural stem cell proliferation and angiogenesis, miR-9 and its downstream targets are promising factors for cellular regenerative therapy following stroke and for brain tumor treatment.
View details for PubMedID 28813666
-
M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity.
Nature genetics
2016
Abstract
Variant pathogenicity classifiers such as SIFT, PolyPhen-2, CADD, and MetaLR assist in interpretation of the hundreds of rare, missense variants in the typical patient genome by deprioritizing some variants as likely benign. These widely used methods misclassify 26 to 38% of known pathogenic mutations, which could lead to missed diagnoses if the classifiers are trusted as definitive in a clinical setting. We developed M-CAP, a clinical pathogenicity classifier that outperforms existing methods at all thresholds and correctly dismisses 60% of rare, missense variants of uncertain significance in a typical genome at 95% sensitivity.
View details for DOI 10.1038/ng.3703
View details for PubMedID 27776117
-
Chitayat syndrome: hyperphalangism, characteristic facies, hallux valgus and bronchomalacia results from a recurrent c.266A>G p.(Tyr89Cys) variant in the ERF gene.
Journal of medical genetics
2016
Abstract
In 1993, Chitayat et al., reported a newborn with hyperphalangism, facial anomalies, and bronchomalacia. We identified three additional families with similar findings. Features include bilateral accessory phalanx resulting in shortened index fingers; hallux valgus; distinctive face; respiratory compromise.To identify the genetic aetiology of Chitayat syndrome and identify a unifying cause for this specific form of hyperphalangism.Through ongoing collaboration, we had collected patients with strikingly-similar phenotype. Trio-based exome sequencing was first performed in Patient 2 through Deciphering Developmental Disorders study. Proband-only exome sequencing had previously been independently performed in Patient 4. Following identification of a candidate gene variant in Patient 2, the same variant was subsequently confirmed from exome data in Patient 4. Sanger sequencing was used to validate this variant in Patients 1, 3; confirm paternal inheritance in Patient 5.A recurrent, novel variant NM_006494.2:c.266A>G p.(Tyr89Cys) in ERF was identified in five affected individuals: de novo (patient 1, 2 and 3) and inherited from an affected father (patient 4 and 5). p.Tyr89Cys is an aromatic polar neutral to polar neutral amino acid substitution, at a highly conserved position and lies within the functionally important ETS-domain of the protein. The recurrent ERF c.266A>C p.(Tyr89Cys) variant causes Chitayat syndrome.ERF variants have previously been associated with complex craniosynostosis. In contrast, none of the patients with the c.266A>G p.(Tyr89Cys) variant have craniosynostosis.We report the molecular aetiology of Chitayat syndrome and discuss potential mechanisms for this distinctive phenotype associated with the p.Tyr89Cys substitution in ERF.
View details for DOI 10.1136/jmedgenet-2016-104143
View details for PubMedID 27738187
-
TBR1 regulates autism risk genes in the developing neocortex.
Genome research
2016; 26 (8): 1013-1022
Abstract
Exome sequencing studies have identified multiple genes harboring de novo loss-of-function (LoF) variants in individuals with autism spectrum disorders (ASD), including TBR1, a master regulator of cortical development. We performed ChIP-seq for TBR1 during mouse cortical neurogenesis and show that TBR1-bound regions are enriched adjacent to ASD genes. ASD genes were also enriched among genes that are differentially expressed in Tbr1 knockouts, which together with the ChIP-seq data, suggests direct transcriptional regulation. Of the nine ASD genes examined, seven were misexpressed in the cortices of Tbr1 knockout mice, including six with increased expression in the deep cortical layers. ASD genes with adjacent cortical TBR1 ChIP-seq peaks also showed unusually low levels of LoF mutations in a reference human population and among Icelanders. We then leveraged TBR1 binding to identify an appealing subset of candidate ASD genes. Our findings highlight a TBR1-regulated network of ASD genes in the developing neocortex that are relatively intolerant to LoF mutations, indicating that these genes may play critical roles in normal cortical development.
View details for DOI 10.1101/gr.203612.115
View details for PubMedID 27325115
-
Systematic reanalysis of clinical exome data yields additional diagnoses: implications for providers.
Genetics in medicine
2016
Abstract
Clinical exome sequencing is nondiagnostic for about 75% of patients evaluated for a possible Mendelian disorder. We examined the ability of systematic reevaluation of exome data to establish additional diagnoses.The exome and phenotypic data of 40 individuals with previously nondiagnostic clinical exomes were reanalyzed with current software and literature.A definitive diagnosis was identified for 4 of 40 participants (10%). In these cases the causative variant is de novo and in a relevant autosomal-dominant disease gene. The literature to tie the causative genes to the participants' phenotypes was weak, nonexistent, or not readily located at the time of the initial clinical exome reports. At the time of diagnosis by reanalysis, the supporting literature was 1 to 3 years old.Approximately 250 gene-disease and 9,200 variant-disease associations are reported annually. This increase in information necessitates regular reevaluation of nondiagnostic exomes. To be practical, systematic reanalysis requires further automation and more up-to-date variant databases. To maximize the diagnostic yield of exome sequencing, providers should periodically request reanalysis of nondiagnostic exomes. Accordingly, policies regarding reanalysis should be weighed in combination with factors such as cost and turnaround time when selecting a clinical exome laboratory.Genet Med advance online publication 21 July 2016Genetics in Medicine (2016); doi:10.1038/gim.2016.88.
View details for DOI 10.1038/gim.2016.88
View details for PubMedID 27441994
-
"Reverse Genomics" Predicts Function of Human Conserved Noncoding Elements
MOLECULAR BIOLOGY AND EVOLUTION
2016; 33 (5): 1358-1369
Abstract
Evolutionary changes in cis-regulatory elements are thought to play a key role in morphological and physiological diversity across animals. Many conserved noncoding elements (CNEs) function as cis-regulatory elements, controlling gene expression levels in different biological contexts. However, determining specific associations between CNEs and related phenotypes is a challenging task. Here, we present a computational "reverse genomics" approach that predicts the phenotypic functions of human CNEs. We identify thousands of human CNEs that were lost in at least two independent mammalian lineages (IL-CNEs), and match their evolutionary profiles against a diverse set of phenotypes recently annotated across multiple mammalian species. We identify 2,759 compelling associations between human CNEs and a diverse set of mammalian phenotypes. We discuss multiple CNEs, including a predicted ear element near BMP7, a pelvic CNE in FBN1, a brain morphology element in UBE4B, and an aquatic adaptation forelimb CNE near EGR2, and provide a full list of our predictions. As more genomes are sequenced and more traits are annotated across species, we expect our method to facilitate the interpretation of noncoding mutations in human disease and expedite the discovery of individual CNEs that play key roles in human evolution and development.
View details for DOI 10.1093/molbev/msw001
View details for Web of Science ID 000374834900019
View details for PubMedID 26744417
View details for PubMedCentralID PMC4909134
-
Erosion of Conserved Binding Sites in Personal Genomes Points to Medical Histories.
PLoS computational biology
2016; 12 (2)
Abstract
Although many human diseases have a genetic component involving many loci, the majority of studies are statistically underpowered to isolate the many contributing variants, raising the question of the existence of alternate processes to identify disease mutations. To address this question, we collect ancestral transcription factor binding sites disrupted by an individual's variants and then look for their most significant congregation next to a group of functionally related genes. Strikingly, when the method is applied to five different full human genomes, the top enriched function for each is invariably reflective of their very different medical histories. For example, our method implicates "abnormal cardiac output" for a patient with a longstanding family history of heart disease, "decreased circulating sodium level" for an individual with hypertension, and other biologically appealing links for medical histories spanning narcolepsy to axonal neuropathy. Our results suggest that erosion of gene regulation by mutation load significantly contributes to observed heritable phenotypes that manifest in the medical history. The test we developed exposes a hitherto hidden layer of personal variants that promise to shed new light on human disease penetrance, expressivity and the sensitivity with which we can detect them.
View details for DOI 10.1371/journal.pcbi.1004711
View details for PubMedID 26845687
View details for PubMedCentralID PMC4742230
-
Changes in the enhancer landscape during early placental development uncover a trophoblast invasion gene-enhancer network.
Placenta
2016; 37: 45-55
Abstract
Trophoblast invasion establishes adequate blood flow between mother and fetus in early placental development. However, little is known about the cis-regulatory mechanisms underlying this important process. We aimed to identify enhancer elements that are active during trophoblast invasion, and build a trophoblast invasion gene-enhancer network.We carried out ChIP-Seq for an enhancer-associated mark (H3k27Ac) at two time points during early placental development in mouse. One time point when invasion is at its peak (e7.5) and another time point shortly afterwards (e9.5). We use computational analysis to identify putative enhancers, as well as the transcription factor binding sites within them, that are specific to the time point of trophoblast invasion.We compared read profiles at e7.5 and e9.5 to identify 1,977 e7.5-specific enhancers. Within a subset of e7.5-specific enhancers, we discovered a cell migration associated regulatory code, consisting of three transcription factor motifs: AP1, Ets, and Tcfap2. To validate differential expression of the transcription factors that bind these motifs, we performed RNA-Seq in the same context. Finally, we integrated these data with publicly available protein-protein interaction data and constructed a trophoblast invasion gene-enhancer network.The data we generated and analysis we carried out improves our understanding of the regulatory mechanisms of trophoblast invasion, by suggesting a transcriptional code exists in the enhancers of cell migration genes. Furthermore, the network we constructed highlights novel candidate genes that may be critical for trophoblast invasion.
View details for DOI 10.1016/j.placenta.2015.11.001
View details for PubMedID 26604129
View details for PubMedCentralID PMC4707081
-
Mx1 and Mx2 key antiviral proteins are surprisingly lost in toothed whales
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2015; 112 (26): 8036-8040
Abstract
Viral outbreaks in dolphins and other Delphinoidea family members warrant investigation into the integrity of the cetacean immune system. The dynamin-like GTPase genes Myxovirus 1 (Mx1) and Mx2 defend mammals against a broad range of viral infections. Loss of Mx1 function in human and mice enhances infectivity by multiple RNA and DNA viruses, including orthomyxoviruses (influenza A), paramyxoviruses (measles), and hepadnaviruses (hepatitis B), whereas loss of Mx2 function leads to decreased resistance to HIV-1 and other viruses. Here we show that both Mx1 and Mx2 have been rendered nonfunctional in Odontoceti cetaceans (toothed whales, including dolphins and orcas). We discovered multiple exon deletions, frameshift mutations, premature stop codons, and transcriptional evidence of decay in the coding sequence of both Mx1 and Mx2 in four species of Odontocetes. We trace the likely loss event for both proteins to soon after the divergence of Odontocetes and Mystocetes (baleen whales) ∼33-37 Mya. Our data raise intriguing questions as to what drove the loss of both Mx1 and Mx2 genes in the Odontoceti lineage, a double loss seen in none of 56 other mammalian genomes, and suggests a hitherto unappreciated fundamental genetic difference in the way these magnificent mammals respond to viral infections.
View details for DOI 10.1073/pnas.1501844112
View details for Web of Science ID 000357079400051
View details for PubMedID 26080416
View details for PubMedCentralID PMC4491785
-
Characterization of TCF21 Downstream Target Regions Identifies a Transcriptional Network Linking Multiple Independent Coronary Artery Disease Loci
PLOS GENETICS
2015; 11 (5)
Abstract
To functionally link coronary artery disease (CAD) causal genes identified by genome wide association studies (GWAS), and to investigate the cellular and molecular mechanisms of atherosclerosis, we have used chromatin immunoprecipitation sequencing (ChIP-Seq) with the CAD associated transcription factor TCF21 in human coronary artery smooth muscle cells (HCASMC). Analysis of identified TCF21 target genes for enrichment of molecular and cellular annotation terms identified processes relevant to CAD pathophysiology, including "growth factor binding," "matrix interaction," and "smooth muscle contraction." We characterized the canonical binding sequence for TCF21 as CAGCTG, identified AP-1 binding sites in TCF21 peaks, and by conducting ChIP-Seq for JUN and JUND in HCASMC confirmed that there is significant overlap between TCF21 and AP-1 binding loci in this cell type. Expression quantitative trait variation mapped to target genes of TCF21 was significantly enriched among variants with low P-values in the GWAS analyses, suggesting a possible functional interaction between TCF21 binding and causal variants in other CAD disease loci. Separate enrichment analyses found over-representation of TCF21 target genes among CAD associated genes, and linkage disequilibrium between TCF21 peak variation and that found in GWAS loci, consistent with the hypothesis that TCF21 may affect disease risk through interaction with other disease associated loci. Interestingly, enrichment for TCF21 target genes was also found among other genome wide association phenotypes, including height and inflammatory bowel disease, suggesting a functional profile important for basic cellular processes in non-vascular tissues. Thus, data and analyses presented here suggest that study of GWAS transcription factors may be a highly useful approach to identifying disease gene interactions and thus pathways that may be relevant to complex disease etiology.
View details for DOI 10.1371/journal.pgen.1005202
View details for Web of Science ID 000355305200022
View details for PubMedID 26020271
-
A family of transposable elements co-opted into developmental enhancers in the mouse neocortex
NATURE COMMUNICATIONS
2015; 6
View details for DOI 10.1038/ncomms7644
View details for Web of Science ID 000353040900001
View details for PubMedID 25806706
-
A family of transposable elements co-opted into developmental enhancers in the mouse neocortex.
Nature communications
2015; 6: 6644-?
Abstract
The neocortex is a mammalian-specific structure that is responsible for higher functions such as cognition, emotion and perception. To gain insight into its evolution and the gene regulatory codes that pattern it, we studied the overlap of its active developmental enhancers with transposable element (TE) families and compared this overlap to uniformly shuffled enhancers. Here we show a striking enrichment of the MER130 repeat family among active enhancers in the mouse dorsal cerebral wall, which gives rise to the neocortex, at embryonic day 14.5. We show that MER130 instances preserve a common code of transcriptional regulatory logic, function as enhancers and are adjacent to critical neocortical genes. MER130, a nonautonomous interspersed TE, originates in the tetrapod or possibly Sarcopterygii ancestor, which far predates the appearance of the neocortex. Our results show that MER130 elements were recruited, likely through their common regulatory logic, as neocortical enhancers.
View details for DOI 10.1038/ncomms7644
View details for PubMedID 25806706
-
Microbiota modulate transcription in the intestinal epithelium without remodeling the accessible chromatin landscape.
Genome research
2014; 24 (9): 1504-1516
Abstract
Microbiota regulate intestinal physiology by modifying host gene expression along the length of the intestine, but the underlying regulatory mechanisms remain unresolved. Transcriptional specificity occurs through interactions between transcription factors (TFs) and cis-regulatory regions (CRRs) characterized by nucleosome-depleted accessible chromatin. We profiled transcriptome and accessible chromatin landscapes in intestinal epithelial cells (IECs) from mice reared in the presence or absence of microbiota. We show that regional differences in gene transcription along the intestinal tract were accompanied by major alterations in chromatin accessibility. Surprisingly, we discovered that microbiota modify host gene transcription in IECs without significantly impacting the accessible chromatin landscape. Instead, microbiota regulation of host gene transcription might be achieved by differential expression of specific TFs and enrichment of their binding sites in nucleosome-depleted CRRs near target genes. Our results suggest that the chromatin landscape in IECs is preprogrammed by the host in a region-specific manner to permit responses to microbiota through binding of open CRRs by specific TFs.
View details for DOI 10.1101/gr.165845.113
View details for PubMedID 24963153
-
Automated discovery of tissue-targeting enhancers and transcription factors from binding motif and gene function data.
PLoS computational biology
2014; 10 (1)
Abstract
Identifying enhancers regulating gene expression remains an important and challenging task. While recent sequencing-based methods provide epigenomic characteristics that correlate well with enhancer activity, it remains onerous to comprehensively identify all enhancers across development. Here we introduce a computational framework to identify tissue-specific enhancers evolving under purifying selection. First, we incorporate high-confidence binding site predictions with target gene functional enrichment analysis to identify transcription factors (TFs) likely functioning in a particular context. We then search the genome for clusters of binding sites for these TFs, overcoming previous constraints associated with biased manual curation of TFs or enhancers. Applying our method to the placenta, we find 33 known and implicate 17 novel TFs in placental function, and discover 2,216 putative placenta enhancers. Using luciferase reporter assays, 31/36 (86%) tested candidates drive activity in placental cells. Our predictions agree well with recent epigenomic data in human and mouse, yet over half our loci, including 7/8 (87%) tested regions, are novel. Finally, we establish that our method is generalizable by applying it to 5 additional tissues: heart, pancreas, blood vessel, bone marrow, and liver.
View details for DOI 10.1371/journal.pcbi.1003449
View details for PubMedID 24499934
-
Automated discovery of tissue-targeting enhancers and transcription factors from binding motif and gene function data.
PLoS computational biology
2014; 10 (1): e1003449
Abstract
Identifying enhancers regulating gene expression remains an important and challenging task. While recent sequencing-based methods provide epigenomic characteristics that correlate well with enhancer activity, it remains onerous to comprehensively identify all enhancers across development. Here we introduce a computational framework to identify tissue-specific enhancers evolving under purifying selection. First, we incorporate high-confidence binding site predictions with target gene functional enrichment analysis to identify transcription factors (TFs) likely functioning in a particular context. We then search the genome for clusters of binding sites for these TFs, overcoming previous constraints associated with biased manual curation of TFs or enhancers. Applying our method to the placenta, we find 33 known and implicate 17 novel TFs in placental function, and discover 2,216 putative placenta enhancers. Using luciferase reporter assays, 31/36 (86%) tested candidates drive activity in placental cells. Our predictions agree well with recent epigenomic data in human and mouse, yet over half our loci, including 7/8 (87%) tested regions, are novel. Finally, we establish that our method is generalizable by applying it to 5 additional tissues: heart, pancreas, blood vessel, bone marrow, and liver.
View details for DOI 10.1371/journal.pcbi.1003449
View details for PubMedID 24499934
View details for PubMedCentralID PMC3907286
-
Structure-aided prediction of mammalian transcription factor complexes in conserved non-coding elements.
Philosophical transactions of the Royal Society of London. Series B, Biological sciences
2013; 368 (1632): 20130029-?
Abstract
Mapping the DNA-binding preferences of transcription factor (TF) complexes is critical for deciphering the functions of cis-regulatory elements. Here, we developed a computational method that compares co-occurring motif spacings in conserved versus unconserved regions of the human genome to detect evolutionarily constrained binding sites of rigid TF complexes. Structural data were used to estimate TF complex physical plausibility, explore overlapping motif arrangements seldom tackled by non-structure-aware methods, and generate and analyse three-dimensional models of the predicted complexes bound to DNA. Using this approach, we predicted 422 physically realistic TF complex motifs at 18% false discovery rate, the majority of which (326, 77%) contain some sequence overlap between binding sites. The set of mostly novel complexes is enriched in known composite motifs, predictive of binding site configurations in TF-TF-DNA crystal structures, and supported by ChIP-seq datasets. Structural modelling revealed three cooperativity mechanisms: direct protein-protein interactions, potentially indirect interactions and 'through-DNA' interactions. Indeed, 38% of the predicted complexes were found to contain four or more bases in which TF pairs appear to synergize through overlapping binding to the same DNA base pairs in opposite grooves or strands. Our TF complex and associated binding site predictions are available as a web resource at http://bejerano.stanford.edu/complex.
View details for DOI 10.1098/rstb.2013.0029
View details for PubMedID 24218641
View details for PubMedCentralID PMC3826502
-
A Penile Spine/Vibrissa Enhancer Sequence Is Missing in Modern and Extinct Humans but Is Retained in Multiple Primates with Penile Spines and Sensory Vibrissae
PLOS ONE
2013; 8 (12)
Abstract
Previous studies show that humans have a large genomic deletion downstream of the Androgen Receptor gene that eliminates an ancestral mammalian regulatory enhancer that drives expression in developing penile spines and sensory vibrissae. Here we use a combination of large-scale sequence analysis and PCR amplification to demonstrate that the penile spine/vibrissa enhancer is missing in all humans surveyed and in the Neandertal and Denisovan genomes, but is present in DNA samples of chimpanzees and bonobos, as well as in multiple other great apes and primates that maintain some form of penile integumentary appendage and facial vibrissae. These results further strengthen the association between the presence of the penile spine/vibrissa enhancer and the presence of penile spines and macro- or micro- vibrissae in non-human primates as well as show that loss of the enhancer is both a distinctive and characteristic feature of the human lineage.
View details for DOI 10.1371/journal.pone.0084258
View details for Web of Science ID 000328741900040
View details for PubMedID 24367647
View details for PubMedCentralID PMC3868586
-
The enhancer landscape during early neocortical development reveals patterns of dense regulation and co-option.
PLoS genetics
2013; 9 (8)
Abstract
Genetic studies have identified a core set of transcription factors and target genes that control the development of the neocortex, the region of the human brain responsible for higher cognition. The specific regulatory interactions between these factors, many key upstream and downstream genes, and the enhancers that mediate all these interactions remain mostly uncharacterized. We perform p300 ChIP-seq to identify over 6,600 candidate enhancers active in the dorsal cerebral wall of embryonic day 14.5 (E14.5) mice. Over 95% of the peaks we measure are conserved to human. Eight of ten (80%) candidates tested using mouse transgenesis drive activity in restricted laminar patterns within the neocortex. GREAT based computational analysis reveals highly significant correlation with genes expressed at E14.5 in key areas for neocortex development, and allows the grouping of enhancers by known biological functions and pathways for further studies. We find that multiple genes are flanked by dozens of candidate enhancers each, including well-known key neocortical genes as well as suspected and novel genes. Nearly a quarter of our candidate enhancers are conserved well beyond mammals. Human and zebrafish regions orthologous to our candidate enhancers are shown to most often function in other aspects of central nervous system development. Finally, we find strong evidence that specific interspersed repeat families have contributed potentially key developmental enhancers via co-option. Our analysis expands the methodologies available for extracting the richness of information found in genome-wide functional maps.
View details for DOI 10.1371/journal.pgen.1003728
View details for PubMedID 24009522
View details for PubMedCentralID PMC3757057
-
Computational methods to detect conserved non-genic elements in phylogenetically isolated genomes: application to zebrafish.
Nucleic acids research
2013; 41 (15)
View details for DOI 10.1093/nar/gkt557
View details for PubMedID 23814184
-
Computational methods to detect conserved non-genic elements in phylogenetically isolated genomes: application to zebrafish.
Nucleic acids research
2013; 41 (15)
Abstract
Many important model organisms for biomedical and evolutionary research have sequenced genomes, but occupy a phylogenetically isolated position, evolutionarily distant from other sequenced genomes. This phylogenetic isolation is exemplified for zebrafish, a vertebrate model for cis-regulation, development and human disease, whose evolutionary distance to all other currently sequenced fish exceeds the distance between human and chicken. Such large distances make it difficult to align genomes and use them for comparative analysis beyond gene-focused questions. In particular, detecting conserved non-genic elements (CNEs) as promising cis-regulatory elements with biological importance is challenging. Here, we develop a general comparative genomics framework to align isolated genomes and to comprehensively detect CNEs. Our approach integrates highly sensitive and quality-controlled local alignments and uses alignment transitivity and ancestral reconstruction to bridge large evolutionary distances. We apply our framework to zebrafish and demonstrate substantially improved CNE detection and quality compared with previous sets. Our zebrafish CNE set comprises 54 533 CNEs, of which 11 792 (22%) are conserved to human or mouse. Our zebrafish CNEs (http://zebrafish.stanford.edu) are highly enriched in known enhancers and extend existing experimental (ChIP-Seq) sets. The same framework can now be applied to the isolated genomes of frog, amphioxus, Caenorhabditis elegans and many others.
View details for DOI 10.1093/nar/gkt557
View details for PubMedID 23814184
-
The Enhancer Landscape during Early Neocortical Development Reveals Patterns of Dense Regulation and Co-option.
PLoS genetics
2013; 9 (8): e1003728
Abstract
Genetic studies have identified a core set of transcription factors and target genes that control the development of the neocortex, the region of the human brain responsible for higher cognition. The specific regulatory interactions between these factors, many key upstream and downstream genes, and the enhancers that mediate all these interactions remain mostly uncharacterized. We perform p300 ChIP-seq to identify over 6,600 candidate enhancers active in the dorsal cerebral wall of embryonic day 14.5 (E14.5) mice. Over 95% of the peaks we measure are conserved to human. Eight of ten (80%) candidates tested using mouse transgenesis drive activity in restricted laminar patterns within the neocortex. GREAT based computational analysis reveals highly significant correlation with genes expressed at E14.5 in key areas for neocortex development, and allows the grouping of enhancers by known biological functions and pathways for further studies. We find that multiple genes are flanked by dozens of candidate enhancers each, including well-known key neocortical genes as well as suspected and novel genes. Nearly a quarter of our candidate enhancers are conserved well beyond mammals. Human and zebrafish regions orthologous to our candidate enhancers are shown to most often function in other aspects of central nervous system development. Finally, we find strong evidence that specific interspersed repeat families have contributed potentially key developmental enhancers via co-option. Our analysis expands the methodologies available for extracting the richness of information found in genome-wide functional maps.
View details for DOI 10.1371/journal.pgen.1003728
View details for PubMedID 24009522
View details for PubMedCentralID PMC3757057
-
PRISM offers a comprehensive genomic approach to transcription factor function prediction.
Genome research
2013; 23 (5): 889-904
Abstract
The human genome encodes 1500-2000 different transcription factors (TFs). ChIP-seq is revealing the global binding profiles of a fraction of TFs in a fraction of their biological contexts. These data show that the majority of TFs bind directly next to a large number of context-relevant target genes, that most binding is distal, and that binding is context specific. Because of the effort and cost involved, ChIP-seq is seldom used in search of novel TF function. Such exploration is instead done using expression perturbation and genetic screens. Here we propose a comprehensive computational framework for transcription factor function prediction. We curate 332 high-quality nonredundant TF binding motifs that represent all major DNA binding domains, and improve cross-species conserved binding site prediction to obtain 3.3 million conserved, mostly distal, binding site predictions. We combine these with 2.4 million facts about all human and mouse gene functions, in a novel statistical framework, in search of enrichments of particular motifs next to groups of target genes of particular functions. Rigorous parameter tuning and a harsh null are used to minimize false positives. Our novel PRISM (predicting regulatory information from single motifs) approach obtains 2543 TF function predictions in a large variety of contexts, at a false discovery rate of 16%. The predictions are highly enriched for validated TF roles, and 45 of 67 (67%) tested binding site regions in five different contexts act as enhancers in functionally matched cells.
View details for DOI 10.1101/gr.139071.112
View details for PubMedID 23382538
View details for PubMedCentralID PMC3638144
-
Enhancers: five essential questions
NATURE REVIEWS GENETICS
2013; 14 (4): 288-295
Abstract
It is estimated that the human genome contains hundreds of thousands of enhancers, so understanding these gene-regulatory elements is a crucial goal. Several fundamental questions need to be addressed about enhancers, such as how do we identify them all, how do they work, and how do they contribute to disease and evolution? Five prominent researchers in this field look at how much we know already and what needs to be done to answer these questions.
View details for Web of Science ID 000316975300012
View details for PubMedID 23503198
-
Evolutionary biology for the 21st century.
PLoS biology
2013; 11 (1)
View details for DOI 10.1371/journal.pbio.1001466
View details for PubMedID 23319892
View details for PubMedCentralID PMC3539946
-
A penile spine/vibrissa enhancer sequence is missing in modern and extinct humans but is retained in multiple primates with penile spines and sensory vibrissae.
PloS one
2013; 8 (12)
Abstract
Previous studies show that humans have a large genomic deletion downstream of the Androgen Receptor gene that eliminates an ancestral mammalian regulatory enhancer that drives expression in developing penile spines and sensory vibrissae. Here we use a combination of large-scale sequence analysis and PCR amplification to demonstrate that the penile spine/vibrissa enhancer is missing in all humans surveyed and in the Neandertal and Denisovan genomes, but is present in DNA samples of chimpanzees and bonobos, as well as in multiple other great apes and primates that maintain some form of penile integumentary appendage and facial vibrissae. These results further strengthen the association between the presence of the penile spine/vibrissa enhancer and the presence of penile spines and macro- or micro- vibrissae in non-human primates as well as show that loss of the enhancer is both a distinctive and characteristic feature of the human lineage.
View details for DOI 10.1371/journal.pone.0084258
View details for PubMedID 24367647
View details for PubMedCentralID PMC3868586
-
Structure-aided prediction of mammalian transcription factor complexes in conserved non-coding elements.
Philosophical transactions of the Royal Society of London. Series B, Biological sciences
2013; 368 (1632): 20130029-?
Abstract
Mapping the DNA-binding preferences of transcription factor (TF) complexes is critical for deciphering the functions of cis-regulatory elements. Here, we developed a computational method that compares co-occurring motif spacings in conserved versus unconserved regions of the human genome to detect evolutionarily constrained binding sites of rigid TF complexes. Structural data were used to estimate TF complex physical plausibility, explore overlapping motif arrangements seldom tackled by non-structure-aware methods, and generate and analyse three-dimensional models of the predicted complexes bound to DNA. Using this approach, we predicted 422 physically realistic TF complex motifs at 18% false discovery rate, the majority of which (326, 77%) contain some sequence overlap between binding sites. The set of mostly novel complexes is enriched in known composite motifs, predictive of binding site configurations in TF-TF-DNA crystal structures, and supported by ChIP-seq datasets. Structural modelling revealed three cooperativity mechanisms: direct protein-protein interactions, potentially indirect interactions and 'through-DNA' interactions. Indeed, 38% of the predicted complexes were found to contain four or more bases in which TF pairs appear to synergize through overlapping binding to the same DNA base pairs in opposite grooves or strands. Our TF complex and associated binding site predictions are available as a web resource at http://bejerano.stanford.edu/complex.
View details for DOI 10.1098/rstb.2013.0029
View details for PubMedID 24218641
-
Evolutionary Biology for the 21st Century
PLOS BIOLOGY
2013; 11 (1)
View details for DOI 10.1371/journal.pbio.1001466
View details for Web of Science ID 000314648700006
View details for PubMedID 23319892
View details for PubMedCentralID PMC3539946
-
PESNPdb: A comprehensive database of SNPs studied in association with pre-eclampsia
PLACENTA
2012; 33 (12): 1055-1057
Abstract
Pre-eclampsia is a pregnancy specific disorder that can be life threatening for mother and child. Multiple studies have been carried out in an attempt to identify SNPs that contribute to the genetic susceptibility of the disease. Here we describe PESNPdb (http://bejerano.stanford.edu/pesnpdb), a database aimed at centralizing SNP and study details investigated in association with pre-eclampsia. We also describe a Placenta Disorders ontology that utilizes information from PESNPdb. The main focus of PESNPdb is to help researchers study the genetic complexity of pre-eclampsia through a user-friendly interface that encourages community participation.
View details for DOI 10.1016/j.placenta.2012.09.016
View details for Web of Science ID 000312171900015
View details for PubMedID 23084601
-
Hundreds of conserved non-coding genomic regions are independently lost in mammals
NUCLEIC ACIDS RESEARCH
2012; 40 (22): 11463-11476
Abstract
Conserved non-protein-coding DNA elements (CNEs) often encode cis-regulatory elements and are rarely lost during evolution. However, CNE losses that do occur can be associated with phenotypic changes, exemplified by pelvic spine loss in sticklebacks. Using a computational strategy to detect complete loss of CNEs in mammalian genomes while strictly controlling for artifacts, we find >600 CNEs that are independently lost in at least two mammalian lineages, including a spinal cord enhancer near GDF11. We observed several genomic regions where multiple independent CNE loss events happened; the most extreme is the DIAPH2 locus. We show that CNE losses often involve deletions and that CNE loss frequencies are non-uniform. Similar to less pleiotropic enhancers, we find that independently lost CNEs are shorter, slightly less constrained and evolutionarily younger than CNEs without detected losses. This suggests that independently lost CNEs are less pleiotropic and that pleiotropic constraints contribute to non-uniform CNE loss frequencies. We also detected 35 CNEs that are independently lost in the human lineage and in other mammals. Our study uncovers an interesting aspect of the evolution of functional DNA in mammalian genomes. Experiments are necessary to test if these independently lost CNEs are associated with parallel phenotype changes in mammals.
View details for DOI 10.1093/nar/gks905
View details for Web of Science ID 000313414800031
View details for PubMedID 23042682
View details for PubMedCentralID PMC3526296
-
A "Forward Genomics'' Approach Links Genotype to Phenotype using Independent Phenotypic Losses among Related Species
CELL REPORTS
2012; 2 (4): 817-823
Abstract
Genotype-phenotype mapping is hampered by countless genomic changes between species. We introduce a computational "forward genomics" strategy that-given only an independently lost phenotype and whole genomes-matches genomic and phenotypic loss patterns to associate specific genomic regions with this phenotype. We conducted genome-wide screens for two metabolic phenotypes. First, our approach correctly matches the inactivated Gulo gene exactly with the species that lost the ability to synthesize vitamin C. Second, we attribute naturally low biliary phospholipid levels in guinea pigs and horses to the inactivated phospholipid transporter Abcb4. Human ABCB4 mutations also result in low phospholipid levels but lead to severe liver disease, suggesting compensatory mechanisms in guinea pig and horse. Our simulation studies, counts of independent changes in existing phenotype surveys, and the forthcoming availability of many new genomes all suggest that forward genomics can be applied to many phenotypes, including those relevant for human evolution and disease.
View details for DOI 10.1016/j.celrep.2012.08.032
View details for Web of Science ID 000314455600014
View details for PubMedID 23022484
View details for PubMedCentralID PMC3572205
-
Human Developmental Enhancers Conserved between Deuterostomes and Protostomes
PLOS GENETICS
2012; 8 (8)
Abstract
The identification of homologies, whether morphological, molecular, or genetic, is fundamental to our understanding of common biological principles. Homologies bridging the great divide between deuterostomes and protostomes have served as the basis for current models of animal evolution and development. It is now appreciated that these two clades share a common developmental toolkit consisting of conserved transcription factors and signaling pathways. These patterning genes sometimes show common expression patterns and genetic interactions, suggesting the existence of similar or even conserved regulatory apparatus. However, previous studies have found no regulatory sequence conserved between deuterostomes and protostomes. Here we describe the first such enhancers, which we call bilaterian conserved regulatory elements (Bicores). Bicores show conservation of sequence and gene synteny. Sequence conservation of Bicores reflects conserved patterns of transcription factor binding sites. We predict that Bicores act as response elements to signaling pathways, and we show that Bicores are developmental enhancers that drive expression of transcriptional repressors in the vertebrate central nervous system. Although the small number of identified Bicores suggests extensive rewiring of cis-regulation between the protostome and deuterostome clades, additional Bicores may be revealed as our understanding of cis-regulatory logic and sample of bilaterian genomes continue to grow.
View details for DOI 10.1371/journal.pgen.1002852
View details for Web of Science ID 000308529300014
View details for PubMedID 22876195
View details for PubMedCentralID PMC3410860
-
A novel 13 base pair insertion in the sonic hedgehog ZRS limb enhancer (ZRS/LMBR1) causes preaxial polydactyly with triphalangeal thumb
HUMAN MUTATION
2012; 33 (7): 1063-1066
Abstract
Mutations in the Sonic hedgehog limb enhancer, the zone of polarizing activity regulatory sequence (ZRS, located within the gene LMBR1), commonly called the ZRS), cause limb malformations. In humans, three classes of mutations have been proposed based on the limb phenotype; single base changes throughout the region cause preaxial polydactyly (PPD), single base changes at one specific site cause Werner mesomelic syndrome, and large duplications cause polysyndactyly. This study presents a novel mutation-a small insertion. In a Swedish family with autosomal-dominant PPD, we found a 13 base pair insertion within the ZRS, NG_009240.1:g.106934_106935insTAAGGAAGTGATT (traditional nomenclature: ZRS603ins13). Computational transcription factor-binding site predictions suggest that this insertion creates new binding sites and a mouse enhancer assay shows that this insertion causes ectopic gene expression. This study is the first to discover a small insertion in an enhancer that causes a human limb malformation and suggests a potential mechanism that could explain the ectopic expression caused by this mutation.
View details for DOI 10.1002/humu.22097
View details for Web of Science ID 000304815100010
View details for PubMedID 22495965
-
Coding exons function as tissue-specific enhancers of nearby genes
GENOME RESEARCH
2012; 22 (6): 1059-1068
Abstract
Enhancers are essential gene regulatory elements whose alteration can lead to morphological differences between species, developmental abnormalities, and human disease. Current strategies to identify enhancers focus primarily on noncoding sequences and tend to exclude protein coding sequences. Here, we analyzed 25 available ChIP-seq data sets that identify enhancers in an unbiased manner (H3K4me1, H3K27ac, and EP300) for peaks that overlap exons. We find that, on average, 7% of all ChIP-seq peaks overlap coding exons (after excluding for peaks that overlap with first exons). By using mouse and zebrafish enhancer assays, we demonstrate that several of these exonic enhancer (eExons) candidates can function as enhancers of their neighboring genes and that the exonic sequence is necessary for enhancer activity. Using ChIP, 3C, and DNA FISH, we further show that one of these exonic limb enhancers, Dync1i1 exon 15, has active enhancer marks and physically interacts with Dlx5/6 promoter regions 900 kb away. In addition, its removal by chromosomal abnormalities in humans could cause split hand and foot malformation 1 (SHFM1), a disorder associated with DLX5/6. These results demonstrate that DNA sequences can have a dual function, operating as coding exons in one tissue and enhancers of nearby gene(s) in another tissue, suggesting that phenotypes resulting from coding mutations could be caused not only by protein alteration but also by disrupting the regulation of another gene.
View details for DOI 10.1101/gr.133546.111
View details for Web of Science ID 000304728100007
View details for PubMedID 22442009
View details for PubMedCentralID PMC3371700
-
Control of Pelvic Girdle Development by Genes of the Pbx Family and Emx2
DEVELOPMENTAL DYNAMICS
2011; 240 (5): 1173-1189
Abstract
Genes expressed in the somatopleuric mesoderm, the embryonic domain giving rise to the vertebrate pelvis, appear important for pelvic girdle formation. Among such genes, Pbx family members and Emx2 were found to genetically interact in hindlimb and pectoral girdle formation. Here, we generated compound mutant embryos carrying combinations of mutated alleles for Pbx1, Pbx2, and Pbx3, as well as Pbx1 and Emx2, to examine potential genetic interactions during pelvic development. Indeed, Pbx genes share overlapping functions and Pbx1 and Emx2 genetically interact in pelvic formation. We show that, in compound Pbx1;Pbx2 and Pbx1;Emx2 mutants, pelvic mesenchymal condensation is markedly perturbed, indicative of an upstream control by these homeoproteins. We establish that expression of Tbx15, Prrx1, and Pax1, among other genes involved in the specification and development of select pelvic structures, is altered in our compound mutants. Lastly, we identify potential Pbx1-Emx2-regulated enhancers for Tbx15, Prrx1, and Pax1, using bioinformatics analyses.
View details for DOI 10.1002/dvdy.22617
View details for Web of Science ID 000289942300023
View details for PubMedID 21455939
View details for PubMedCentralID PMC3081414
-
Human-specific loss of regulatory DNA and the evolution of human-specific traits
NATURE
2011; 471 (7337): 216-219
Abstract
Humans differ from other animals in many aspects of anatomy, physiology, and behaviour; however, the genotypic basis of most human-specific traits remains unknown. Recent whole-genome comparisons have made it possible to identify genes with elevated rates of amino acid change or divergent expression in humans, and non-coding sequences with accelerated base pair changes. Regulatory alterations may be particularly likely to produce phenotypic effects while preserving viability, and are known to underlie interesting evolutionary differences in other species. Here we identify molecular events particularly likely to produce significant regulatory changes in humans: complete deletion of sequences otherwise highly conserved between chimpanzees and other mammals. We confirm 510 such deletions in humans, which fall almost exclusively in non-coding regions and are enriched near genes involved in steroid hormone signalling and neural function. One deletion removes a sensory vibrissae and penile spine enhancer from the human androgen receptor (AR) gene, a molecular change correlated with anatomical loss of androgen-dependent sensory vibrissae and penile spines in the human lineage. Another deletion removes a forebrain subventricular zone enhancer near the tumour suppressor gene growth arrest and DNA-damage-inducible, gamma (GADD45G), a loss correlated with expansion of specific brain regions in humans. Deletions of tissue-specific enhancers may thus accompany both loss and gain traits in the human lineage, and provide specific examples of the kinds of regulatory alterations and inactivation events long proposed to have an important role in human evolutionary divergence.
View details for DOI 10.1038/nature09774
View details for Web of Science ID 000288170200037
View details for PubMedID 21390129
View details for PubMedCentralID PMC3071156
-
Noninvasive Monitoring of Placenta-Specific Transgene Expression by Bioluminescence Imaging
PLOS ONE
2011; 6 (1)
Abstract
Placental dysfunction underlies numerous complications of pregnancy. A major obstacle to understanding the roles of potential mediators of placental pathology has been the absence of suitable methods for tissue-specific gene manipulation and sensitive assays for studying gene functions in the placentas of intact animals. We describe a sensitive and noninvasive method of repetitively tracking placenta-specific gene expression throughout pregnancy using lentivirus-mediated transduction of optical reporter genes in mouse blastocysts.Zona-free blastocysts were incubated with lentivirus expressing firefly luciferase (Fluc) and Tomato fluorescent fusion protein for trophectoderm-specific infection and transplanted into day 3 pseudopregnant recipients (GD3). Animals were examined for Fluc expression by live bioluminescence imaging (BLI) at different points during pregnancy, and the placentas were examined for tomato expression in different cell types on GD18. In another set of experiments, blastocysts with maximum photon fluxes in the range of 2.0E+4 to 6.0E+4 p/s/cm(2)/sr were transferred. Fluc expression was detectable in all surrogate dams by day 5 of pregnancy by live imaging, and the signal increased dramatically thereafter each day until GD12, reaching a peak at GD16 and maintaining that level through GD18. All of the placentas, but none of the fetuses, analyzed on GD18 by BLI showed different degrees of Fluc expression. However, only placentas of dams transferred with selected blastocysts showed uniform photon distribution with no significant variability of photon intensity among placentas of the same litter. Tomato expression in the placentas was limited to only trophoblast cell lineages.These results, for the first time, demonstrate the feasibility of selecting lentivirally-transduced blastocysts for uniform gene expression in all placentas of the same litter and early detection and quantitative analysis of gene expression throughout pregnancy by live BLI. This method may be useful for a wide range of applications involving trophoblast-specific gene manipulations in utero.
View details for DOI 10.1371/journal.pone.0016348
View details for PubMedID 21283713
-
Human-specific loss of an androgen receptor enhancer is associated with the loss of vibrissae and penile spines
80th Annual Meeting of the American-Association-of-Physical-Anthropologists
WILEY-BLACKWELL. 2011: 252–252
View details for Web of Science ID 000288034000703
-
Endangered Species Hold Clues to Human Evolution
JOURNAL OF HEREDITY
2010; 101 (4): 437-447
Abstract
We report that 18 conserved, and by extension functional, elements in the human genome are the result of retroposon insertions that are evolving under purifying selection in mammals. We show evidence that 1 of the 18 elements regulates the expression of ASXL3 during development by encoding an alternatively spliced exon that causes nonsense-mediated decay of the transcript. The retroposon that gave rise to these functional elements was quickly inactivated in the mammalian ancestor, and all traces of it have been lost due to neutral decay. However, the tuatara has maintained a near-ancestral version of this retroposon in its extant genome, which allows us to connect the 18 human elements to the evolutionary events that created them. We propose that conservation efforts over more than 100 years may not have only prevented the tuatara from going extinct but could have preserved our ability to understand the evolutionary history of functional elements in the human genome. Through simulations, we argue that species with historically low population sizes are more likely to harbor ancient mobile elements for long periods of time and in near-ancestral states, making these species indispensable in understanding the evolutionary origin of functional elements in the human genome.
View details for DOI 10.1093/jhered/esq016
View details for Web of Science ID 000279430300005
View details for PubMedID 20332163
-
GREAT improves functional interpretation of cis-regulatory regions
NATURE BIOTECHNOLOGY
2010; 28 (5): 495-U155
Abstract
We developed the Genomic Regions Enrichment of Annotations Tool (GREAT) to analyze the functional significance of cis-regulatory regions identified by localized measurements of DNA binding events across an entire genome. Whereas previous methods took into account only binding proximal to genes, GREAT is able to properly incorporate distal binding sites and control for false positives using a binomial test over the input genomic regions. GREAT incorporates annotations from 20 ontologies and is available as a web application. Applying GREAT to data sets from chromatin immunoprecipitation coupled with massively parallel sequencing (ChIP-seq) of multiple transcription-associated factors, including SRF, NRSF, GABP, Stat3 and p300 in different developmental contexts, we recover many functions of these factors that are missed by existing gene-based tools, and we generate testable hypotheses. The utility of GREAT is not limited to ChIP-seq, as it could also be applied to open chromatin, localized epigenomic markers and similar functional data sets, as well as comparative genomics sets.
View details for DOI 10.1038/nbt.1630
View details for Web of Science ID 000277452700030
View details for PubMedID 20436461
-
Dispensability of mammalian DNA
GENOME RESEARCH
2008; 18 (11): 1743-1751
Abstract
In the lab, the cis-regulatory network seems to exhibit great functional redundancy. Many experiments testing enhancer activity of neighboring cis-regulatory elements show largely overlapping expression domains. Of recent interest, mice in which cis-regulatory ultraconserved elements were knocked out showed no obvious phenotype, further suggesting functional redundancy. Here, we present a global evolutionary analysis of mammalian conserved nonexonic elements (CNEs), and find strong evidence to the contrary. Given a set of CNEs conserved between several mammals, we characterize functional dispensability as the propensity for the ancestral element to be lost in mammalian species internal to the spanned species tree. We show that ultraconserved-like elements are over 300-fold less likely than neutral DNA to have been lost during rodent evolution. In fact, many thousands of noncoding loci under purifying selection display near uniform indispensability during mammalian evolution, largely irrespective of nucleotide conservation level. These findings suggest that many genomic noncoding elements possess functions that contribute noticeably to organism fitness in naturally evolving populations.
View details for DOI 10.1101/gr.080184.108
View details for Web of Science ID 000260536100007
View details for PubMedID 18832441
View details for PubMedCentralID PMC2577864
-
Human genome ultraconserved elements are ultraselected
SCIENCE
2007; 317 (5840): 915-915
Abstract
Ultraconserved elements in the human genome are defined as stretches of at least 200 base pairs of DNA that match identically with corresponding regions in the mouse and rat genomes. Most ultraconserved elements are noncoding and have been evolutionarily conserved since mammal and bird ancestors diverged over 300 million years ago. The reason for this extreme conservation remains a mystery. It has been speculated that they are mutational cold spots or regions where every site is under weak but still detectable negative selection. However, analysis of the derived allele frequency spectrum shows that these regions are in fact under negative selection that is much stronger than that in protein coding genes.
View details for DOI 10.1126/science.1142430
View details for Web of Science ID 000248780200030
View details for PubMedID 17702936
- Thousands of human mobile element fragments undergo strong purifying selection near developmental genes Proc. Nat?l Acad. Sci. USA 2007; 104 (19): 8005-8010
-
Comparative genomic analysis using the UCSC genome browser.
Methods in molecular biology (Clifton, N.J.)
2007; 395: 17-34
Abstract
Comparative analysis of DNA sequence from multiple species can provide insights into the function and evolutionary processes that shape genomes. The University of California Santa Cruz (UCSC) Genome Bioinformatics group has developed several tools and methodologies in its study of comparative genomics, many of which have been incorporated into the UCSC Genome Browser (http://genome.ucsc.edu), an easy-to-use online tool for browsing genomic data and aligned annotation "tracks" in a single window. The comparative genomics annotations in the browser include pairwise alignments, which aid in the identification of orthologous regions between species, and conservation tracks that show measures of evolutionary conservation among sets of multiply aligned species, highlighting regions of the genome that may be functionally important. A related tool, the UCSC Table Browser, provides a simple interface for querying, analyzing, and downloading the data underlying the Genome Browser annotation tracks. Here, we describe a procedure for examining a genomic region of interest in the Genome Browser, analyzing characteristics of the region, filtering the data, and downloading data sets for further study.
View details for PubMedID 17993665
-
Branch and bound computation of exact p-values
BIOINFORMATICS
2006; 22 (17): 2158-2159
Abstract
P-value computation is often used in bioinformatics to quantify the surprise, or significance, associated with a given observation. An implementation is provided that computes the exact p-value associated with any observed sample, against a null multinomial distribution, using the likelihood-ratio statistic. The efficient branch and bound code, far exceeding the full enumeration implemented by commercial packages, is especially useful with small sample, sparse data and rare events, common scenarios in bioinformatics, where approximations are often inaccurate and inappropriate. This code base can also be adapted to compute exact p-values of other statistics in diverse sampling scenarios.Freely available at http://www.soe.ucsc.edu/~jill/src/.
View details for DOI 10.1093/bioinformatics/btl357
View details for Web of Science ID 000240433100015
View details for PubMedID 16895926
-
Identification and classification of conserved RNA secondary structures in the human genome
PLOS COMPUTATIONAL BIOLOGY
2006; 2 (4): 251-262
Abstract
The discoveries of microRNAs and riboswitches, among others, have shown functional RNAs to be biologically more important and genomically more prevalent than previously anticipated. We have developed a general comparative genomics method based on phylogenetic stochastic context-free grammars for identifying functional RNAs encoded in the human genome and used it to survey an eight-way genome-wide alignment of the human, chimpanzee, mouse, rat, dog, chicken, zebra-fish, and puffer-fish genomes for deeply conserved functional RNAs. At a loose threshold for acceptance, this search resulted in a set of 48,479 candidate RNA structures. This screen finds a large number of known functional RNAs, including 195 miRNAs, 62 histone 3'UTR stem loops, and various types of known genetic recoding elements. Among the highest-scoring new predictions are 169 new miRNA candidates, as well as new candidate selenocysteine insertion sites, RNA editing hairpins, RNAs involved in transcript auto regulation, and many folds that form singletons or small functional RNA families of completely unknown function. While the rate of false positives in the overall set is difficult to estimate and is likely to be substantial, the results nevertheless provide evidence for many new human functional RNAs and present specific predictions to facilitate their further characterization.
View details for DOI 10.1371/journal.pcbi.0020033
View details for Web of Science ID 000239493800005
View details for PubMedID 16628248
-
The UCSC Genome Browser Database: update 2006
NUCLEIC ACIDS RESEARCH
2006; 34: D590-D598
Abstract
The University of California Santa Cruz Genome Browser Database (GBD) contains sequence and annotation data for the genomes of about a dozen vertebrate species and several major model organisms. Genome annotations typically include assembly data, sequence composition, genes and gene predictions, mRNA and expressed sequence tag evidence, comparative genomics, regulation, expression and variation data. The database is optimized to support fast interactive performance with web tools that provide powerful visualization and querying capabilities for mining the data. The Genome Browser displays a wide variety of annotations at all scales from single nucleotide level up to a full chromosome. The Table Browser provides direct access to the database tables and sequence data, enabling complex queries on genome-wide datasets. The Proteome Browser graphically displays protein properties. The Gene Sorter allows filtering and comparison of genes by several metrics including expression data and several gene properties. BLAT and In Silico PCR search for sequences in entire genomes in seconds. These tools are highly integrated and provide many hyperlinks to other databases and websites. The GBD, browsing tools, downloadable data files and links to documentation and other information can be found at http://genome.ucsc.edu/.
View details for DOI 10.1093/nar/gkj144
View details for Web of Science ID 000239307700126
View details for PubMedID 16381938
- A Distal Enhancer and an Ultraconserved Exon are Derived From a Novel Retroposon Nature 2006; 441 (7089): 87-90
- Forces Shaping the Fastest Evolving Regions in the Human Genome PLoS Genetics 2006; 2 (10): e168
-
Computational screening of conserved genomic DNA in search of functional noncoding elements
NATURE METHODS
2005; 2 (7): 535-545
View details for Web of Science ID 000230165700018
View details for PubMedID 16170870
-
Ultraconserved elements in insect genomes: A highly conserved intronic sequence implicated in the control of homothorax mRNA splicing
GENOME RESEARCH
2005; 15 (6): 800-808
Abstract
Recently, we identified a large number of ultraconserved (uc) sequences in noncoding regions of human, mouse, and rat genomes that appear to be essential for vertebrate and amniote ontogeny. Here, we used similar methods to identify ultraconserved genomic regions between the insect species Drosophila melanogaster and Drosophila pseudoobscura, as well as the more distantly related Anopheles gambiae. As with vertebrates, ultraconserved sequences in insects appear to occur primarily in intergenic and intronic sequences, and at intron-exon junctions. The sequences are significantly associated with genes encoding developmental regulators and transcription factors, but are less frequent and are smaller in size than in vertebrates. The longest identical, nongapped orthologous match between the three genomes was found within the homothorax (hth) gene. This sequence spans an internal exon-intron junction, with the majority located within the intron, and is predicted to form a highly stable stem-loop RNA structure. Real-time quantitative PCR analysis of different hth splice isoforms and Northern blotting showed that the conserved element is associated with a high incidence of intron retention in hth pre-mRNA, suggesting that the conserved intronic element is critically important in the post-transcriptional regulation of hth expression in Diptera.
View details for DOI 10.1101/gr.3545105
View details for Web of Science ID 000229623100005
View details for PubMedID 15899965
- Evolutionarily Conserved Elements in Vertebrate, Fly, Worm, and Yeast Genomes Genome Research 2005; 15 (8): 1034-1050
-
Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution
NATURE
2004; 432 (7018): 695-716
Abstract
We present here a draft genome sequence of the red jungle fowl, Gallus gallus. Because the chicken is a modern descendant of the dinosaurs and the first non-mammalian amniote to have its genome sequenced, the draft sequence of its genome--composed of approximately one billion base pairs of sequence and an estimated 20,000-23,000 genes--provides a new perspective on vertebrate genome evolution, while also improving the annotation of mammalian genomes. For example, the evolutionary distance between chicken and human provides high specificity in detecting functional elements, both non-coding and coding. Notably, many conserved non-coding sequences are far from genes and cannot be assigned to defined functional classes. In coding regions the evolutionary dynamics of protein domains and orthologous groups illustrate processes that distinguish the lineages leading to birds and mammals. The distinctive properties of avian microchromosomes, together with the inferred patterns of conserved synteny, provide additional insights into vertebrate chromosome architecture.
View details for DOI 10.1038/nature03154
View details for Web of Science ID 000225597200038
View details for PubMedID 15592404
-
Into the heart of darkness: large-scale clustering of human non-coding DNA
BIOINFORMATICS
2004; 20: 40-48
View details for DOI 10.1093/bioinformatics/bth946
View details for Web of Science ID 000208392400006
-
Into the heart of darkness: large-scale clustering of human non-coding DNA.
Bioinformatics
2004; 20: i40-8
Abstract
It is currently believed that the human genome contains about twice as much non-coding functional regions as it does protein-coding genes, yet our understanding of these regions is very limited.We examine the intersection between syntenically conserved sequences in the human, mouse and rat genomes, and sequence similarities within the human genome itself, in search of families of non-protein-coding elements. For this purpose we develop a graph theoretic clustering algorithm, akin to the highly successful methods used in elucidating protein sequence family relationships. The algorithm is applied to a highly filtered set of about 700 000 human-rodent evolutionarily conserved regions, not resembling any known coding sequence, which encompasses 3.7% of the human genome. From these, we obtain roughly 12 000 non-singleton clusters, dense in significant sequence similarities. Further analysis of genomic location, evidence of transcription and RNA secondary structure reveals many clusters to be significantly homogeneous in one or more characteristics. This subset of the highly conserved non-protein-coding elements in the human genome thus contains rich family-like structures, which merit in-depth analysis.Supplementary material to this work is available at http://www.soe.ucsc.edu/~jill/dark.html
View details for PubMedID 15262779
-
Ultraconserved elements in the human genome
SCIENCE
2004; 304 (5675): 1321-1325
Abstract
There are 481 segments longer than 200 base pairs (bp) that are absolutely conserved (100% identity with no insertions or deletions) between orthologous regions of the human, rat, and mouse genomes. Nearly all of these segments are also conserved in the chicken and dog genomes, with an average of 95 and 99% identity, respectively. Many are also significantly conserved in fish. These ultraconserved elements of the human genome are most often located either overlapping exons in genes involved in RNA processing or in introns or nearby genes involved in the regulation of transcription and development. Along with more than 5000 sequences of over 100 bp that are absolutely conserved among the three sequenced mammals, these represent a class of genetic elements whose functions and evolutionary origins are yet to be determined, but which are more highly conserved between these species than are proteins and appear to be essential for the ontogeny of mammals and other vertebrates.
View details for DOI 10.1126/science.1098119
View details for Web of Science ID 000221669600054
View details for PubMedID 15131266
-
Algorithms for variable length Markov chain modeling
BIOINFORMATICS
2004; 20 (5): 788-U729
Abstract
We present a general purpose implementation of variable length Markov models. Contrary to fixed order Markov models, these models are not restricted to a predefined uniform depth. Rather, by examining the training data, a model is constructed that fits higher order Markov dependencies where such contexts exist, while using lower order Markov dependencies elsewhere. As both theoretical and experimental results show, these models are capable of capturing rich signals from a modest amount of training data, without the use of hidden states.The source code is freely available at http://www.soe.ucsc.edu/~jill/src/
View details for DOI 10.1093/bioinformatics/btg489
View details for Web of Science ID 000220485300025
View details for PubMedID 14751999
- Efficient exact p-value computation for small sample, sparse and surprising categorical data J. Computational Biology 2004; 11 (5675): 867-886
-
Extremely conserved non-coding sequences in vertebrate genomes
4th International Conference on Bioinformatics of Genome Regulation and Structure (BGRS 2004)
RUSSIAN ACAD SCI SIBERIAN BRANCH. 2004: 138–140
View details for Web of Science ID 000242399100034
- Extremely conserved non-coding sequences in the vertebrate genomes Proceedings of 4th International Conference on Bioinformatics of Genome Regulation and Structure 2004; BGRS
- Sequencing and comparative analysis of the chicken genome Nature 2004; 432 (7018): 695-716
-
Discriminative feature selection via multiclass variable memory Markov model
EURASIP JOURNAL ON APPLIED SIGNAL PROCESSING
2003; 2003 (2): 93-102
View details for Web of Science ID 000183504000002
- A system for computer music generation by learning and improvisation in a particular style IEEE Computer J. 2003; 36 (10): 73-80
- Efficient exact p-value computation and applications to biosequence analysis Proceedings of the 7th annual international conference on research in computational molecular biology 2003; RECOMB
- Discriminative feature selection via multiclass variable memory Markov models EURASIP J. Applied Signal Processing 2003; 2: 93-102
- Discriminative feature selection via multiclass variable memory Markov models Proceedings of 19th International Conference on Machine Learning 2002; ICML
-
Markovian domain fingerprinting: statistical segmentation of protein sequences
3rd Georgia-Tech-Emory International Conference on Bioinformatics in Silico Biology: Bioinformatics after the Human Genome
OXFORD UNIV PRESS. 2001: 927–34
Abstract
Characterization of a protein family by its distinct sequence domains is crucial for functional annotation and correct classification of newly discovered proteins. Conventional Multiple Sequence Alignment (MSA) based methods find difficulties when faced with heterogeneous groups of proteins. However, even many families of proteins that do share a common domain contain instances of several other domains, without any common underlying linear ordering. Ignoring this modularity may lead to poor or even false classification results. An automated method that can analyze a group of proteins into the sequence domains it contains is therefore highly desirable.We apply a novel method to the problem of protein domain detection. The method takes as input an unaligned group of protein sequences. It segments them and clusters the segments into groups sharing the same underlying statistics. A Variable Memory Markov (VMM) model is built using a Prediction Suffix Tree (PST) data structure for each group of segments. Refinement is achieved by letting the PSTs compete over the segments, and a deterministic annealing framework infers the number of underlying PST models while avoiding many inferior solutions. We show that regions of similar statistics correlate well with protein sequence domains, by matching a unique signature to each domain. This is done in a fully automated manner, and does not require or attempt an MSA. Several representative cases are analyzed. We identify a protein fusion event, refine an HMM superfamily classification into the underlying families the HMM cannot separate, and detect all 12 instances of a short domain in a group of 396 sequences.jill@cs.huji.ac.il; tishby@cs.huji.ac.il.
View details for Web of Science ID 000171690300009
View details for PubMedID 11673237
-
Novel small RNA-encoding genes in the intergenic regions of Escherichia coli
CURRENT BIOLOGY
2001; 11 (12): 941-950
Abstract
Small, untranslated RNA molecules were identified initially in bacteria, but examples can be found in all kingdoms of life. These RNAs carry out diverse functions, and many of them are regulators of gene expression. Genes encoding small, untranslated RNAs are difficult to detect experimentally or to predict by traditional sequence analysis approaches. Thus, in spite of the rising recognition that such RNAs may play key roles in bacterial physiology, many of the small RNAs known to date were discovered fortuitously.To search the Escherichia coli genome sequence for genes encoding small RNAs, we developed a computational strategy employing transcription signals and genomic features of the known small RNA-encoding genes. The search, for which we used rather restrictive criteria, has led to the prediction of 24 putative sRNA-encoding genes, of which 23 were tested experimentally. Here we report on the discovery of 14 genes encoding novel small RNAs in E. coli and their expression patterns under a variety of physiological conditions. Most of the newly discovered RNAs are abundant. Interestingly, the expression level of a significant number of these RNAs increases upon entry into stationary phase.Based on our results, we conclude that small RNAs are much more widespread than previously imagined and that these versatile molecules may play important roles in the fine-tuning of cell responses to changing environments.
View details for Web of Science ID 000169612900018
View details for PubMedID 11448770
-
Variations on probabilistic suffix trees: statistical modeling and prediction of protein families
BIOINFORMATICS
2001; 17 (1): 23-43
Abstract
We present a method for modeling protein families by means of probabilistic suffix trees (PSTs). The method is based on identifying significant patterns in a set of related protein sequences. The patterns can be of arbitrary length, and the input sequences do not need to be aligned, nor is delineation of domain boundaries required. The method is automatic, and can be applied, without assuming any preliminary biological information, with surprising success. Basic biological considerations such as amino acid background probabilities, and amino acids substitution probabilities can be incorporated to improve performance.The PST can serve as a predictive tool for protein sequence classification, and for detecting conserved patterns (possibly functionally or structurally important) within protein sequences. The method was tested on the Pfam database of protein families with more than satisfactory performance. Exhaustive evaluations show that the PST model detects much more related sequences than pairwise methods such as Gapped-BLAST, and is almost as sensitive as a hidden Markov model that is trained from a multiple alignment of the input sequences, while being much faster.
View details for Web of Science ID 000167241500005
View details for PubMedID 11222260
- Unsupervised sequence segmentation by a mixture of switching variable memory Markov sources Proceedings of 18th International Conference on Machine Learning 2001; IMCL
- A Simple Hyper-Geometric Approach for Discovering Putative Transcription Factor Binding Sites, 1st Workshop on Algorithms in Bioinformatics Lecture Notes in Computer Science 2001; WABI (2149): 278-293
-
PromEC: An updated database of Escherichia coli mRNA promoters with experimentally identified transcriptional start sites
NUCLEIC ACIDS RESEARCH
2001; 29 (1): 277-277
Abstract
PromEC is an updated compilation of Escherichia coli mRNA promoter sequences. It includes documentation on the location of experimentally identified mRNA transcriptional start sites on the E. coli chromosome, as well as the actual sequences in the promoter region. The database was updated as of July 2000 and includes 472 entries. PromEC is accessible at http://bioinfo.md.huji.ac. il/marg/promec
View details for Web of Science ID 000166360300075
View details for PubMedID 11125111
- Novel small RNA-encoding genes in Escherichia coli Current Biology 2001; 11 (12): 941-950
- Automated modeling of musical style Proceedings of the International Computer Music Conference 2001; ICMC
- A variable memory Markovian modeling approach to unsupervised sequence segmentation Proceedings of 33rd Symposium on the Interface of Computing Science and Statistics 2001; INTERFACE
- Optimal amnesic probabilistic automata, or, how to learn and classify proteins in linear time and space Proceedings of the 4th annual international conference on research in computational molecular biology 2000; RECOMB
- Optimal amnesic probabilistic automata, or, how to learn and classify proteins in linear time and space J. Computational Biology 2000; 7 (3-4): 381-393
- Modeling protein families using probabilistic suffix trees, Proceedings of the 3rd annual international conference on research in computational molecular biology RECOMB 1999