A Method for Localizing Non-Reference Sequences to the Human Genome.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2022; 27: 313-324
As the last decade of human genomics research begins to bear the fruit of advancements in precision medicine, it is important to ensure that genomics' improvements in human health are distributed globally and equitably. An important step to ensuring health equity is to improve the human reference genome to capture global diversity by including a wide variety of alternative haplotypes, sequences that are not currently captured on the reference genome.We present a method that localizes 100 basepair (bp) long sequences extracted from short-read sequencing that can ultimately be used to identify what regions of the human genome non-reference sequences belong to.We extract reads that don't align to the reference genome, and compute the population's distribution of 100-mers found within the unmapped reads. We use genetic data from families to identify shared genetic material between siblings and match the distribution of unmapped k-mers to these inheritance patterns to determine the the most likely genomic region of a k-mer. We perform this localization with two highly interpretable methods of artificial intelligence: a computationally tractable Hidden Markov Model coupled to a Maximum Likelihood Estimator. Using a set of alternative haplotypes with known locations on the genome, we show that our algorithm is able to localize 96% of k-mers with over 90% accuracy and less than 1Mb median resolution. As the collection of sequenced human genomes grows larger and more diverse, we hope that this method can be used to improve the human reference genome, a critical step in addressing precision medicine's diversity crisis.
View details for PubMedID 34890159
Improved detection of disease-associated gut microbes using 16S sequence-based biomarkers.
2021; 22 (1): 509
BACKGROUND: Sequencing partial 16S rRNA genes is a cost effective method for quantifying the microbial composition of an environment, such as the human gut. However, downstream analysis relies on binning reads into microbial groups by either considering each unique sequence as a different microbe, querying a database to get taxonomic labels from sequences, or clustering similar sequences together. However, these approaches do not fully capture evolutionary relationships between microbes, limiting the ability to identify differentially abundant groups of microbes between a diseased and control cohort. We present sequence-based biomarkers (SBBs), an aggregation method that groups and aggregates microbes using single variants and combinations of variants within their 16S sequences. We compare SBBs against other existing aggregation methods (OTU clustering and Microphenoor DiTaxa features) in several benchmarking tasks: biomarker discovery via permutation test, biomarker discovery via linear discriminant analysis, and phenotype prediction power. We demonstrate the SBBs perform on-par or better than the state-of-the-art methods in biomarker discovery and phenotype prediction.RESULTS: On two independent datasets, SBBs identify differentially abundant groups of microbes with similar or higher statistical significance than existing methods in both a permutation-test-based analysis and using linear discriminant analysis effect size. . By grouping microbes by SBB, we can identify several differentially abundant microbial groups (FDR <.1) between children with autism and neurotypical controls in a set of 115 discordant siblings. Porphyromonadaceae, Ruminococcaceae, and an unnamed species of Blastocystis were significantly enriched in autism, while Veillonellaceae was significantly depleted. Likewise, aggregating microbes by SBB on a dataset of obese and lean twins, we find several significantly differentially abundant microbial groups (FDR<.1). We observed Megasphaera andSutterellaceae highly enriched in obesity, and Phocaeicola significantly depleted. SBBs also perform on bar with or better than existing aggregation methods as features in a phenotype prediction model, predicting the autism phenotype with an ROC-AUC score of .64 and the obesity phenotype with an ROC-AUC score of .84.CONCLUSIONS: SBBs provide a powerful method for aggregating microbes to perform differential abundance analysis as well as phenotype prediction. Our source code can be freely downloaded from http://github.com/briannachrisman/16s_biomarkers .
View details for DOI 10.1186/s12859-021-04427-7
View details for PubMedID 34666677
- Training Affective Computer Vision Models by Crowdsourcing Soft-Target Labels COGNITIVE COMPUTATION 2021
A maximum flow-based network approach for identification of stable noncoding biomarkers associated with the multigenic neurological condition, autism.
2021; 14 (1): 28
BACKGROUND: Machine learning approaches for predicting disease risk from high-dimensional whole genome sequence (WGS) data often result in unstable models that can be difficult to interpret, limiting the identification of putative sets of biomarkers. Here, we design and validate a graph-based methodology based on maximum flow, which leverages the presence of linkage disequilibrium (LD) to identify stable sets of variants associated with complex multigenic disorders.RESULTS: We apply our method to a previously published logistic regression model trained to identify variants in simple repeat sequences associated with autism spectrum disorder (ASD); this L1-regularized model exhibits high predictive accuracy yet demonstrates great variability in the features selected from over 230,000 possible variants. In order to improve model stability, we extract the variants assigned non-zero weights in each of 5 cross-validation folds and then assemble the five sets of features into a flow network subject to LD constraints. The maximum flow formulation allowed us to identify 55 variants, which we show to be more stable than the features identified by the original classifier.CONCLUSION: Our method allows for the creation of machine learning models that can identify predictive variants. Our results help pave the way towards biomarker-based diagnosis methods for complex genetic disorders.
View details for DOI 10.1186/s13040-021-00262-x
View details for PubMedID 33941233
Estimating sequencing error rates using families.
2021; 14 (1): 27
BACKGROUND: As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample.RESULTS: We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method's versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites.CONCLUSION: Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.
View details for DOI 10.1186/s13040-021-00259-6
View details for PubMedID 33892748
Indels in SARS-CoV-2 occur at template-switching hotspots.
2021; 14 (1): 20
The evolutionary dynamics of SARS-CoV-2 have been carefully monitored since the COVID-19 pandemic began in December 2019. However, analysis has focused primarily on single nucleotide polymorphisms and largely ignored the role of insertions and deletions (indels) as well as recombination in SARS-CoV-2 evolution. Using sequences from the GISAID database, we catalogue over 100 insertions and deletions in the SARS-CoV-2 consensus sequences. We hypothesize that these indels are artifacts of recombination events between SARS-CoV-2 replicates whereby RNA-dependent RNA polymerase (RdRp) re-associates with a homologous template at a different loci ("imperfect homologous recombination"). We provide several independent pieces of evidence that suggest this. (1) The indels from the GISAID consensus sequences are clustered at specific regions of the genome. (2) These regions are also enriched for 5' and 3' breakpoints in the transcription regulatory site (TRS) independent transcriptome, presumably sites of RNA-dependent RNA polymerase (RdRp) template-switching. (3) Within raw reads, these indel hotspots have cases of both high intra-host heterogeneity and intra-host homogeneity, suggesting that these indels are both consequences of de novo recombination events within a host and artifacts of previous recombination. We briefly analyze the indels in the context of RNA secondary structure, noting that indels preferentially occur in "arms" and loop structures of the predicted folded RNA, suggesting that secondary structure may be a mechanism for TRS-independent template-switching in SARS-CoV-2 or other coronaviruses. These insights into the relationship between structural variation and recombination in SARS-CoV-2 can improve our reconstructions of the SARS-CoV-2 evolutionary history as well as our understanding of the process of RdRp template-switching in RNA viruses.
View details for DOI 10.1186/s13040-021-00251-0
View details for PubMedID 33743803
Game theoretic centrality: a novel approach to prioritize disease candidate genes by combining biological networks with the Shapley value.
2020; 21 (1): 356
BACKGROUND: Complex human health conditions with etiological heterogeneity like Autism Spectrum Disorder (ASD) often pose a challenge for traditional genome-wide association study approaches in defining a clear genotype to phenotype model. Coalitional game theory (CGT) is an exciting method that can consider the combinatorial effect of groups of variants working in concert to produce a phenotype. CGT has been applied to associate likely-gene-disrupting variants encoded from whole genome sequence data to ASD; however, this previous approach cannot take into account for prior biological knowledge. Here we extend CGT to incorporate a priori knowledge from biological networks through a game theoretic centrality measure based on Shapley value to rank genes by their relevance-the individual gene's synergistic influence in a gene-to-gene interaction network. Game theoretic centrality extends the notion of Shapley value to the evaluation of a gene's contribution to the overall connectivity of its corresponding node in a biological network.RESULTS: We implemented and applied game theoretic centrality to rank genes on whole genomes from 756 multiplex autism families. Top ranking genes with the highest game theoretic centrality in both the weighted and unweighted approaches were enriched for pathways previously associated with autism, including pathways of the immune system. Four of the selected genes HLA-A, HLA-B, HLA-G, and HLA-DRB1-have also been implicated in ASD and further support the link between ASD and the human leukocyte antigen complex.CONCLUSIONS: Game theoretic centrality can prioritize influential, disease-associated genes within biological networks, and assist in the decoding of polygenic associations to complex disorders like autism.
View details for DOI 10.1186/s12859-020-03693-1
View details for PubMedID 32787845
- Coalitional Game Theory Facilitates Identification of Non-Coding Variants Associated With Autism BIOMEDICAL INFORMATICS INSIGHTS 2019; 11
Outgroup Machine Learning Approach Identifies Single Nucleotide Variants in Noncoding DNA Associated with Autism Spectrum Disorder
WORLD SCIENTIFIC PUBL CO PTE LTD. 2019: 260–71
View details for Web of Science ID 000461866400024
Inherited and De Novo Genetic Risk for Autism Impacts Shared Networks.
2019; 178 (4): 850–66.e26
We performed a comprehensive assessment of rare inherited variation in autism spectrum disorder (ASD) by analyzing whole-genome sequences of 2,308 individuals from families with multiple affected children. We implicate 69 genes in ASD risk, including 24 passing genome-wide Bonferroni correction and 16 new ASD risk genes, most supported by rare inherited variants, a substantial extension of previous findings. Biological pathways enriched for genes harboring inherited variants represent cytoskeletal organization and ion transport, which are distinct from pathways implicated in previous studies. Nevertheless, the de novo and inherited genes contribute to a common protein-protein interaction network. We also identified structural variants (SVs) affecting non-coding regions, implicating recurrent deletions in the promoters of DLG2 and NR3C2. Loss of nr3c2 function in zebrafish disrupts sleep and social function, overlapping with human ASD-related phenotypes. These data support the utility of studying multiplex families in ASD and are available through the Hartwell Autism Research and Technology portal.
View details for DOI 10.1016/j.cell.2019.07.015
View details for PubMedID 31398340
Coalitional Game Theory Facilitates Identification of Non-Coding Variants Associated With Autism.
Biomedical informatics insights
2019; 11: 1178222619832859
Studies on autism spectrum disorder (ASD) have amassed substantial evidence for the role of genetics in the disease's phenotypic manifestation. A large number of coding and non-coding variants with low penetrance likely act in a combinatorial manner to explain the variable forms of ASD. However, many of these combined interactions, both additive and epistatic, remain undefined. Coalitional game theory (CGT) is an approach that seeks to identify players (individual genetic variants or genes) who tend to improve the performance-association to a disease phenotype of interest-of any coalition (subset of co-occurring genetic variants) they join. This method has been previously applied to boost biologically informative signal from gene expression data and exome sequencing data but remains to be explored in the context of cooperativity among non-coding genomic regions. We describe our extension of previous work, highlighting non-coding chromosomal regions relevant to ASD using CGT on alteration data of 4595 fully sequenced genomes from 756 multiplex families. Genomes were encoded into binary matrices for three types of non-coding regions previously implicated in ASD and separated into ASD (case) and unaffected (control) samples. A player metric, the Shapley value, enabled determination of individual variant contributions in both sets of cohorts. A total of 30 non-coding positions were found to have significantly elevated player scores and likely represent significant contributors to the genetic coordination underlying ASD. Cross-study analyses revealed that a subset of mutated non-coding regions (all of which are in human accelerated regions (HARs)) and related genes are involved in biological pathways or behavioral outcomes known to be affected in autism, suggesting the importance of single nucleotide polymorphisms (SNPs) within HARs in ASD. These findings support the use of CGT in identifying hidden yet influential non-coding players from large-scale genomic data, to better understand the precise underpinnings of complex neurodevelopmental disorders such as autism.
View details for PubMedID 30886520
Analysis of Sex and Recurrence Ratios in Simplex and Multiplex Autism Spectrum Disorder Implicates Sex-Specific Alleles as Inheritance Mechanism
IEEE. 2018: 1470–77
View details for Web of Science ID 000458654000258