Doctor of Philosophy, University of British Columbia (2019)
Bachelor of Engineering, University Of Edinburgh (2009)
PGxMine: Text mining for curation of PharmGKB.
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
2020; 25: 611–22
Precision medicine tailors treatment to individuals personal data including differences in their genome. The Pharmacogenomics Knowledgebase (PharmGKB) provides highly curated information on the effect of genetic variation on drug response and side effects for a wide range of drugs. PharmGKB's scientific curators triage, review and annotate a large number of papers each year but the task is challenging. We present the PGxMine resource, a text-mined resource of pharmacogenomic associations from all accessible published literature to assist in the curation of PharmGKB. We developed a supervised machine learning pipeline to extract associations between a variant (DNA and protein changes, star alleles and dbSNP identifiers) and a chemical. PGxMine covers 452 chemicals and 2,426 variants and contains 19,930 mentions of pharmacogenomic associations across 7,170 papers. An evaluation by PharmGKB curators found that 57 of the top 100 associations not found in PharmGKB led to 83 curatable papers and a further 24 associations would likely lead to curatable papers through citations. The results can be viewed at https://pgxmine.pharmgkb.org/ and code can be downloaded at https://github.com/jakelever/pgxmine.
View details for PubMedID 31797632
Extending TextAE for annotation of non-contiguous entities.
Genomics & informatics
2020; 18 (2): e15
Named entity recognition tools are used to identify mentions of biomedical entities in free text and are essential components of high-quality information retrieval and extraction systems. Without good entity recognition, methods will mislabel searched text and will miss important information or identify spurious text that will frustrate users. Most tools do not capture non-contiguous entities which are separate spans of text that together refer to an entity, e.g., the entity "type 1 diabetes" in the phrase "type 1 and type 2 diabetes." This type is commonly found in biomedical texts, especially in lists, where multiple biomedical entities are named in shortened form to avoid repeating words. Most text annotation systems, that enable users to view and edit entity annotations, do not support non-contiguous entities. Therefore, experts cannot even visualize non-contiguous entities, let alone annotate them to build valuable datasets for machine learning methods. To combat this problem and as part of the BLAH6 hackathon, we extended the TextAE platform to allow visualization and annotation of non-contiguous entities. This enables users to add new subspans to existing entities by selecting additional text. We integrate this new functionality with TextAE's existing editing functionality to allow easy changes to entity annotation and editing of relation annotations involving non-contiguous entities, with importing and exporting to the PubAnnotation format. Finally, we roughly quantify the problem across the entire accessible biomedical literature to highlight that there are a substantial number of non-contiguous entities that appear in lists that would be missed by most text mining systems.
View details for DOI 10.5808/GI.2020.18.2.e15
View details for PubMedID 32634869
Text-mining clinically relevant cancer biomarkers for curation into the CIViC database.
2019; 11 (1): 78
Precision oncology involves analysis of individual cancer samples to understand the genes and pathways involved in the development and progression of a cancer. To improve patient care, knowledge of diagnostic, prognostic, predisposing, and drug response markers is essential. Several knowledgebases have been created by different groups to collate evidence for these associations. These include the open-access Clinical Interpretation of Variants in Cancer (CIViC) knowledgebase. These databases rely on time-consuming manual curation from skilled experts who read and interpret the relevant biomedical literature.To aid in this curation and provide the greatest coverage for these databases, particularly CIViC, we propose the use of text mining approaches to extract these clinically relevant biomarkers from all available published literature. To this end, a group of cancer genomics experts annotated sentences that discussed biomarkers with their clinical associations and achieved good inter-annotator agreement. We then used a supervised learning approach to construct the CIViCmine knowledgebase.We extracted 121,589 relevant sentences from PubMed abstracts and PubMed Central Open Access full-text papers. CIViCmine contains over 87,412 biomarkers associated with 8035 genes, 337 drugs, and 572 cancer types, representing 25,818 abstracts and 39,795 full-text publications.Through integration with CIVIC, we provide a prioritized list of curatable clinically relevant cancer biomarkers as well as a resource that is valuable to other knowledgebases and precision cancer analysts in general. All data is publically available and distributed with a Creative Commons Zero license. The CIViCmine knowledgebase is available at http://bionlp.bcgsc.ca/civicmine/.
View details for DOI 10.1186/s13073-019-0686-y
View details for PubMedID 31796060
Identification and Analyses of Extra-Cranial and Cranial Rhabdoid Tumor Molecular Subgroups Reveal Tumors with Cytotoxic T Cell Infiltration.
2019; 29 (8): 2338–54.e7
Extra-cranial malignant rhabdoid tumors (MRTs) and cranial atypical teratoid RTs (ATRTs) are heterogeneous pediatric cancers driven primarily by SMARCB1 loss. To understand the genome-wide molecular relationships between MRTs and ATRTs, we analyze multi-omics data from 140 MRTs and 161 ATRTs. We detect similarities between the MYC subgroup of ATRTs (ATRT-MYC) and extra-cranial MRTs, including global DNA hypomethylation and overexpression of HOX genes and genes involved in mesenchymal development, distinguishing them from other ATRT subgroups that express neural-like features. We identify five DNA methylation subgroups associated with anatomical sites and SMARCB1 mutation patterns. Groups 1, 3, and 4 exhibit cytotoxic T cell infiltration and expression of immune checkpoint regulators, consistent with a potential role for immunotherapy in rhabdoid tumor patients.
View details for DOI 10.1016/j.celrep.2019.10.013
View details for PubMedID 31708418
Comprehensive genomic profiling of glioblastoma tumors, BTICs, and xenografts reveals stability and adaptation to growth environments.
Proceedings of the National Academy of Sciences of the United States of America
2019; 116 (38): 19098–108
Glioblastoma multiforme (GBM) is the most deadly brain tumor, and currently lacks effective treatment options. Brain tumor-initiating cells (BTICs) and orthotopic xenografts are widely used in investigating GBM biology and new therapies for this aggressive disease. However, the genomic characteristics and molecular resemblance of these models to GBM tumors remain undetermined. We used massively parallel sequencing technology to decode the genomes and transcriptomes of BTICs and xenografts and their matched tumors in order to delineate the potential impacts of the distinct growth environments. Using data generated from whole-genome sequencing of 201 samples and RNA sequencing of 118 samples, we show that BTICs and xenografts resemble their parental tumor at the genomic level but differ at the mRNA expression and epigenomic levels, likely due to the different growth environment for each sample type. These findings suggest that a comprehensive genomic understanding of in vitro and in vivo GBM model systems is crucial for interpreting data from drug screens, and can help control for biases introduced by cell-culture conditions and the microenvironment in mouse models. We also found that lack of MGMT expression in pretreated GBM is linked to hypermutation, which in turn contributes to increased genomic heterogeneity and requires new strategies for GBM treatment.
View details for DOI 10.1073/pnas.1813495116
View details for PubMedID 31471491
View details for PubMedCentralID PMC6754609
CancerMine: a literature-mined resource for drivers, oncogenes and tumor suppressors in cancer.
2019; 16 (6): 505–7
Tumors from individuals with cancer are frequently genetically profiled for information about the driving forces behind the disease. We present the CancerMine resource, a text-mined and routinely updated database of drivers, oncogenes and tumor suppressors in different types of cancer. All data are available online ( http://bionlp.bcgsc.ca/cancermine ) and downloadable under a Creative Commons Zero license for ease of use.
View details for DOI 10.1038/s41592-019-0422-y
View details for PubMedID 31110280
LIONS: analysis suite for detecting and quantifying transposable element initiated transcription from RNA-seq.
Bioinformatics (Oxford, England)
2019; 35 (19): 3839–41
Transposable elements (TEs) influence the evolution of novel transcriptional networks yet the specific and meaningful interpretation of how TE-derived transcriptional initiation contributes to the transcriptome has been marred by computational and methodological deficiencies. We developed LIONS for the analysis of RNA-seq data to specifically detect and quantify TE-initiated transcripts.Source code, container, test data and instruction manual are freely available at www.github.com/ababaian/LIONS.Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btz130
View details for PubMedID 30793157
Text-based phenotypic profiles incorporating biochemical phenotypes of inborn errors of metabolism improve phenomics-based diagnosis.
Journal of inherited metabolic disease
2018; 41 (3): 555–62
Phenomics is the comprehensive study of phenotypes at every level of biology: from metabolites to organisms. With high throughput technologies increasing the scope of biological discoveries, the field of phenomics has been developing rapid and precise methods to collect, catalog, and analyze phenotypes. Such methods have allowed phenotypic data to be widely used in medical applications, from assisting clinical diagnoses to prioritizing genomic diagnoses. To channel the benefits of phenomics into the field of inborn errors of metabolism (IEM), we have recently launched IEMbase, an expert-curated knowledgebase of IEM and their disease-characterizing phenotypes. While our efforts with IEMbase have realized benefits, taking full advantage of phenomics requires a comprehensive curation of IEM phenotypes in core phenomics projects, which is dependent upon contributions from the IEM clinical and research community. Here, we assess the inclusion of IEM biochemical phenotypes in a core phenomics project, the Human Phenotype Ontology. We then demonstrate the utility of biochemical phenotypes using a text-based phenomics method to predict gene-disease relationships, showing that the prediction of IEM genes is significantly better using biochemical rather than clinical profiles. The findings herein provide a motivating goal for the IEM community to expand the computationally accessible descriptions of biochemical phenotypes associated with IEM in phenomics resources.
View details for DOI 10.1007/s10545-017-0125-4
View details for PubMedID 29340838
View details for PubMedCentralID PMC5959948
A collaborative filtering-based approach to biomedical knowledge discovery.
Bioinformatics (Oxford, England)
2018; 34 (4): 652–59
The increase in publication rates makes it challenging for an individual researcher to stay abreast of all relevant research in order to find novel research hypotheses. Literature-based discovery methods make use of knowledge graphs built using text mining and can infer future associations between biomedical concepts that will likely occur in new publications. These predictions are a valuable resource for researchers to explore a research topic. Current methods for prediction are based on the local structure of the knowledge graph. A method that uses global knowledge from across the knowledge graph needs to be developed in order to make knowledge discovery a frequently used tool by researchers.We propose an approach based on the singular value decomposition (SVD) that is able to combine data from across the knowledge graph through a reduced representation. Using cooccurrence data extracted from published literature, we show that SVD performs better than the leading methods for scoring discoveries. We also show the diminishing predictive power of knowledge discovery as we compare our predictions with real associations that appear further into the future. Finally, we examine the strengths and weaknesses of the SVD approach against another well-performing system using several predicted associations.All code and results files for this analysis can be accessed at https://firstname.lastname@example.org.Supplementary data are available at Bioinformatics online.
View details for DOI 10.1093/bioinformatics/btx613
View details for PubMedID 29028901
PubRunner: A light-weight framework for updating text mining results.
2017; 6: 612
Biomedical text mining promises to assist biologists in quickly navigating the combined knowledge in their domain. This would allow improved understanding of the complex interactions within biological systems and faster hypothesis generation. New biomedical research articles are published daily and text mining tools are only as good as the corpus from which they work. Many text mining tools are underused because their results are static and do not reflect the constantly expanding knowledge in the field. In order for biomedical text mining to become an indispensable tool used by researchers, this problem must be addressed. To this end, we present PubRunner, a framework for regularly running text mining tools on the latest publications. PubRunner is lightweight, simple to use, and can be integrated with an existing text mining tool. The workflow involves downloading the latest abstracts from PubMed, executing a user-defined tool, pushing the resulting data to a public FTP or Zenodo dataset, and publicizing the location of these results on the public PubRunner website. We illustrate the use of this tool by re-running the commonly used word2vec tool on the latest PubMed abstracts to generate up-to-date word vector representations for the biomedical domain. This shows a proof of concept that we hope will encourage text mining developers to build tools that truly will aid biologists in exploring the latest publications.
View details for DOI 10.12688/f1000research.11389.2
View details for PubMedID 29152221
View details for PubMedCentralID PMC5664974
Small molecule epigenetic screen identifies novel EZH2 and HDAC inhibitors that target glioblastoma brain tumor-initiating cells.
2016; 7 (37): 59360–76
Glioblastoma (GBM) is the most lethal and aggressive adult brain tumor, requiring the development of efficacious therapeutics. Towards this goal, we screened five genetically distinct patient-derived brain-tumor initiating cell lines (BTIC) with a unique collection of small molecule epigenetic modulators from the Structural Genomics Consortium (SGC). We identified multiple hits that inhibited the growth of BTICs in vitro, and further evaluated the therapeutic potential of EZH2 and HDAC inhibitors due to the high relevance of these targets for GBM. We found that the novel SAM-competitive EZH2 inhibitor UNC1999 exhibited low micromolar cytotoxicity in vitro on a diverse collection of BTIC lines, synergized with dexamethasone (DEX) and suppressed tumor growth in vivo in combination with DEX. In addition, a unique brain-penetrant class I HDAC inhibitor exhibited cytotoxicity in vitro on a panel of BTIC lines and extended survival in combination with TMZ in an orthotopic BTIC model in vivo. Finally, a combination of EZH2 and HDAC inhibitors demonstrated synergy in vitro by augmenting apoptosis and increasing DNA damage. Our findings identify key epigenetic modulators in GBM that regulate BTIC growth and survival and highlight promising combination therapies.
View details for DOI 10.18632/oncotarget.10661
View details for PubMedID 27449082
View details for PubMedCentralID PMC5312317
Comparative genomic and genetic analysis of glioblastoma-derived brain tumor-initiating cells and their parent tumors.
2016; 18 (3): 350–60
Glioblastoma (GBM) is a fatal cancer that has eluded major therapeutic advances. Failure to make progress may reflect the absence of a human GBM model that could be used to test compounds for anti-GBM activity. In this respect, the development of brain tumor-initiating cell (BTIC) cultures is a step forward because BTICs appear to capture the molecular diversity of GBM better than traditional glioma cell lines. Here, we perform a comparative genomic and genetic analysis of BTICs and their parent tumors as preliminary evaluation of the BTIC model.We assessed single nucleotide polymorphisms (SNPs), genome-wide copy number variations (CNVs), gene expression patterns, and molecular subtypes of 11 established BTIC lines and matched parent tumors.Although CNV differences were noted, BTICs retained the major genomic alterations characteristic of GBM. SNP patterns were similar between BTICs and tumors. Importantly, recurring SNP or CNV alterations specific to BTICs were not seen. Comparative gene expression analysis and molecular subtyping revealed differences between BTICs and GBMs. These differences formed the basis of a 63-gene expression signature that distinguished cells from tumors; differentially expressed genes primarily involved metabolic processes. We also derived a set of 73 similarly expressed genes; these genes were not associated with specific biological functions.Although not identical, established BTIC lines preserve the core molecular alterations seen in their parent tumors, as well as the genomic hallmarks of GBM, without acquiring recurring BTIC-specific changes.
View details for DOI 10.1093/neuonc/nov143
View details for PubMedID 26245525
View details for PubMedCentralID PMC4767234