HoJoon Lee
Sr Biomedical Data Scientist, Medicine - Med/Oncology
Current Role at Stanford
Senior Data Scientist
Honors & Awards
-
2nd place in Stanford HealthAI hackathon, Stanford University (2019 Jan)
-
Travel grants, AACR (2015 April)
-
Travel Fellowship, International Society for Computational Biology (ISMB) meeting (2015)
-
Best poster award in Oncology retreat, Stanford University (2014 October)
-
Travel fellowships of School of Life Sciences, Arizona State University (2009 September)
-
GPSA Conference Travel Grant, Arizona State University (2008)
-
Travel Fellowship, RECOM Computational Cancer Biology (2007)
-
Dr. John and Rose Maher Alumni Scholarship for students involved in cancer research, Arizona State University (2006)
Education & Certifications
-
PhD, Arizona State University, Molecular & Cellular Biology with Bioinformatics (2012)
-
Master of Science, Arizona State University, Computational Biology (2005)
-
Bachelor of Science, Yonsei University, Biology (2002)
Patents
-
HoJoon Lee (co-inventor). "United States Patent US PTO 62/200,904 High Resolution STR analysis using Next Generation Sequencing", Leland Stanford Junior University
Professional Interests
My primary research interest is “Cancer Treatment and Prevention through Precision Medicine” based on the analysis of genomic data of cancer patients for using genomics information to make proper clinical decisions. During my B.S in biology major, I was captivated by the fact that DNA – digital information - codes life. To gain computational skill, I joined the computational biology program for a master's degree. I learned the principle and techniques of various sequence analyses such as sequence alignment models, molecular phylogenetics, and motif searching from estimating the neural mutation rate by comparing human and mouse genomes. To have an impact on real life, I then applied my existing expertise to cancer sequencing data in order to identify neo-antigens for breast cancer vaccine development during my Ph.D. study. I learned how to analyze RNA-seq data by my own algorithm and organize/manage large data generated from the project using MySQL. As a post-doc at Stanford, I expanded my research to investigate the clinical implications of genomic features. I applied regularized regression (Elastic-net) to integrate multiple, heterogeneous genomic assays data from the Cancer Genome Atlas (TCGA) project and identify known and novel candidate drive mutations that predict tumor stage and other clinical parameters. As a data scientist, my innovative works include a novel k-mer representation technique for genome sequence analysis without traditional alignment methods, which has been used to characterize the human pangenome reference. Currently, I coordinates interdisciplinary teams combining expertise in genomics, computer science, pathology, and immunology to advance a deep learning model that classifies cell types in hematoxylin and eosin-stained (H&E) images.
Work Experience
-
Senior Research Engineer, Stanford University (3/16/2019 - Present)
-Manage bioinformatics analysis pipeline for all genomics data
-Lead a team to develop bioinformatics analysis pipeline for single cell immunogenomics data
-Lead a team to develop dynamic representing the human reference genomes, which enable population level sequencing analysis
-Lead a team to develop new cancer immune therapy targets (under IP process)Location
Stanford, CA
-
Project Leader, Stanford University (3/16/2017 - 3/15/2019)
-Developed bioinformatics pipeline to analyze single cell sequencing of T cell receptors (TCRs)
-Developed bioinformatics pipeline to identify personalized neo-antigens for clinical phase 1 trial of immune therapy by pLADD. (licensing to Aduro)
-Developed new algorithms to analyze sequencing reads for substitution, indels, gene fusions, copy number variation, and Cas9 mutagenesis
-Developed analysis pipeline for whole genome sequencing from clinical samples with Intermountain HealthcareLocation
Stanford, CA
-
Post-doctoral fellow, Stanford University (3/16/2012 - 3/15/2017)
-Set up bioinformatics tools on Amazon Web Service through Seven Bridge for immune-genomics analysis of the Cancer Genome Atlas (TCGA)
-Identified clinically relevant genomic/proteomic changes from >10,000 samples of > 32 cancers in TCGA by integrative analysis.
-Developed a web portal for the exploration of the clinical associations of the TCGA data; http://genomeportal.stanford.edu/pan-tcga
-Designed the optimal probes for targeted sequencing such as STR-OS seq and digital droplet PCRLocation
Stanford, CA
-
Research Associate, Arizona State University (8/2005 - 2/2012)
Worked in the cancer vaccine project that funded by Department of Defense (DoD) and Keck grant.
-Developed algorithm to identify tumor-specific frame-shifted mutations derived from gene fusions, alternative splicing and insertion/deletion as neo antigens that could be used in vaccine.
-Validated these putative candidates by molecular biology such as RT-PCR and cloning. Validated candidates were tested in mouse model.
-Constructed database to organize all data using mysql with all available information such as epitopes, MHC binders, gene expression, exon structure, GO annotation and etc to evaluate them as vaccine antigens.Location
Tempe, AZ
All Publications
-
Pan-conserved segment tags identify ultra-conserved sequences across assemblies in the human pangenome.
Cell reports methods
2023; 3 (8): 100543
Abstract
The human pangenome, a new reference sequence, addresses many limitations of the current GRCh38 reference. The first release is based on 94 high-quality haploid assemblies from individuals with diverse backgrounds. We employed a k-mer indexing strategy for comparative analysis across multiple assemblies, including the pangenome reference, GRCh38, and CHM13, a telomere-to-telomere reference assembly. Our k-mer indexing approach enabled us to identify a valuable collection of universally conserved sequences across all assemblies, referred to as "pan-conserved segment tags" (PSTs). By examining intervals between these segments, we discerned highly conserved genomic segments and those with structurally related polymorphisms. We found 60,764 polymorphic intervals with unique geo-ethnic features in the pangenome reference. In this study, we utilized ultra-conserved sequences (PSTs) to forge a link between human pangenome assemblies and reference genomes. This methodology enables the examination of any sequence of interest within the pangenome, using the reference genome as a comparative framework.
View details for DOI 10.1016/j.crmeth.2023.100543
View details for PubMedID 37671027
View details for PubMedCentralID PMC10475782
-
Pangenome graph construction from genome alignments with Minigraph-Cactus
NATURE BIOTECHNOLOGY
2023
Abstract
Pangenome references address biases of reference genomes by storing a representative set of diverse haplotypes and their alignment, usually as a graph. Alternate alleles determined by variant callers can be used to construct pangenome graphs, but advances in long-read sequencing are leading to widely available, high-quality phased assemblies. Constructing a pangenome graph directly from assemblies, as opposed to variant calls, leverages the graph's ability to represent variation at different scales. Here we present the Minigraph-Cactus pangenome pipeline, which creates pangenomes directly from whole-genome alignments, and demonstrate its ability to scale to 90 human haplotypes from the Human Pangenome Reference Consortium. The method builds graphs containing all forms of genetic variation while allowing use of current mapping and genotyping tools. We measure the effect of the quality and completeness of reference genomes used for analysis within the pangenomes and show that using the CHM13 reference from the Telomere-to-Telomere Consortium improves the accuracy of our methods. We also demonstrate construction of a Drosophila melanogaster pangenome.
View details for DOI 10.1038/s41587-023-01793-w
View details for Web of Science ID 000992565300001
View details for PubMedID 37165083
View details for PubMedCentralID 8006571
-
Single-molecule methylation profiles of cell-free DNA in cancer with nanopore sequencing.
Genome medicine
2023; 15 (1): 33
Abstract
Epigenetic characterization of cell-free DNA (cfDNA) is an emerging approach for detecting and characterizing diseases such as cancer. We developed a strategy using nanopore-based single-molecule sequencing to measure cfDNA methylomes. This approach generated up to 200 million reads for a single cfDNA sample from cancer patients, an order of magnitude improvement over existing nanopore sequencing methods. We developed a single-molecule classifier to determine whether individual reads originated from a tumor or immune cells. Leveraging methylomes of matched tumors and immune cells, we characterized cfDNA methylomes of cancer patients for longitudinal monitoring during treatment.
View details for DOI 10.1186/s13073-023-01178-3
View details for PubMedID 37138315
View details for PubMedCentralID 1283450
-
A draft human pangenome reference.
Nature
2023; 617 (7960): 312-324
Abstract
Here the Human Pangenome Reference Consortium presents a first draft of the human pangenome reference. The pangenome contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals1. These assemblies cover more than 99% of the expected sequence in each genome and are more than 99% accurate at the structural and base pair levels. Based on alignments of the assemblies, we generate a draft pangenome that captures known variants and haplotypes and reveals new alleles at structurally complex loci. We also add 119 million base pairs of euchromatic polymorphic sequences and 1,115 gene duplications relative to the existing reference GRCh38. Roughly 90 million of the additional base pairs are derived from structural variation. Using our draft pangenome to analyse short-read data reduced small variant discovery errors by 34% and increased the number of structural variants detected per haplotype by 104% compared with GRCh38-based workflows, which enabled the typing of the vast majority of structural variant alleles per sample.
View details for DOI 10.1038/s41586-023-05896-x
View details for PubMedID 37165242
View details for PubMedCentralID PMC10172123
-
The Gastric Cancer Registry Genome Explorer: A tool for genomic discovery.
LIPPINCOTT WILLIAMS & WILKINS. 2023: 434
View details for Web of Science ID 001093994600500
-
Tumor-associated microbiome features of metastatic colorectal cancer and clinical implications.
Frontiers in oncology
2023; 13: 1310054
Abstract
Colon microbiome composition contributes to the pathogenesis of colorectal cancer (CRC) and prognosis. We analyzed 16S rRNA sequencing data from tumor samples of patients with metastatic CRC and determined the clinical implications.We enrolled 133 patients with metastatic CRC at St. Vincent Hospital in Korea. The V3-V4 regions of the 16S rRNA gene from the tumor DNA were amplified, sequenced on an Illumina MiSeq, and analyzed using the DADA2 package.After excluding samples that retained <5% of the total reads after merging, 120 samples were analyzed. The median age of patients was 63 years (range, 34-82 years), and 76 patients (63.3%) were male. The primary cancer sites were the right colon (27.5%), left colon (30.8%), and rectum (41.7%). All subjects received 5-fluouracil-based systemic chemotherapy. After removing genera with <1% of the total reads in each patient, 523 genera were identified. Rectal origin, high CEA level (≥10 ng/mL), and presence of lung metastasis showed higher richness. Survival analysis revealed that the presence of Prevotella (p = 0.052), Fusobacterium (p = 0.002), Selenomonas (p<0.001), Fretibacterium (p = 0.001), Porphyromonas (p = 0.007), Peptostreptococcus (p = 0.002), and Leptotrichia (p = 0.003) were associated with short overall survival (OS, <24 months), while the presence of Sphingomonas was associated with long OS (p = 0.070). From the multivariate analysis, the presence of Selenomonas (hazard ratio [HR], 6.35; 95% confidence interval [CI], 2.38-16.97; p<0.001) was associated with poor prognosis along with high CEA level.Tumor microbiome features may be useful prognostic biomarkers for metastatic CRC.
View details for DOI 10.3389/fonc.2023.1310054
View details for PubMedID 38304032
View details for PubMedCentralID PMC10833227
-
Colorectal cancer metastases in the liver establish immunosuppressive spatial networking between tumor associated SPP1+ macrophages and fibroblasts.
Clinical cancer research : an official journal of the American Association for Cancer Research
2022
Abstract
The liver is the most frequent metastatic site for colorectal cancer (CRC). Its microenvironment is modified to provide a niche that is conducive for CRC cell growth.This study focused on characterizing the cellular changes in the metastatic CRC (mCRC) liver tumor microenvironment (TME).We analyzed a series of microsatellite stable (MSS) mCRCs to the liver, paired normal liver tissue and peripheral blood mononuclear cells using single cell RNA-seq (scRNA-seq). We validated our findings using multiplexed spatial imaging and bulk gene expression with cell deconvolution.We identified TME-specific SPP1-expressing macrophages with altered metabolism features, foam cell characteristics and increased activity in extracellular matrix (ECM) organization. SPP1+ macrophages and fibroblasts expressed complementary ligand receptor pairs with the potential to mutually influence their gene expression programs. TME lacked dysfunctional CD8 T cells and contained regulatory T cells, indicative of immunosuppression. Spatial imaging validated these cell states in the TME. Moreover, TME macrophages and fibroblasts had close spatial proximity, which is a requirement for intercellular communication and networking.In an independent cohort of mCRCs in the liver, we confirmed the presence of SPP1+ macrophages and fibroblasts using gene expression data. An increased proportion of TME fibroblasts was associated with a worst prognosis in these patients.We demonstrated that mCRC in the liver is characterized by transcriptional alterations of macrophages in the TME. Intercellular networking between macrophages and fibroblasts supports CRC growth in the immunosuppressed metastatic niche in the liver. These features can be used to target immune checkpoint resistant MSS tumors.
View details for DOI 10.1158/1078-0432.CCR-22-2041
View details for PubMedID 36239989
-
Exploratory genomic analysis of high grade neuroendocrine neoplasms across diverse primary sites.
Endocrine-related cancer
2022
Abstract
High grade (grade 3) neuroendocrine neoplasms (G3 NENs) have poor survival outcomes. From a clinical standpoint, G3 NENs are usually grouped regardless of primary site and treated similarly. Little is known regarding the underlying genomics of these rare tumors, especially when compared across different primary sites. We performed whole transcriptome (n = 46), whole exome (n = 40) and gene copy number (n = 43) sequencing on G3 NEN FFPE samples from diverse organs (in total 17 were lung, 16 were gastroenteropancreatic, 13 other). G3 NENs despite arising from diverse primary sites did not have gene expression profiles that were easily segregated by organ of origin. Across all G3 NENs, TP53, APC, RB1 and CDKN2A were significantly mutated. The CDK4/6 cell cycling pathway was mutated in 95% of cases, with upregulation of oncogenes within this pathway. G3 NENs had high tumor mutation burden (mean 7.09 mutations/MB), with 20% having >10 mutations/MB. Two somatic copy number alterations were significantly associated with worse prognosis across tissue types: focal deletion 22q13.31 (HR, 7.82; p = 0.034) and arm amplification 19q (HR, 4.82; p = 0.032). This study is among the most diverse genomic study of high-grade neuroendocrine neoplasms. We uncovered genomic features previously unrecognized for this rapidly fatal and rare cancer type that could have potential prognostic and therapeutic implications.
View details for DOI 10.1530/ERC-22-0015
View details for PubMedID 36165930
-
The Gastric Cancer Registry: A Genomic Translational Resource for Multidisciplinary Research in Gastric Cancer.
Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology
2022
Abstract
Gastric cancer (GC) is a leading cause of cancer morbidity and mortality. Developing information systems which integrate clinical and genomic data may accelerate discoveries to improve cancer prevention, detection, and treatment. To support translational research in GC, we developed the GC Registry (GCR), a North American repository of clinical and cancer genomics data.Participants self-enrolled online. Entry criteria into the GCR included the following: (1) diagnosis of GC, (2) history of GC in a first- or second-degree relative, or (3) known germline mutation in the gene CDH1. Participants provided demographic and clinical information through a detailed survey. Some participants provided specimens of saliva and tumor samples. Tumor samples underwent exome sequencing, whole genome sequencing and transcriptome sequencing.From 2011-2021, 567 individuals registered and returned the clinical questionnaire. For this cohort 65% had a personal history of GC, 36% reported a family history of GC and 14% had a germline CDH1 mutation. 89 GC patients provided tumor samples. For the initial study, 41 tumors were sequenced using next generation sequencing. The data was analyzed for cancer mutations, copy number variations, gene expression, microbiome, neoantigens, immune infiltrates, and other features. We developed a searchable, web-based interface (the GCR Genome Explorer) to enable researchers access to these datasets.The GCR is a unique, North American GC registry which integrates clinical and genomic annotation.Available for researchers through an open access, web-based explorer, the GCR Genome Explorer will accelerate collaborative GC research across the United States and world.
View details for DOI 10.1158/1055-9965.EPI-22-0308
View details for PubMedID 35771165
-
KmerKeys: a web resource for searching indexed genome assemblies and variants.
Nucleic acids research
2022
Abstract
K-mers are short DNA sequences that are used for genome sequence analysis. Applications that use k-mers include genome assembly and alignment. However, the wider bioinformatic use of these short sequences has challenges related to the massive scale of genomic sequence data. A single human genome assembly has billions of k-mers. As a result, the computational requirements for analyzing k-mer information is enormous, particularly when involving complete genome assemblies. To address these issues, we developed a new indexing data structure based on a hash table tuned for the lookup of short sequence keys. This web application, referred to as KmerKeys, provides performant, rapid query speeds for cloud computation on genome assemblies. We enable fuzzy as well as exact sequence searches of assemblies. To enable robust and speedy performance, the website implements cache-friendly hash tables, memory mapping and massive parallel processing. Our method employs a scalable and efficient data structure that can be used to jointly index and search a large collection of human genome assembly information. One can include variant databases and their associated metadata such as the gnomAD population variant catalogue. This feature enables the incorporation of future genomic information into sequencing analysis. KmerKeys is freely accessible at https://kmerkeys.dgi-stanford.org.
View details for DOI 10.1093/nar/gkac266
View details for PubMedID 35474383
-
A deep learning model for molecular label transfer that enables cancer cell identification from histopathology images.
NPJ precision oncology
2022; 6 (1): 14
Abstract
Deep-learning classification systems have the potential to improve cancer diagnosis. However, development of these computational approaches so far depends on prior pathological annotations and large training datasets. The manual annotation is low-resolution, time-consuming, highly variable and subject to observer variance. To address this issue, we developed a method, H&E Molecular neural network (HEMnet). HEMnet utilizes immunohistochemistry as an initial molecular label for cancer cells on a H&E image and trains a cancer classifier on the overlapping clinical histopathological images. Using this molecular transfer method, HEMnet successfully generated and labeled 21,939 tumor and 8782 normal tiles from ten whole-slide images for model training. After building the model, HEMnet accurately identified colorectal cancer regions, which achieved 0.84 and 0.73 of ROC AUC values compared to p53 staining and pathological annotations, respectively. Our validation study using histopathology images from TCGA samples accurately estimated tumor purity, which showed a significant correlation (regression coefficient of 0.8) with the estimation based on genomic sequencing data. Thus, HEMnet contributes to addressing two main challenges in cancer deep-learning analysis, namely the need to have a large number of images for training and the dependence on manual labeling by a pathologist. HEMnet also predicts cancer cells at a much higher resolution compared to manual histopathologic evaluation. Overall, our method provides a path towards a fully automated delineation of any type of tumor so long as there is a cancer-oriented molecular stain available for subsequent learning. Software, tutorials and interactive tools are available at: https://github.com/BiomedicalMachineLearning/HEMnet.
View details for DOI 10.1038/s41698-022-00252-0
View details for PubMedID 35236916
-
Analysis of 16S rRNA sequencing in advanced colorectal cancer tissue samples
LIPPINCOTT WILLIAMS & WILKINS. 2022
View details for DOI 10.1200/JCO.2022.40.4_suppl.163
View details for Web of Science ID 000770995900159
-
Characterization of the consensus mucosal microbiome of colorectal cancer.
NAR cancer
1800; 3 (4): zcab049
Abstract
Dysbioisis is an imbalance of an organ's microbiome and plays a role in colorectal cancer pathogenesis. Characterizing the bacteria in the microenvironment of a cancer through genome sequencing has advantages compared to culture-based profiling. However, there are notable technical and analytical challenges in characterizing universal features of tumor microbiomes. Colorectal tumors demonstrate microbiome variation among different studies and across individual patients. To address these issues, we conducted a computational study to determine a consensus microbiome for colorectal cancer, analyzing 924 tumors from eight independent RNA-Seq data sets. A standardized meta-transcriptomic analysis pipeline was established with quality control metrics. Microbiome profiles across different cohorts were compared and recurrently altered microbial shifts specific to colorectal cancer were determined. We identified cancer-specific set of 114 microbial species associated with tumors that were found among all investigated studies. Firmicutes, Bacteroidetes, Proteobacteria and Actinobacteria were among the four most abundant phyla for the colorectal cancer microbiome. Member species of Clostridia were depleted and Fusobacterium nucleatum was one of the most enriched bacterial species in tumors. Associations between the consensus species and specific immune cell types were noted. Our results are available as a web data resource for other researchers to explore (https://crc-microbiome.stanford.edu).
View details for DOI 10.1093/narcan/zcab049
View details for PubMedID 34988460
-
Profiling diverse sequence tandem repeats in colorectal cancer reveals co-occurrence of microsatellite and chromosomal instability involving Chromosome 8.
Genome medicine
2021; 13 (1): 145
Abstract
We developed a sensitive sequencing approach that simultaneously profiles microsatellite instability, chromosomal instability, and subclonal structure in cancer. We assessed diverse repeat motifs across 225 microsatellites on colorectal carcinomas. Our study identified elevated alterations at both selected tetranucleotide and conventional mononucleotide repeats. Many colorectal carcinomas had a mix of genomic instability states that are normally considered exclusive. An MSH3 mutation may have contributed to the mixed states. Increased copy number of chromosome arm 8q was most prevalent among tumors with microsatellite instability, including a case of translocation involving 8q. Subclonal analysis identified co-occurring driver mutations previously known to be exclusive.
View details for DOI 10.1186/s13073-021-00958-z
View details for PubMedID 34488871
-
Profiling SARS-CoV-2 mutation fingerprints that range from the viral pangenome to individual infection quasispecies.
Genome medicine
2021; 13 (1): 62
Abstract
BACKGROUND: The genome of SARS-CoV-2 is susceptible to mutations during viral replication due to the errors generated by RNA-dependent RNA polymerases. These mutations enable the SARS-CoV-2 to evolve into new strains. Viral quasispecies emerge from de novo mutations that occur in individual patients. In combination, these sets of viral mutations provide distinct genetic fingerprints that reveal the patterns of transmission and have utility in contact tracing.METHODS: Leveraging thousands of sequenced SARS-CoV-2 genomes, we performed a viral pangenome analysis to identify conserved genomic sequences. We used a rapid and highly efficient computational approach that relies on k-mers, short tracts of sequence, instead of conventional sequence alignment. Using this method, we annotated viral mutation signatures that were associated with specific strains. Based on these highly conserved viral sequences, we developed a rapid and highly scalable targeted sequencing assay to identify mutations, detect quasispecies variants, and identify mutation signatures from patients. These results were compared to the pangenome genetic fingerprints.RESULTS: We built a k-mer index for thousands of SARS-CoV-2 genomes and identified conserved genomics regions and landscape of mutations across thousands of virus genomes. We delineated mutation profiles spanning common genetic fingerprints (the combination of mutations in a viral assembly) and a combination of mutations that appear in only a small number of patients. We developed a targeted sequencing assay by selecting primers from the conserved viral genome regions to flank frequent mutations. Using a cohort of 100 SARS-CoV-2 clinical samples, we identified genetic fingerprints consisting of strain-specific mutations seen across populations and de novo quasispecies mutations localized to individual infections. We compared the mutation profiles of viral samples undergoing analysis with the features of the pangenome.CONCLUSIONS: We conducted an analysis for viral mutation profiles that provide the basis of genetic fingerprints. Our study linked pangenome analysis with targeted deep sequenced SARS-CoV-2 clinical samples. We identified quasispecies mutations occurring within individual patients and determined their general prevalence when compared to over 70,000 other strains. Analysis of these genetic fingerprints may provide a way of conducting molecular contact tracing.
View details for DOI 10.1186/s13073-021-00882-2
View details for PubMedID 33875001
-
Single Cell Analysis Can Define Distinct Evolution of Tumor Sites in Follicular Lymphoma.
Blood
2021
Abstract
Tumor heterogeneity complicates biomarker development and fosters drug resistance in solid malignancies. In lymphoma, our knowledge of site-to-site heterogeneity and its clinical implications is still limited. Here, we profiled two nodal, synchronously-acquired tumor samples from ten follicular lymphoma patients using single cell RNA, B cell receptor (BCR) and T cell receptor sequencing, and flow cytometry. By following the rapidly mutating tumor immunoglobulin genes, we discovered that BCR subclones were shared between the two tumor sites in some patients, but in many patients the disease had evolved separately with limited tumor cell migration between the sites. Patients exhibiting divergent BCR evolution also exhibited divergent tumor gene expression and cell surface protein profiles. While the overall composition of the tumor microenvironment did not differ significantly between sites, we did detect a specific correlation between site-to-site tumor heterogeneity and T follicular helper (Tfh) cell abundance. We further observed enrichment of particular ligand-receptor pairs between tumor and Tfh cells, including CD40 and CD40LG, and a significant correlation between tumor CD40 expression and Tfh proliferation. Our study may explain discordant responses to systemic therapies, underscores the difficulty of capturing a patient's disease with a single biopsy, and furthers our understanding of tumor-immune networks in follicular lymphoma.
View details for DOI 10.1182/blood.2020009855
View details for PubMedID 33728464
-
Unique k-mer sequences for validating cancer-related substitution, insertion and deletion mutations.
NAR cancer
2020; 2 (4): zcaa034
Abstract
Cancer genome sequencing has led to important discoveries such as the identification of cancer genes. However, challenges remain in the analysis of cancer genome sequencing. One significant issue is that mutations identified by multiple variant callers are frequently discordant even when using the same genome sequencing data. For insertion and deletion mutations, oftentimes there is no agreement among different callers. Identifying somatic mutations involves read mapping and variant calling, a complicated process that uses many parameters and model tuning. To validate the identification of true mutations, we developed a method using k-mer sequences. First, we characterized the landscape of unique versus non-unique k-mers in the human genome. Second, we developed a software package, KmerVC, to validate the given somatic mutations from sequencing data. Our program validates the occurrence of a mutation based on statistically significant difference in frequency of k-mers with and without a mutation from matched normal and tumor sequences. Third, we tested our method on both simulated and cancer genome sequencing data. Counting k-mer involving mutations effectively validated true positive mutations including insertions and deletions across different individual samples in a reproducible manner. Thus, we demonstrated a straightforward approach for rapidly validating mutations from cancer genome sequencing data.
View details for DOI 10.1093/narcan/zcaa034
View details for PubMedID 33345188
-
SPATIAL SINGLE-CELL ANALYSIS OF COLORECTAL CANCER TUMOUR USING MULTIPLEXED IMAGING MASS CYTOMETRY
BMJ PUBLISHING GROUP. 2020: A399
View details for DOI 10.1136/jitc-2020-SITC2020.0665
View details for Web of Science ID 000616665301184
-
CRISPRpic: fast and precise analysis for CRISPR-induced mutations via prefixed index counting.
NAR genomics and bioinformatics
2020; 2 (2): lqaa012
Abstract
Analysis of CRISPR-induced mutations at targeted locus can be achieved by polymerase chain reaction amplification followed by parallel massive sequencing. We developed a novel algorithm, named as CRISPRpic, to analyze the sequencing reads for the CRISPR experiments via counting exact-matching and pattern-searching. Compare to the other methods based on sequence alignment, CRISPRpic provides precise mutation calling and ultrafast analysis of the sequencing results. Python script of CRISPRpic is available at https://github.com/compbio/CRISPRpic.
View details for DOI 10.1093/nargab/lqaa012
View details for PubMedID 32118203
-
Entire landscape of epitopes from all possible missense mutations in human coding sequences.
AMER ASSOC CANCER RESEARCH. 2020: 118–19
View details for Web of Science ID 000522837200195
-
Comprehensive genomic sequencing of high-grade neuroendocrine neoplasms
AMER SOC CLINICAL ONCOLOGY. 2020
View details for Web of Science ID 000530922700602
-
Gastric Cancer Registry: A comprehensive patient-reported resource for multidisciplinary and translational genomic approaches to gastric cancer
AMER SOC CLINICAL ONCOLOGY. 2020
View details for Web of Science ID 000530922700413
-
Whole genome analysis identifies the association of TP53 genomic deletions with lower survival in Stage III colorectal cancer.
Scientific reports
2020; 10 (1): 5009
Abstract
DNA copy number aberrations (CNA) are frequently observed in colorectal cancers (CRC). There is an urgent need for CNA-based biomarkers in clinics,. n For Stage III CRC, if combined with imaging or pathologic evidence, these markers promise more precise care. We conducted this Stage III specific biomarker discovery with a cohort of 134 CRCs, and with a newly developed high-efficiency CNA profiling protocol. Specifically, we developed the profiling protocol for tumor-normal matched tissue samples based on low-coverage clinical whole-genome sequencing (WGS). We demonstrated the protocol's accuracy and robustness by a systematic benchmark with microarray, high-coverage whole-exome and -genome approaches, where the low-coverage WGS-derived CNA segments were highly accordant (PCC >0.95) with those derived from microarray, and they were substantially less variable if compared to exome-derived segments. A lasso-based model and multivariate cox regression analysis identified a chromosome 17p loss, containing the TP53 tumor suppressor gene, that was significantly associated with reduced survival (P = 0.0139, HR = 1.688, 95% CI = [1.112-2.562]), which was validated by an independent cohort of 187 Stage III CRCs. In summary, this low-coverage WGS protocol has high sensitivity, high resolution and low cost and the identified 17p-loss is an effective poor prognosis marker for Stage III patients.
View details for DOI 10.1038/s41598-020-61643-6
View details for PubMedID 32193467
-
Profiling SARS-CoV-2 mutation fingerprints that range from the viral pangenome to individual infection quasispecies.
medRxiv : the preprint server for health sciences
2020
Abstract
The genome of SARS-CoV-2 is susceptible to mutations during viral replication due to the errors generated by RNA-dependent RNA polymerases. These mutations enable the SARS-CoV-2 to evolve into new strains. Viral quasispecies emerge from de novo mutations that occur in individual patients. In combination, these sets of viral mutations provide distinct genetic fingerprints that reveal the patterns of transmission and have utility in contract tracing.Leveraging thousands of sequenced SARS-CoV-2 genomes, we performed a viral pangenome analysis to identify conserved genomic sequences. We used a rapid and highly efficient computational approach that relies on k-mers, short tracts of sequence, instead of conventional sequence alignment. Using this method, we annotated viral mutation signatures that were associated with specific strains. Based on these highly conserved viral sequences, we developed a rapid and highly scalable targeted sequencing assay to identify mutations, detect quasispecies and identify mutation signatures from patients. These results were compared to the pangenome genetic fingerprints.We built a k-mer index for thousands of SARS-CoV-2 genomes and identified conserved genomics regions and landscape of mutations across thousands of virus genomes. We delineated mutation profiles spanning common genetic fingerprints (the combination of mutations in a viral assembly) and rare ones that occur in only small fraction of patients. We developed a targeted sequencing assay by selecting primers from the conserved viral genome regions to flank frequent mutations. Using a cohort of SARS-CoV-2 clinical samples, we identified genetic fingerprints consisting of strain-specific mutations seen across populations and de novo quasispecies mutations localized to individual infections. We compared the mutation profiles of viral samples undergoing analysis with the features of the pangenome.We conducted an analysis for viral mutation profiles that provide the basis of genetic fingerprints. Our study linked pangenome analysis with targeted deep sequenced SARS-CoV-2 clinical samples. We identified quasispecies mutations occurring within individual patients, mutations demarcating dominant species and the prevalence of mutation signatures, of which a significant number were relatively unique. Analysis of these genetic fingerprints may provide a way of conducting molecular contact tracing.
View details for DOI 10.1101/2020.11.02.20224816
View details for PubMedID 33173909
View details for PubMedCentralID PMC7654905
-
Author Correction: RNA Transcription and Splicing Errors as a Source of Cancer Frameshift Neoantigens for Vaccines.
Scientific reports
2020; 10 (1): 6251
Abstract
An amendment to this paper has been published and can be accessed via a link at the top of the paper.
View details for DOI 10.1038/s41598-020-63114-4
View details for PubMedID 32253381
-
RNA Transcription and Splicing Errors as a Source of Cancer Frameshift Neoantigens for Vaccines.
Scientific reports
2019; 9 (1): 14184
Abstract
The success of checkpoint inhibitors in cancer therapy is largely attributed to activating the patient's immune response to their tumor's neoantigens arising from DNA mutations. This realization has motivated the interest in personal cancer vaccines based on sequencing the patient's tumor DNA to discover neoantigens. Here we propose an additional, unrecognized source of tumor neoantigens. We show that errors in transcription of microsatellites (MS) and mis-splicing of exons create highly immunogenic frameshift (FS) neoantigens in tumors. The sequence of these FS neoantigens are predictable, allowing creation of a peptide array representing all possible neoantigen FS peptides. This array can be used to detect the antibody response in a patient to the FS peptides. A survey of 5 types of cancers reveals peptides that are personally reactive for each patient. This source of neoantigens and the method to discover them may be useful in developing cancer vaccines.
View details for DOI 10.1038/s41598-019-50738-4
View details for PubMedID 31578439
-
Targeted short read sequencing and assembly of re-arrangements and candidate gene loci provide megabase diplotypes.
Nucleic acids research
2019
Abstract
The human genome is composed of two haplotypes, otherwise called diplotypes, which denote phased polymorphisms and structural variations (SVs) that are derived from both parents. Diplotypes place genetic variants in the context of cis-related variants from a diploid genome. As a result, they provide valuable information about hereditary transmission, context of SV, regulation of gene expression and other features which are informative for understanding human genetics. Successful diplotyping with short read whole genome sequencing generally requires either a large population or parent-child trio samples. To overcome these limitations, we developed a targeted sequencing method for generating megabase (Mb)-scale haplotypes with short reads. One selects specific 0.1-0.2 Mb high molecular weight DNA targets with custom-designed Cas9-guide RNA complexes followed by sequencing with barcoded linked reads. To test this approach, we designed three assays, targeting the BRCA1 gene, the entire 4-Mb major histocompatibility complex locus and 18 well-characterized SVs, respectively. Using an integrated alignment- and assembly-based approach, we generated comprehensive variant diplotypes spanning the entirety of the targeted loci and characterized SVs with exact breakpoints. Our results were comparable in quality to long read sequencing.
View details for DOI 10.1093/nar/gkz661
View details for PubMedID 31350896
-
Therapeutic Monitoring of Circulating DNA Mutations in Metastatic Cancer with Personalized Digital PCR.
The Journal of molecular diagnostics : JMD
2019
Abstract
As a high-performance solution for longitudinal monitoring of patients being treated for metastatic cancer, we developed and a single-color digital PCR (dPCR) assay that detects and quantifies specific cancer mutations present in circulating tumor DNA (ctDNA). This customizable assay has a high sensitivity of detection. One can detect a mutation allelic fraction of 0.1%, equivalent to three mutation-bearing DNA molecules among 3,000 genome equivalents. The objective of this study was to validate the use of personalized dPCR mutation assays to monitor patients with metastatic cancer. We compared our digital PCR results to serum biomarkers indicating disease progression or response. Patients had metastatic colorectal, biliary, breast, lung and melanoma cancers. Mutations occurred in essential cancer drivers such as BRAF, KRAS and PIK3CA. We monitored patients over multiple cycles of treatment up to a year. All patients had detectable ctDNA mutations. Our results correlated with serum markers of metastatic cancer burden including CEA, CA-19-9, and CA-15-3, and qualitatively corresponding to imaging studies. We observed corresponding trends among these patients receiving active treatment with chemotherapy or targeted agents. For example, in one patient under active treatment, we detected increasing quantities of ctDNA molecules over time, indicating recurrence of tumor. Our study demonstrates that personalized digital PCR enables longitudinal monitoring of patients with metastatic cancer and maybe a useful indicator for treatment response.
View details for DOI 10.1016/j.jmoldx.2019.10.008
View details for PubMedID 31837432
-
Author Correction: RNA Transcription and Splicing Errors as a Source of Cancer Frameshift Neoantigens for Vaccines.
Scientific reports
2019; 9 (1): 17815
Abstract
An amendment to this paper has been published and can be accessed via a link at the top of the paper.
View details for DOI 10.1038/s41598-019-54300-0
View details for PubMedID 31767927
-
SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution.
GigaScience
2018
Abstract
Background: Simulating genome sequence data with variant features facilitates the development and benchmarking of structural variant analysis programs. However, there are only a few data simulators that provide structural variants in silico and even fewer that provide variants with different allelic fraction and haplotypes.Findings: We developed SVEngine, an open source tool to address this need. SVEngine simulates next generation sequencing data with embedded structural variations. As input, SVEngine takes template haploid sequences (FASTA) and an external variant file, a variant distribution file and/or a clonal phylogeny tree file (NEWICK) as input. Subsequently, it simulates and outputs sequence contigs (FASTAs), sequence reads (FASTQs) and/or post-alignment files (BAMs). All of the files contain the desired variants, along with BED files containing the ground truth. SVEngine's flexible design process enables one to specify size, position, and allelic fraction for deletions, insertions, duplications, inversions and translocations. Finally, SVEngine simulates sequence data that replicates the characteristics of a sequencing library with mixed sizes of DNA insert molecules. To improve the compute speed, SVEngine is highly parallelized to reduce the simulation time.Conclusions: We demonstrated the versatile features of SVEngine and its improved runtime comparisons with other available simulators. SVEngine's features include the simulation of locus-specific variant frequency designed to mimic the phylogeny of cancer clonal evolution. We validated SVEngine's accuracy by simulating genome-wide structural variants of NA12878 and a heterogenous cancer genome. Our evaluation included checking various sequencing mapping features such as coverage change, read clipping, insert size shift and neighbouring hanging read pairs for representative variant types. Structural variant callers Lumpy and Manta and tumor heterogeneity estimator THetA2 were able to perform realistically on the simulated data. SVEngine is implemented as a standard Python package and is freely available for academic use at: https://bitbucket.org/charade/svengine.
View details for PubMedID 29982625
-
Mapping the comprehensive landscape of missense-mutation neoantigens across the human genome
AMER ASSOC CANCER RESEARCH. 2018
View details for DOI 10.1158/1538-7445.AM2018-1298
View details for Web of Science ID 000468818903270
-
Improved detection and identification of microsatellite instability features in colorectal cancer: Implications for immunotherapy
AMER ASSOC CANCER RESEARCH. 2018
View details for DOI 10.1158/1538-7445.AM2018-421
View details for Web of Science ID 000468818901486
-
High-quality CNV segments from low-coverage whole genome sequencing from FFPE cancer biopsies based on an evaluation of multiple CNV tools
AMER ASSOC CANCER RESEARCH. 2018
View details for DOI 10.1158/1538-7445.AM2018-438
View details for Web of Science ID 000468818901502
-
Single-Color Digital PCR Provides High-Performance Detection of Cancer Mutations from Circulating DNA.
The Journal of molecular diagnostics : JMD
2017; 19 (5): 697-710
Abstract
We describe a single-color digital PCR assay that detects and quantifies cancer mutations directly from circulating DNA collected from the plasma of cancer patients. This approach relies on a double-stranded DNA intercalator dye and paired allele-specific DNA primer sets to determine an absolute count of both the mutation and wild-type-bearing DNA molecules present in the sample. The cell-free DNA assay uses an input of 1 ng of nonamplified DNA, approximately 300 genome equivalents, and has a molecular limit of detection of three mutation DNA genome-equivalent molecules per assay reaction. When using more genome equivalents as input, we demonstrated a sensitivity of 0.10% for detecting the BRAF V600E and KRAS G12D mutations. We developed several mutation assays specific to the cancer driver mutations of patients' tumors and detected these same mutations directly from the nonamplified, circulating cell-free DNA. This rapid and high-performance digital PCR assay can be configured to detect specific cancer mutations unique to an individual cancer, making it a potentially valuable method for patient-specific longitudinal monitoring.
View details for DOI 10.1016/j.jmoldx.2017.05.003
View details for PubMedID 28818432
-
CRISPR-Cas9-targeted fragmentation and selective sequencing enable massively parallel microsatellite analysis
NATURE COMMUNICATIONS
2017; 8
Abstract
Microsatellites are multi-allelic and composed of short tandem repeats (STRs) with individual motifs composed of mononucleotides, dinucleotides or higher including hexamers. Next-generation sequencing approaches and other STR assays rely on a limited number of PCR amplicons, typically in the tens. Here, we demonstrate STR-Seq, a next-generation sequencing technology that analyses over 2,000 STRs in parallel, and provides the accurate genotyping of microsatellites. STR-Seq employs in vitro CRISPR-Cas9-targeted fragmentation to produce specific DNA molecules covering the complete microsatellite sequence. Amplification-free library preparation provides single molecule sequences without unique molecular barcodes. STR-selective primers enable massively parallel, targeted sequencing of large STR sets. Overall, STR-Seq has higher throughput, improved accuracy and provides a greater number of informative haplotypes compared with other microsatellite analysis approaches. With these new features, STR-Seq can identify a 0.1% minor genome fraction in a DNA mixture composed of different, unrelated samples.
View details for DOI 10.1038/ncomms14291
View details for PubMedID 28169275
-
The Cancer Genome Atlas Clinical Explorer: a web and mobile interface for identifying clinical-genomic driver associations
GENOME MEDICINE
2015; 7
Abstract
The Cancer Genome Atlas (TCGA) project has generated genomic data sets covering over 20 malignancies. These data provide valuable insights into the underlying genetic and genomic basis of cancer. However, exploring the relationship among TCGA genomic results and clinical phenotype remains a challenge, particularly for individuals lacking formal bioinformatics training. Overcoming this hurdle is an important step toward the wider clinical translation of cancer genomic/proteomic data and implementation of precision cancer medicine. Several websites such as the cBio portal or University of California Santa Cruz genome browser make TCGA data accessible but lack interactive features for querying clinically relevant phenotypic associations with cancer drivers. To enable exploration of the clinical-genomic driver associations from TCGA data, we developed the Cancer Genome Atlas Clinical Explorer.The Cancer Genome Atlas Clinical Explorer interface provides a straightforward platform to query TCGA data using one of the following methods: (1) searching for clinically relevant genes, micro RNAs, and proteins by name, cancer types, or clinical parameters; (2) searching for genomic/proteomic profile changes by clinical parameters in a cancer type; or (3) testing two-hit hypotheses. SQL queries run in the background and results are displayed on our portal in an easy-to-navigate interface according to user's input. To derive these associations, we relied on elastic-net estimates of optimal multiple linear regularized regression and clinical parameters in the space of multiple genomic/proteomic features provided by TCGA data. Moreover, we identified and ranked gene/micro RNA/protein predictors of each clinical parameter for each cancer. The robustness of the results was estimated by bootstrapping. Overall, we identify associations of potential clinical relevance among genes/micro RNAs/proteins using our statistical analysis from 25 cancer types and 18 clinical parameters that include clinical stage or smoking history.The Cancer Genome Atlas Clinical Explorer enables the cancer research community and others to explore clinically relevant associations inferred from TCGA data. With its accessible web and mobile interface, users can examine queries and test hypothesis regarding genomic/proteomic alterations across a broad spectrum of malignancies.
View details for DOI 10.1186/s13073-015-0226-3
View details for Web of Science ID 000363619100002
View details for PubMedID 26507825
View details for PubMedCentralID PMC4624593
-
The Cancer Genome Atlas Clinical Explorer: a web and mobile interface for identifying clinical-genomic driver associations.
Genome medicine
2015; 7 (1): 112-?
Abstract
The Cancer Genome Atlas (TCGA) project has generated genomic data sets covering over 20 malignancies. These data provide valuable insights into the underlying genetic and genomic basis of cancer. However, exploring the relationship among TCGA genomic results and clinical phenotype remains a challenge, particularly for individuals lacking formal bioinformatics training. Overcoming this hurdle is an important step toward the wider clinical translation of cancer genomic/proteomic data and implementation of precision cancer medicine. Several websites such as the cBio portal or University of California Santa Cruz genome browser make TCGA data accessible but lack interactive features for querying clinically relevant phenotypic associations with cancer drivers. To enable exploration of the clinical-genomic driver associations from TCGA data, we developed the Cancer Genome Atlas Clinical Explorer.The Cancer Genome Atlas Clinical Explorer interface provides a straightforward platform to query TCGA data using one of the following methods: (1) searching for clinically relevant genes, micro RNAs, and proteins by name, cancer types, or clinical parameters; (2) searching for genomic/proteomic profile changes by clinical parameters in a cancer type; or (3) testing two-hit hypotheses. SQL queries run in the background and results are displayed on our portal in an easy-to-navigate interface according to user's input. To derive these associations, we relied on elastic-net estimates of optimal multiple linear regularized regression and clinical parameters in the space of multiple genomic/proteomic features provided by TCGA data. Moreover, we identified and ranked gene/micro RNA/protein predictors of each clinical parameter for each cancer. The robustness of the results was estimated by bootstrapping. Overall, we identify associations of potential clinical relevance among genes/micro RNAs/proteins using our statistical analysis from 25 cancer types and 18 clinical parameters that include clinical stage or smoking history.The Cancer Genome Atlas Clinical Explorer enables the cancer research community and others to explore clinically relevant associations inferred from TCGA data. With its accessible web and mobile interface, users can examine queries and test hypothesis regarding genomic/proteomic alterations across a broad spectrum of malignancies.
View details for DOI 10.1186/s13073-015-0226-3
View details for PubMedID 26507825
-
Systematic genomic identification of colorectal cancer genes delineating advanced from early clinical stage and metastasis
BMC MEDICAL GENOMICS
2013; 6
Abstract
Colorectal cancer is the third leading cause of cancer deaths in the United States. The initial assessment of colorectal cancer involves clinical staging that takes into account the extent of primary tumor invasion, determining the number of lymph nodes with metastatic cancer and the identification of metastatic sites in other organs. Advanced clinical stage indicates metastatic cancer, either in regional lymph nodes or in distant organs. While the genomic and genetic basis of colorectal cancer has been elucidated to some degree, less is known about the identity of specific cancer genes that are associated with advanced clinical stage and metastasis.We compiled multiple genomic data types (mutations, copy number alterations, gene expression and methylation status) as well as clinical meta-data from The Cancer Genome Atlas (TCGA). We used an elastic-net regularized regression method on the combined genomic data to identify genetic aberrations and their associated cancer genes that are indicators of clinical stage. We ranked candidate genes by their regression coefficient and level of support from multiple assay modalities.A fit of the elastic-net regularized regression to 197 samples and integrated analysis of four genomic platforms identified the set of top gene predictors of advanced clinical stage, including: WRN, SYK, DDX5 and ADRA2C. These genetic features were identified robustly in bootstrap resampling analysis.We conducted an analysis integrating multiple genomic features including mutations, copy number alterations, gene expression and methylation. This integrated approach in which one considers all of these genomic features performs better than any individual genomic assay. We identified multiple genes that robustly delineate advanced clinical stage, suggesting their possible role in colorectal cancer metastatic progression.
View details for DOI 10.1186/1755-8794-6-54
View details for Web of Science ID 000328897400001
View details for PubMedID 24308539
-
Systematic genomic identification of colorectal cancer genes delineating advanced from early clinical stage and metastasis.
BMC medical genomics
2013; 6: 54-?
Abstract
Colorectal cancer is the third leading cause of cancer deaths in the United States. The initial assessment of colorectal cancer involves clinical staging that takes into account the extent of primary tumor invasion, determining the number of lymph nodes with metastatic cancer and the identification of metastatic sites in other organs. Advanced clinical stage indicates metastatic cancer, either in regional lymph nodes or in distant organs. While the genomic and genetic basis of colorectal cancer has been elucidated to some degree, less is known about the identity of specific cancer genes that are associated with advanced clinical stage and metastasis.We compiled multiple genomic data types (mutations, copy number alterations, gene expression and methylation status) as well as clinical meta-data from The Cancer Genome Atlas (TCGA). We used an elastic-net regularized regression method on the combined genomic data to identify genetic aberrations and their associated cancer genes that are indicators of clinical stage. We ranked candidate genes by their regression coefficient and level of support from multiple assay modalities.A fit of the elastic-net regularized regression to 197 samples and integrated analysis of four genomic platforms identified the set of top gene predictors of advanced clinical stage, including: WRN, SYK, DDX5 and ADRA2C. These genetic features were identified robustly in bootstrap resampling analysis.We conducted an analysis integrating multiple genomic features including mutations, copy number alterations, gene expression and methylation. This integrated approach in which one considers all of these genomic features performs better than any individual genomic assay. We identified multiple genes that robustly delineate advanced clinical stage, suggesting their possible role in colorectal cancer metastatic progression.
View details for DOI 10.1186/1755-8794-6-54
View details for PubMedID 24308539