My research focuses on big data in genomics and precision medicine.
I integrate and interpret multiple omics datasets (whole-genome, whole-exome, RNA-Seq, single cell RNA-Seq, methylome, etc.) to understand the genetic and genomic basics of diseases.
I am proficient in several programming languages (R, Python, Linux/Bash, and Perl), statistical analysis, machine learning methods, data visualization, Docker, and cloud computing.
Find more about me on my website: https://littlebitofdata.com/
Doctor of Philosophy, The University of Hong Kong, Cancer Genomics (2016)
Whole slide images reflect DNA methylation patterns of human tumors.
NPJ genomic medicine
2020; 5: 11
DNA methylation is an important epigenetic mechanism regulating gene expression and its role in carcinogenesis has been extensively studied. High-throughput DNA methylation assays have been used broadly in cancer research. Histopathology images are commonly obtained in cancer treatment, given that tissue sampling remains the clinical gold-standard for diagnosis. In this work, we investigate the interaction between cancer histopathology images and DNA methylation profiles to provide a better understanding of tumor pathobiology at the epigenetic level. We demonstrate that classical machine learning algorithms can associate the DNA methylation profiles of cancer samples with morphometric features extracted from whole slide images. Furthermore, grouping the genes into methylation clusters greatly improves the performance of the models. The well-predicted genes are enriched in key pathways in carcinogenesis including hypoxia in glioma and angiogenesis in renal cell carcinoma. Our results provide new insights into the link between histopathological and molecular data.
View details for DOI 10.1038/s41525-020-0120-9
View details for PubMedID 32194984
View details for PubMedCentralID PMC7064513
Genomic data imputation with variational auto-encoders.
2020; 9 (8)
As missing values are frequently present in genomic data, practical methods to handle missing data are necessary for downstream analyses that require complete data sets. State-of-the-art imputation techniques, including methods based on singular value decomposition and K-nearest neighbors, can be computationally expensive for large data sets and it is difficult to modify these algorithms to handle certain cases not missing at random.In this work, we use a deep-learning framework based on the variational auto-encoder (VAE) for genomic missing value imputation and demonstrate its effectiveness in transcriptome and methylome data analysis. We show that in the vast majority of our testing scenarios, VAE achieves similar or better performances than the most widely used imputation standards, while having a computational advantage at evaluation time. When dealing with data missing not at random (e.g., few values are missing), we develop simple yet effective methodologies to leverage the prior knowledge about missing data. Furthermore, we investigate the effect of varying latent space regularization strength in VAE on the imputation performances and, in this context, show why VAE has a better imputation capacity compared to a regular deterministic auto-encoder.We describe a deep learning imputation framework for transcriptome and methylome data using a VAE and show that it can be a preferable alternative to traditional methods for data imputation, especially in the setting of large-scale data and certain missing-not-at-random scenarios.
View details for DOI 10.1093/gigascience/giaa082
View details for PubMedID 32761097
Benchmark of long non-coding RNA quantification for RNA sequencing of cancer samples.
2019; 8 (12)
Long non-coding RNAs (lncRNAs) are emerging as important regulators of various biological processes. While many studies have exploited public resources such as RNA sequencing (RNA-Seq) data in The Cancer Genome Atlas to study lncRNAs in cancer, it is crucial to choose the optimal method for accurate expression quantification.In this study, we compared the performance of pseudoalignment methods Kallisto and Salmon, alignment-based transcript quantification method RSEM, and alignment-based gene quantification methods HTSeq and featureCounts, in combination with read aligners STAR, Subread, and HISAT2, in lncRNA quantification, by applying them to both un-stranded and stranded RNA-Seq datasets. Full transcriptome annotation, including protein-coding and non-coding RNAs, greatly improves the specificity of lncRNA expression quantification. Pseudoalignment methods and RSEM outperform HTSeq and featureCounts for lncRNA quantification at both sample- and gene-level comparison, regardless of RNA-Seq protocol type, choice of aligners, and transcriptome annotation. Pseudoalignment methods and RSEM detect more lncRNAs and correlate highly with simulated ground truth. On the contrary, HTSeq and featureCounts often underestimate lncRNA expression. Antisense lncRNAs are poorly quantified by alignment-based gene quantification methods, which can be improved using stranded protocols and pseudoalignment methods.Considering the consistency with ground truth and computational resources, pseudoalignment methods Kallisto or Salmon in combination with full transcriptome annotation is our recommended strategy for RNA-Seq analysis for lncRNAs.
View details for DOI 10.1093/gigascience/giz145
View details for PubMedID 31808800
Establishment and characterization of new tumor xenografts and cancer cell lines from EBV-positive nasopharyngeal carcinoma.
2018; 9 (1): 4663
The lack of representative nasopharyngeal carcinoma (NPC) models has seriously hampered research on EBV carcinogenesis and preclinical studies in NPC. Here we report the successful growth of five NPC patient-derived xenografts (PDXs) from fifty-eight attempts of transplantation of NPC specimens into NOD/SCID mice. The take rates for primary and recurrent NPC are 4.9% and 17.6%, respectively. Successful establishment of a new EBV-positive NPC cell line, NPC43, is achieved directly from patient NPC tissues by including Rho-associated coiled-coil containing kinases inhibitor (Y-27632) in culture medium. Spontaneous lytic reactivation of EBV can be observed in NPC43 upon withdrawal of Y-27632. Whole-exome sequencing (WES) reveals a close similarity in mutational profiles of these NPC PDXs with their corresponding patient NPC. Whole-genome sequencing (WGS) further delineates the genomic landscape and sequences of EBV genomes in these newly established NPC models, which supports their potential use in future studies of NPC.
View details for PubMedID 30405107
A radiogenomic dataset of non-small cell lung cancer.
2018; 5: 180202
Medical image biomarkers of cancer promise improvements in patient care through advances in precision medicine. Compared to genomic biomarkers, image biomarkers provide the advantages of being non-invasive, and characterizing a heterogeneous tumor in its entirety, as opposed to limited tissue available via biopsy. We developed a unique radiogenomic dataset from a Non-Small Cell Lung Cancer (NSCLC) cohort of 211 subjects. The dataset comprises Computed Tomography (CT), Positron Emission Tomography (PET)/CT images, semantic annotations of the tumors as observed on the medical images using a controlled vocabulary, and segmentation maps of tumors in the CT scans. Imaging data are also paired with results of gene mutation analyses, gene expression microarrays and RNA sequencing data from samples of surgically excised tumor tissue, and clinical data, including survival outcomes. This dataset was created to facilitate the discovery of the underlying relationship between tumor molecular and medical image features, as well as the development and evaluation of prognostic medical image biomarkers.
View details for PubMedID 30325352
- A radiogenomic dataset of non-small cell lung cancer SCIENTIFIC DATA 2018; 5
- Benchmark of lncRNA quantification in RNA-Seq of cancer samples AMER ASSOC CANCER RESEARCH. 2018
Whole-exome sequencing reveals critical genes underlying metastasis in esophageal squamous cell carcinoma.
The Journal of pathology
Esophageal squamous cell carcinoma (ESCC) is one of the most lethal cancers due to a high frequency of metastasis. However, little is known about the genomic landscape of metastatic ESCC. To identify the genetic alterations that underlie ESCC metastasis, whole-exome sequencing (WES) was performed for 41 primary tumors and 15 lymph nodes (LNs) with metastatic ESCC. Eleven cases included matched primary tumors, synchronous LN metastases and non-neoplastic mucosa. Approximately 50-76% of the mutations identified in primary tumors appeared in the synchronous LN metastases. Metastatic ESCC harbor frequent mutations of TP53, KMT2D, ZNF750, and IRF5. Importantly, ZNF750 was recurrently mutated in metastatic ESCC. Combined analysis from current and previous genomic ESCC studies indicated more frequent ZNF750 mutation occurred in diagnosed cases with LN metastasis than those without metastasis (14% vs. 3.4%, n = 629, p = 1.78 × 10(-5)) ). The Cancer Genome Atlas (TCGA) data further showed that ZNF750 genetic alterations were associated with early disease relapse. Previous ESCC studies demonstrated that ZNF750 knockdown strongly promotes proliferation, migration and invasion. Collectively, these results suggest a role for ZNF750 as a metastasis suppressor. TP53 is highly mutated in ESCC and missense mutations are associated with poor overall survival, independent of pathological stage, suggesting these missense mutations have important functional impact for tumor progression, and are, thus, likely to be gain-of-function (GOF) mutations. Additionally, mutations of epigenetic regulators, including KMT2D, TET2 and KAT2A, and chromosomal 6p22 and 11q23 deletions of histone variants, important for nucleosome assembly, were detected in 80% of LN metastases. Our study highlights the important role of critical genetic events including ZNF750 mutations, TP53 putative GOF mutations and nucleosome disorganization caused by genetic lesions seen with ESCC metastasis.
View details for DOI 10.1002/path.4925
View details for PubMedID 28608921
Whole-exome sequencing identifies multiple loss-of-function mutations of NF-kappa B pathway regulators in nasopharyngeal carcinoma
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2016; 113 (40): 11283-11288
Nasopharyngeal carcinoma (NPC) is an epithelial malignancy with a unique geographical distribution. The genomic abnormalities leading to NPC pathogenesis remain unclear. In total, 135 NPC tumors were examined to characterize the mutational landscape using whole-exome sequencing and targeted resequencing. An APOBEC cytidine deaminase mutagenesis signature was revealed in the somatic mutations. Noticeably, multiple loss-of-function mutations were identified in several NF-κB signaling negative regulators NFKBIA, CYLD, and TNFAIP3 Functional studies confirmed that inhibition of NFKBIA had a significant impact on NF-κB activity and NPC cell growth. The identified loss-of-function mutations in NFKBIA leading to protein truncation contributed to the altered NF-κB activity, which is critical for NPC tumorigenesis. In addition, somatic mutations were found in several cancer-relevant pathways, including cell cycle-phase transition, cell death, EBV infection, and viral carcinogenesis. These data provide an enhanced road map for understanding the molecular basis underlying NPC.
View details for DOI 10.1073/pnas.1607606113
View details for Web of Science ID 000384528900070
View details for PubMedID 27647909
View details for PubMedCentralID PMC5056105
Genetic and epigenetic landscape of nasopharyngeal carcinoma.
Chinese clinical oncology
2016; 5 (2): 16-?
Nasopharyngeal carcinoma (NPC) is a unique epithelial malignancy that shows a remarkable geographical and ethic distribution. Multiple factors including predisposing genetic factors, environmental carcinogens, and Epstein-Barr virus (EBV) infection contribute to the accumulation of genetic and epigenetic alterations leading to NPC development. Emerging technologies now allow us to detailedly characterize and understand cancer genomes. Genome-wide studies show that typically NPC tumors are characterized as having comparatively low mutation rates, widespread hypermethylation, and frequent copy number alterations and chromosome abnormalities. In this review, we provide an updated overview of the genetic and epigenetic aberrations that likely drive nasopharyngeal tumor development and progression. We integrate the previous knowledge and novel findings from whole-exome sequencing (WES) and methylome studies in NPC, and further discuss the potential use of these findings to identify biomarkers for NPC diagnosis and prognosis.
View details for DOI 10.21037/cco.2016.03.06
View details for PubMedID 27121876
Whole-exome sequencing identifies MST1R as a genetic susceptibility gene in nasopharyngeal carcinoma
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
2016; 113 (12): 3317-3322
Multiple factors, including host genetics, environmental factors, and Epstein-Barr virus (EBV) infection, contribute to nasopharyngeal carcinoma (NPC) development. To identify genetic susceptibility genes for NPC, a whole-exome sequencing (WES) study was performed in 161 NPC cases and 895 controls of Southern Chinese descent. The gene-based burden test discovered an association between macrophage-stimulating 1 receptor (MST1R) and NPC. We identified 13 independent cases carrying the MST1R pathogenic heterozygous germ-line variants, and 53.8% of these cases were diagnosed with NPC aged at or even younger than 20 y, indicating that MST1R germline variants are relevant to disease early-age onset (EAO) (age of ≤20 y). In total, five MST1R missense variants were found in EAO cases but were rare in controls (EAO vs. control, 17.9% vs. 1.2%, P = 7.94 × 10(-12)). The validation study, including 2,160 cases and 2,433 controls, showed that the MST1R variant c.G917A:p.R306H is highly associated with NPC (odds ratio of 9.0). MST1R is predominantly expressed in the tissue-resident macrophages and is critical for innate immunity that protects organs from tissue damage and inflammation. Importantly, MST1R expression is detected in the ciliated epithelial cells in normal nasopharyngeal mucosa and plays a role in the cilia motility important for host defense. Although no somatic mutation of MST1R was identified in the sporadic NPC tumors, copy number alterations and promoter hypermethylation at MST1R were often observed. Our findings provide new insights into the pathogenesis of NPC by highlighting the involvement of the MST1R-mediated signaling pathways.
View details for DOI 10.1073/pnas.1523436113
View details for Web of Science ID 000372488200060
View details for PubMedID 26951679
View details for PubMedCentralID PMC4812767
Comparative methylome analysis in solid tumors reveals aberrant methylation at chromosome 6p in nasopharyngeal carcinoma
2015; 4 (7): 1079-1090
Altered patterns of DNA methylation are key features of cancer. Nasopharyngeal carcinoma (NPC) has the highest incidence in Southern China. Aberrant methylation at the promoter region of tumor suppressors is frequently reported in NPC; however, genome-wide methylation changes have not been comprehensively investigated. Therefore, we systematically analyzed methylome data in 25 primary NPC tumors and nontumor counterparts using a high-throughput approach with the Illumina HumanMethylation450 BeadChip. Comparatively, we examined the methylome data of 11 types of solid tumors collected by The Cancer Genome Atlas (TCGA). In NPC, the hypermethylation pattern was more dominant than hypomethylation and the majority of de novo methylated loci were within or close to CpG islands in tumors. The comparative methylome analysis reveals hypermethylation at chromosome 6p21.3 frequently occurred in NPC (false discovery rate; FDR=1.33 × 10(-9) ), but was less obvious in other types of solid tumors except for prostate and Epstein-Barr virus (EBV)-positive gastric cancer (FDR<10(-3) ). Bisulfite pyrosequencing results further confirmed the aberrant methylation at 6p in an additional patient cohort. Evident enrichment of the repressive mark H3K27me3 and active mark H3K4me3 derived from human embryonic stem cells were found at these regions, indicating both DNA methylation and histone modification function together, leading to epigenetic deregulation in NPC. Our study highlights the importance of epigenetic deregulation in NPC. Polycomb Complex 2 (PRC2), responsible for H3K27 trimethylation, is a promising therapeutic target. A key genomic region on 6p with aberrant methylation was identified. This region contains several important genes having potential use as biomarkers for NPC detection.
View details for DOI 10.1002/cam4.451
View details for Web of Science ID 000357899100013
View details for PubMedID 25924914
View details for PubMedCentralID PMC4529346
Viral-Inducible Argonaute18 Confers Broad-Spectrum Virus Resistance in Rice by Sequestering A Host MicroRNA
Viral pathogens are a major threat to rice production worldwide. Although RNA interference (RNAi) is known to mediate antiviral immunity in plant and animal models, the mechanism of antiviral RNAi in rice and other economically important crops is poorly understood. Here, we report that rice resistance to evolutionarily diverse viruses requires Argonaute18 (AGO18). Genetic studies reveal that the antiviral function of AGO18 depends on its activity to sequester microRNA168 (miR168) to alleviate repression of rice AGO1 essential for antiviral RNAi. Expression of miR168-resistant AGO1a in ago18 background rescues or increases rice antiviral activity. Notably, stable transgenic expression of AGO18 confers broad-spectrum virus resistance in rice. Our findings uncover a novel cooperative antiviral activity of two distinct AGO proteins and suggest a new strategy for the control of viral diseases in rice.
View details for DOI 10.7554/eLife.05733
View details for Web of Science ID 000349462700004
View details for PubMedID 25688565
View details for PubMedCentralID PMC4358150
RNA-dependent RNA polymerase 6 of rice (Oryza sativa) plays role in host defense against negative-strand RNA virus, Rice stripe virus
2012; 163 (2): 512-519
RNA-dependent RNA polymerases (RDRs) from fungi, plants and some invertebrate animals play fundamental roles in antiviral defense. Here, we investigated the role of RDR6 in the defense of economically important rice plants against a negative-strand RNA virus (Rice stripe virus, RSV) that causes enormous crop damage. In three independent transgenic lines (OsRDR6AS line A, B and C) in which OsRDR6 transcription levels were reduced by 70-80% through antisense silencing, the infection and disease symptoms of RSV were shown to be significantly enhanced. The hypersusceptibilities of the OsRDR6AS plants were attributed not to enhanced insect infestation but to enhanced virus infection. The rise in symptoms was associated with the increased accumulation of RSV genomic RNA in the OsRDR6AS plants. The deep sequencing data showed reduced RSV-derived siRNA accumulation in the OsRDR6AS plants compared with the wild type plants. This is the first report of the antiviral role of a RDR in a monocot crop plant in the defense against a negative-strand RNA virus and significantly expands upon the current knowledge of the antiviral roles of RDRs in the defense against different types of viral genomes in numerous groups of plants.
View details for DOI 10.1016/j.virusres.2011.11.016
View details for Web of Science ID 000301309400013
View details for PubMedID 22142475
Viral Infection Induces Expression of Novel Phased MicroRNAs from Conserved Cellular MicroRNA Precursors
2011; 7 (8)
RNA silencing, mediated by small RNAs including microRNAs (miRNAs) and small interfering RNAs (siRNAs), is a potent antiviral or antibacterial mechanism, besides regulating normal cellular gene expression critical for development and physiology. To gain insights into host small RNA metabolism under infections by different viruses, we used Solexa/Illumina deep sequencing to characterize the small RNA profiles of rice plants infected by two distinct viruses, Rice dwarf virus (RDV, dsRNA virus) and Rice stripe virus (RSV, a negative sense and ambisense RNA virus), respectively, as compared with those from non-infected plants. Our analyses showed that RSV infection enhanced the accumulation of some rice miRNA*s, but not their corresponding miRNAs, as well as accumulation of phased siRNAs from a particular precursor. Furthermore, RSV infection also induced the expression of novel miRNAs in a phased pattern from several conserved miRNA precursors. In comparison, no such changes in host small RNA expression was observed in RDV-infected rice plants. Significantly RSV infection elevated the expression levels of selective OsDCLs and OsAGOs, whereas RDV infection only affected the expression of certain OsRDRs. Our results provide a comparative analysis, via deep sequencing, of changes in the small RNA profiles and in the genes of RNA silencing machinery induced by different viruses in a natural and economically important crop host plant. They uncover new mechanisms and complexity of virus-host interactions that may have important implications for further studies on the evolution of cellular small RNA biogenesis that impact pathogen infection, pathogenesis, as well as organismal development.
View details for DOI 10.1371/journal.ppat.1002176
View details for Web of Science ID 000294298100019
View details for PubMedID 21901091
View details for PubMedCentralID PMC3161970