I am a postdoctoral scholar at Genetics Department, Stanford University. Before I moved to Stanford, I served as a Simons postdoctoral fellow at Joint Genome Institute (JGI). I have earned my Ph.D. in Computer Science Department at Stony Brook University. My dissertation is about “Algorithms and applications in genome assembly using long read sequencing technology”, advised by Prof. Michael Schatz, Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory (CSHL). I received my Master’s degree from Carnegie Mellon University. The master thesis was published at 70th IEEE Vehicular Technology Conference. My undergraduate degree with Cum Laude was received at Seoul National University in Korea and worked for 4 years at AhnLab, Inc., where I programmed Windows kernel-level file system filter driver for V3 anti-virus program.
I am interested in developing methods (tools/software) using statistical machine learning and deep learning using clinical data and/or long reads produced by 3rd generation sequencing technology such as PacBio, Moleculo, Oxford Nanopore, etc. This includes de novo genome assembly, single cell, cancer data analysis, time series data analysis and population structure to analyze big data, infer critical factors, predict the future and discover biological importance.
Honors & Awards
Simons postdoctoral fellowship, Lawrence Berkeley National Laboratory (2015-2016)
Fellowship, Stony Brook University (2009-2010)
Scholarship from CyLab, Carnegie Mellon University, (2006-2007)
Boards, Advisory Committees, Professional Organizations
Program Comittee, Pacific Symposium on Biocomputing (PSB) (2015 - 2015)
Program Comittee, Pacific Symposium on Biocomputing (PSB) (2014 - 2014)
Doctor of Philosophy, S.U.N.Y. State University at Stony Brook (2015)
Master of Science, Carnegie Mellon University (2008)
Michael Snyder, Postdoctoral Faculty Sponsor
Piercing the dark matter: bioinformatics of long-range sequencing and mapping
NATURE REVIEWS GENETICS
2018; 19 (6): 329–46
Several new genomics technologies have become available that offer long-read sequencing or long-range mapping with higher throughput and higher resolution analysis than ever before. These long-range technologies are rapidly advancing the field with improved reference genomes, more comprehensive variant identification and more complete views of transcriptomes and epigenomes. However, they also require new bioinformatics approaches to take full advantage of their unique characteristics while overcoming their complex errors and modalities. Here, we discuss several of the most important applications of the new technologies, focusing on both the currently available bioinformatics tools and opportunities for future research.
View details for PubMedID 29599501
- Hybrid assembly with long and short reads improves discovery of gene family expansions BMC GENOMICS 2017; 18
The pineapple genome and the evolution of CAM photosynthesis
2015; 47 (12): 1435-+
Pineapple (Ananas comosus (L.) Merr.) is the most economically valuable crop possessing crassulacean acid metabolism (CAM), a photosynthetic carbon assimilation pathway with high water-use efficiency, and the second most important tropical fruit. We sequenced the genomes of pineapple varieties F153 and MD2 and a wild pineapple relative, Ananas bracteatus accession CB5. The pineapple genome has one fewer ancient whole-genome duplication event than sequenced grass genomes and a conserved karyotype with seven chromosomes from before the ρ duplication event. The pineapple lineage has transitioned from C3 photosynthesis to CAM, with CAM-related genes exhibiting a diel expression pattern in photosynthetic tissues. CAM pathway genes were enriched with cis-regulatory elements associated with the regulation of circadian clock genes, providing the first cis-regulatory link between CAM and circadian clock regulation. Pineapple CAM photosynthesis evolved by the reconfiguration of pathways in C3 plants, through the regulatory neofunctionalization of preexisting genes and not through the acquisition of neofunctionalized genes via whole-genome or tandem gene duplication.
View details for PubMedID 26523774
SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips
2014; 30 (24): 3476–83
Genomics is expanding from a single reference per species paradigm into a more comprehensive pan-genome approach that analyzes multiple individuals together. A compressed de Bruijn graph is a sophisticated data structure for representing the genomes of entire populations. It robustly encodes shared segments, simple single-nucleotide polymorphisms and complex structural variations far beyond what can be represented in a collection of linear sequences alone.We explore deep topological relationships between suffix trees and compressed de Bruijn graphs and introduce an algorithm, splitMEM, that directly constructs the compressed de Bruijn graph in time and space linear to the total number of genomes for a given maximum genome size. We introduce suffix skips to traverse several suffix links simultaneously and use them to efficiently decompose maximal exact matches into graph nodes. We demonstrate the utility of splitMEM by analyzing the nine-strain pan-genome of Bacillus anthracis and up to 62 strains of Escherichia coli, revealing their core-genome properties.
View details for DOI 10.1093/bioinformatics/btu756
View details for Web of Science ID 000346051000005
View details for PubMedID 25398610
View details for PubMedCentralID PMC4253837
Whole genome de novo assemblies of three divergent strains of rice, Oryza sativa, document novel gene space of aus and indica
2014; 15 (11): 506
The use of high throughput genome-sequencing technologies has uncovered a large extent of structural variation in eukaryotic genomes that makes important contributions to genomic diversity and phenotypic variation. When the genomes of different strains of a given organism are compared, whole genome resequencing data are typically aligned to an established reference sequence. However, when the reference differs in significant structural ways from the individuals under study, the analysis is often incomplete or inaccurate.Here, we use rice as a model to demonstrate how improvements in sequencing and assembly technology allow rapid and inexpensive de novo assembly of next generation sequence data into high-quality assemblies that can be directly compared using whole genome alignment to provide an unbiased assessment. Using this approach, we are able to accurately assess the "pan-genome" of three divergent rice varieties and document several megabases of each genome absent in the other two.Many of the genome-specific loci are annotated to contain genes, reflecting the potential for new biological properties that would be missed by standard reference-mapping approaches. We further provide a detailed analysis of several loci associated with agriculturally important traits, including the S5 hybrid sterility locus, the Sub1 submergence tolerance locus, the LRK gene cluster associated with improved yield, and the Pup1 cluster associated with phosphorus deficiency, illustrating the utility of our approach for biological discovery. All of the data and software are openly available to support further breeding and functional studies of rice and other species.
View details for DOI 10.1186/s13059-014-0506-z
View details for Web of Science ID 000346607300008
View details for PubMedID 25468217
View details for PubMedCentralID PMC4268812
Virmid: accurate detection of somatic mutations with sample impurity inference
2013; 14 (8): R90
Detection of somatic variation using sequence from disease-control matched data sets is a critical first step. In many cases including cancer, however, it is hard to isolate pure disease tissue, and the impurity hinders accurate mutation analysis by disrupting overall allele frequencies. Here, we propose a new method, Virmid, that explicitly determines the level of impurity in the sample, and uses it for improved detection of somatic variation. Extensive tests on simulated and real sequencing data from breast cancer and hemimegalencephaly demonstrate the power of our model. A software implementation of our method is available at http://sourceforge.net/projects/virmid/.
View details for DOI 10.1186/gb-2013-14-8-r90
View details for Web of Science ID 000328195400008
View details for PubMedID 23987214
View details for PubMedCentralID PMC4054681
Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score
2012; 28 (16): 2097–2105
Genome resequencing and short read mapping are two of the primary tools of genomics and are used for many important applications. The current state-of-the-art in mapping uses the quality values and mapping quality scores to evaluate the reliability of the mapping. These attributes, however, are assigned to individual reads and do not directly measure the problematic repeats across the genome. Here, we present the Genome Mappability Score (GMS) as a novel measure of the complexity of resequencing a genome. The GMS is a weighted probability that any read could be unambiguously mapped to a given position and thus measures the overall composition of the genome itself.We have developed the Genome Mappability Analyzer to compute the GMS of every position in a genome. It leverages the parallelism of cloud computing to analyze large genomes, and enabled us to identify the 5-14% of the human, mouse, fly and yeast genomes that are difficult to analyze with short reads. We examined the accuracy of the widely used BWA/SAMtools polymorphism discovery pipeline in the context of the GMS, and found discovery errors are dominated by false negatives, especially in regions with poor GMS. These errors are fundamental to the mapping process and cannot be overcome by increasing coverage. As such, the GMS should be considered in every resequencing project to pinpoint the 'dark matter' of the genome, including of known clinically relevant variations in these regions.The source code and profiles of several model organisms are available at http://gma-bio.sourceforge.net
View details for DOI 10.1093/bioinformatics/bts330
View details for Web of Science ID 000307501100002
View details for PubMedID 22668792
View details for PubMedCentralID PMC3413383
On the Security of Intra-Car Wireless Sensor Networks
IEEE. 2009: 663-+
View details for Web of Science ID 000280580400132