Bio


Iman Hajirasouliha is a Post Doctoral scholar at the computer science department, Stanford University, working with Serafim Batzoglou. His research focuses on computational genomics, large-scale sequence analysis, and analyzing intra-tumor heterogeneity and somatic structural variants in cancer.

Iman received his B.Sc. in Computer Engineering (software) from Sharif University and his M.Sc. in Computer Science from Simon Fraser University (SFU). In January 2008, he moved to Lab for Computational Biology at SFU under Cenk Sahinalp for his doctoral studies. He defended his Ph.D. thesis on August 1, 2012, with Exceptional Recognition from the external reviewer. Prior to joining Stanford, Iman held a Post Doctoral Research Associate of Computing Science and Center for Computational Molecular Biology at Brown University, working under Ben Raphael, for 18 months.

During his Ph.D. studies, Iman completed the PIMS-IGTC in Mathematical Biology program and he was a student collaborator at Canada's Michael Smith Genome Sciences Centre. He also worked at the Genome Sciences department, University of Washington under Evan Eichler and Can Alkan, as a visiting scholar for five months. He also spent two semesters at Bilkent and Sabanci universities in Turkey and University of Durham in England, when both his Ph.D. supervisors took sabbatical leave abroad.

Honors & Awards


  • Simons-Berkeley Research Fellowship (Algorithmic Challenges in Genomics), The Simons Institute for the Theory of Computing (2016)
  • Postdoctoral Fellowships, Natural Sciences and Engineering Research Council of Canada (2014-2016)
  • Michael Smith Foreign Study Supplements, Natural Sciences and Engineering Research Council of Canada (2012)
  • Best Paper Award, ISMB-HitSeq (2011)
  • Alexander Graham Bell Canada Graduate Scholarships, Natural Sciences and Engineering Research Council of Canada (2010-2012)

Boards, Advisory Committees, Professional Organizations


  • Participant, The 1000 Genomes Project (2009 - Present)
  • Member, International Society for Computational Biology (2008 - Present)

Professional Education


  • Doctor of Philosophy, Simon Fraser University (2012)

Stanford Advisors


Current Research and Scholarly Interests


Computational Genomics, Cancer Heterogeneity, Combinatorics, Machine Learning

All Publications


  • A combinatorial approach for analyzing intra-tumor heterogeneity from high-throughput sequencing data. Bioinformatics (Oxford, England) Hajirasouliha, I., Mahmoody, A., Raphael, B. J. 2014; 30 (12): i78-i86

    Abstract

    High-throughput sequencing of tumor samples has shown that most tumors exhibit extensive intra-tumor heterogeneity, with multiple subpopulations of tumor cells containing different somatic mutations. Recent studies have quantified this intra-tumor heterogeneity by clustering mutations into subpopulations according to the observed counts of DNA sequencing reads containing the variant allele. However, these clustering approaches do not consider that the population frequencies of different tumor subpopulations are correlated by their shared ancestry in the same population of cells.We introduce the binary tree partition (BTP), a novel combinatorial formulation of the problem of constructing the subpopulations of tumor cells from the variant allele frequencies of somatic mutations. We show that finding a BTP is an NP-complete problem; derive an approximation algorithm for an optimization version of the problem; and present a recursive algorithm to find a BTP with errors in the input. We show that the resulting algorithm outperforms existing clustering approaches on simulated and real sequencing data. Availability and implementation: Python and MATLAB implementations of our method are available at http://compbio.cs.brown.edu/software/braphael@cs.brown.edu Supplementary information: Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btu284

    View details for PubMedID 24932008

  • An integrated map of genetic variation from 1,092 human genomes NATURE Altshuler, D. M., Durbin, R. M., Abecasis, G. R., Bentley, D. R., Chakravarti, A., Clark, A. G., Donnelly, P., Eichler, E. E., Flicek, P., Gabriel, S. B., Gibbs, R. A., Green, E. D., Hurles, M. E., Knoppers, B. M., Korbel, J. O., Lander, E. S., Lee, C., Lehrach, H., Mardis, E. R., Marth, G. T., McVean, G. A., Nickerson, D. A., Schmidt, J. P., Sherry, S. T., Wang, J., Wilson, R. K., Gibbs, R. A., Dinh, H., Kovar, C., Lee, S., Lewis, L., Muzny, D., Reid, J., Wang, M., Wang, J., Fang, X., Guo, X., Jian, M., Jiang, H., Jin, X., Li, G., Li, J., Li, Y., Li, Z., Liu, X., Lu, Y., Ma, X., Su, Z., Tai, S., Tang, M., Wang, B., Wang, G., Wu, H., Wu, R., Yin, Y., Zhang, W., Zhao, J., Zhao, M., Zheng, X., Zhou, Y., Lander, E. S., Altshuler, D. M., Gabriel, S. B., Gupta, N., Flicek, P., Clarke, L., Leinonen, R., Smith, R. E., Zheng-Bradley, X., Bentley, D. R., Grocock, R., Humphray, S., James, T., Kingsbury, Z., Lehrach, H., Sudbrak, R., Albrecht, M. W., Amstislavskiy, V. S., Borodina, T. A., Lienhard, M., Mertes, F., Sultan, M., Timmermann, B., Yaspo, M., Sherry, S. T., McVean, G. A., Mardis, E. R., Wilson, R. K., Fulton, L., Fulton, R., Weinstock, G. M., Durbin, R. M., Balasubramaniam, S., Burton, J., Danecek, P., Keane, T. M., Kolb-Kokocinski, A., McCarthy, S., Stalker, J., Quail, M., Schmidt, J. P., Davies, C. J., Gollub, J., Webster, T., Wong, B., Zhan, Y., Auton, A., Gibbs, R. A., Yu, F., Bainbridge, M., Challis, D., Evani, U. S., Lu, J., Muzny, D., Nagaswamy, U., Reid, J., Sabo, A., Wang, Y., Yu, J., Wang, J., Coin, L. J., Fang, L., Guo, X., Jin, X., Li, G., Li, Q., Li, Y., Li, Z., Lin, H., Liu, B., Luo, R., Qin, N., Shao, H., Wang, B., Xie, Y., Ye, C., Yu, C., Zhang, F., Zheng, H., Zhu, H., Marth, G. T., Garrison, E. P., Kural, D., Lee, W., Leong, W. F., Ward, A. N., Wu, J., Zhang, M., Lee, C., Griffin, L., Hsieh, C., Mills, R. E., Shi, X., von Grotthuss, M., Zhang, C., Daly, M. J., DePristo, M. A., Altshuler, D. M., Banks, E., Bhatia, G., Carneiro, M. O., del Angel, G., Gabriel, S. B., Genovese, G., Gupta, N., Handsaker, R. E., Hartl, C., Lander, E. S., McCarroll, S. A., Nemesh, J. C., Poplin, R. E., Schaffner, S. F., Shakir, K., Yoon, S. C., Lihm, J., Makarov, V., Jin, H., Kim, W., Kim, K. C., Korbel, J. O., Rausch, T., Flicek, P., Beal, K., Clarke, L., Cunningham, F., Herrero, J., McLaren, W. M., Ritchie, G. R., Smith, R. E., Zheng-Bradley, X., Clark, A. G., Gottipati, S., Keinan, A., Rodriguez-Flores, J. L., Sabeti, P. C., Grossman, S. R., Tabrizi, S., Tariyal, R., Cooper, D. N., Ball, E. V., Stenson, P. D., Bentley, D. R., Barnes, B., Bauer, M., Cheetham, R. K., Cox, T., Eberle, M., Humphray, S., Kahn, S., Murray, L., Peden, J., Shaw, R., Ye, K., Batzer, M. A., Konkel, M. K., Walker, J. A., MacArthur, D. G., Lek, M., Sudbrak, R., Amstislavskiy, V. S., Herwig, R., Shriver, M. D., Bustamante, C. D., Byrnes, J. K., De La Vega, F. M., Gravel, S., Kenny, E. E., Kidd, J. M., Lacroute, P., Maples, B. K., Moreno-Estrada, A., Zakharia, F., Halperin, E., Baran, Y., Craig, D. W., Christoforides, A., Homer, N., Izatt, T., Kurdoglu, A. A., Sinari, S. A., Squire, K., Sherry, S. T., Xiao, C., Sebat, J., Bafna, V., Ye, K., Burchard, E. G., Hernandez, R. D., Gignoux, C. R., Haussler, D., Katzman, S. J., Kent, W. J., Howie, B., Ruiz-Linares, A., Dermitzakis, E. T., Lappalainen, T., Devine, S. E., Liu, X., Maroo, A., Tallon, L. J., Rosenfeld, J. A., Michelson, L. P., Abecasis, G. R., Kang, H. M., Anderson, P., Angius, A., Bigham, A., Blackwell, T., Busonero, F., Cucca, F., Fuchsberger, C., Jones, C., Jun, G., Li, Y., Lyons, R., Maschio, A., Porcu, E., Reinier, F., Sanna, S., Schlessinger, D., Sidore, C., Tan, A., Trost, M. K., Awadalla, P., Hodgkinson, A., Lunter, G., McVean, G. A., Marchini, J. L., Myers, S., Churchhouse, C., Delaneau, O., Gupta-Hinch, A., Iqbal, Z., Mathieson, I., Rimmer, A., Xifara, D. K., Oleksyk, T. K., Fu, Y., Liu, X., Xiong, M., Jorde, L., Witherspoon, D., Xing, J., Eichler, E. E., Browning, B. L., Alkan, C., Hajirasouliha, I., Hormozdiari, F., Ko, A., Sudmant, P. H., Mardis, E. R., Chen, K., Chinwalla, A., Ding, L., Dooling, D., Koboldt, D. C., McLellan, M. D., Wallis, J. W., Wendl, M. C., Zhang, Q., Durbin, R. M., Hurles, M. E., Tyler-Smith, C., Albers, C. A., Ayub, Q., Balasubramaniam, S., Chen, Y., Coffey, A. J., Colonna, V., Danecek, P., Huang, N., Jostins, L., Keane, T. M., Li, H., McCarthy, S., Scally, A., Stalker, J., Walter, K., Xue, Y., Zhang, Y., Gerstein, M. B., Abyzov, A., Balasubramanian, S., Chen, J., Clarke, D., Fu, Y., Habegger, L., Harmanci, A. O., Jin, M., Khurana, E., Mu, X. J., Sisu, C., Li, Y., Luo, R., Zhu, H., Lee, C., Griffin, L., Hsieh, C., Mills, R. E., Shi, X., von Grotthuss, M., Zhang, C., Marth, G. T., Garrison, E. P., Kural, D., Lee, W., Ward, A. N., Wu, J., Zhang, M., McCarroll, S. A., Altshuler, D. M., Banks, E., del Angel, G., Genovese, G., Handsaker, R. E., Hartl, C., Nemesh, J. C., Shakir, K., Yoon, S. C., Lihm, J., Makarov, V., Degenhardt, J., Flicek, P., Clarke, L., Smith, R. E., Zheng-Bradley, X., Korbel, J. O., Rausch, T., Stuetz, A. M., Bentley, D. R., Barnes, B., Cheetham, R. K., Eberle, M., Humphray, S., Kahn, S., Murray, L., Shaw, R., Ye, K., Batzer, M. A., Konkel, M. K., Walker, J. A., Lacroute, P., Craig, D. W., Homer, N., Church, D., Xiao, C., Sebat, J., Bafna, V., Michaelson, J. J., Ye, K., Devine, S. E., Liu, X., Maroo, A., Tallon, L. J., Lunter, G., McVean, G. A., Iqbal, Z., Witherspoon, D., Xing, J., Eichler, E. E., Alkan, C., Hajirasouliha, I., Hormozdiari, F., Ko, A., Sudmant, P. H., Chen, K., Chinwalla, A., Ding, L., McLellan, M. D., Wallis, J. W., Hurles, M. E., Ben Blackburne, Li, H., Lindsay, S. J., Ning, Z., Scally, A., Walter, K., Zhang, Y., Gerstein, M. B., Abyzov, A., Chen, J., Clarke, D., Khurana, E., Mu, X. J., Sisu, C., Gibbs, R. A., Yu, F., Bainbridge, M., Challis, D., Evani, U. S., Kovar, C., Lewis, L., Lu, J., Muzny, D., Nagaswamy, U., Reid, J., Sabo, A., Yu, J., Guo, X., Li, Y., Wu, R., Marth, G. T., Garrison, E. P., Leong, W. F., Ward, A. N., del Angel, G., DePristo, M. A., Gabriel, S. B., Gupta, N., Hartl, C., Poplin, R. E., Clark, A. G., Rodriguez-Flores, J. L., Flicek, P., Clarke, L., Smith, R. E., Zheng-Bradley, X., MacArthur, D. G., Bustamante, C. D., Gravel, S., Craig, D. W., Christoforides, A., Homer, N., Izatt, T., Sherry, S. T., Xiao, C., Dermitzakis, E. T., Abecasis, G. R., Kang, H. M., McVean, G. A., Mardis, E. R., Dooling, D., Fulton, L., Fulton, R., Koboldt, D. C., Durbin, R. M., Balasubramaniam, S., Keane, T. M., McCarthy, S., Stalker, J., Gerstein, M. B., Balasubramanian, S., Habegger, L., Garrison, E. P., Gibbs, R. A., Bainbridge, M., Muzny, D., Yu, F., Yu, J., del Angel, G., Handsaker, R. E., Makarov, V., Rodriguez-Flores, J. L., Jin, H., Kim, W., Kim, K. C., Flicek, P., Beal, K., Clarke, L., Cunningham, F., Herrero, J., McLaren, W. M., Ritchie, G. R., Zheng-Bradley, X., Tabrizi, S., MacArthur, D. G., Lek, M., Bustamante, C. D., De La Vega, F. M., Craig, D. W., Kurdoglu, A. A., Lappalainen, T., Rosenfeld, J. A., Michelson, L. P., Awadalla, P., Hodgkinson, A., McVean, G. A., Chen, K., Tyler-Smith, C., Chen, Y., Colonna, V., Frankish, A., Harrow, J., Xue, Y., Gerstein, M. B., Abyzov, A., Balasubramanian, S., Chen, J., Clarke, D., Fu, Y., Harmanci, A. O., Jin, M., Khurana, E., Mu, X. J., Sisu, C., Gibbs, R. A., Fowler, G., Hale, W., Kalra, D., Kovar, C., Muzny, D., Reid, J., Wang, J., Guo, X., Li, G., Li, Y., Zheng, X., Altshuler, D. M., Flicek, P., Clarke, L., Barker, J., Kelman, G., Kulesha, E., Leinonen, R., McLaren, W. M., Radhakrishnan, R., Roa, A., Smirnov, D., Smith, R. E., Streeter, I., Toneva, I., Vaughan, B., Zheng-Bradley, X., Bentley, D. R., Cox, T., Humphray, S., Kahn, S., Sudbrak, R., Albrecht, M. W., Lienhard, M., Craig, D. W., Izatt, T., Kurdoglu, A. A., Sherry, S. T., Ananiev, V., Belaia, Z., Beloslyudtsev, D., Bouk, N., Chen, C., Church, D., Cohen, R., Cook, C., Garner, J., Hefferon, T., Kimelman, M., Liu, C., Lopez, J., Meric, P., O'Sullivan, C., Ostapchuk, Y., Phan, L., Ponomarov, S., Schneider, V., Shekhtman, E., Sirotkin, K., Slotta, D., Xiao, C., Zhang, H., Haussler, D., Abecasis, G. R., McVean, G. A., Alkan, C., Ko, A., Dooling, D., Durbin, R. M., Balasubramaniam, S., Keane, T. M., McCarthy, S., Stalker, J., Chakravarti, A., Knoppers, B. M., Abecasis, G. R., Barnes, K. C., Beiswanger, C., Burchard, E. G., Bustamante, C. D., Cai, H., Cao, H., Durbin, R. M., Gharani, N., Gibbs, R. A., Gignoux, C. R., Gravel, S., Henn, B., Jones, D., Jorde, L., Kaye, J. S., Keinan, A., Kent, A., Kerasidou, A., Li, Y., Mathias, R., McVean, G. A., Moreno-Estrada, A., Ossorio, P. N., Parker, M., Reich, D., Rotimi, C. N., Royal, C. D., Sandoval, K., Su, Y., Sudbrak, R., Tian, Z., Timmermann, B., Tishkoff, S., Toji, L. H., Tyler-Smith, C., Via, M., Wang, Y., Yang, H., Yang, L., Zhu, J., Bodmer, W., Bedoya, G., Ruiz-Linares, A., Ming, C. Z., Yang, G., You, C. J., Peltonen, L., Garcia-Montero, A., Orfao, A., Dutil, J., Martinez-Cruzado, J. C., Oleksyk, T. K., Brooks, L. D., Felsenfeld, A. L., McEwen, J. E., Clemm, N. C., Duncanson, A., Dunn, M., Green, E. D., Guyer, M. S., Peterson, J. L., Abecasis, G. R., Auton, A., Brooks, L. D., DePristo, M. A., Durbin, R. M., Handsaker, R. E., Kang, H. M., Marth, G. T., McVean, G. A. 2012; 491 (7422): 56-65

    Abstract

    By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38?million single nucleotide polymorphisms, 1.4?million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

    View details for DOI 10.1038/nature11632

    View details for Web of Science ID 000310434500030

    View details for PubMedID 23128226

  • Mapping copy number variation by population-scale genome sequencing NATURE Mills, R. E., Walter, K., Stewart, C., Handsaker, R. E., Chen, K., Alkan, C., Abyzov, A., Yoon, S. C., Ye, K., Cheetham, R. K., Chinwalla, A., Conrad, D. F., Fu, Y., Grubert, F., Hajirasouliha, I., Hormozdiari, F., Iakoucheva, L. M., Iqbal, Z., Kang, S., Kidd, J. M., Konkel, M. K., Korn, J., Khurana, E., Kural, D., Lam, H. Y., Leng, J., Li, R., Li, Y., Lin, C., Luo, R., Mu, X. J., Nemesh, J., Peckham, H. E., Rausch, T., Scally, A., Shi, X., Stromberg, M. P., Stuetz, A. M., Urban, A. E., Walker, J. A., Wu, J., Zhang, Y., Zhang, Z. D., Batzer, M. A., Ding, L., Marth, G. T., McVean, G., Sebat, J., Snyder, M., Wang, J., Ye, K., Eichler, E. E., Gerstein, M. B., Hurles, M. E., Lee, C., McCarroll, S. A., Korbel, J. O. 2011; 470 (7332): 59-65

    Abstract

    Genomic structural variants (SVs) are abundant in humans, differing from other forms of variation in extent, origin and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (that is, copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analysing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.

    View details for DOI 10.1038/nature09708

    View details for Web of Science ID 000286886400033

    View details for PubMedID 21293372

  • Simultaneous structural variation discovery among multiple paired-end sequenced genomes. Genome research Hormozdiari, F., Hajirasouliha, I., McPherson, A., Eichler, E. E., Sahinalp, S. C. 2011; 21 (12): 2203-12

    Abstract

    With the increasing popularity of whole-genome shotgun sequencing (WGSS) via high-throughput sequencing technologies, it is becoming highly desirable to perform comparative studies involving multiple individuals (from a specific population, race, or a group sharing a particular phenotype). The conventional approach for a comparative genome variation study involves two key steps: (1) each paired-end high-throughput sequenced genome is compared with a reference genome and its (structural) differences are identified; (2) the lists of structural variants in each genome are compared against each other. In this study we propose to move away from this two-step approach to a novel one in which all genomes are compared with the reference genome simultaneously for obtaining much higher accuracy in structural variation detection. For this purpose, we introduce the maximum parsimony-based simultaneous structural variation discovery problem for a set of high-throughput sequenced genomes and provide efficient algorithms to solve it. We compare the proposed framework with the conventional framework, on the genomes of the Yoruban mother-father-child trio, as well as the CEU trio of European ancestry (both sequenced by Illumina platforms). We observed that the conventional framework predicts an unexpectedly high number of de novo variations in the child in comparison to the parents and misses some of the known variations. Our proposed framework, on the other hand, not only significantly reduces the number of incorrectly predicted de novo variations but also predicts more of the known (true) variations.

    View details for DOI 10.1101/gr.120501.111

    View details for PubMedID 22048523

  • A map of human genome variation from population-scale sequencing NATURE Altshuler, D., Durbin, R. M., Abecasis, G. R., Bentley, D. R., Chakravarti, A., Clark, A. G., Collins, F. S., De La Vega, F. M., Donnelly, P., Egholm, M., Flicek, P., Gabriel, S. B., Gibbs, R. A., Knoppers, B. M., Lander, E. S., Lehrach, H., Mardis, E. R., McVean, G. A., Nickerson, D., Peltonen, L., Schafer, A. J., Sherry, S. T., Wang, J., Wilson, R. K., Gibbs, R. A., Deiros, D., Metzker, M., Muzny, D., Reid, J., Wheeler, D., Wang, J., Li, J., Jian, M., Li, G., Li, R., Liang, H., Tian, G., Wang, B., Wang, J., Wang, W., Yang, H., Zhang, X., Zheng, H., Lander, E. S., Altshuler, D. L., Ambrogio, L., Bloom, T., Cibulskis, K., Fennell, T. J., Gabriel, S. B., Jaffe, D. B., Shefler, E., Sougnez, C. L., Bentley, D. R., Gormley, N., Humphray, S., Kingsbury, Z., Koko-Gonzales, P., Stone, J., McKernan, K. J., Costa, G. L., Ichikawa, J. K., Lee, C. C., Sudbrak, R., Lehrach, H., Borodina, T. A., Dahl, A., Davydov, A. N., Marquardt, P., Mertes, F., Nietfeld, W., Rosenstiel, P., Schreiber, S., Soldatov, A. V., Timmermann, B., Tolzmann, M., Egholm, M., Affourtit, J., Ashworth, D., Attiya, S., Bachorski, M., Buglione, E., Burke, A., Caprio, A., Celone, C., Clark, S., Conners, D., Desany, B., Gu, L., Guccione, L., Kao, K., Kebbel, A., Knowlton, J., Labrecque, M., McDade, L., Mealmaker, C., Minderman, M., Nawrocki, A., Niazi, F., Pareja, K., Ramenani, R., Riches, D., Song, W., Turcotte, C., Wang, S., Mardis, E. R., Dooling, D., Fulton, L., Fulton, R., Weinstock, G., Durbin, R. M., Burton, J., Carter, D. M., Churcher, C., Coffey, A., Cox, A., Palotie, A., Quail, M., Skelly, T., Stalker, J., Swerdlow, H. P., Turner, D., De Witte, A., Giles, S., Gibbs, R. A., Wheeler, D., Bainbridge, M., Challis, D., Sabo, A., Yu, F., Yu, J., Wang, J., Fang, X., Guo, X., Li, R., Li, Y., Luo, R., Tai, S., Wu, H., Zheng, H., Zheng, X., Zhou, Y., Yang, H., Marth, G. T., Garrison, E. P., Huang, W., Indap, A., Kural, D., Lee, W., Leong, W. F., Huang, W., Indap, A., Kural, D., Lee, W., Leong, W. F., Quinlan, A. R., Stewart, C., Stromberg, M. P., Ward, A. N., Wu, J., Lee, C., Mills, R. E., Shi, X., Daly, M. J., DePristo, M. A., Altshuler, D. L., Ball, A. D., Banks, E., Bloom, T., Browning, B. L., Cibulskis, K., Fennell, T. J., Garimella, K. V., Grossman, S. R., Handsaker, R. E., Hanna, M., Hartl, C., Jaffe, D. B., Kernytsky, A. M., Korn, J. M., Li, H., Maguire, J. R., McCarroll, S. A., McKenna, A., Nemesh, J. C., Philippakis, A. A., Poplin, R. E., Price, A., Rivas, M. A., Sabeti, P. C., Schaffner, S. F., Shefler, E., Shlyakhter, I. A., Cooper, D. N., Ball, E. V., Mort, M., Phillips, A. D., Stenson, P. D., Sebat, J., Makarov, V., Ye, K., Yoon, S. C., Bustamante, C. D., Clark, A. G., Boyko, A., Degenhardt, J., Gravel, S., Gutenkunst, R. N., Kaganovich, M., Keinan, A., Lacroute, P., Ma, X., Reynolds, A., Clarke, L., Flicek, P., Cunningham, F., Herrero, J., Keenen, S., Kulesha, E., Leinonen, R., McLaren, W., Radhakrishnan, R., Smith, R. E., Zalunin, V., Zheng-Bradley, X., Korbel, J. O., Stuetz, A. M., Humphray, S., Bauer, M., Cheetham, R. K., Cox, T., Eberle, M., James, T., Kahn, S., Murray, L., Ye, K., De La Vega, F. M., Fu, Y., Hyland, F. C., Manning, J. M., McLaughlin, S. F., Peckham, H. E., Sakarya, O., Sun, Y. A., Tsung, E. F., Batzer, M. A., Konkel, M. K., Walker, J. A., Sudbrak, R., Albrecht, M. W., Amstislavskiy, V. S., Herwig, R., Parkhomchuk, D. V., Sherry, S. T., Agarwala, R., Khouri, H., Morgulis, A. O., Paschall, J. E., Phan, L. D., Rotmistrovsky, K. E., Sanders, R. D., Shumway, M. F., Xiao, C., McVean, G. A., Auton, A., Iqbal, Z., Lunter, G., Marchini, J. L., Moutsianas, L., Myers, S., Tumian, A., Desany, B., Knight, J., Winer, R., Craig, D. W., Beckstrom-Sternberg, S. M., Christoforides, A., Kurdoglu, A. A., Pearson, J., Sinari, S. A., Tembe, W. D., Haussler, D., Hinrichs, A. S., Katzman, S. J., Kern, A., Kuhn, R. M., Przeworski, M., Hernandez, R. D., Howie, B., Kelley, J. L., Melton, S. C., Abecasis, G. R., Li, Y., Anderson, P., Blackwell, T., Chen, W., Cookson, W. O., Ding, J., Kang, H. M., Lathrop, M., Liang, L., Moffatt, M. F., Scheet, P., Sidore, C., Snyder, M., Zhan, X., Zoellner, S., Awadalla, P., Casals, F., Idaghdour, Y., Keebler, J., Stone, E. A., Zilversmit, M., Jorde, L., Xing, J., Eichler, E. E., Aksay, G., Alkan, C., Hajirasouliha, I., Hormozdiari, F., Kidd, J. M., Sahinalp, S. C., Sudmant, P. H., Mardis, E. R., Chen, K., Chinwalla, A., Ding, L., Koboldt, D. C., McLellan, M. D., Dooling, D., Weinstock, G., Wallis, J. W., Wendl, M. C., Zhang, Q., Durbin, R. M., Albers, C. A., Ayub, Q., Balasubramaniam, S., Barrett, J. C., Carter, D. M., Chen, Y., Conrad, D. F., Danecek, P., Dermitzakis, E. T., Hu, M., Huang, N., Hurles, M. E., Jin, H., Jostins, L., Keane, T. M., Keane, T. M., Le, S. Q., Lindsay, S., Long, Q., MacArthur, D. G., Montgomery, S. B., Parts, L., Stalker, J., Tyler-Smith, C., Walter, K., Zhang, Y., Gerstein, M. B., Snyder, M., Abyzov, A., Abyzov, A., Balasubramanian, S., Bjornson, R., Du, J., Grubert, F., Habegger, L., Haraksingh, R., Jee, J., Khurana, E., Lam, H. Y., Leng, J., Mu, X. J., Urban, A. E., Zhang, Z., Li, Y., Luo, R., Marth, G. T., Garrison, E. P., Kural, D., Quinlan, A. R., Stewart, C., Stromberg, M. P., Ward, A. N., Wu, J., Lee, C., Mills, R. E., Shi, X., McCarroll, S. A., Banks, E., DePristo, M. A., Handsaker, R. E., Hartl, C., Korn, J. M., Li, H., Nemesh, J. C., Sebat, J., Makarov, V., Ye, K., Yoon, S. C., Degenhardt, J., Kaganovich, M., Clarke, L., Smith, R. E., Zheng-Bradley, X., Korbel, J. O., Humphray, S., Cheetham, R. K., Eberle, M., Kahn, S., Murray, L., Ye, K., De La Vega, F. M., Fu, Y., Peckham, H. E., Sun, Y. A., Batzer, M. A., Konkel, M. K., Xiao, C., Iqbal, Z., Desany, B., Blackwell, T., Snyder, M., Xing, J., Eichler, E. E., Aksay, G., Alkan, C., Hajirasouliha, I., Hormozdiari, F., Kidd, J. M., Chen, K., Chinwalla, A., Ding, L., McLellan, M. D., Wallis, J. W., Hurles, M. E., Conrad, D. F., Walter, K., Zhang, Y., Gerstein, M. B., Snyder, M., Abyzov, A., Du, J., Grubert, F., Haraksingh, R., Jee, J., Khurana, E., Lam, H. Y., Leng, J., Mu, X. J., Urban, A. E., Zhang, Z., Gibbs, R. A., Bainbridge, M., Challis, D., Coafra, C., Dinh, H., Kovar, C., Lee, S., Muzny, D., Nazareth, L., Reid, J., Sabo, A., Yu, F., Yu, J., Marth, G. T., Garrison, E. P., Indap, A., Leong, W. F., Quinlan, A. R., Stewart, C., Ward, A. N., Wu, J., Cibulskis, K., Fennell, T. J., Gabriel, S. B., Garimella, K. V., Hartl, C., Shefler, E., Sougnez, C. L., Wilkinson, J., Clark, A. G., Gravel, S., Grubert, F., Clarke, L., Flicek, P., Smith, R. E., Zheng-Bradley, X., Sherry, S. T., Khouri, H. M., Paschall, J. E., Shumway, M. F., Xiao, C., McVean, G. A., Katzman, S. J., Abecasis, G. R., Blackwell, T., Mardis, E. R., Dooling, D., Fulton, L., Fulton, R., Koboldt, D. C., Durbin, R. M., Balasubramaniam, S., Coffey, A., Keane, T. M., MacArthur, D. G., Palotie, A., Scott, C., Stalker, J., Tyler-Smith, C., Gerstein, M. B., Balasubramanian, S., Chakravarti, A., Knoppers, B. M., Peltonen, L., Abecasis, G. R., Bustamante, C. D., Gharani, N., Gibbs, R. A., Jorde, L., Kaye, J. S., Kent, A., Li, T., McGuire, A. L., McVean, G. A., Ossorio, P. N., Rotimi, C. N., Su, Y., Toji, L. H., Tyler-Smith, C., Brooks, L. D., Felsenfeld, A. L., McEwen, J. E., Abdallah, A., Juenger, C. R., Clemm, N. C., Collins, F. S., Duncanson, A., Green, E. D., Guyer, M. S., Peterson, J. L., Schafer, A. J., Abecasis, G. R., Altshuler, D. L., Auton, A., Brooks, L. D., Durbin, R. M., Gibbs, R. A., Hurles, M. E., McVean, G. A. 2010; 467 (7319): 1061-1073

    Abstract

    The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

    View details for DOI 10.1038/nature09534

    View details for Web of Science ID 000283548600039

    View details for PubMedID 20981092

  • Detection and characterization of novel sequence insertions using paired-end next-generation sequencing BIOINFORMATICS Hajirasouliha, I., Hormozdiari, F., Alkan, C., Kidd, J. M., Birol, I., Eichler, E. E., Sahinalp, S. C. 2010; 26 (10): 1277-1283

    Abstract

    In the past few years, human genome structural variation discovery has enjoyed increased attention from the genomics research community. Many studies were published to characterize short insertions, deletions, duplications and inversions, and associate copy number variants (CNVs) with disease. Detection of new sequence insertions requires sequence data, however, the 'detectable' sequence length with read-pair analysis is limited by the insert size. Thus, longer sequence insertions that contribute to our genetic makeup are not extensively researched.We present NovelSeq: a computational framework to discover the content and location of long novel sequence insertions using paired-end sequencing data generated by the next-generation sequencing platforms. Our framework can be built as part of a general sequence analysis pipeline to discover multiple types of genetic variation (SNPs, structural variation, etc.), thus it requires significantly less-computational resources than de novo sequence assembly. We apply our methods to detect novel sequence insertions in the genome of an anonymous donor and validate our results by comparing with the insertions discovered in the same genome using various sources of sequence data.The implementation of the NovelSeq pipeline is available at http://compbio.cs.sfu.ca/strvar.htmeee@gs.washington.edu; cenk@cs.sfu.ca

    View details for DOI 10.1093/bioinformatics/btq152

    View details for Web of Science ID 000277447500001

    View details for PubMedID 20385726

  • Fast and scalable inference of multi-sample cancer lineages. Genome biology Popic, V., Salari, R., Hajirasouliha, I., Kashef-Haghighi, D., West, R. B., Batzoglou, S. 2015; 16 (1): 91-?

    Abstract

    Somatic variants can be used as lineage markers for the phylogenetic reconstruction of cancer evolution. Since somatic phylogenetics is complicated by sample heterogeneity, novel specialized tree-building methods are required for cancer phylogeny reconstruction. We present LICHeE (Lineage Inference for Cancer Heterogeneity and Evolution), a novel method that automates the phylogenetic inference of cancer progression from multiple somatic samples. LICHeE uses variant allele frequencies of somatic single nucleotide variants obtained by deep sequencing to reconstruct multi-sample cell lineage trees and infer the subclonal composition of the samples. LICHeE is open source and available at http://viq854.github.io/lichee .

    View details for DOI 10.1186/s13059-015-0647-8

    View details for PubMedID 25944252

  • Fast and scalable inference of multi-sample cancer lineages Genome Biology Popic, V., et al 2015; 16 (1), 91
  • Characterization of structural variants with single molecule and hybrid sequencing approaches BIOINFORMATICS Ritz, A., Bashir, A., Sindi, S., Hsu, D., Hajirasouliha, I., Raphael, B. J. 2014; 30 (24): 3458-3466

    Abstract

    Structural variation is common in human and cancer genomes. High-throughput DNA sequencing has enabled genome-scale surveys of structural variation. However, the short reads produced by these technologies limit the study of complex variants, particularly those involving repetitive regions. Recent 'third-generation' sequencing technologies provide single-molecule templates and longer sequencing reads, but at the cost of higher per-nucleotide error rates.We present MultiBreak-SV, an algorithm to detect structural variants (SVs) from single molecule sequencing data, paired read sequencing data, or a combination of sequencing data from different platforms. We demonstrate that combining low-coverage third-generation data from Pacific Biosciences (PacBio) with high-coverage paired read data is advantageous on simulated chromosomes. We apply MultiBreak-SV to PacBio data from four human fosmids and show that it detects known SVs with high sensitivity and specificity. Finally, we perform a whole-genome analysis on PacBio data from a complete hydatidiform mole cell line and predict 1002 high-probability SVs, over half of which are confirmed by an Illumina-based assembly.

    View details for DOI 10.1093/bioinformatics/btu714

    View details for Web of Science ID 000346051000003

    View details for PubMedID 25355789

  • Detecting independent and recurrent copy number aberrations using interval graphs. Bioinformatics (Oxford, England) Wu, H. T., Hajirasouliha, I., Raphael, B. J. 2014; 30 (12): i195-i203

    Abstract

    Somatic copy number aberrations SCNAS: are frequent in cancer genomes, but many of these are random, passenger events. A common strategy to distinguish functional aberrations from passengers is to identify those aberrations that are recurrent across multiple samples. However, the extensive variability in the length and position of SCNA: s makes the problem of identifying recurrent aberrations notoriously difficult.We introduce a combinatorial approach to the problem of identifying independent and recurrent SCNA: s, focusing on the key challenging of separating the overlaps in aberrations across individuals into independent events. We derive independent and recurrent SCNA: s as maximal cliques in an interval graph constructed from overlaps between aberrations. We efficiently enumerate all such cliques, and derive a dynamic programming algorithm to find an optimal selection of non-overlapping cliques, resulting in a very fast algorithm, which we call RAIG (Recurrent Aberrations from Interval Graphs). We show that RAIG outperforms other methods on simulated data and also performs well on data from three cancer types from The Cancer Genome Atlas (TCGA). In contrast to existing approaches that employ various heuristics to select independent aberrations, RAIG optimizes a well-defined objective function. We show that this allows RAIG to identify rare aberrations that are likely functional, but are obscured by overlaps with larger passenger aberrations. Availability: http://compbio.cs.brown.edu/software.braphael@brown.edu Supplementary information: Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btu276

    View details for PubMedID 24931984

  • Reconstructing Mutational History in Multiply Sampled Tumors Using Perfect Phylogeny Mixtures ALGORITHMS IN BIOINFORMATICS Hajirasouliha, I., Raphael, B. J. 2014; 8701: 354-367
  • MATE-CLEVER: Mendelian-inheritance-aware discovery and genotyping of midsize and long indels. Bioinformatics (Oxford, England) Marschall, T., Hajirasouliha, I., Schönhuth, A. 2013; 29 (24): 3143-50

    Abstract

    Accurately predicting and genotyping indels longer than 30 bp has remained a central challenge in next-generation sequencing (NGS) studies. While indels of up to 30 bp are reliably processed by standard read aligners and the Genome Analysis Toolkit (GATK), longer indels have still resisted proper treatment. Also, discovering and genotyping longer indels has become particularly relevant owing to the increasing attention in globally concerted projects.We present MATE-CLEVER (Mendelian-inheritance-AtTEntive CLique-Enumerating Variant findER) as an approach that accurately discovers and genotypes indels longer than 30 bp from contemporary NGS reads with a special focus on family data. For enhanced quality of indel calls in family trios or quartets, MATE-CLEVER integrates statistics that reflect the laws of Mendelian inheritance. MATE-CLEVER's performance rates for indels longer than 30 bp are on a par with those of the GATK for indels shorter than 30 bp, achieving up to 90% precision overall, with >80% of calls correctly typed. In predicting de novo indels longer than 30 bp in family contexts, MATE-CLEVER even raises the standards of the GATK. MATE-CLEVER achieves precision and recall of ∼63% on indels of 30 bp and longer versus 55% in both categories for the GATK on indels of 10-29 bp. A special version of MATE-CLEVER has contributed to indel discovery, in particular for indels of 30-100 bp, the 'NGS twilight zone of indels', in the Genome of the Netherlands Project. http://clever-sv.googlecode.com/

    View details for DOI 10.1093/bioinformatics/btt556

    View details for PubMedID 24072733

  • Mirroring co-evolving trees in the light of their topologies. Bioinformatics (Oxford, England) Hajirasouliha, I., Schönhuth, A., de Juan, D., Valencia, A., Sahinalp, S. C. 2012; 28 (9): 1202-8

    Abstract

    Determining the interaction partners among protein/domain families poses hard computational problems, in particular in the presence of paralogous proteins. Available approaches aim to identify interaction partners among protein/domain families through maximizing the similarity between trimmed versions of their phylogenetic trees. Since maximization of any natural similarity score is computationally difficult, many approaches employ heuristics to evaluate the distance matrices corresponding to the tree topologies in question. In this article, we devise an efficient deterministic algorithm which directly maximizes the similarity between two leaf labeled trees with edge lengths, obtaining a score-optimal alignment of the two trees in question.Our algorithm is significantly faster than those methods based on distance matrix comparison: 1 min on a single processor versus 730 h on a supercomputer. Furthermore, we outperform the current state-of-the-art exhaustive search approach in terms of precision, while incurring acceptable losses in recall.A C implementation of the method demonstrated in this article is available at http://compbio.cs.sfu.ca/mirrort.htm

    View details for DOI 10.1093/bioinformatics/bts109

    View details for PubMedID 22399677

  • From sequence to molecular pathology, and a mechanism driving the neuroendocrine phenotype in prostate cancer. The Journal of pathology Lapuk, A. V., Wu, C., Wyatt, A. W., McPherson, A., McConeghy, B. J., Brahmbhatt, S., Mo, F., Zoubeidi, A., Anderson, S., Bell, R. H., Haegert, A., Shukin, R., Wang, Y., Fazli, L., Hurtado-Coll, A., Jones, E. C., Hach, F., Hormozdiari, F., Hajirasouliha, I., Boutros, P. C., Bristow, R. G., Zhao, Y., Marra, M. A., Fanjul, A., Maher, C. A., Chinnaiyan, A. M., Rubin, M. A., Beltran, H., Sahinalp, S. C., Gleave, M. E., Volik, S. V., Collins, C. C. 2012; 227 (3): 286-97

    Abstract

    The current paradigm of cancer care relies on predictive nomograms which integrate detailed histopathology with clinical data. However, when predictions fail, the consequences for patients are often catastrophic, especially in prostate cancer where nomograms influence the decision to therapeutically intervene. We hypothesized that the high dimensional data afforded by massively parallel sequencing (MPS) is not only capable of providing biological insights, but may aid molecular pathology of prostate tumours. We assembled a cohort of six patients with high-risk disease, and performed deep RNA and shallow DNA sequencing in primary tumours and matched metastases where available. Our analysis identified copy number abnormalities, accurately profiled gene expression levels, and detected both differential splicing and expressed fusion genes. We revealed occult and potentially dormant metastases, unambiguously supporting the patients' clinical history, and implicated the REST transcriptional complex in the development of neuroendocrine prostate cancer, validating this finding in a large independent cohort. We massively expand on the number of novel fusion genes described in prostate cancer; provide fresh evidence for the growing link between fusion gene aetiology and gene expression profiles; and show the utility of fusion genes for molecular pathology. Finally, we identified chromothripsis in a patient with chronic prostatitis. Our results provide a strong foundation for further development of MPS-based molecular pathology.

    View details for DOI 10.1002/path.4047

    View details for PubMedID 22553170

  • Integrated genome and transcriptome sequencing identifies a novel form of hybrid and aggressive prostate cancer. The Journal of pathology Wu, C., Wyatt, A. W., Lapuk, A. V., McPherson, A., McConeghy, B. J., Bell, R. H., Anderson, S., Haegert, A., Brahmbhatt, S., Shukin, R., Mo, F., Li, E., Fazli, L., Hurtado-Coll, A., Jones, E. C., Butterfield, Y. S., Hach, F., Hormozdiari, F., Hajirasouliha, I., Boutros, P. C., Bristow, R. G., Jones, S. J., Hirst, M., Marra, M. A., Maher, C. A., Chinnaiyan, A. M., Sahinalp, S. C., Gleave, M. E., Volik, S. V., Collins, C. C. 2012; 227 (1): 53-61

    Abstract

    Next-generation sequencing is making sequence-based molecular pathology and personalized oncology viable. We selected an individual initially diagnosed with conventional but aggressive prostate adenocarcinoma and sequenced the genome and transcriptome from primary and metastatic tissues collected prior to hormone therapy. The histology-pathology and copy number profiles were remarkably homogeneous, yet it was possible to propose the quadrant of the prostate tumour that likely seeded the metastatic diaspora. Despite a homogeneous cell type, our transcriptome analysis revealed signatures of both luminal and neuroendocrine cell types. Remarkably, the repertoire of expressed but apparently private gene fusions, including C15orf21:MYC, recapitulated this biology. We hypothesize that the amplification and over-expression of the stem cell gene MSI2 may have contributed to the stable hybrid cellular identity. This hybrid luminal-neuroendocrine tumour appears to represent a novel and highly aggressive case of prostate cancer with unique biological features and, conceivably, a propensity for rapid progression to castrate-resistance. Overall, this work highlights the importance of integrated analyses of genome, exome and transcriptome sequences for basic tumour biology, sequence-based molecular pathology and personalized oncology.

    View details for DOI 10.1002/path.3987

    View details for PubMedID 22294438

  • Alu repeat discovery and characterization within human genomes. Genome research Hormozdiari, F., Alkan, C., Ventura, M., Hajirasouliha, I., Malig, M., Hach, F., Yorukoglu, D., Dao, P., Bakhshi, M., Sahinalp, S. C., Eichler, E. E. 2011; 21 (6): 840-9

    Abstract

    Human genomes are now being rapidly sequenced, but not all forms of genetic variation are routinely characterized. In this study, we focus on Alu retrotransposition events and seek to characterize differences in the pattern of mobile insertion between individuals based on the analysis of eight human genomes sequenced using next-generation sequencing. Applying a rapid read-pair analysis algorithm, we discover 4342 Alu insertions not found in the human reference genome and show that 98% of a selected subset (63/64) experimentally validate. Of these new insertions, 89% correspond to AluY elements, suggesting that they arose by retrotransposition. Eighty percent of the Alu insertions have not been previously reported and more novel events were detected in Africans when compared with non-African samples (76% vs. 69%). Using these data, we develop an experimental and computational screen to identify ancestry informative Alu retrotransposition events among different human populations.

    View details for DOI 10.1101/gr.115956.110

    View details for PubMedID 21131385

  • Comrad: detection of expressed rearrangements by integrated analysis of RNA-Seq and low coverage genome sequence data. Bioinformatics (Oxford, England) McPherson, A., Wu, C., Hajirasouliha, I., Hormozdiari, F., Hach, F., Lapuk, A., Volik, S., Shah, S., Collins, C., Sahinalp, S. C. 2011; 27 (11): 1481-8

    Abstract

    Comrad is a novel algorithmic framework for the integrated analysis of RNA-Seq and whole genome shotgun sequencing (WGSS) data for the purposes of discovering genomic rearrangements and aberrant transcripts. The Comrad framework leverages the advantages of both RNA-Seq and WGSS data, providing accurate classification of rearrangements as expressed or not expressed and accurate classification of the genomic or non-genomic origin of aberrant transcripts. A major benefit of Comrad is its ability to accurately identify aberrant transcripts and associated rearrangements using low coverage genome data. As a result, a Comrad analysis can be performed at a cost comparable to that of two RNA-Seq experiments, significantly lower than an analysis requiring high coverage genome data.We have applied Comrad to the discovery of gene fusions and read-throughs in prostate cancer cell line C4-2, a derivative of the LNCaP cell line with androgen-independent characteristics. As a proof of concept, we have rediscovered in the C4-2 data 4 of the 6 fusions previously identified in LNCaP. We also identified six novel fusion transcripts and associated genomic breakpoints, and verified their existence in LNCaP, suggesting that Comrad may be more sensitive than previous methods that have been applied to fusion discovery in LNCaP. We show that many of the gene fusions discovered using Comrad would be difficult to identify using currently available techniques.A C++ and Perl implementation of the method demonstrated in this article is available at http://compbio.cs.sfu.ca/.

    View details for DOI 10.1093/bioinformatics/btr184

    View details for PubMedID 21478487

  • Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery. Bioinformatics (Oxford, England) Hormozdiari, F., Hajirasouliha, I., Dao, P., Hach, F., Yorukoglu, D., Alkan, C., Eichler, E. E., Sahinalp, S. C. 2010; 26 (12): i350-7

    Abstract

    Recent years have witnessed an increase in research activity for the detection of structural variants (SVs) and their association to human disease. The advent of next-generation sequencing technologies make it possible to extend the scope of structural variation studies to a point previously unimaginable as exemplified by the 1000 Genomes Project. Although various computational methods have been described for the detection of SVs, no such algorithm is yet fully capable of discovering transposon insertions, a very important class of SVs to the study of human evolution and disease. In this article, we provide a complete and novel formulation to discover both loci and classes of transposons inserted into genomes sequenced with high-throughput sequencing technologies. In addition, we also present 'conflict resolution' improvements to our earlier combinatorial SV detection algorithm (VariationHunter) by taking the diploid nature of the human genome into consideration. We test our algorithms with simulated data from the Venter genome (HuRef) and are able to discover >85% of transposon insertion events with precision of >90%. We also demonstrate that our conflict resolution algorithm (denoted as VariationHunter-CR) outperforms current state of the art (such as original VariationHunter, BreakDancer and MoDIL) algorithms when tested on the genome of the Yoruba African individual (NA18507).The implementation of algorithm is available at http://compbio.cs.sfu.ca/strvar.htm.Supplementary data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btq216

    View details for PubMedID 20529927

  • Quantifying Systemic Evolutionary Changes by Color Coding Confidence-Scored PPI Networks ALGORITHMS IN BIOINFORMATICS, PROCEEDINGS Dao, P., Schoenhuth, A., Hormozdiari, F., Hajirasouliha, I., Sahinalp, S. C., Ester, M. 2009; 5724: 37-48
  • smyRNA: a novel Ab initio ncRNA gene finder. PloS one Salari, R., Aksay, C., Karakoc, E., Unrau, P. J., Hajirasouliha, I., Sahinalp, S. C. 2009; 4 (5): e5433

    Abstract

    Non-coding RNAs (ncRNAs) have important functional roles in the cell: for example, they regulate gene expression by means of establishing stable joint structures with target mRNAs via complementary sequence motifs. Sequence motifs are also important determinants of the structure of ncRNAs. Although ncRNAs are abundant, discovering novel ncRNAs on genome sequences has proven to be a hard task; in particular past attempts for ab initio ncRNA search mostly failed with the exception of tools that can identify micro RNAs.We present a very general ab initio ncRNA gene finder that exploits differential distributions of sequence motifs between ncRNAs and background genome sequences.Our method, once trained on a set of ncRNAs from a given species, can be applied to a genome sequences of other organisms to find not only ncRNAs homologous to those in the training set but also others that potentially belong to novel (and perhaps unknown) ncRNA families.(http://compbio.cs.sfu.ca/taverna/smyrna).

    View details for DOI 10.1371/journal.pone.0005433

    View details for PubMedID 19415115

  • Biomolecular network motif counting and discovery by color coding. Bioinformatics (Oxford, England) Alon, N., Dao, P., Hajirasouliha, I., Hormozdiari, F., Sahinalp, S. C. 2008; 24 (13): i241-9

    Abstract

    Protein-protein interaction (PPI) networks of many organisms share global topological features such as degree distribution, k-hop reachability, betweenness and closeness. Yet, some of these networks can differ significantly from the others in terms of local structures: e.g. the number of specific network motifs can vary significantly among PPI networks. Counting the number of network motifs provides a major challenge to compare biomolecular networks. Recently developed algorithms have been able to count the number of induced occurrences of subgraphs with k < or = 7 vertices. Yet no practical algorithm exists for counting non-induced occurrences, or counting subgraphs with k > or = 8 vertices. Counting non-induced occurrences of network motifs is not only challenging but also quite desirable as available PPI networks include several false interactions and miss many others. In this article, we show how to apply the 'color coding' technique for counting non-induced occurrences of subgraph topologies in the form of trees and bounded treewidth subgraphs. Our algorithm can count all occurrences of motif G' with k vertices in a network G with n vertices in time polynomial with n, provided k = O(log n). We use our algorithm to obtain 'treelet' distributions for k < or = 10 of available PPI networks of unicellular organisms (Saccharomyces cerevisiae Escherichia coli and Helicobacter Pyloris), which are all quite similar, and a multicellular organism (Caenorhabditis elegans) which is significantly different. Furthermore, the treelet distribution of the unicellular organisms are similar to that obtained by the 'duplication model' but are quite different from that of the 'preferential attachment model'. The treelet distribution is robust w.r.t. sparsification with bait/edge coverage of 70% but differences can be observed when bait/edge coverage drops to 50%.

    View details for DOI 10.1093/bioinformatics/btn163

    View details for PubMedID 18586721

  • Optimal pooling for genome re-sequencing with ultra-high-throughput short-read technologies. Bioinformatics (Oxford, England) Hajirasouliha, I., Hormozdiari, F., Sahinalp, S. C., Birol, I. 2008; 24 (13): i32-40

    Abstract

    New generation sequencing technologies offer unique opportunities and challenges for re-sequencing studies. In this article, we focus on re-sequencing experiments using the Solexa technology, based on bacterial artificial chromosome (BAC) clones, and address an experimental design problem. In these specific experiments, approximate coordinates of the BACs on a reference genome are known, and fine-scale differences between the BAC sequences and the reference are of interest. The high-throughput characteristics of the sequencing technology makes it possible to multiplex BAC sequencing experiments by pooling BACs for a cost-effective operation. However, the way BACs are pooled in such re-sequencing experiments has an effect on the downstream analysis of the generated data, mostly due to subsequences common to multiple BACs. The experimental design strategy we develop in this article offers combinatorial solutions based on approximation algorithms for the well-known max n-cut problem and the related max n-section problem on hypergraphs. Our algorithms, when applied to a number of sample cases give more than a 2-fold performance improvement over random partitioning.

    View details for DOI 10.1093/bioinformatics/btn173

    View details for PubMedID 18586730

  • Convergence to equilibria in distributed, selfish reallocation processes with weighted tasks ALGORITHMS - ESA 2007, PROCEEDINGS Berenbrink, P., Friedetzky, T., Hajirasouliha, I., Hu, Z. 2007; 4698: 41-52
  • On completing Latin squares STACS 2007, PROCEEDINGS Hajirasouliha, I., Jowhari, H., Kumar, R., Sundaram, R. 2007; 4393: 524-535