Aziz Khan is a staff scientist at the Stanford Cancer Institute, where he develops reproducible pipelines and machine learning methods for integrative analysis of multi-omics data at bulk and single-cell resolution to understand tumor evolution and chromatin regulatory dynamics of tumor growth.

Aziz completed his PhD in Bioinformatics at Tsinghua University, China in 2016 followed by a three year postdoctoral training at the University of Oslo, Norway. During PhD and Postdoc his primary research emphasis was on regulatory genomics and epigenomics. He developed computational methods, tools, and resources to understand the (epi)genomic control of gene regulation in development and disease.

Apart from research, he is advocating for open science, open-source, preprints, and reproducibility in research. He is a contributor for Bioconda and also developed several open-source tools and resources such as JASPAR. He is ASAPbio and eLife Community Ambassador and co-founded ECRcentral (, a community initiative for early-career researchers.

Honors & Awards

  • CSC fully-funded PhD Scholarship, Chinese Scholarship Council (2012 – 2016)
  • Erasmus+ mobility grant, European Commission (2019)
  • OBF Travel Fellowship, ISMB/ECCB 2019, Basel, Switzerland (May 2019)
  • Biocuration Travel Fellowship, Biocuration Conference 2019, Cambridge, UK (Apr 2019)
  • Biocuration Travel Fellowship, Biocuration Conference 2018, Shanghai, China (Apr 2018)
  • Research Travel Grant, Higher Education Commission (HEC), Pakistan (2016)
  • TWAS BIOVISION.Next Fellowship, BioVision conference, Lyon, France (2014)
  • TWAS BIOVISION.Next Fellowship, BioVision conference, Lyon, France (2013)
  • Research Travel Grant, Higher Education Commission (HEC), Pakistan (2012)

Education & Certifications

  • Teaching Certificate (Associate), CIRTL and Stanford University, Evidence-based STEM Teaching (2024)
  • PhD, Tsinghua University, China, Bioinformatics (2016)
  • Postdoctoral Fellow, NCMM, University of Oslo, Norway, Computational Biology (2019)

Professional Interests

gene regulation, cancer regulatory genomics and epigenomics, integrative analysis of multi-omics data, machine learning

Professional Affiliations and Activities

  • Community Ambassador, eLife (2018 - 2020)
  • Member, International Society for Computational Biology (ISCB) (2015 - Present)
  • Co-Chair, Web Committee, ISCB Student Council (2016 - 2019)
  • Ambassador, ASAPbio (2019 - Present)

All Publications

  • Germline-mediated immunoediting sculpts breast cancer subtypes and metastatic proclivity. Science (New York, N.Y.) Houlahan, K. E., Khan, A., Greenwald, N. F., Vivas, C. S., West, R. B., Angelo, M., Curtis, C. 2024; 384 (6699): eadh8697


    Tumors with the same diagnosis can have different molecular profiles and response to treatment. It remains unclear when and why these differences arise. Somatic genomic aberrations occur within the context of a highly variable germline genome. Interrogating 5870 breast cancer lesions, we demonstrated that germline-derived epitopes in recurrently amplified genes influence somatic evolution by mediating immunoediting. Individuals with a high germline-epitope burden in human epidermal growth factor receptor 2 (HER2/ERBB2) are less likely to develop HER2-positive breast cancer compared with other subtypes. The same holds true for recurrent amplicons defining three aggressive estrogen receptor (ER)-positive subgroups. Tumors that overcome such immune-mediated negative selection are more aggressive and demonstrate an "immune cold" phenotype. These data show that the germline genome plays a role in dictating somatic evolution.

    View details for DOI 10.1126/science.adh8697

    View details for PubMedID 38815010

  • JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles. Nucleic acids research Rauluseviciute, I., Riudavets-Puig, R., Blanc-Mathieu, R., Castro-Mondragon, J. A., Ferenc, K., Kumar, V., Lemma, R. B., Lucas, J., Cheneby, J., Baranasic, D., Khan, A., Fornes, O., Gundersen, S., Johansen, M., Hovig, E., Lenhard, B., Sandelin, A., Wasserman, W. W., Parcy, F., Mathelier, A. 2023


    JASPAR ( is a widely-used open-access database presenting manually curated high-quality and non-redundant DNA-binding profiles for transcription factors (TFs) across taxa. In this 10th release and 20th-anniversary update, the CORE collection has expanded with 329 new profiles. We updated three existing profiles and provided orthogonal support for 72 profiles from the previous release's UNVALIDATED collection. Altogether, the JASPAR 2024 update provides a 20% increase in CORE profiles from the previous release. A trimming algorithm enhanced profiles by removing low information content flanking base pairs, which were likely uninformative (within the capacity of the PFM models) for TFBS predictions and modelling TF-DNA interactions. This release includes enhanced metadata, featuring a refined classification for plant TFs' structural DNA-binding domains. The new JASPAR collections prompt updates to the genomic tracks of predicted TFbinding sites (TFBSs) in 8 organisms, with human and mouse tracks available as native tracks in the UCSC Genome browser. All data are available through the JASPAR web interface and programmatically through its API and the updated Bioconductor and pyJASPAR packages. Finally, a new TFBS extraction tool enables users to retrieve predicted JASPAR TFBSs intersecting their genomic regions of interest.

    View details for DOI 10.1093/nar/gkad1059

    View details for PubMedID 37962376

  • Deterministic evolution and stringent selection during preneoplasia. Nature Karlsson, K., Przybilla, M. J., Kotler, E., Khan, A., Xu, H., Karagyozova, K., Sockell, A., Wong, W. H., Liu, K., Mah, A., Lo, Y. H., Lu, B., Houlahan, K. E., Ma, Z., Suarez, C. J., Barnes, C. P., Kuo, C. J., Curtis, C. 2023


    The earliest events during human tumour initiation, although poorly characterized, may hold clues to malignancy detection and prevention1. Here we model occult preneoplasia by biallelic inactivation of TP53, a common early event in gastric cancer, in human gastric organoids. Causal relationships between this initiating genetic lesion and resulting phenotypes were established using experimental evolution in multiple clonally derived cultures over 2 years. TP53 loss elicited progressive aneuploidy, including copy number alterations and structural variants prevalent in gastric cancers, with evident preferred orders. Longitudinal single-cell sequencing of TP53-deficient gastric organoids similarly indicates progression towards malignant transcriptional programmes. Moreover, high-throughput lineage tracing with expressed cellular barcodes demonstrates reproducible dynamics whereby initially rare subclones with shared transcriptional programmes repeatedly attain clonal dominance. This powerful platform for experimental evolution exposes stringent selection, clonal interference and a marked degree of phenotypic convergence in premalignant epithelial organoids. These data imply predictability in the earliest stages of tumorigenesis and show evolutionary constraints and barriers to malignant transformation, with implications for earlier detection and interception of aggressive, genome-instable tumours.

    View details for DOI 10.1038/s41586-023-06102-8

    View details for PubMedID 37258665

    View details for PubMedCentralID 5656752

  • Germline-mediated immunoediting sculpts breast cancer subtypes and metastatic proclivity. bioRxiv : the preprint server for biology Houlahan, K. E., Khan, A., Greenwald, N. F., West, R. B., Angelo, M., Curtis, C. 2023


    Cancer represents a broad spectrum of molecularly and morphologically diverse diseases. Individuals with the same clinical diagnosis can have tumors with drastically different molecular profiles and clinical response to treatment. It remains unclear when these differences arise during disease course and why some tumors are addicted to one oncogenic pathway over another. Somatic genomic aberrations occur within the context of an individual's germline genome, which can vary across millions of polymorphic sites. An open question is whether germline differences influence somatic tumor evolution. Interrogating 3,855 breast cancer lesions, spanning pre-invasive to metastatic disease, we demonstrate that germline variants in highly expressed and amplified genes influence somatic evolution by modulating immunoediting at early stages of tumor development. Specifically, we show that the burden of germline-derived epitopes in recurrently amplified genes selects against somatic gene amplification in breast cancer. For example, individuals with a high burden of germline-derived epitopes in ERBB2, encoding human epidermal growth factor receptor 2 (HER2), are significantly less likely to develop HER2-positive breast cancer compared to other subtypes. The same holds true for recurrent amplicons that define four subgroups of ER-positive breast cancers at high risk of distant relapse. High epitope burden in these recurrently amplified regions is associated with decreased likelihood of developing high risk ER-positive cancer. Tumors that overcome such immune-mediated negative selection are more aggressive and demonstrate an "immune cold" phenotype. These data show the germline genome plays a previously unappreciated role in dictating somatic evolution. Exploiting germline-mediated immunoediting may inform the development of biomarkers that refine risk stratification within breast cancer subtypes.

    View details for DOI 10.1101/2023.03.15.532870

    View details for PubMedID 36993286

    View details for PubMedCentralID PMC10055121

  • Somatic variant detection from multi-sampled genomic sequencing data of tumor specimens using the ith.Variant pipeline. STAR protocols Maeser, N., Khan, A., Sun, R. 2022; 4 (1): 101927


    A common technique for uncovering intra-tumor genomic heterogeneity (ITH) is variant detection. However, it can be challenging to reliably characterize ITH given uneven sample quality (e.g., depth of coverage, tumor purity, and subclonality). We describe a protocol for calling point mutations and copy number alterations using sequencing of multiple related clinical patient samples across diverse tissue, optimizing for sensitivity with specificity. The ith.Variant pipeline can be run on single- or multi-region whole-genome and whole-exome sequencing. For complete details on the use and execution of this protocol, please refer to Sun etal. (2017).1.

    View details for DOI 10.1016/j.xpro.2022.101927

    View details for PubMedID 36586123

  • Molecular classification and biomarkers of clinical outcome in breast ductal carcinoma in situ: Analysis of TBCRC 038 and RAHBT cohorts. Cancer cell Strand, S. H., Rivero-Gutierrez, B., Houlahan, K. E., Seoane, J. A., King, L. M., Risom, T., Simpson, L. A., Vennam, S., Khan, A., Cisneros, L., Hardman, T., Harmon, B., Couch, F., Gallagher, K., Kilgore, M., Wei, S., DeMichele, A., King, T., McAuliffe, P. F., Nangia, J., Lee, J., Tseng, J., Storniolo, A. M., Thompson, A. M., Gupta, G. P., Burns, R., Veis, D. J., DeSchryver, K., Zhu, C., Matusiak, M., Wang, J., Zhu, S. X., Tappenden, J., Ding, D. Y., Zhang, D., Luo, J., Jiang, S., Varma, S., Anderson, L., Straub, C., Srivastava, S., Curtis, C., Tibshirani, R., Angelo, R. M., Hall, A., Owzar, K., Polyak, K., Maley, C., Marks, J. R., Colditz, G. A., Hwang, E. S., West, R. B. 2022


    Ductal carcinoma in situ (DCIS) is the most common precursor of invasive breast cancer (IBC), with variable propensity for progression. We perform multiscale, integrated molecular profiling of DCIS with clinical outcomes by analyzing 774 DCIS samples from 542 patients with 7.3 years median follow-up from the Translational Breast Cancer Research Consortium 038 study and the Resource of Archival Breast Tissue cohorts. We identify 812 genes associated with ipsilateral recurrence within 5 years from treatment and develop a classifier that predicts DCIS or IBC recurrence in both cohorts. Pathways associated with recurrence include proliferation, immune response, and metabolism. Distinct stromal expression patterns and immune cell compositions are identified. Our multiscale approach employed in situ methods to generate a spatially resolved atlas of breast precancers, where complementary modalities can be directly compared and correlated with conventional pathology findings, disease states, and clinical outcome.

    View details for DOI 10.1016/j.ccell.2022.10.021

    View details for PubMedID 36400020

  • MITI minimum information guidelines for highly multiplexed tissue images. Nature methods Schapiro, D., Yapp, C., Sokolov, A., Reynolds, S. M., Chen, Y., Sudar, D., Xie, Y., Muhlich, J., Arias-Camison, R., Arena, S., Taylor, A. J., Nikolov, M., Tyler, M., Lin, J., Burlingame, E. A., Human Tumor Atlas Network, Chang, Y. H., Farhi, S. L., Thorsson, V., Venkatamohan, N., Drewes, J. L., Pe'er, D., Gutman, D. A., Herrmann, M. D., Gehlenborg, N., Bankhead, P., Roland, J. T., Herndon, J. M., Snyder, M. P., Angelo, M., Nolan, G., Swedlow, J. R., Schultz, N., Merrick, D. T., Mazzili, S. A., Cerami, E., Rodig, S. J., Santagata, S., Sorger, P. K., Abravanel, D. L., Achilefu, S., Ademuyiwa, F. O., Adey, A. C., Aft, R., Ahn, K. J., Alikarami, F., Alon, S., Ashenberg, O., Baker, E., Baker, G. J., Bandyopadhyay, S., Bayguinov, P., Beane, J., Becker, W., Bernt, K., Betts, C. B., Bletz, J., Blosser, T., Boire, A., Boland, G. M., Boyden, E. S., Bucher, E., Bueno, R., Cai, Q., Cambuli, F., Campbell, J., Cao, S., Caravan, W., Chaligne, R., Chan, J. M., Chasnoff, S., Chatterjee, D., Chen, A. A., Chen, C., Chen, C., Chen, B., Chen, F., Chen, S., Chheda, M. G., Chin, K., Cho, H., Chun, J., Cisneros, L., Coffey, R. J., Cohen, O., Colditz, G. A., Cole, K. A., Collins, N., Cotter, D., Coussens, L. M., Coy, S., Creason, A. L., Cui, Y., Zhou, D. C., Curtis, C., Davies, S. R., Bruijn, I., Delorey, T. M., Demir, E., Denardo, D., Diep, D., Ding, L., DiPersio, J., Dubinett, S. M., Eberlein, T. J., Eddy, J. A., Esplin, E. D., Factor, R. E., Fatahalian, K., Feiler, H. S., Fernandez, J., Fields, A., Fields, R. C., Fitzpatrick, J. A., Ford, J. M., Franklin, J., Fulton, B., Gaglia, G., Galdieri, L., Ganesh, K., Gao, J., Gaudio, B. L., Getz, G., Gibbs, D. L., Gillanders, W. E., Goecks, J., Goodwin, D., Gray, J. W., Greenleaf, W., Grimm, L. J., Gu, Q., Guerriero, J. L., Guha, T., Guimaraes, A. R., Gutierrez, B., Hacohen, N., Hanson, C. R., Harris, C. R., Hawkins, W. G., Heiser, C. N., Hoffer, J., Hollmann, T. J., Hsieh, J. J., Huang, J., Hunger, S. P., Hwang, E., Iacobuzio-Donahue, C., Iglesia, M. D., Islam, M., Izar, B., Jacobson, C. A., Janes, S., Jayasinghe, R. G., Jeudi, T., Johnson, B. E., Johnson, B. E., Ju, T., Kadara, H., Karnoub, E., Karpova, A., Khan, A., Kibbe, W., Kim, A. H., King, L. M., Kozlowski, E., Krishnamoorthy, P., Krueger, R., Kundaje, A., Ladabaum, U., Laquindanum, R., Lau, C., Lau, K. S., LeBoeuf, N. R., Lee, H., Lenburg, M., Leshchiner, I., Levy, R., Li, Y., Lian, C. G., Liang, W., Lim, K., Lin, Y., Liu, D., Liu, Q., Liu, R., Lo, J., Lo, P., Longabaugh, W. J., Longacre, T., Luckett, K., Ma, C., Maher, C., Maier, A., Makowski, D., Maley, C., Maliga, Z., Manoj, P., Maris, J. M., Markham, N., Marks, J. R., Martinez, D., Mashl, J., Masilionis, I., Massague, J., Mazurowski, M. A., McKinley, E. T., McMichael, J., Meyerson, M., Mills, G. B., Mitri, Z. I., Moorman, A., Mudd, J., Murphy, G. F., Deen, N. N., Navin, N. E., Nawy, T., Ness, R. M., Nevins, S., Nirmal, A. J., Novikov, E., Oh, S. T., Oldridge, D. A., Owzar, K., Pant, S. M., Park, W., Patti, G. J., Paul, K., Pelletier, R., Persson, D., Petty, C., Pfister, H., Polyak, K., Puram, S. V., Qiu, Q., Villalonga, A. Q., Ramirez, M. A., Rashid, R., Reeb, A. N., Reid, M. E., Remsik, J., Riesterer, J. L., Risom, T., Ritch, C. C., Rolong, A., Rudin, C. M., Ryser, M. D., Sato, K., Sears, C. L., Semenov, Y. R., Shen, J., Shoghi, K. I., Shrubsole, M. J., Shyr, Y., Sibley, A. B., Simmons, A. J., Sinha, A., Sivagnanam, S., Song, S., Southar-Smith, A., Spira, A. E., Cyr, J. S., Stefankiewicz, S., Storrs, E. P., Stover, E. H., Strand, S. H., Straub, C., Street, C., Su, T., Surrey, L. F., Suver, C., Tan, K., Terekhanova, N. V., Ternes, L., Thadi, A., Thomas, G., Tibshirani, R., Umeda, S., Uzun, Y., Vallius, T., Van Allen, E. R., Vandekar, S., Vega, P. N., Veis, D. J., Vennam, S., Verma, A., Vigneau, S., Wagle, N., Wahl, R., Walle, T., Wang, L., Warchol, S., Washington, M. K., Watson, C., Weimer, A. K., Wendl, M. C., West, R. B., White, S., Windon, A. L., Wu, H., Wu, C., Wu, Y., Wyczalkowski, M. A., Xu, J., Yao, L., Yu, W., Zhang, K., Zhu, X. 2022; 19 (3): 262-267

    View details for DOI 10.1038/s41592-022-01415-4

    View details for PubMedID 35277708

  • UniBind: maps of high-confidence direct TF-DNA interactions across nine species. BMC genomics Puig, R. R., Boddie, P., Khan, A., Castro-Mondragon, J. A., Mathelier, A. 2021; 22 (1): 482


    BACKGROUND: Transcription factors (TFs) bind specifically to TF binding sites (TFBSs) at cis-regulatory regions to control transcription. It is critical to locate these TF-DNA interactions to understand transcriptional regulation. Efforts to predict bona fide TFBSs benefit from the availability of experimental data mapping DNA binding regions of TFs (chromatin immunoprecipitation followed by sequencing - ChIP-seq).RESULTS: In this study, we processed ~10,000 public ChIP-seq datasets from nine species to provide high-quality TFBS predictions. After quality control, it culminated with the prediction of ~56 million TFBSs with experimental and computational support for direct TF-DNA interactions for 644 TFs in >1000 cell lines and tissues. These TFBSs were used to predict >197,000 cis-regulatory modules representing clusters of binding events in the corresponding genomes. The high-quality of the TFBSs was reinforced by their evolutionary conservation, enrichment at active cis-regulatory regions, and capacity to predict combinatorial binding of TFs. Further, we confirmed that the cell type and tissue specificity of enhancer activity was correlated with the number of TFs with binding sites predicted in these regions. All the data is provided to the community through the UniBind database that can be accessed through its web-interface ( ), a dedicated RESTful API, and as genomic tracks. Finally, we provide an enrichment tool, available as a web-service and an R package, for users to find TFs with enriched TFBSs in a set of provided genomic regions.CONCLUSIONS: UniBind is the first resource of its kind, providing the largest collection of high-confidence direct TF-DNA interactions in nine species.

    View details for DOI 10.1186/s12864-021-07760-6

    View details for PubMedID 34174819

  • Pakistan: anger mounts over threat to higher education. Nature Khan, A. 2021; 592 (7856): 685

    View details for DOI 10.1038/d41586-021-01119-3

    View details for PubMedID 33907327

  • Changing scientific meetings for the better. Nature human behaviour Sarabipour, S. n., Khan, A. n., Seah, Y. F., Mwakilili, A. D., Mumoki, F. N., Sáez, P. J., Schwessinger, B. n., Debat, H. J., Mestrovic, T. n. 2021

    View details for DOI 10.1038/s41562-021-01067-y

    View details for PubMedID 33723404

  • JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles. Nucleic acids research Castro-Mondragon, J. A., Riudavets-Puig, R., Rauluseviciute, I., Berhanu Lemma, R., Turchi, L., Blanc-Mathieu, R., Lucas, J., Boddie, P., Khan, A., Manosalva Pérez, N., Fornes, O., Leung, T. Y., Aguirre, A., Hammal, F., Schmelter, D., Baranasic, D., Ballester, B., Sandelin, A., Lenhard, B., Vandepoele, K., Wasserman, W. W., Parcy, F., Mathelier, A. 2021


    JASPAR ( is an open-access database containing manually curated, non-redundant transcription factor (TF) binding profiles for TFs across six taxonomic groups. In this 9th release, we expanded the CORE collection with 341 new profiles (148 for plants, 101 for vertebrates, 85 for urochordates, and 7 for insects), which corresponds to a 19% expansion over the previous release. We added 298 new profiles to the Unvalidated collection when no orthogonal evidence was found in the literature. All the profiles were clustered to provide familial binding profiles for each taxonomic group. Moreover, we revised the structural classification of DNA binding domains to consider plant-specific TFs. This release introduces word clouds to represent the scientific knowledge associated with each TF. We updated the genome tracks of TFBSs predicted with JASPAR profiles in eight organisms; the human and mouse TFBS predictions can be visualized as native tracks in the UCSC Genome Browser. Finally, we provide a new tool to perform JASPAR TFBS enrichment analysis in user-provided genomic regions. All the data is accessible through the JASPAR website, its associated RESTful API, the R/Bioconductor data package, and a new Python package, pyJASPAR, that facilitates serverless access to the data.

    View details for DOI 10.1093/nar/gkab1113

    View details for PubMedID 34850907

  • A call to eradicate non-inclusive terms from the life sciences. eLife Khan, A. n. 2021; 10


    Since the Black Lives Matter movement rose to mainstream prominence, the academic enterprise has started recognizing the systematic racism present in science. However, there have been relatively few efforts to make sure that the language used to communicate science is inclusive. Here, I quantify the number of research articles published between 2000 and 2020 that contained non-inclusive terms with racial connotations, such as "blacklist" and "whitelist", or "master" and "slave". This reveals that non-inclusive language is being increasingly used in the life sciences literature, and I urge the global academic community to expunge these archaic terms to make science inclusive for everyone.

    View details for DOI 10.7554/eLife.65604

    View details for PubMedID 33556000

  • The Human Tumor Atlas Network: Charting Tumor Transitions across Space and Time at Single-Cell Resolution. Cell Rozenblatt-Rosen, O., Regev, A., Oberdoerffer, P., Nawy, T., Hupalowska, A., Rood, J. E., Ashenberg, O., Cerami, E., Coffey, R. J., Demir, E., Ding, L., Esplin, E. D., Ford, J. M., Goecks, J., Ghosh, S., Gray, J. W., Guinney, J., Hanlon, S. E., Hughes, S. K., Hwang, E. S., Iacobuzio-Donahue, C. A., Jane-Valbuena, J., Johnson, B. E., Lau, K. S., Lively, T., Mazzilli, S. A., Pe'er, D., Santagata, S., Shalek, A. K., Schapiro, D., Snyder, M. P., Sorger, P. K., Spira, A. E., Srivastava, S., Tan, K., West, R. B., Williams, E. H., Human Tumor Atlas Network, Aberle, D., Achilefu, S. I., Ademuyiwa, F. O., Adey, A. C., Aft, R. L., Agarwal, R., Aguilar, R. A., Alikarami, F., Allaj, V., Amos, C., Anders, R. A., Angelo, M. R., Anton, K., Ashenberg, O., Aster, J. C., Babur, O., Bahmani, A., Balsubramani, A., Barrett, D., Beane, J., Bender, D. E., Bernt, K., Berry, L., Betts, C. B., Bletz, J., Blise, K., Boire, A., Boland, G., Borowsky, A., Bosse, K., Bott, M., Boyden, E., Brooks, J., Bueno, R., Burlingame, E. A., Cai, Q., Campbell, J., Caravan, W., Cerami, E., Chaib, H., Chan, J. M., Chang, Y. H., Chatterjee, D., Chaudhary, O., Chen, A. A., Chen, B., Chen, C., Chen, C., Chen, F., Chen, Y., Chheda, M. G., Chin, K., Chiu, R., Chu, S., Chuaqui, R., Chun, J., Cisneros, L., Coffey, R. J., Colditz, G. A., Cole, K., Collins, N., Contrepois, K., Coussens, L. M., Creason, A. L., Crichton, D., Curtis, C., Davidsen, T., Davies, S. R., de Bruijn, I., Dellostritto, L., De Marzo, A., Demir, E., DeNardo, D. G., Diep, D., Ding, L., Diskin, S., Doan, X., Drewes, J., Dubinett, S., Dyer, M., Egger, J., Eng, J., Engelhardt, B., Erwin, G., Esplin, E. D., Esserman, L., Felmeister, A., Feiler, H. S., Fields, R. C., Fisher, S., Flaherty, K., Flournoy, J., Ford, J. M., Fortunato, A., Frangieh, A., Frye, J. L., Fulton, R. S., Galipeau, D., Gan, S., Gao, J., Gao, L., Gao, P., Gao, V. R., Geiger, T., George, A., Getz, G., Ghosh, S., Giannakis, M., Gibbs, D. L., Gillanders, W. E., Goecks, J., Goedegebuure, S. P., Gould, A., Gowers, K., Gray, J. W., Greenleaf, W., Gresham, J., Guerriero, J. L., Guha, T. K., Guimaraes, A. R., Guinney, J., Gutman, D., Hacohen, N., Hanlon, S., Hansen, C. R., Harismendy, O., Harris, K. A., Hata, A., Hayashi, A., Heiser, C., Helvie, K., Herndon, J. M., Hirst, G., Hodi, F., Hollmann, T., Horning, A., Hsieh, J. J., Hughes, S., Huh, W. J., Hunger, S., Hwang, S. E., Iacobuzio-Donahue, C. A., Ijaz, H., Izar, B., Jacobson, C. A., Janes, S., Jane-Valbuena, J., Jayasinghe, R. G., Jiang, L., Johnson, B. E., Johnson, B., Ju, T., Kadara, H., Kaestner, K., Kagan, J., Kalinke, L., Keith, R., Khan, A., Kibbe, W., Kim, A. H., Kim, E., Kim, J., Kolodzie, A., Kopytra, M., Kotler, E., Krueger, R., Krysan, K., Kundaje, A., Ladabaum, U., Lake, B. B., Lam, H., Laquindanum, R., Lau, K. S., Laughney, A. M., Lee, H., Lenburg, M., Leonard, C., Leshchiner, I., Levy, R., Li, J., Lian, C. G., Lim, K., Lin, J., Lin, Y., Liu, Q., Liu, R., Lively, T., Longabaugh, W. J., Longacre, T., Ma, C. X., Macedonia, M. C., Madison, T., Maher, C. A., Maitra, A., Makinen, N., Makowski, D., Maley, C., Maliga, Z., Mallo, D., Maris, J., Markham, N., Marks, J., Martinez, D., Mashl, R. J., Masilionais, I., Mason, J., Massague, J., Massion, P., Mattar, M., Mazurchuk, R., Mazutis, L., Mazzilli, S. A., McKinley, E. T., McMichael, J. F., Merrick, D., Meyerson, M., Miessner, J. R., Mills, G. B., Mills, M., Mondal, S. B., Mori, M., Mori, Y., Moses, E., Mosse, Y., Muhlich, J. L., Murphy, G. F., Navin, N. E., Nawy, T., Nederlof, M., Ness, R., Nevins, S., Nikolov, M., Nirmal, A. J., Nolan, G., Novikov, E., Oberdoerffer, P., O'Connell, B., Offin, M., Oh, S. T., Olson, A., Ooms, A., Ossandon, M., Owzar, K., Parmar, S., Patel, T., Patti, G. J., Pe'er, D., Pe'er, I., Peng, T., Persson, D., Petty, M., Pfister, H., Polyak, K., Pourfarhangi, K., Puram, S. V., Qiu, Q., Quintanal-Villalonga, A., Raj, A., Ramirez-Solano, M., Rashid, R., Reeb, A. N., Regev, A., Reid, M., Resnick, A., Reynolds, S. M., Riesterer, J. L., Rodig, S., Roland, J. T., Rosenfield, S., Rotem, A., Roy, S., Rozenblatt-Rosen, O., Rudin, C. M., Ryser, M. D., Santagata, S., Santi-Vicini, M., Sato, K., Schapiro, D., Schrag, D., Schultz, N., Sears, C. L., Sears, R. C., Sen, S., Sen, T., Shalek, A., Sheng, J., Sheng, Q., Shoghi, K. I., Shrubsole, M. J., Shyr, Y., Sibley, A. B., Siex, K., Simmons, A. J., Singer, D. S., Sivagnanam, S., Slyper, M., Snyder, M. P., Sokolov, A., Song, S., Sorger, P. K., Southard-Smith, A., Spira, A., Srivastava, S., Stein, J., Storm, P., Stover, E., Strand, S. H., Su, T., Sudar, D., Sullivan, R., Surrey, L., Suva, M., Tan, K., Terekhanova, N. V., Ternes, L., Thammavong, L., Thibault, G., Thomas, G. V., Thorsson, V., Todres, E., Tran, L., Tyler, M., Uzun, Y., Vachani, A., Van Allen, E., Vandekar, S., Veis, D. J., Vigneau, S., Vossough, A., Waanders, A., Wagle, N., Wang, L., Wendl, M. C., West, R., Williams, E. H., Wu, C., Wu, H., Wu, H., Wyczalkowski, M. A., Xie, Y., Yang, X., Yapp, C., Yu, W., Yuan, Y., Zhang, D., Zhang, K., Zhang, M., Zhang, N., Zhang, Y., Zhao, Y., Zhou, D. C., Zhou, Z., Zhu, H., Zhu, Q., Zhu, X., Zhu, Y., Zhuang, X. 2020; 181 (2): 236–49


    Crucial transitions in cancer-including tumor initiation, local expansion, metastasis, and therapeutic resistance-involve complex interactions between cells within the dynamic tumor ecosystem. Transformative single-cell genomics technologies and spatial multiplex in situ methods now provide an opportunity to interrogate this complexity at unprecedented resolution. The Human Tumor Atlas Network (HTAN), part of the National Cancer Institute (NCI) Cancer Moonshot Initiative, will establish a clinical, experimental, computational, and organizational framework to generate informative and accessible three-dimensional atlases of cancer transitions for a diverse set of tumor types. This effort complements both ongoing efforts to map healthy organs and previous large-scale cancer genomics approaches focused on bulk sequencing at a single point in time. Generating single-cell, multiparametric, longitudinal atlases and integrating them with clinical outcomes should help identify novel predictive biomarkers and features as well as therapeutically relevant cell types, cell states, and cellular interactions across transitions. The resulting tumor atlases should have a profound impact on our understanding of cancer biology and have the potential to improve cancer detection, prevention, and therapeutic discovery for better precision-medicine treatments of cancer patients and those at risk for cancer.

    View details for DOI 10.1016/j.cell.2020.03.053

    View details for PubMedID 32302568

  • COVID-19: students caught in Pakistan's digital divide. Nature Khan, A. n. 2020; 587 (7835): 548

    View details for DOI 10.1038/d41586-020-03291-4

    View details for PubMedID 33235363

  • BiasAway: command-line and web server to generate nucleotide composition-matched DNA background sequences. Bioinformatics (Oxford, England) Khan, A. n., Puig, R. R., Boddie, P. n., Mathelier, A. n. 2020


    Accurate motif enrichment analyses depend on the choice of background DNA sequences used, which should ideally match the sequence composition of the foreground sequences. It is important to avoid false positive enrichment due to sequence biases in the genome, such as GC-bias. Therefore, relying on an appropriate set of background sequences is crucial for enrichment analysis.We developed BiasAway, a command line tool and its dedicated easy-to-use web server to generate synthetic sequences matching any k-mer nucleotide composition or select genomic DNA sequences matching the mononucleotide composition of the foreground sequences through four different models. For genomic sequences, we provide precomputed partitions of genomes from nine species with five different bin sizes to generate appropriate genomic background sequences.BiasAway source code is freely available from Bitbucket ( and can be easily installed using bioconda or pip. The web server is available at and a detailed documentation is available at data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btaa928

    View details for PubMedID 33135764

  • JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic acids research Fornes, O., Castro-Mondragon, J. A., Khan, A., van der Lee, R., Zhang, X., Richmond, P. A., Modi, B. P., Correard, S., Gheorghe, M., Baranasic, D., Santana-Garcia, W., Tan, G., Cheneby, J., Ballester, B., Parcy, F., Sandelin, A., Lenhard, B., Wasserman, W. W., Mathelier, A. 2019


    JASPAR ( is an open-access database of curated, non-redundant transcription factor (TF)-binding profiles stored as position frequency matrices (PFMs) for TFs across multiple species in six taxonomic groups. In this 8th release of JASPAR, the CORE collection has been expanded with 245 new PFMs (169 for vertebrates, 42 for plants, 17 for nematodes, 10 for insects, and 7 for fungi), and 156 PFMs were updated (125 for vertebrates, 28 for plants and 3 for insects). These new profiles represent an 18% expansion compared to the previous release. JASPAR 2020 comes with a novel collection of unvalidated TF-binding profiles for which our curators did not find orthogonal supporting evidence in the literature. This collection has a dedicated web form to engage the community in the curation of unvalidated TF-binding profiles. Moreover, we created a Q&A forum to ease the communication between the user community and JASPAR curators. Finally, we updated the genomic tracks, inference tool, and TF-binding profile similarity clusters. All the data is available through the JASPAR website, its associated RESTful API, and through the JASPAR2020 R/Bioconductor package.

    View details for DOI 10.1093/nar/gkz1001

    View details for PubMedID 31701148

  • Modeling RNA-Binding Protein Specificity In Vivo by Precisely Registering Protein-RNA Crosslink Sites MOLECULAR CELL Feng, H., Bao, S., Rahman, M., Weyn-Vanhentenryck, S. M., Khan, A., Wong, J., Shah, A., Flynn, E. D., Krainer, A. R., Zhang, C. 2019; 74 (6): 1189-+


    RNA-binding proteins (RBPs) regulate post-transcriptional gene expression by recognizing short and degenerate sequence motifs in their target transcripts, but precisely defining their binding specificity remains challenging. Crosslinking and immunoprecipitation (CLIP) allows for mapping of the exact protein-RNA crosslink sites, which frequently reside at specific positions in RBP motifs at single-nucleotide resolution. Here, we have developed a computational method, named mCross, to jointly model RBP binding specificity while precisely registering the crosslinking position in motif sites. We applied mCross to 112 RBPs using ENCODE eCLIP data and validated the reliability of the discovered motifs by genome-wide analysis of allelic binding sites. Our analyses revealed that the prototypical SR protein SRSF1 recognizes clusters of GGA half-sites in addition to its canonical GGAGGA motif. Therefore, SRSF1 regulates splicing of a much larger repertoire of transcripts than previously appreciated, including HNRNPD and HNRNPDL, which are involved in multivalent protein assemblies and phase separation.

    View details for DOI 10.1016/j.molcel.2019.02.002

    View details for Web of Science ID 000472231600010

    View details for PubMedID 31226278

    View details for PubMedCentralID PMC6676488

  • A map of direct TF-DNA interactions in the human genome NUCLEIC ACIDS RESEARCH Gheorghe, M., Sandve, G., Khan, A., Cheneby, J., Ballester, B., Mathelier, A. 2019; 47 (4): e21


    Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is the most popular assay to identify genomic regions, called ChIP-seq peaks, that are bound in vivo by transcription factors (TFs). These regions are derived from direct TF-DNA interactions, indirect binding of the TF to the DNA (through a co-binding partner), nonspecific binding to the DNA, and noise/bias/artifacts. Delineating the bona fide direct TF-DNA interactions within the ChIP-seq peaks remains challenging. We developed a dedicated software, ChIP-eat, that combines computational TF binding models and ChIP-seq peaks to automatically predict direct TF-DNA interactions. Our work culminated with predicted interactions covering >4% of the human genome, obtained by uniformly processing 1983 ChIP-seq peak data sets from the ReMap database for 232 unique TFs. The predictions were a posteriori assessed using protein binding microarray and ChIP-exo data, and were predominantly found in high quality ChIP-seq peaks. The set of predicted direct TF-DNA interactions suggested that high-occupancy target regions are likely not derived from direct binding of the TFs to the DNA. Our predictions derived co-binding TFs supported by protein-protein interaction data and defined cis-regulatory modules enriched for disease- and trait-associated SNPs. We provide this collection of direct TF-DNA interactions and cis-regulatory modules through the UniBind web-interface (

    View details for DOI 10.1093/nar/gky1210

    View details for Web of Science ID 000467961200003

    View details for PubMedID 30517703

    View details for PubMedCentralID PMC6393237

  • Integrative modeling reveals key chromatin and sequence signatures predicting super-enhancers SCIENTIFIC REPORTS Khan, A., Zhang, X. 2019; 9: 2877


    Super-enhancers (SEs) are clusters of transcriptional enhancers which control the expression of cell identity and disease-associated genes. Current studies demonstrated the role of multiple factors in SE formation; however, a systematic analysis to assess the relative predictive importance of chromatin and sequence features of SEs and their constituents is lacking. In addition, a predictive model that integrates various types of data to predict SEs has not been established. Here, we integrated diverse types of genomic and epigenomic datasets to identify key signatures of SEs and investigated their predictive importance. Through integrative modeling, we found Cdk8, Cdk9, and Smad3 as new features of SEs, which can define known and new SEs in mouse embryonic stem cells and pro-B cells. We compared six state-of-the-art machine learning models to predict SEs and showed that non-parametric ensemble models performed better as compared to parametric. We validated these models using cross-validation and also independent datasets in four human cell-types. Taken together, our systematic analysis and ranking of features can be used as a platform to define and understand the biology of SEs in other cell-types.

    View details for DOI 10.1038/s41598-019-38979-9

    View details for Web of Science ID 000459799800026

    View details for PubMedID 30814546

    View details for PubMedCentralID PMC6393462

  • High OGT activity is essential for MYC-driven proliferation of prostate cancer cells THERANOSTICS Itkonen, H. M., Urbanucci, A., Martin, S. S., Khan, A., Mathelier, A., Thiede, B., Walker, S., Mills, I. G. 2019; 9 (8): 2183–97


    O-GlcNAc transferase (OGT) is overexpressed in aggressive prostate cancer. OGT modifies intra-cellular proteins via single sugar conjugation (O-GlcNAcylation) to alter their activity. We recently discovered the first fast-acting OGT inhibitor OSMI-2. Here, we probe the stability and function of the chromatin O-GlcNAc and identify transcription factors that coordinate with OGT to promote proliferation of prostate cancer cells. Methods: Chromatin immunoprecipitation (ChIP) coupled to sequencing (seq), formaldehyde-assisted isolation of regulatory elements, RNA-seq and reverse-phase protein arrays (RPPA) were used to study the importance of OGT for chromatin structure and transcription. Mass spectrometry, western blot, RT-qPCR, cell cycle analysis and viability assays were used to establish the role of OGT for MYC-related processes. Prostate cancer patient data profiled for both mRNA and protein levels were used to validate findings. Results: We show for the first time that OGT inhibition leads to a rapid loss of O-GlcNAc chromatin mark. O-GlcNAc ChIP-seq regions overlap with super-enhancers (SE) and MYC binding sites. OGT inhibition leads to down-regulation of SE-dependent genes. We establish the first O-GlcNAc chromatin consensus motif, which we use as a bait for mass spectrometry. By combining the proteomic data from oligonucleotide enrichment with O-GlcNAc and MYC ChIP-mass spectrometry, we identify host cell factor 1 (HCF-1) as an interaction partner of MYC. Inhibition of OGT disrupts this interaction and compromises MYC's ability to confer androgen-independent proliferation to prostate cancer cells. We show that OGT is required for MYC-mediated stabilization of mitotic proteins, including Cyclin B1, and/or the increased translation of their coding transcripts. This implies that increased expression of mRNA is not always required to achieve increased protein expression and confer aggressive phenotype. Indeed, high expression of Cyclin B1 protein has strong predictive value in prostate cancer patients (p=0.000014) while mRNA does not. Conclusions: OGT promotes SE-dependent gene expression. OGT activity is required for the interaction between MYC and HCF-1 and expression of MYC-regulated mitotic proteins. These features render OGT essential for the androgen-independent, MYC-driven proliferation of prostate cancer cells. Androgen-independency is the major mechanism of prostate cancer progression, and our study identifies OGT as an essential mediator in this process.

    View details for DOI 10.7150/thno.30834

    View details for Web of Science ID 000464623500005

    View details for PubMedID 31149037

    View details for PubMedCentralID PMC6531294

  • Making genome browsers portable and personal GENOME BIOLOGY Khan, A., Zhang, X. 2018; 19: 93


    GIVE is a framework and library for creating portable and personalized genome browsers. It makes visualizing genomic data as easy as building a laboratory homepage.

    View details for DOI 10.1186/s13059-018-1470-9

    View details for Web of Science ID 000439134200002

    View details for PubMedID 30016986

    View details for PubMedCentralID PMC6050684

  • JASPAR RESTful API: accessing JASPAR data from any programming language BIOINFORMATICS Khan, A., Mathelier, A. 2018; 34 (9): 1612–14


    JASPAR is a widely used open-access database of curated, non-redundant transcription factor binding profiles. Currently, data from JASPAR can be retrieved as flat files or by using programming language-specific interfaces. Here, we present a programming language-independent application programming interface (API) to access JASPAR data using the Representational State Transfer (REST) architecture. The REST API enables programmatic access to JASPAR by most programming languages and returns data in eight widely used formats. Several endpoints are available to access the data and an endpoint is available to infer the TF binding profile(s) likely bound by a given DNA binding domain protein sequence. Additionally, it provides an interactive browsable interface for bioinformatics tool developers.This REST API is implemented in Python using the Django REST Framework. It is accessible at and the source code is freely available at under GPL v3 or data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btx804

    View details for Web of Science ID 000431509200033

    View details for PubMedID 29253085

  • Put science first and formatting later EMBO REPORTS Khan, A., Montenegro-Montero, A., Mathelier, A. 2018; 19 (5)

    View details for DOI 10.15252/embr.201845731

    View details for Web of Science ID 000431633800013

    View details for PubMedID 29650529

    View details for PubMedCentralID PMC5934772

  • JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework NUCLEIC ACIDS RESEARCH Khan, A., Fornes, O., Stigliani, A., Gheorghe, M., Castro-Mondragon, J. A., van der Lee, R., Bessy, A., Cheneby, J., Kulkarni, S. R., Tan, G., Baranasic, D., Arenillas, D. J., Sandelin, A., Vandepoele, K., Lenhard, B., Ballester, B., Wasserman, W. W., Parcy, F., Mathelier, A. 2018; 46 (D1): D260–D266


    JASPAR ( is an open-access database of curated, non-redundant transcription factor (TF)-binding profiles stored as position frequency matrices (PFMs) and TF flexible models (TFFMs) for TFs across multiple species in six taxonomic groups. In the 2018 release of JASPAR, the CORE collection has been expanded with 322 new PFMs (60 for vertebrates and 262 for plants) and 33 PFMs were updated (24 for vertebrates, 8 for plants and 1 for insects). These new profiles represent a 30% expansion compared to the 2016 release. In addition, we have introduced 316 TFFMs (95 for vertebrates, 218 for plants and 3 for insects). This release incorporates clusters of similar PFMs in each taxon and each TF class per taxon. The JASPAR 2018 CORE vertebrate collection of PFMs was used to predict TF-binding sites in the human genome. The predictions are made available to the scientific community through a UCSC Genome Browser track data hub. Finally, this update comes with a new web framework with an interactive and responsive user-interface, along with new features. All the underlying data can be retrieved programmatically using a RESTful API and through the JASPAR 2018 R/Bioconductor package.

    View details for DOI 10.1093/nar/gkx1126

    View details for Web of Science ID 000419550700040

    View details for PubMedID 29140473

    View details for PubMedCentralID PMC5753243

  • Super-enhancers are transcriptionally more active and cell type-specific than stretch enhancers EPIGENETICS Khan, A., Mathelier, A., Zhang, X. 2018; 13 (9): 910–22


    Super-enhancers and stretch enhancers represent classes of transcriptional enhancers that have been shown to control the expression of cell identity genes and carry disease- and trait-associated variants. Specifically, super-enhancers are clusters of enhancers defined based on the binding occupancy of master transcription factors, chromatin regulators, or chromatin marks, while stretch enhancers are large chromatin-defined regulatory regions of at least 3,000 base pairs. Several studies have characterized these regulatory regions in numerous cell types and tissues to decipher their functional importance. However, the differences and similarities between these regulatory regions have not been fully assessed. We integrated genomic, epigenomic, and transcriptomic data from ten human cell types to perform a comparative analysis of super and stretch enhancers with respect to their chromatin profiles, cell type-specificity, and ability to control gene expression. We found that stretch enhancers are more abundant, more distal to transcription start sites, cover twice as much the genome, and are significantly less conserved than super-enhancers. In contrast, super-enhancers are significantly more enriched for active chromatin marks and cohesin complex, and more transcriptionally active than stretch enhancers. Importantly, a vast majority of super-enhancers (85%) overlap with only a small subset of stretch enhancers (13%), which are enriched for cell type-specific biological functions, and control cell identity genes. These results suggest that super-enhancers are transcriptionally more active and cell type-specific than stretch enhancers, and importantly, most of the stretch enhancers that are distinct from super-enhancers do not show an association with cell identity genes, are less active, and more likely to be poised enhancers.

    View details for DOI 10.1080/15592294.2018.1514231

    View details for Web of Science ID 000450445600002

    View details for PubMedID 30169995

    View details for PubMedCentralID PMC6284781

  • Bioconda: sustainable and comprehensive software distribution for the life sciences. Nature methods Grüning, B. n., Dale, R. n., Sjödin, A. n., Chapman, B. A., Rowe, J. n., Tomkins-Tinch, C. H., Valieris, R. n., Köster, J. n. 2018; 15 (7): 475–76

    View details for DOI 10.1038/s41592-018-0046-7

    View details for PubMedID 29967506

  • Intervene: a tool for intersection and visualization of multiple gene or genomic region sets BMC BIOINFORMATICS Khan, A., Mathelier, A. 2017; 18: 287


    A common task for scientists relies on comparing lists of genes or genomic regions derived from high-throughput sequencing experiments. While several tools exist to intersect and visualize sets of genes, similar tools dedicated to the visualization of genomic region sets are currently limited.To address this gap, we have developed the Intervene tool, which provides an easy and automated interface for the effective intersection and visualization of genomic region or list sets, thus facilitating their analysis and interpretation. Intervene contains three modules: venn to generate Venn diagrams of up to six sets, upset to generate UpSet plots of multiple sets, and pairwise to compute and visualize intersections of multiple sets as clustered heat maps. Intervene, and its interactive web ShinyApp companion, generate publication-quality figures for the interpretation of genomic region and list sets.Intervene and its web application companion provide an easy command line and an interactive web interface to compute intersections of multiple genomic and list sets. They have the capacity to plot intersections using easy-to-interpret visual approaches. Intervene is developed and designed to meet the needs of both computer scientists and biologists. The source code is freely available at , with the web application available at .

    View details for DOI 10.1186/s12859-017-1708-7

    View details for Web of Science ID 000402839800002

    View details for PubMedID 28569135

    View details for PubMedCentralID PMC5452382

  • dbSUPER: a database of super-enhancers in mouse and human genome. Nucleic acids research Khan, A., Zhang, X. 2016; 44 (D1): D164-71


    Super-enhancers are clusters of transcriptional enhancers that drive cell-type-specific gene expression and are crucial to cell identity. Many disease-associated sequence variations are enriched in super-enhancer regions of disease-relevant cell types. Thus, super-enhancers can be used as potential biomarkers for disease diagnosis and therapeutics. Current studies have identified super-enhancers in more than 100 cell types and demonstrated their functional importance. However, a centralized resource to integrate all these findings is not currently available. We developed dbSUPER (, the first integrated and interactive database of super-enhancers, with the primary goal of providing a resource for assistance in further studies related to transcriptional control of cell identity and disease. dbSUPER provides a responsive and user-friendly web interface to facilitate efficient and comprehensive search and browsing. The data can be easily sent to Galaxy instances, GREAT and Cistrome web-servers for downstream analysis, and can also be visualized in the UCSC genome browser where custom tracks can be added automatically. The data can be downloaded and exported in variety of formats. Furthermore, dbSUPER lists genes associated with super-enhancers and also links to external databases such as GeneCards, UniProt and Entrez. dbSUPER also provides an overlap analysis tool to annotate user-defined regions. We believe dbSUPER is a valuable resource for the biology and genetic research communities.

    View details for DOI 10.1093/nar/gkv1002

    View details for PubMedID 26438538

    View details for PubMedCentralID PMC4702767