Dr. Wall runs a lab in Pediatric Innovation focused on developing methods in biomedical informatics to disentangle complex conditions that originate in childhood and perpetuate through the life course, including autism and related developmental delays. For over a decade, first on faculty at Harvard and now at Stanford University, and as healthcare has shifted increasingly to the use of digital technologies for data capture and finer resolutions of genomic scale, Dr. Wall has innovated, adapted and deployed bioinformatic strategies to enable precise and personalized interpretation of high resolution molecular and phenotypic data. Dr. Wall has pioneered the use of machine learning and artificial intelligence for fast, quantitative and mobile detection of neurodevelopmental disorders in children, as well as the use of use of machine learning systems on wearable devices, such as Google Glass, for real-time “exclinical" therapy. These same precision health approaches enable quantitative tracking of progress during treatment throughout an individual’s life enabling big data generation of a type and scale never before possible, and have defined a new paradigm for behavioral detection and therapy that has won Dr. Wall several awards including a spot in the top ten of the World’s top 30 autism researchers. Dr. Wall has acted as science advisor to several biotechnology and pharmaceutical companies, has created and advised on cutting-edge approaches to cloud computing, and has received numerous awards, including the Fred R. Cagle Award for Outstanding Achievement in Biology, the Vice Chancellor's Award for Research, three awards for excellence in teaching, the Harvard Medical School Leadership award, and the Slifka/Ritvo Clinical Innovation in Autism Research Award for outstanding advancements in clinical translation. He completed his PhD at the University of California, Berkeley and a National Science Foundation postdoctoral fellowship in Computational Genetics at Stanford University before joining the faculty at Harvard Medical School.

Professional Education

  • Fellow, Stanford University, Bioinformatics and Computational Genetics (2003)
  • Ph.D., University of California, Berkeley, Integrative Biology (2001)

Current Research and Scholarly Interests

Systems biology for design of clinical solutions that detect and treat disease

Clinical Trials

  • A Lead-in Study Evaluating Efficacy of GuessWhat Mobile App Therapy for Children With Autism Not Recruiting

    The following study aims to assess the efficacy, safety data, and best outcome measurements of the mobile game platform, GuessWhat, in delivering behavioral therapy to children with Autism Spectrum Disorder (ASD). GuessWhat is a mobile application (available for free for iOS and Android) which contains a suite of games: pro-social charades, emotion guessing, and quiz. Participant families will use their personal smartphones to download the app and play it with their child according to a predetermined regimen.

    Stanford is currently not accepting patients for this trial. For more information, please contact Kaiti Dunlap, MRes, 650-497-9214.

    View full details

  • Examining the Efficacy of a Mobile Therapy for Children With Autism Spectrum Disorder Not Recruiting

    The purpose of this research is to study the effects of a novel artificial intelligence (AI) tool for automatic facial expression recognition that runs on Google Glass through an Android app to deliver social emotion cues to children with autism during social interactions. This novel device will use a camera, microphone, head motion tracker to analyze the behavior of the subject during interactions with other people. The system is designed to give participants non-interruptive social cues in real-time and will record social responses that can later be used to help aid behavioral therapy. It is hypothesized that the system's ability to provide continuous behavioral therapy during social interactions will enable faster gains in social skills.

    Stanford is currently not accepting patients for this trial.

    View full details

  • Investigation of Mechanisms of Action in Superpower Glass Not Recruiting

    The following study aims to understand the mechanism of action at work in a novel artificial intelligence (AI) tool that runs on Google Glass through an Android app to deliver social emotion cues to children with autism during social interactions. This study will examine 2 versions of software on the Google Glass based wearable intervention system. Participants will receive 1 of 2 versions of the software and use the device at home for 4 weeks. This novel device will use a camera, microphone, head motion tracker to analyze the behavior of the subject during interactions with other people. The system is designed to give participants non-interruptive social cues in real-time and will record social responses that can later be used to help aid behavioral therapy. It is hypothesized that both mechanisms under investigation will contribute to social gains in children over the 4 week period of use.

    Stanford is currently not accepting patients for this trial. For more information, please contact Kaiti Dunlap, MRes, 650-497-9214.

    View full details

  • Piloting a Mobile Game for Behavioral Therapy Not Recruiting

    The following study aims to understand the feasibility of the mobile app and game, GuessWhat, to deliver behavioral therapy to children with autism. The GuessWhat app is a charades style game that engages parent and child in fluid social interaction where the parent must guess what the child is acting out based on the prompt shown on the phone screen. Participants will use their own personal phone to download the study app. The app will walk participants through a variety of charades style games. The interactive games will be video recorded and all data are transferred securely to the Wall Lab for analysis. This study is enrolling parents of children with ASD who are at least 18 years of age and have a child between 3-12 years old. Parents are asked to complete questionnaires before and after playing the GuessWhat game with their child 3-4 times per week for 4 weeks.

    Stanford is currently not accepting patients for this trial.

    View full details

2023-24 Courses

Stanford Advisees

Graduate and Fellowship Programs

All Publications

  • Digitally Diagnosing Multiple Developmental Delays Using Crowdsourcing Fused With Machine Learning: Protocol for a Human-in-the-Loop Machine Learning Study. JMIR research protocols Jaiswal, A., Kruiper, R., Rasool, A., Nandkeolyar, A., Wall, D. P., Washington, P. 2024; 13: e52205


    A considerable number of minors in the United States are diagnosed with developmental or psychiatric conditions, potentially influenced by underdiagnosis factors such as cost, distance, and clinician availability. Despite the potential of digital phenotyping tools with machine learning (ML) approaches to expedite diagnoses and enhance diagnostic services for pediatric psychiatric conditions, existing methods face limitations because they use a limited set of social features for prediction tasks and focus on a single binary prediction, resulting in uncertain accuracies.This study aims to propose the development of a gamified web system for data collection, followed by a fusion of novel crowdsourcing algorithms with ML behavioral feature extraction approaches to simultaneously predict diagnoses of autism spectrum disorder and attention-deficit/hyperactivity disorder in a precise and specific manner.The proposed pipeline will consist of (1) gamified web applications to curate videos of social interactions adaptively based on the needs of the diagnostic system, (2) behavioral feature extraction techniques consisting of automated ML methods and novel crowdsourcing algorithms, and (3) the development of ML models that classify several conditions simultaneously and that adaptively request additional information based on uncertainties about the data.A preliminary version of the web interface has been implemented, and a prior feature selection method has highlighted a core set of behavioral features that can be targeted through the proposed gamified approach.The prospect for high reward stems from the possibility of creating the first artificial intelligence-powered tool that can identify complex social behaviors well enough to distinguish conditions with nuanced differentiators such as autism spectrum disorder and attention-deficit/hyperactivity disorder.PRR1-10.2196/52205.

    View details for DOI 10.2196/52205

    View details for PubMedID 38329783

  • Localizing unmapped sequences with families to validate the Telomere-to-Telomere assembly and identify new hotspots for genetic diversity. Genome research Chrisman, B., He, C., Jung, J. Y., Stockham, N., Paskov, K., Washington, P., Petereit, J., Wall, D. P. 2023


    Although it is ubiquitous in genomics, the current human reference genome (GRCh38) is incomplete: It is missing large sections of heterochromatic sequence, and as a singular, linear reference genome, it does not represent the full spectrum of human genetic diversity. To characterize gaps in GRCh38 and human genetic diversity, we developed an algorithm for sequence location approximation using nuclear families (ASLAN) to identify the region of origin of reads that do not align to GRCh38. Using unmapped reads and variant calls from whole-genome sequences (WGSs), ASLAN uses a maximum likelihood model to identify the most likely region of the genome that a subsequence belongs to given the distribution of the subsequence in the unmapped reads and phasings of families. Validating ASLAN on synthetic data and on reads from the alternative haplotypes in the decoy genome, ASLAN localizes >90% of 100-bp sequences with >92% accuracy and ∼1 Mb of resolution. We then ran ASLAN on 100-mers from unmapped reads from WGS from more than 700 families, and compared ASLAN localizations to alignment of the 100-mers to the recently released T2T-CHM13 assembly. We found that many unmapped reads in GRCh38 originate from telomeres and centromeres that are gaps in GRCh38. ASLAN localizations are in high concordance with T2T-CHM13 alignments, except in the centromeres of the acrocentric chromosomes. Comparing ASLAN localizations and T2T-CHM13 alignments, we identified sequences missing from T2T-CHM13 or sequences with high divergence from their aligned region in T2T-CHM13, highlighting new hotspots for genetic diversity.

    View details for DOI 10.1101/gr.277175.122

    View details for PubMedID 37879860

  • Identifying crossovers and shared genetic material in whole genome sequencing data from families. Genome research Paskov, K., Chrisman, B., Stockham, N., Washington, P. Y., Dunlap, K., Jung, J. Y., Wall, D. P. 2023


    Large, whole-genome sequencing (WGS) data sets containing families provide an important opportunity to identify crossovers and shared genetic material in siblings. However, the high variant calling error rates of WGS in some areas of the genome can result in spurious crossover calls, and the special inheritance status of the X Chromosome presents challenges. We have developed a hidden Markov model that addresses these issues by modeling the inheritance of variants in families in the presence of error-prone regions and inherited deletions. We call our method PhasingFamilies. We validate PhasingFamilies using the platinum genome family NA1281 (precision: 0.81; recall: 0.97), as well as simulated genomes with known crossover positions (precision: 0.93; recall: 0.92). Using 1925 quads from the Simons Simplex Collection, we found that PhasingFamilies resolves crossovers to a median resolution of 3527.5 bp. These crossovers recapitulate existing recombination rate maps, including for the X Chromosome; produce sibling pair IBD that matches expected distributions; and are validated by the haplotype estimation tool SHAPEIT. We provide an efficient, open-source implementation of PhasingFamilies that can be used to identify crossovers from family sequencing data.

    View details for DOI 10.1101/gr.277172.122

    View details for PubMedID 37879861

  • The contributions of rare inherited and polygenic risk to ASD in multiplex families. Proceedings of the National Academy of Sciences of the United States of America Cirnigliaro, M., Chang, T. S., Arteaga, S. A., Pérez-Cano, L., Ruzzo, E. K., Gordon, A., Bicks, L. K., Jung, J. Y., Lowe, J. K., Wall, D. P., Geschwind, D. H. 2023; 120 (31): e2215632120


    Autism spectrum disorder (ASD) has a complex genetic architecture involving contributions from both de novo and inherited variation. Few studies have been designed to address the role of rare inherited variation or its interaction with common polygenic risk in ASD. Here, we performed whole-genome sequencing of the largest cohort of multiplex families to date, consisting of 4,551 individuals in 1,004 families having two or more autistic children. Using this study design, we identify seven previously unrecognized ASD risk genes supported by a majority of rare inherited variants, finding support for a total of 74 genes in our cohort and a total of 152 genes after combined analysis with other studies. Autistic children from multiplex families demonstrate an increased burden of rare inherited protein-truncating variants in known ASD risk genes. We also find that ASD polygenic score (PGS) is overtransmitted from nonautistic parents to autistic children who also harbor rare inherited variants, consistent with combinatorial effects in the offspring, which may explain the reduced penetrance of these rare variants in parents. We also observe that in addition to social dysfunction, language delay is associated with ASD PGS overtransmission. These results are consistent with an additive complex genetic risk architecture of ASD involving rare and common variation and further suggest that language delay is a core biological feature of ASD.

    View details for DOI 10.1073/pnas.2215632120

    View details for PubMedID 37506195

  • Topic modeling for multi-omic integration in the human gut microbiome and implications for Autism. Scientific reports Tataru, C., Peras, M., Rutherford, E., Dunlap, K., Yin, X., Chrisman, B. S., DeSantis, T. Z., Wall, D. P., Iwai, S., David, M. M. 2023; 13 (1): 11353


    While healthy gut microbiomes are critical to human health, pertinent microbial processes remain largely undefined, partially due to differential bias among profiling techniques. By simultaneously integrating multiple profiling methods, multi-omic analysis can define generalizable microbial processes, and is especially useful in understanding complex conditions such as Autism. Challenges with integrating heterogeneous data produced by multiple profiling methods can be overcome using Latent Dirichlet Allocation (LDA), a promising natural language processing technique that identifies topics in heterogeneous documents. In this study, we apply LDA to multi-omic microbial data (16S rRNA amplicon, shotgun metagenomic, shotgun metatranscriptomic, and untargeted metabolomic profiling) from the stool of 81 children with and without Autism. We identify topics, or microbial processes, that summarize complex phenomena occurring within gut microbial communities. We then subset stool samples by topic distribution, and identify metabolites, specifically neurotransmitter precursors and fatty acid derivatives, that differ significantly between children with and without Autism. We identify clusters of topics, deemed "cross-omic topics", which we hypothesize are representative of generalizable microbial processes observable regardless of profiling method. Interpreting topics, we find each represents a particular diet, and we heuristically label each cross-omic topic as: healthy/general function, age-associated function, transcriptional regulation, and opportunistic pathogenesis.

    View details for DOI 10.1038/s41598-023-38228-0

    View details for PubMedID 37443184

  • Multi-level analysis of the gut-brain axis shows autism spectrum disorder-associated molecular and microbial profiles. Nature neuroscience Morton, J. T., Jin, D. M., Mills, R. H., Shao, Y., Rahman, G., McDonald, D., Zhu, Q., Balaban, M., Jiang, Y., Cantrell, K., Gonzalez, A., Carmel, J., Frankiensztajn, L. M., Martin-Brevet, S., Berding, K., Needham, B. D., Zurita, M. F., David, M., Averina, O. V., Kovtun, A. S., Noto, A., Mussap, M., Wang, M., Frank, D. N., Li, E., Zhou, W., Fanos, V., Danilenko, V. N., Wall, D. P., Cárdenas, P., Baldeón, M. E., Jacquemont, S., Koren, O., Elliott, E., Xavier, R. J., Mazmanian, S. K., Knight, R., Gilbert, J. A., Donovan, S. M., Lawley, T. D., Carpenter, B., Bonneau, R., Taroncher-Oldenburg, G. 2023


    Autism spectrum disorder (ASD) is a neurodevelopmental disorder characterized by heterogeneous cognitive, behavioral and communication impairments. Disruption of the gut-brain axis (GBA) has been implicated in ASD although with limited reproducibility across studies. In this study, we developed a Bayesian differential ranking algorithm to identify ASD-associated molecular and taxa profiles across 10 cross-sectional microbiome datasets and 15 other datasets, including dietary patterns, metabolomics, cytokine profiles and human brain gene expression profiles. We found a functional architecture along the GBA that correlates with heterogeneity of ASD phenotypes, and it is characterized by ASD-associated amino acid, carbohydrate and lipid profiles predominantly encoded by microbial species in the genera Prevotella, Bifidobacterium, Desulfovibrio and Bacteroides and correlates with brain gene expression changes, restrictive dietary patterns and pro-inflammatory cytokine profiles. The functional architecture revealed in age-matched and sex-matched cohorts is not present in sibling-matched cohorts. We also show a strong association between temporal changes in microbiome composition and ASD phenotypes. In summary, we propose a framework to leverage multi-omic datasets from well-defined cohorts and investigate how the GBA influences ASD.

    View details for DOI 10.1038/s41593-023-01361-0

    View details for PubMedID 37365313

    View details for PubMedCentralID 8900942

  • A Review of and Roadmap for Data Science and Machine Learning for the Neuropsychiatric Phenotype of Autism. Annual review of biomedical data science Washington, P., Wall, D. P. 2023


    Autism spectrum disorder (autism) is a neurodevelopmental delay that affects at least 1 in 44 children. Like many neurological disorder phenotypes, the diagnostic features are observable, can be tracked over time, and can be managed or even eliminated through proper therapy and treatments. However, there are major bottlenecks in the diagnostic, therapeutic, and longitudinal tracking pipelines for autism and related neurodevelopmental delays, creating an opportunity for novel data science solutions to augment and transform existing workflows and provide increased access to services for affected families. Several efforts previously conducted by a multitude of research labs have spawned great progress toward improved digital diagnostics and digital therapies for children with autism. We review the literature on digital health methods for autism behavior quantification and beneficial therapies using data science. We describe both case-control studies and classification systems for digital phenotyping. We then discuss digital diagnostics and therapeutics that integrate machine learning models of autism-related behaviors, including the factors that must be addressed for translational use. Finally, we describe ongoing challenges and potential opportunities for the field of autism data science. Given the heterogeneous nature of autism and the complexities of the relevant behaviors, this review contains insights that are relevant to neurological behavior analysis and digital psychiatry more broadly. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 6 is August 2023. Please see for revised estimates.

    View details for DOI 10.1146/annurev-biodatasci-020722-125454

    View details for PubMedID 37137169

  • Racial and Ethnic Disparities in Geographic Access to Autism Resources Across the US. JAMA network open Liu, B. M., Paskov, K., Kent, J., McNealis, M., Sutaria, S., Dods, O., Harjadi, C., Stockham, N., Ostrovsky, A., Wall, D. P. 2023; 6 (1): e2251182


    Importance: While research has identified racial and ethnic disparities in access to autism services, the size, extent, and specific locations of these access gaps have not yet been characterized on a national scale. Mapping comprehensive national listings of autism health care services together with the prevalence of autistic children of various races and ethnicities and evaluating geographic regions defined by localized commuting patterns may help to identify areas within the US where families who belong to minoritized racial and ethnic groups have disproportionally lower access to services.Objective: To evaluate differences in access to autism health care services among autistic children of various races and ethnicities within precisely defined geographic regions encompassing all serviceable areas within the US.Design, Setting, and Participants: This population-based cross-sectional study was conducted from October 5, 2021, to June 3, 2022, and involved 530 965 autistic children in kindergarten through grade 12. Core-based statistical areas (CBSAs; defined as areas containing a city and its surrounding commuter region), the Civil Rights Data Collection (CRDC) data set, and 51 071 autism resources (collected from October 1, 2015, to December 18, 2022) geographically distributed into 912 CBSAs were combined and analyzed to understand variation in access to autism health care services among autistic children of different races and ethnicities. Six racial and ethnic categories (American Indian or Alaska Native, Asian, Black or African American, Hispanic or Latino, Native Hawaiian or other Pacific Islander, and White) assigned by the US Department of Education were included in the analysis.Main Outcomes and Measures: A regularized least-squares regression analysis was used to measure differences in nationwide resource allocation between racial and ethnic groups. The number of autism resources allocated per autistic child was estimated based on the child's racial and ethnic group. To evaluate how the CBSA population size may have altered the results, the least-squares regression analysis was run on CBSAs divided into metropolitan (>50 000 inhabitants) and micropolitan (10 000-50 000 inhabitants) groups. A Mann-Whitney U test was used to compare the model estimated ratio of autism resources to autistic children among specific racial and ethnic groups comprising the proportions of autistic children in each CBSA.Results: Among 530 965 autistic children aged 5 to 18 years, 83.9% were male and 16.1% were female; 0.7% of children were American Indian or Alaska Native, 5.9% were Asian, 14.3% were Black or African American, 22.9% were Hispanic or Latino, 0.2% were Native Hawaiian or other Pacific Islander, 51.7% were White, and 4.2% were of 2 or more races and/or ethnicities. At a national scale, American Indian or Alaska Native autistic children (beta=0; 95% CI, 0-0; P=.01) and Hispanic autistic children (beta=0.02; 95% CI, 0-0.06; P=.02) had significant disparities in access to autism resources in comparison with White autistic children. When evaluating the proportion of autistic children in each racial and ethnic group, areas in which Black autistic children (>50% of the population: beta=0.05; <50% of the population: beta=0.07; P=.002) or Hispanic autistic children (>50% of the population: beta=0.04; <50% of the population: beta=0.07; P<.001) comprised greater than 50% of the total population of autistic children had significantly fewer resources than areas in which Black or Hispanic autistic children comprised less than 50% of the total population. Comparing metropolitan vs micropolitan CBSAs revealed that in micropolitan CBSAs, Black autistic children (beta=0; 95% CI, 0-0; P<.001) and Hispanic autistic children (beta=0; 95% CI, 0-0.02; P<.001) had the greatest disparities in access to autism resources compared with White autistic children. In metropolitan CBSAs, American Indian or Alaska Native autistic children (beta=0; 95% CI, 0-0; P=.005) and Hispanic autistic children (beta=0.01; 95% CI, 0-0.06; P=.02) had the greatest disparities compared with White autistic children.Conclusions and Relevance: In this study, autistic children from several minoritized racial and ethnic groups, including Black and Hispanic autistic children, had access to significantly fewer autism resources than White autistic children in the US. This study pinpointed the specific geographic regions with the greatest disparities, where increases in the number and types of treatment options are warranted. These findings suggest that a prioritized response strategy to address these racial and ethnic disparities is needed.

    View details for DOI 10.1001/jamanetworkopen.2022.51182

    View details for PubMedID 36689227

  • TOWARDS ETHICAL BIOMEDICAL INFORMATICS: LEARNING FROM OLELO NOEAU, HAWAIIAN PROVERBS Washington, P. Y., Puniwai, N., Kamaka, M., Gursoy, G., Tatonetti, N., Brenner, S. E., Wall, D. P., Altman, R. B., Hunter, L., Ritchie, M. D., Murray, T., Klein, T. E. WORLD SCIENTIFIC PUBL CO PTE LTD. 2023: 461-471
  • Transmission dynamics of human herpesvirus 6A, 6B and 7 from whole genome sequences of families. Virology journal Chrisman, B. S., He, C., Jung, J. Y., Stockham, N., Paskov, K., Wall, D. P. 2022; 19 (1): 225


    While hundreds of thousands of human whole genome sequences (WGS) have been collected in the effort to better understand genetic determinants of disease, these whole genome sequences have less frequently been used to study another major determinant of human health: the human virome. Using the unmapped reads from WGS of over 1000 families, we present insights into the human blood DNA virome, focusing particularly on human herpesvirus (HHV) 6A, 6B, and 7. In addition to extensively cataloguing the viruses detected in WGS of human whole blood and lymphoblastoid cell lines, we use the family structure of our dataset to show that household drives transmission of several viruses, and identify the Mendelian inheritance patterns characteristic of inherited chromsomally integrated human herpesvirus 6 (iciHHV-6). Consistent with prior studies, we find that 0.6% of our dataset's population has iciHHV, and we locate candidate integration sequences for these cases. We document genetic diversity within exogenous and integrated HHV species and within integration sites of HHV-6. Finally, in the first observation of its kind, we present evidence that suggests widespread de novo HHV-6B integration and HHV-7 integration and reactivation in lymphoblastoid cell lines. These findings show that the unmapped read space of WGS is a promising source of data for virology research.

    View details for DOI 10.1186/s12985-022-01941-9

    View details for PubMedID 36566197

    View details for PubMedCentralID PMC9789512

  • An Introduction to Artificial Intelligence in Developmental and Behavioral Pediatrics. Journal of developmental and behavioral pediatrics : JDBP Aylward, B. S., Abbas, H., Taraman, S., Salomon, C., Gal-Szabo, D., Kraft, C., Ehwerhemuepha, L., Chang, A., Wall, D. P. 2022


    ABSTRACT: Technological breakthroughs, together with the rapid growth of medical information and improved data connectivity, are creating dramatic shifts in the health care landscape, including the field of developmental and behavioral pediatrics. While medical information took an estimated 50 years to double in 1950, by 2020, it was projected to double every 73 days. Artificial intelligence (AI)-powered health technologies, once considered theoretical or research-exclusive concepts, are increasingly being granted regulatory approval and integrated into clinical care. In the United States, the Food and Drug Administration has cleared or approved over 160 health-related AI-based devices to date. These trends are only likely to accelerate as economic investment in AI health care outstrips investment in other sectors. The exponential increase in peer-reviewed AI-focused health care publications year over year highlights the speed of growth in this sector. As health care moves toward an era of intelligent technology powered by rich medical information, pediatricians will increasingly be asked to engage with tools and systems underpinned by AI. However, medical students and practicing clinicians receive insufficient training and lack preparedness for transitioning into a more AI-informed future. This article provides a brief primer on AI in health care. Underlying AI principles and key performance metrics are described, and the clinical potential of AI-driven technology together with potential pitfalls is explored within the developmental and behavioral pediatric health context.

    View details for DOI 10.1097/DBP.0000000000001149

    View details for PubMedID 36730317

  • Multi-angle meta-analysis of the gut microbiome in Autism Spectrum Disorder: a step toward understanding patient subgroups. Scientific reports West, K. A., Yin, X., Rutherford, E. M., Wee, B., Choi, J., Chrisman, B. S., Dunlap, K. L., Hannibal, R. L., Hartono, W., Lin, M., Raack, E., Sabino, K., Wu, Y., Wall, D. P., David, M. M., Dabbagh, K., DeSantis, T. Z., Iwai, S. 2022; 12 (1): 17034


    Observational studies have shown that the composition of the human gut microbiome in children diagnosed with Autism Spectrum Disorder (ASD) differs significantly from that of their neurotypical (NT) counterparts. Thus far, reported ASD-specific microbiome signatures have been inconsistent. To uncover reproducible signatures, we compiled 10 publicly available raw amplicon and metagenomic sequencing datasets alongside new data generated from an internal cohort (the largest ASD cohort to date), unified them with standardized pre-processing methods, and conducted a comprehensive meta-analysis of all taxa and variables detected across multiple studies. By screening metadata to test associations between the microbiome and 52 variables in multiple patient subsets and across multiple datasets, we determined that differentially abundant taxa in ASD versus NT children were dependent upon age, sex, and bowel function, thus marking these variables as potential confounders in case-control ASD studies. Several taxa, including the strains Bacteroides stercoris t__190463 and Clostridium M bolteae t__180407, and the species Granulicatella elegans and Massilioclostridium coli, exhibited differential abundance in ASD compared to NT children only after subjects with bowel dysfunction were removed. Adjusting for age, sex and bowel function resulted in adding or removing significantly differentially abundant taxa in ASD-diagnosed individuals, emphasizing the importance of collecting and controlling for these metadata. We have performed the largest (n=690) and most comprehensive systematic analysis of ASD gut microbiome data to date. Our study demonstrated the importance of accounting for confounding variables when designing statistical comparative analyses of ASD- and NT-associated gut bacterial profiles. Mitigating these confounders identified robust microbial signatures across cohorts, signifying the importance of accounting for these factors in comparative analyses of ASD and NT-associated gut profiles. Such studies will advance the understanding of different patient groups to deliver appropriate therapeutics by identifying microbiome traits germane to the specific ASD phenotype.

    View details for DOI 10.1038/s41598-022-21327-9

    View details for PubMedID 36220843

  • INTRODUCING KIDSFIRST: A DIVERSE, LONGITUDINAL PHENOTYPIC DATABASE FOR FAMILIES WITH ASD McNealis, M., Kent, J., Dunlap, K., Abbeduto, L., Dimitropoulos, A., Dombrose, F., Hardan, A., Lane, J., Phillips, B., Rodriguez, N., Wall, D. P. ELSEVIER SCIENCE INC. 2022: S245-S246
  • Machine learning models using mobile game play accurately classify children with autism. Intelligence-based medicine Deveau, N., Washington, P., Leblanc, E., Husic, A., Dunlap, K., Penev, Y., Kline, A., Mutlu, O. C., Wall, D. P. 2022: 100057


    Digitally-delivered healthcare is well suited to address current inequities in the delivery of care due to barriers of access to healthcare facilities. As the COVID-19 pandemic phases out, we have a unique opportunity to capitalize on the current familiarity with telemedicine approaches and continue to advocate for mainstream adoption of remote care delivery. In this paper, we specifically focus on the ability of GuessWhat? a smartphone-based charades-style gamified therapeutic intervention for autism spectrum disorder (ASD) to generate a signal that distinguishes children with ASD from neurotypical (NT) children. We demonstrate the feasibility of using "in-the-wild", naturalistic gameplay data to distinguish between ASD and NT by children by training a random forest classifier to discern the two classes (AU-ROC = 0.745, recall = 0.769). This performance demonstrates the potential for GuessWhat? to facilitate screening for ASD in historically difficult-to-reach communities. To further examine this potential, future work should expand the size of the training sample and interrogate differences in predictive ability by demographic.

    View details for DOI 10.1016/j.ibmed.2022.100057

    View details for PubMedID 36035501

  • Training and Profiling a Pediatric Facial Expression Classifier for Children on Mobile Devices: Machine Learning Study. JMIR formative research Banerjee, A., Mutlu, O. C., Kline, A., Washington, P., Wall, D., Surabhi, S. 2022


    BACKGROUND: Implementing automated facial expression recognition on mobile devices could provide an accessible diagnostic and therapeutic tool for those who struggle to recognize facial expression, including children with developmental behavioral conditions such as autism. Although recent advances have been made in building more accurate facial expression classifiers for children, existing models are too computationally expensive to be deployed on smartphones.OBJECTIVE: In this study, we explored the deployment of several state-of-the-art facial expression classifiers designed for usage on mobile devices. We use various post-training optimization techniques for both classification performance and efficiency on a Motorola Moto G6 phone. We additionally explore the importance of training our classifiers on children compared to adults and evaluate the performance of our models against different ethnic groups.METHODS: We collected images from twelve public datasets and used video frames crowdsourced from the GuessWhat app a to train our classifiers. All images were annotated for 7 expressions: neutral, fear, happiness, sadness, surprise, anger, and disgust. We tested three copies for each of five different convolutional neural network architectures: MobileNetV3-Small 1.0x, MobileNetV2 1.0x, EfficientNetB0, MobileNetV3-Large 1.0x, and NASNetMobile. The first copy trained on images of children, the second copy trained on images of adults, while the third copy trained on all datasets. We evaluated each model against the Child Affective Facial Expression set, both in its entirety and by ethnicity. We then performed weight pruning, weight clustering, and quantize-aware training when possible and profiled the performance of each model on the Moto G6.RESULTS: Our best model, a MobileNetV3-Large network pre-trained on ImageNet, achieved 65.78% balanced accuracy and 65.31% F1-score on CAFE while achieving a 90-millisecond inference latency on a Motorola Moto G6 phone when trained on all data. This balanced accuracy is only 1.12% lower than the current state of the art for CAFE, a model with 13.91x more parameters and was unable to run on the Moto G6 due to its size, even when fully optimized. When trained solely on children, this model achieved 60.57% balanced accuracy and 60.29% F1-score, while when trained only on adults the model received 53.36% balanced accuracy and 53.10% F1-score. Although the MobileNetV3-Large trained on all datasets achieved nearly 60% F1-score across all ethnicities, South Asian and African American children receive as much as 11.56% balanced accuracy and 11.25% F1-score lower than other groups.CONCLUSIONS: This work demonstrates that with specialized design and optimization techniques, facial expression classifiers can become lightweight enough to run on mobile devices and still achieve state-of-the-art performance. This study also shows that there is potentially a "data shift" phenomenon between facial expressions of children compared to adults, with our classifiers performing much better when trained on children. In addition, we find that certain underrepresented ethnic groups such as South Asian and African American perform significantly worse than groups such as European Caucasian despite having a similar quality of data. The models developed in this study can be integrated into mobile health therapies to help diagnose ASD and to provide targeted therapeutic treatment to children.CLINICALTRIAL:

    View details for DOI 10.2196/39917

    View details for PubMedID 35962462

  • The human "contaminome": bacterial, viral, and computational contamination in whole genome sequences from 1000 families. Scientific reports Chrisman, B., He, C., Jung, J., Stockham, N., Paskov, K., Washington, P., Wall, D. P. 2022; 12 (1): 9863


    The unmapped readspace of whole genome sequencing data tends to be large but is often ignored. We posit that it contains valuable signals of both human infection and contamination. Using unmapped and poorly aligned reads from whole genome sequences (WGS) of over 1000 families and nearly5000 individuals, we present insights into common viral, bacterial, and computational contamination that plague whole genome sequencing studies. We present several notable results: (1) In addition to known contaminants such as Epstein-Barr virus and phiX, sequences from whole blood and lymphocyte cell lines contain many other contaminants, likely originating from storage, prep, and sequencing pipelines. (2) Sequencing plate and biological sample source of a sample strongly influence contamination profile. And, (3) Y-chromosome fragments not on the human reference genome commonly mismap to bacterial reference genomes. Both experiment-derived and computational contamination is prominent in next-generation sequencing data. Such contamination can compromise results from WGS as well as metagenomics studies, and standard protocols for identifying and removing contamination should be developed to ensure the fidelity of sequencing-based studies.

    View details for DOI 10.1038/s41598-022-13269-z

    View details for PubMedID 35701436

  • Causal Modeling to Mitigate Selection Bias and Unmeasured Confounding in Internet-Based Epidemiology of COVID-19: Model Development and Validation. JMIR public health and surveillance Stockham, N., Washingon, P., Chrisman, B., Paskov, K., Jung, J. Y., Wall, D. P. 2022


    Selection bias and unmeasured confounding are fundamental problems in epidemiology that threaten study internal and external validity. These phenomena are particularly dangerous in internet-based public health surveillance where traditional mitigation and adjustment methods are inapplicable, unavailable, or out of date. Recent theoretical advances in causal modeling can mitigate these threats, but these innovations have not been widely deployed in the epidemiological community.The purpose of our paper is to demonstrate the practical utility of causal modeling to both detect unmeasured confounding and selection bias and also guide model selection to minimize bias. We implement this approach in an applied epidemiological study of COVID-19 cumulative infection rate in the New York City Spring 2020 epidemic.We collected primary data from Qualtrics surveys of Amazon Mechanical Turk crowd workers residing in New Jersey and New York state across two sampling periods; April 11-14th and May 8-11th 2020. The surveys queried the subjects on household health status and demographic characteristics. We constructed a set of possible causal models of household infection and survey selection mechanisms and ranked them by compatibility with the collected survey data. The most compatible causal model was then used to estimate the cumulative infection rate in each survey period.There were 527 and 513 responses collected for each period. Response demographics were highly skewed toward younger age in both survey periods. Despite the extremely strong relationship between age and COVID-19 symptoms we recovered minimally biased estimates of cumulative infection rate using only primary data and the most compatible causal model, with a relative bias of +3.8% and -1.9% from the reported cumulative infection rate for the first and second survey periods.We successfully recovered accurate estimates of cumulative infection rate from an internet-based crowd sourced sample despite considerable selection bias and unmeasured confounding in the primary data. This implementation demonstrates how simple applications of structural causal modeling can be effectively used to determine falsifiable model conditions, detect selection bias and confounding factors, and minimize estimate bias through model selection in a novel epidemiological context.

    View details for DOI 10.2196/31306

    View details for PubMedID 35605128

  • Evaluation of an artificial intelligence-based medical device for diagnosis of autism spectrum disorder. NPJ digital medicine Megerian, J. T., Dey, S., Melmed, R. D., Coury, D. L., Lerner, M., Nicholls, C. J., Sohl, K., Rouhbakhsh, R., Narasimhan, A., Romain, J., Golla, S., Shareef, S., Ostrovsky, A., Shannon, J., Kraft, C., Liu-Mayo, S., Abbas, H., Gal-Szabo, D. E., Wall, D. P., Taraman, S. 2022; 5 (1): 57


    Autism spectrum disorder (ASD) can be reliably diagnosed at 18 months, yet significant diagnostic delays persist in the United States. This double-blinded, multi-site, prospective, active comparator cohort study tested the accuracy of an artificial intelligence-based Software as a Medical Device designed to aid primary care healthcare providers (HCPs) in diagnosing ASD. The Device combines behavioral features from three distinct inputs (a caregiver questionnaire, analysis of two short home videos, and an HCP questionnaire) in a gradient boosted decision tree machine learning algorithm to produce either an ASD positive, ASD negative, or indeterminate output. This study compared Device outputs to diagnostic agreement by two or more independent specialists in a cohort of 18-72-month-olds with developmental delay concerns (425 study completers, 36% female, 29% ASD prevalence). Device output PPV for all study completers was 80.8% (95% confidence intervals (CI), 70.3%-88.8%) and NPV was 98.3% (90.6%-100%). For the 31.8% of participants who received a determinate output (ASD positive or negative) Device sensitivity was 98.4% (91.6%-100%) and specificity was 78.9% (67.6%-87.7%). The Device's indeterminate output acts as a risk control measure when inputs are insufficiently granular to make a determinate recommendation with confidence. If this risk control measure were removed, the sensitivity for all study completers would fall to 51.6% (63/122) (95% CI 42.4%, 60.8%), and specificity would fall to 18.5% (56/303) (95% CI 14.3%, 23.3%). Among participants for whom the Device abstained from providing a result, specialists identified that 91% had one or more complex neurodevelopmental disorders. No significant differences in Device performance were found across participants' sex, race/ethnicity, income, or education level. For nearly a third of this primary care sample, the Device enabled timely diagnostic evaluation with a high degree of accuracy. The Device shows promise to significantly increase the number of children able to be diagnosed with ASD in a primary care setting, potentially facilitating earlier intervention and more efficient use of specialist resources.

    View details for DOI 10.1038/s41746-022-00598-6

    View details for PubMedID 35513550

  • Classifying Autism From Crowdsourced Semistructured Speech Recordings: Machine Learning Model Comparison Study. JMIR pediatrics and parenting Chi, N. A., Washington, P., Kline, A., Husic, A., Hou, C., He, C., Dunlap, K., Wall, D. P. 2022; 5 (2): e35406


    BACKGROUND: Autism spectrum disorder (ASD) is a neurodevelopmental disorder that results in altered behavior, social development, and communication patterns. In recent years, autism prevalence has tripled, with 1 in 44 children now affected. Given that traditional diagnosis is a lengthy, labor-intensive process that requires the work of trained physicians, significant attention has been given to developing systems that automatically detect autism. We work toward this goal by analyzing audio data, as prosody abnormalities are a signal of autism, with affected children displaying speech idiosyncrasies such as echolalia, monotonous intonation, atypical pitch, and irregular linguistic stress patterns.OBJECTIVE: We aimed to test the ability for machine learning approaches to aid in detection of autism in self-recorded speech audio captured from children with ASD and neurotypical (NT) children in their home environments.METHODS: We considered three methods to detect autism in child speech: (1) random forests trained on extracted audio features (including Mel-frequency cepstral coefficients); (2) convolutional neural networks trained on spectrograms; and (3) fine-tuned wav2vec 2.0-a state-of-the-art transformer-based speech recognition model. We trained our classifiers on our novel data set of cellphone-recorded child speech audio curated from the Guess What? mobile game, an app designed to crowdsource videos of children with ASD and NT children in a natural home environment.RESULTS: The random forest classifier achieved 70% accuracy, the fine-tuned wav2vec 2.0 model achieved 77% accuracy, and the convolutional neural network achieved 79% accuracy when classifying children's audio as either ASD or NT. We used 5-fold cross-validation to evaluate model performance.CONCLUSIONS: Our models were able to predict autism status when trained on a varied selection of home audio clips with inconsistent recording qualities, which may be more representative of real-world conditions. The results demonstrate that machine learning methods offer promise in detecting autism automatically from speech without specialized equipment.

    View details for DOI 10.2196/35406

    View details for PubMedID 35436234

  • Improved Digital Therapy for Developmental Pediatrics Using Domain-Specific Artificial Intelligence: Machine Learning Study. JMIR pediatrics and parenting Washington, P., Kalantarian, H., Kent, J., Husic, A., Kline, A., Leblanc, E., Hou, C., Mutlu, O. C., Dunlap, K., Penev, Y., Varma, M., Stockham, N. T., Chrisman, B., Paskov, K., Sun, M. W., Jung, J. Y., Voss, C., Haber, N., Wall, D. P. 2022; 5 (2): e26760


    Automated emotion classification could aid those who struggle to recognize emotions, including children with developmental behavioral conditions such as autism. However, most computer vision emotion recognition models are trained on adult emotion and therefore underperform when applied to child faces.We designed a strategy to gamify the collection and labeling of child emotion-enriched images to boost the performance of automatic child emotion recognition models to a level closer to what will be needed for digital health care approaches.We leveraged our prototype therapeutic smartphone game, GuessWhat, which was designed in large part for children with developmental and behavioral conditions, to gamify the secure collection of video data of children expressing a variety of emotions prompted by the game. Independently, we created a secure web interface to gamify the human labeling effort, called HollywoodSquares, tailored for use by any qualified labeler. We gathered and labeled 2155 videos, 39,968 emotion frames, and 106,001 labels on all images. With this drastically expanded pediatric emotion-centric database (>30 times larger than existing public pediatric emotion data sets), we trained a convolutional neural network (CNN) computer vision classifier of happy, sad, surprised, fearful, angry, disgust, and neutral expressions evoked by children.The classifier achieved a 66.9% balanced accuracy and 67.4% F1-score on the entirety of the Child Affective Facial Expression (CAFE) as well as a 79.1% balanced accuracy and 78% F1-score on CAFE Subset A, a subset containing at least 60% human agreement on emotions labels. This performance is at least 10% higher than all previously developed classifiers evaluated against CAFE, the best of which reached a 56% balanced accuracy even when combining "anger" and "disgust" into a single class.This work validates that mobile games designed for pediatric therapies can generate high volumes of domain-relevant data sets to train state-of-the-art classifiers to perform tasks helpful to precision health efforts.

    View details for DOI 10.2196/26760

    View details for PubMedID 35394438

  • Identification of Social Engagement Indicators Associated With Autism Spectrum Disorder Using a Game-Based Mobile App: Comparative Study of Gaze Fixation and Visual Scanning Methods. Journal of medical Internet research Varma, M., Washington, P., Chrisman, B., Kline, A., Leblanc, E., Paskov, K., Stockham, N., Jung, J., Sun, M. W., Wall, D. P. 2022; 24 (2): e31830


    BACKGROUND: Autism spectrum disorder (ASD) is a widespread neurodevelopmental condition with a range of potential causes and symptoms. Standard diagnostic mechanisms for ASD, which involve lengthy parent questionnaires and clinical observation, often result in long waiting times for results. Recent advances in computer vision and mobile technology hold potential for speeding up the diagnostic process by enabling computational analysis of behavioral and social impairments from home videos. Such techniques can improve objectivity and contribute quantitatively to the diagnostic process.OBJECTIVE: In this work, we evaluate whether home videos collected from a game-based mobile app can be used to provide diagnostic insights into ASD. To the best of our knowledge, this is the first study attempting to identify potential social indicators of ASD from mobile phone videos without the use of eye-tracking hardware, manual annotations, and structured scenarios or clinical environments.METHODS: Here, we used a mobile health app to collect over 11 hours of video footage depicting 95 children engaged in gameplay in a natural home environment. We used automated data set annotations to analyze two social indicators that have previously been shown to differ between children with ASD and their neurotypical (NT) peers: (1) gaze fixation patterns, which represent regions of an individual's visual focus and (2) visual scanning methods, which refer to the ways in which individuals scan their surrounding environment. We compared the gaze fixation and visual scanning methods used by children during a 90-second gameplay video to identify statistically significant differences between the 2 cohorts; we then trained a long short-term memory (LSTM) neural network to determine if gaze indicators could be predictive of ASD.RESULTS: Our results show that gaze fixation patterns differ between the 2 cohorts; specifically, we could identify 1 statistically significant region of fixation (P<.001). In addition, we also demonstrate that there are unique visual scanning patterns that exist for individuals with ASD when compared to NT children (P<.001). A deep learning model trained on coarse gaze fixation annotations demonstrates mild predictive power in identifying ASD.CONCLUSIONS: Ultimately, our study demonstrates that heterogeneous video data sets collected from mobile devices hold potential for quantifying visual patterns and providing insights into ASD. We show the importance of automated labeling techniques in generating large-scale data sets while simultaneously preserving the privacy of participants, and we demonstrate that specific social engagement indicators associated with ASD can be identified and characterized using such data.

    View details for DOI 10.2196/31830

    View details for PubMedID 35166683

  • A Method for Localizing Non-Reference Sequences to the Human Genome. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Chrisman, B. S., Paskov, K. M., He, C., Jung, J., Stockham, N., Washington, P. Y., Wall, D. P. 2022; 27: 313-324


    As the last decade of human genomics research begins to bear the fruit of advancements in precision medicine, it is important to ensure that genomics' improvements in human health are distributed globally and equitably. An important step to ensuring health equity is to improve the human reference genome to capture global diversity by including a wide variety of alternative haplotypes, sequences that are not currently captured on the reference genome.We present a method that localizes 100 basepair (bp) long sequences extracted from short-read sequencing that can ultimately be used to identify what regions of the human genome non-reference sequences belong to.We extract reads that don't align to the reference genome, and compute the population's distribution of 100-mers found within the unmapped reads. We use genetic data from families to identify shared genetic material between siblings and match the distribution of unmapped k-mers to these inheritance patterns to determine the the most likely genomic region of a k-mer. We perform this localization with two highly interpretable methods of artificial intelligence: a computationally tractable Hidden Markov Model coupled to a Maximum Likelihood Estimator. Using a set of alternative haplotypes with known locations on the genome, we show that our algorithm is able to localize 96% of k-mers with over 90% accuracy and less than 1Mb median resolution. As the collection of sequenced human genomes grows larger and more diverse, we hope that this method can be used to improve the human reference genome, a critical step in addressing precision medicine's diversity crisis.

    View details for PubMedID 34890159

  • Human Intrigue: Meta-analysis approaches for big questions with big data while shaking up the peer review process Bobak, C. A., Muse, M., Giffin, K. A., Williamson, D. A., Greene, C. S., Moore, J. H., Wall, D. P., Altman, R. B., Dunker, A. K., Hunter, L., Ritchie, M. D., Murray, T., Klein, T. E. WORLD SCIENTIFIC PUBL CO PTE LTD. 2022: 156-162


    Scientific innovation has long been heralded the collaborative effort of many people, groups, and studies to drive forward research. However, the traditional peer review process relies on reviewers acting in a silo to critically judge research. As research becomes more cross-disciplinary, finding reviewers with appropriate expertise to provide feedback on an entire paper is increasingly difficult. We sought to pilot a crowd peer review process that allowed reviewers to interact with one another in the spirit of collaborative science. We focused this session on manuscripts using meta-analysis, to fully embrace the importance of collaborative and open scientific research in the field of biocomputing. Our pilot study found that researchers enjoy a more collaborative peer review process and felt that the process led to higher quality feedback for submitting authors than traditional review offers.

    View details for Web of Science ID 001235270700013

    View details for PubMedID 34890145

  • TikTok for good: Creating a diverse emotion expression database Surabhi, S., Shah, B., Washington, P., Mutlu, O., Leblanc, E., Mohite, P., Husic, A., Kline, A., Dunlap, K., McNealis, M., Liu, B., Deveaux, N., Sleiman, E., Wall, D. P., IEEE IEEE. 2022: 2495-2505
  • An Informatics Analysis to Identify Sex Disparities and Healthcare Needs for Autism across the United States. AMIA ... Annual Symposium proceedings. AMIA Symposium Stockham, N. T., Paskov, K. M., Tabatabaei, K., Sutaria, S., Liu, B., Kent, J., Wall, D. P. 2022; 2022: 456-465


    Autism is among the most common neurodevelopmental conditions. Timely diagnosis and access to therapeutic resources are essential for positive prognoses, yet long queues and unevenly dispersed resources leave many untreated. Without granular estimates of autism prevalence by geographic area, it is difficult to identify unmet needs and mechanisms to address them. Mining a dataset of 53M children using meaningful geographic regions, we computed autism prevalence across the country. We then performed comparative analysis against 50,000 resources to identify the type and extent of gaps in access to autism services. We find a steady increase in autism diagnoses from K-5, supporting delayed diagnosis of autism, and consistent under-diagnosis of females. We find a significant inverse relationship between prevalence and availability of resources (p < 0.001). While more work is needed to characterize additional trends including racial and ethnicity-based disparities, the identification of resource gaps can direct and prioritize new innovations.

    View details for PubMedID 35854759

  • Crowd annotations can approximate clinical autism impressions from short home videos with privacy protections. Intelligence-based medicine Washington, P., Chrisman, B., Leblanc, E., Dunlap, K., Kline, A., Mutlu, C., Stockham, N., Paskov, K., Wall, D. P. 2022; 6


    Artificial Intelligence (A.I.) solutions are increasingly considered for telemedicine. For these methods to serve children and their families in home settings, it is crucial to ensure the privacy of the child and parent or caregiver. To address this challenge, we explore the potential for global image transformations to provide privacy while preserving the quality of behavioral annotations. Crowd workers have previously been shown to reliably annotate behavioral features in unstructured home videos, allowing machine learning classifiers to detect autism using the annotations as input. We evaluate this method with videos altered via pixelation, dense optical flow, and Gaussian blurring. On a balanced test set of 30 videos of children with autism and 30 neurotypical controls, we find that the visual privacy alterations do not drastically alter any individual behavioral annotation at the item level. The AUROC on the evaluation set was 90.0% ±7.5% for unaltered videos, 85.0% ±9.0% for pixelation, 85.0% ±9.0% for optical flow, and 83.3% ±9.3% for blurring, demonstrating that an aggregation of small changes across behavioral questions can collectively result in increased misdiagnosis rates. We also compare crowd answers against clinicians who provided the same annotations for the same videos as crowd workers, and we find that clinicians have higher sensitivity in their recognition of autism-related symptoms. We also find that there is a linear correlation (r = 0.75, p < 0.0001) between the mean Clinical Global Impression (CGI) score provided by professional clinicians and the corresponding score emitted by a previously validated autism classifier with crowd inputs, indicating that the classifier's output probability is a reliable estimate of the clinical impression of autism. A significant correlation is maintained with privacy alterations, indicating that crowd annotations can approximate clinician-provided autism impression from home videos in a privacy-preserved manner.

    View details for DOI 10.1016/j.ibmed.2022.100056

    View details for PubMedID 35634270

  • HyperNetVec: Fast and Scalable Hierarchical Embedding for Hypergraphs Maleki, S., Saless, D., Wall, D. P., Pingali, K., Ribeiro, P., Silva, F., Mendes, J. F., Laureano, R. SPRINGER INTERNATIONAL PUBLISHING AG. 2022: 169-183
  • Longitudinal study of stool-associated microbial taxa in sibling pairs with and without autism spectrum disorder ISME COMMUNICATIONS Bernhard, A. Z., Beltz, J., Giblin, A. P., Roberts, B. M. 2021; 1 (1)
  • Longitudinal study of stool-associated microbial taxa in sibling pairs with and without autism spectrum disorder. ISME communications Tataru, C., Martin, A., Dunlap, K., Peras, M., Chrisman, B. S., Rutherford, E., Deitzler, G. E., Phillips, A., Yin, X., Sabino, K., Hannibal, R. L., Hartono, W., Lin, M., Raack, E., Wu, Y., DeSantis, T. Z., Iwai, S., Wall, D. P., David, M. M. 2021; 1 (1): 80


    Autism Spectrum Disorder (ASD) is a complex neurodevelopmental disorder influenced by both genetic and environmental factors. Recently, gut dysbiosis has emerged as a powerful contributor to ASD symptoms. In this study, we recruited over 100 age-matched sibling pairs (between 2 and 8 years old) where one had an Autism ASD diagnosis and the other was developing typically (TD) (432 samples total). We collected stool samples over four weeks, tracked over 100 lifestyle and dietary variables, and surveyed behavior measures related to ASD symptoms. We identified 117 amplicon sequencing variants (ASVs) that were significantly different in abundance between sibling pairs across all three timepoints, 11 of which were supported by at least two contrast methods. We additionally identified dietary and lifestyle variables that differ significantly between cohorts, and further linked those variables to the ASVs they statistically relate to. Overall, dietary and lifestyle features were explanatory of ASD phenotype using logistic regression, however, global compositional microbiome features were not. Leveraging our longitudinal behavior questionnaires, we additionally identified 11 ASVs associated with changes in reported anxiety over time within and across all individuals. Lastly, we find that overall microbiome composition (beta-diversity) is associated with specific ASD-related behavioral characteristics.

    View details for DOI 10.1038/s43705-021-00080-6

    View details for PubMedID 37938270

    View details for PubMedCentralID PMC9723651

  • Improved detection of disease-associated gut microbes using 16S sequence-based biomarkers. BMC bioinformatics Chrisman, B. S., Paskov, K. M., Stockham, N., Jung, J., Varma, M., Washington, P. Y., Tataru, C., Iwai, S., DeSantis, T. Z., David, M., Wall, D. P. 2021; 22 (1): 509


    BACKGROUND: Sequencing partial 16S rRNA genes is a cost effective method for quantifying the microbial composition of an environment, such as the human gut. However, downstream analysis relies on binning reads into microbial groups by either considering each unique sequence as a different microbe, querying a database to get taxonomic labels from sequences, or clustering similar sequences together. However, these approaches do not fully capture evolutionary relationships between microbes, limiting the ability to identify differentially abundant groups of microbes between a diseased and control cohort. We present sequence-based biomarkers (SBBs), an aggregation method that groups and aggregates microbes using single variants and combinations of variants within their 16S sequences. We compare SBBs against other existing aggregation methods (OTU clustering and Microphenoor DiTaxa features) in several benchmarking tasks: biomarker discovery via permutation test, biomarker discovery via linear discriminant analysis, and phenotype prediction power. We demonstrate the SBBs perform on-par or better than the state-of-the-art methods in biomarker discovery and phenotype prediction.RESULTS: On two independent datasets, SBBs identify differentially abundant groups of microbes with similar or higher statistical significance than existing methods in both a permutation-test-based analysis and using linear discriminant analysis effect size. . By grouping microbes by SBB, we can identify several differentially abundant microbial groups (FDR <.1) between children with autism and neurotypical controls in a set of 115 discordant siblings. Porphyromonadaceae, Ruminococcaceae, and an unnamed species of Blastocystis were significantly enriched in autism, while Veillonellaceae was significantly depleted. Likewise, aggregating microbes by SBB on a dataset of obese and lean twins, we find several significantly differentially abundant microbial groups (FDR<.1). We observed Megasphaera andSutterellaceae highly enriched in obesity, and Phocaeicola significantly depleted. SBBs also perform on bar with or better than existing aggregation methods as features in a phenotype prediction model, predicting the autism phenotype with an ROC-AUC score of .64 and the obesity phenotype with an ROC-AUC score of .84.CONCLUSIONS: SBBs provide a powerful method for aggregating microbes to perform differential abundance analysis as well as phenotype prediction. Our source code can be freely downloaded from .

    View details for DOI 10.1186/s12859-021-04427-7

    View details for PubMedID 34666677

  • A Mobile Game Platform for Improving Social Communication in Children with Autism: A Feasibility Study. Applied clinical informatics Penev, Y., Dunlap, K., Husic, A., Hou, C., Washington, P., Leblanc, E., Kline, A., Kent, J., Ng-Thow-Hing, A., Liu, B., Harjadi, C., Tsou, M., Desai, M., Wall, D. P. 2021; 12 (5): 1030-1040


    BACKGROUND: Many children with autism cannot receive timely in-person diagnosis and therapy, especially in situations where access is limited by geography, socioeconomics, or global health concerns such as the current COVD-19 pandemic. Mobile solutions that work outside of traditional clinical environments can safeguard against gaps in access to quality care.OBJECTIVE: The aim of the study is to examine the engagement level and therapeutic feasibility of a mobile game platform for children with autism.METHODS: We designed a mobile application, GuessWhat, which, in its current form, delivers game-based therapy to children aged 3 to 12 in home settings through a smartphone. The phone, held by a caregiver on their forehead, displays one of a range of appropriate and therapeutically relevant prompts (e.g., a surprised face) that the child must recognize and mimic sufficiently to allow the caregiver to guess what is being imitated and proceed to the next prompt. Each game runs for 90seconds to create a robust social exchange between the child and the caregiver.RESULTS: We examined the therapeutic feasibility of GuessWhat in 72 children (75% male, average age 8 years 2 months) with autism who were asked to play the game for three 90-second sessions per day, 3 days per week, for a total of 4 weeks. The group showed significant improvements in Social Responsiveness Score-2 (SRS-2) total (3.97, p <0.001) and Vineland Adaptive Behavior Scales-II (VABS-II) socialization standard (5.27, p=0.002) scores.CONCLUSION: The results support that the GuessWhat mobile game is a viable approach for efficacious treatment of autism and further support the possibility that the game can be used in natural settings to increase access to treatment when barriers to care exist.

    View details for DOI 10.1055/s-0041-1736626

    View details for PubMedID 34788890

  • Training Affective Computer Vision Models by Crowdsourcing Soft-Target Labels COGNITIVE COMPUTATION Washington, P., Kalantarian, H., Kent, J., Husic, A., Kline, A., Leblanc, E., Hou, C., Mutlu, C., Dunlap, K., Penev, Y., Stockham, N., Chrisman, B., Paskov, K., Jung, J., Voss, C., Haber, N., Wall, D. P. 2021
  • Training Affective Computer Vision Models by Crowdsourcing Soft-Target Labels. Cognitive computation Washington, P., Kalantarian, H., Kent, J., Husic, A., Kline, A., Leblanc, E., Hou, C., Mutlu, C., Dunlap, K., Penev, Y., Stockham, N., Chrisman, B., Paskov, K., Jung, J. Y., Voss, C., Haber, N., Wall, D. P. 2021; 13 (5): 1363-1373


    Emotion detection classifiers traditionally predict discrete emotions. However, emotion expressions are often subjective, thus requiring a method to handle compound and ambiguous labels. We explore the feasibility of using crowdsourcing to acquire reliable soft-target labels and evaluate an emotion detection classifier trained with these labels. We hypothesize that training with labels that are representative of the diversity of human interpretation of an image will result in predictions that are similarly representative on a disjoint test set. We also hypothesize that crowdsourcing can generate distributions which mirror those generated in a lab setting.We center our study on the Child Affective Facial Expression (CAFE) dataset, a gold standard collection of images depicting pediatric facial expressions along with 100 human labels per image. To test the feasibility of crowdsourcing to generate these labels, we used Microworkers to acquire labels for 207 CAFE images. We evaluate both unfiltered workers as well as workers selected through a short crowd filtration process. We then train two versions of a ResNet-152 neural network on soft-target CAFE labels using the original 100 annotations provided with the dataset: (1) a classifier trained with traditional one-hot encoded labels, and (2) a classifier trained with vector labels representing the distribution of CAFE annotator responses. We compare the resulting softmax output distributions of the two classifiers with a 2-sample independent t-test of L1 distances between the classifier's output probability distribution and the distribution of human labels.While agreement with CAFE is weak for unfiltered crowd workers, the filtered crowd agree with the CAFE labels 100% of the time for happy, neutral, sad and "fear + surprise", and 88.8% for "anger + disgust". While the F1-score for a one-hot encoded classifier is much higher (94.33% vs. 78.68%) with respect to the ground truth CAFE labels, the output probability vector of the crowd-trained classifier more closely resembles the distribution of human labels (t=3.2827, p=0.0014).For many applications of affective computing, reporting an emotion probability distribution that accounts for the subjectivity of human interpretation can be more useful than an absolute label. Crowdsourcing, including a sufficient filtering mechanism for selecting reliable crowd workers, is a feasible solution for acquiring soft-target labels.

    View details for DOI 10.1007/s12559-021-09936-4

    View details for PubMedID 35669554

    View details for PubMedCentralID PMC9165031

  • Performance of a Novel Software-Based Autism Spectrum Disorder Diagnostic Device for Use in a Primary Care Setting Megerian, J., Dey, S., Melmed, R. D., Coury, D. L., Lerner, M., Nicholls, C., Sohl, K., Rouhbakhsh, R., Narasimhan, A., Romain, J., Golla, S., Shareef, S., Ostrovsky, A., Shannon, J., Kraft, C., Liu-Mayo, S., Abbas, H., Gal-Szabo, D. E., Wall, D. P., Taraman, S. QUADRANT HEALTHCOM INC. 2021: 13
  • A maximum flow-based network approach for identification of stable noncoding biomarkers associated with the multigenic neurological condition, autism. BioData mining Varma, M., Paskov, K. M., Chrisman, B. S., Sun, M. W., Jung, J., Stockham, N. T., Washington, P. Y., Wall, D. P. 2021; 14 (1): 28


    BACKGROUND: Machine learning approaches for predicting disease risk from high-dimensional whole genome sequence (WGS) data often result in unstable models that can be difficult to interpret, limiting the identification of putative sets of biomarkers. Here, we design and validate a graph-based methodology based on maximum flow, which leverages the presence of linkage disequilibrium (LD) to identify stable sets of variants associated with complex multigenic disorders.RESULTS: We apply our method to a previously published logistic regression model trained to identify variants in simple repeat sequences associated with autism spectrum disorder (ASD); this L1-regularized model exhibits high predictive accuracy yet demonstrates great variability in the features selected from over 230,000 possible variants. In order to improve model stability, we extract the variants assigned non-zero weights in each of 5 cross-validation folds and then assemble the five sets of features into a flow network subject to LD constraints. The maximum flow formulation allowed us to identify 55 variants, which we show to be more stable than the features identified by the original classifier.CONCLUSION: Our method allows for the creation of machine learning models that can identify predictive variants. Our results help pave the way towards biomarker-based diagnosis methods for complex genetic disorders.

    View details for DOI 10.1186/s13040-021-00262-x

    View details for PubMedID 33941233

  • Estimating sequencing error rates using families. BioData mining Paskov, K., Jung, J., Chrisman, B., Stockham, N. T., Washington, P., Varma, M., Sun, M. W., Wall, D. P. 2021; 14 (1): 27


    BACKGROUND: As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample.RESULTS: We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method's versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites.CONCLUSION: Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.

    View details for DOI 10.1186/s13040-021-00259-6

    View details for PubMedID 33892748

  • Crowdsourced privacy-preserved feature tagging of short home videos for machine learning ASD detection. Scientific reports Washington, P., Tariq, Q., Leblanc, E., Chrisman, B., Dunlap, K., Kline, A., Kalantarian, H., Penev, Y., Paskov, K., Voss, C., Stockham, N., Varma, M., Husic, A., Kent, J., Haber, N., Winograd, T., Wall, D. P. 2021; 11 (1): 7620


    Standard medical diagnosis of mental health conditions requires licensed experts who are increasingly outnumbered by those at risk, limiting reach. We test the hypothesis that a trustworthy crowd of non-experts can efficiently annotate behavioral features needed for accurate machine learning detection of the common childhood developmental disorder Autism Spectrum Disorder (ASD) for children under 8years old. We implement a novel process for identifying andcertifyinga trustworthy distributed workforce for video feature extraction, selecting a workforce of 102 workers from a pool of 1,107. Two previously validated ASD logistic regression classifiers, evaluated against parent-reported diagnoses, were used to assess the accuracy of the trusted crowd's ratings of unstructured home videos. A representative balanced sample (N=50 videos) of videos were evaluated with and without face box and pitch shift privacy alterations, with AUROC and AUPRC scores>0.98. With both privacy-preserving modifications, sensitivity is preserved (96.0%) while maintaining specificity (80.0%) and accuracy (88.0%) at levels comparable to prior classification methods without alterations. We find that machine learning classification from features extracted by a certified nonexpert crowd achieves high performance for ASD detection from natural home videos of the child at risk and maintains high sensitivity when privacy-preserving mechanisms are applied. These results suggest that privacy-safeguarded crowdsourced analysis of short home videos can help enable rapid and mobile machine-learning detection of developmental delays in children.

    View details for DOI 10.1038/s41598-021-87059-4

    View details for PubMedID 33828118

  • Children with Autism and Their Typically Developing Siblings Differ in Amplicon Sequence Variants and Predicted Functions of Stool-Associated Microbes. mSystems David, M. M., Tataru, C., Daniels, J., Schwartz, J., Keating, J., Hampton-Marcell, J., Gottel, N., Gilbert, J. A., Wall, D. P. 2021; 6 (2)


    The existence of a link between the gut microbiome and autism spectrum disorder (ASD) is well established in mice, but in human populations, efforts to identify microbial biomarkers have been limited due to a lack of appropriately matched controls, stratification of participants within the autism spectrum, and sample size. To overcome these limitations, we crowdsourced the recruitment of families with age-matched sibling pairs between 2 and 7years old (within 2 years of each other), where one child had a diagnosis of ASD and the other did not. Parents collected stool samples, provided a home video of their ASD child's natural social behavior, and responded online to diet and behavioral questionnaires. 16S rRNA V4 amplicon sequencing of 117 samples (60 ASD and 57 controls) identified 21 amplicon sequence variants (ASVs) that differed significantly between the two cohorts: 11 were found to be enriched in neurotypical children (six ASVs belonging to the Lachnospiraceae family), while 10 were enriched in children with ASD (including Ruminococcaceae and Bacteroidaceae families). Summarizing the expected KEGG orthologs of each predicted genome, the taxonomic biomarkers associated with children with ASD can use amino acids as precursors for butyragenic pathways, potentially altering the availability of neurotransmitters like glutamate and gamma aminobutyric acid (GABA).IMPORTANCE Autism spectrum disorder (ASD), which now affects 1 in 54 children in the United States, is known to have comorbidity with gut disorders of a variety of types; however, the link to the microbiome remains poorly characterized. Recent work has provided compelling evidence to link the gut microbiome to the autism phenotype in mouse models, but identification of specific taxa associated with autism has suffered replicability issues in humans. This has been due in part to sample size that sufficiently covers the spectrum of phenotypes known to autism (which range from subtle to severe) and a lack of appropriately matched controls. Our original study proposes to overcome these limitations by collecting stool-associated microbiome on 60 sibling pairs of children, one with autism and one neurotypically developing, both 2 to 7years old and no more than 2years apart in age. We use exact sequence variant analysis and both permutation and differential abundance procedures to identify 21 taxa with significant enrichment or depletion in the autism cohort compared to their matched sibling controls. Several of these 21 biomarkers have been identified in previous smaller studies; however, some are new to autism and known to be important in gut-brain interactions and/or are associated with specific fatty acid biosynthesis pathways.

    View details for DOI 10.1128/mSystems.00193-20

    View details for PubMedID 33824194

  • Indels in SARS-CoV-2 occur at template-switching hotspots. BioData mining Chrisman, B. S., Paskov, K., Stockham, N., Tabatabaei, K., Jung, J., Washington, P., Varma, M., Sun, M. W., Maleki, S., Wall, D. P. 2021; 14 (1): 20


    The evolutionary dynamics of SARS-CoV-2 have been carefully monitored since the COVID-19 pandemic began in December 2019. However, analysis has focused primarily on single nucleotide polymorphisms and largely ignored the role of insertions and deletions (indels) as well as recombination in SARS-CoV-2 evolution. Using sequences from the GISAID database, we catalogue over 100 insertions and deletions in the SARS-CoV-2 consensus sequences. We hypothesize that these indels are artifacts of recombination events between SARS-CoV-2 replicates whereby RNA-dependent RNA polymerase (RdRp) re-associates with a homologous template at a different loci ("imperfect homologous recombination"). We provide several independent pieces of evidence that suggest this. (1) The indels from the GISAID consensus sequences are clustered at specific regions of the genome. (2) These regions are also enriched for 5' and 3' breakpoints in the transcription regulatory site (TRS) independent transcriptome, presumably sites of RNA-dependent RNA polymerase (RdRp) template-switching. (3) Within raw reads, these indel hotspots have cases of both high intra-host heterogeneity and intra-host homogeneity, suggesting that these indels are both consequences of de novo recombination events within a host and artifacts of previous recombination. We briefly analyze the indels in the context of RNA secondary structure, noting that indels preferentially occur in "arms" and loop structures of the predicted folded RNA, suggesting that secondary structure may be a mechanism for TRS-independent template-switching in SARS-CoV-2 or other coronaviruses. These insights into the relationship between structural variation and recombination in SARS-CoV-2 can improve our reconstructions of the SARS-CoV-2 evolutionary history as well as our understanding of the process of RdRp template-switching in RNA viruses.

    View details for DOI 10.1186/s13040-021-00251-0

    View details for PubMedID 33743803

  • Achieving Trustworthy Biomedical Data Solutions. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Washington, P., Yeung, S., Percha, B., Tatonetti, N., Liphardt, J., Wall, D. P. 2021; 26: 1–13


    Privacy and trust of biomedical solutions that capture and share data is an issue rising to the center of public attention and discourse. While large-scale academic, medical, and industrial research initiatives must collect increasing amounts of personal biomedical data from patient stakeholders, central to ensuring precision health becomes a reality, methods for providing sufficient privacy in biomedical databases and conveying a sense of trust to the user is equally crucial for the field of biocomputing to advance with the grace of those stakeholders. If the intended audience does not trust new precision health innovations, funding and support for these efforts will inevitably be limited. It is therefore crucial for the field to address these issues in a timely manner. Here we describe current research directions towards achieving trustworthy biomedical informatics solutions.

    View details for PubMedID 33690999

  • Activity Recognition with Moving Cameras and Few Training Examples: Applications for Detection of Autism-Related Headbanging Washington, P., Kline, A., Mutlu, O., Leblanc, E., Hou, C., Stockham, N., Paskov, K., Chrisman, B., Wall, D., ACM ASSOC COMPUTING MACHINERY. 2021
  • Selection of trustworthy crowd workers for telemedical diagnosis of pediatric autism spectrum disorder Washington, P., Leblanc, E., Dunlap, K., Penev, Y., Varma, M., Jung, J., Chrisman, B., Sun, M., Stockham, N., Paskov, K., Kalantarian, H., Voss, C., Haber, N., Wall, D. P., Altman, R. B., Dunker, A. K., Hunter, L., Ritchie, M. D., Murray, T., Klein, T. E. WORLD SCIENTIFIC PUBL CO PTE LTD. 2021: 14-25
  • Raising the stakeholders: Improving patient outcomes through interprofessional collaborations in AI for healthcare Bobak, C. A., Svoboda, M., Giffin, K. A., Wall, D. P., Moore, J., Altman, R. B., Dunker, A. K., Hunter, L., Ritchie, M. D., Murray, T., Klein, T. E. WORLD SCIENTIFIC PUBL CO PTE LTD. 2021: 351-355


    Research into AI implementations for healthcare continues to boom. However, successfully launching these implementations into healthcare clinics requires the co-operation and collaboration of multiple stakeholders in healthcare including healthcare professionals, administrators, insurers, legislators, advocacy groups, as well as the patients themselves. The co-operation and collaboration of these interprofessional groups is necessary not just in the final stages of launching AI based solutions in healthcare, but along each stage of the research design and analysis. In this workshop, we solicited talks from researchers who have embraced the idea of interprofessional collaboration across many different stakeholder groups at multiple stages of their research. We specifically focus on projects which included heavy collaborations from healthcare professionals, embraced the research subjects' communities as critical research partners, as well as included researchers who are advocating for systemized changes to include interprofessional stakeholders as evaluators of AI research in healthcare.

    View details for Web of Science ID 000759784400035

    View details for PubMedID 33691033

  • Achieving Trustworthy Biomedical Data Solutions Washington, P., Yeung, S., Percha, B., Tatonetti, N., Liphardt, J., Wall, D. P., Altman, R. B., Dunker, A. K., Hunter, L., Ritchie, M. D., Murray, T., Klein, T. E. WORLD SCIENTIFIC PUBL CO PTE LTD. 2021: 1-13
  • Selection of trustworthy crowd workers for telemedical diagnosis of pediatric autism spectrum disorder. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Washington, P., Leblanc, E., Dunlap, K., Penev, Y., Varma, M., Jung, J., Chrisman, B., Sun, M. W., Stockham, N., Paskov, K. M., Kalantarian, H., Voss, C., Haber, N., Wall, D. P. 2021; 26: 14–25


    Crowd-powered telemedicine has the potential to revolutionize healthcare, especially during times that require remote access to care. However, sharing private health data with strangers from around the world is not compatible with data privacy standards, requiring a stringent filtration process to recruit reliable and trustworthy workers who can go through the proper training and security steps. The key challenge, then, is to identify capable, trustworthy, and reliable workers through high-fidelity evaluation tasks without exposing any sensitive patient data during the evaluation process. We contribute a set of experimentally validated metrics for assessing the trustworthiness and reliability of crowd workers tasked with providing behavioral feature tags to unstructured videos of children with autism and matched neurotypical controls. The workers are blinded to diagnosis and blinded to the goal of using the features to diagnose autism. These behavioral labels are fed as input to a previously validated binary logistic regression classifier for detecting autism cases using categorical feature vectors. While the metrics do not incorporate any ground truth labels of child diagnosis, linear regression using the 3 correlative metrics as input can predict the mean probability of the correct class of each worker with a mean average error of 7.51% for performance on the same set of videos and 10.93% for performance on a distinct balanced video set with different children. These results indicate that crowd workers can be recruited for performance based largely on behavioral metrics on a crowdsourced task, enabling an affordable way to filter crowd workforces into a trustworthy and reliable diagnostic workforce.

    View details for PubMedID 33691000

  • Feature replacement methods enable reliable home video analysis for machine learning detection of autism. Scientific reports Leblanc, E., Washington, P., Varma, M., Dunlap, K., Penev, Y., Kline, A., Wall, D. P. 2020; 10 (1): 21245


    Autism Spectrum Disorder is a neuropsychiatric condition affecting 53 million children worldwide and for which early diagnosis is critical to the outcome of behavior therapies. Machine learning applied to features manually extracted from readily accessible videos (e.g., from smartphones) has the potential to scale this diagnostic process. However, nearly unavoidable variability in video quality can lead to missing features that degrade algorithm performance. To manage this uncertainty, we evaluated the impact of missing values and feature imputation methods on two previously published autism detection classifiers, trained on standard-of-care instrument scoresheets and tested on ratings of 140 children videos from YouTube. We compare the baseline method of listwise deletion to classic univariate and multivariate techniques. We also introduce a feature replacement method that, based on a score, selects a feature from an expanded dataset to fill-in the missing value. The replacement feature selected can be identical for all records (general) or automatically adjusted to the record considered (dynamic). Our results show that general and dynamic feature replacement methods achieve a higher performance than classic univariate and multivariate methods, supporting the hypothesis that algorithmic management can maintain the fidelity of video-based diagnostics in the face of missing values and variable video quality.

    View details for DOI 10.1038/s41598-020-76874-w

    View details for PubMedID 33277527

  • Precision Telemedicine through Crowdsourced Machine Learning: Testing Variability of Crowd Workers for Video-Based Autism Feature Recognition. Journal of personalized medicine Washington, P., Leblanc, E., Dunlap, K., Penev, Y., Kline, A., Paskov, K., Sun, M. W., Chrisman, B., Stockham, N., Varma, M., Voss, C., Haber, N., Wall, D. P. 2020; 10 (3)


    Mobilized telemedicine is becoming a key, and even necessary, facet of both precision health and precision medicine. In this study, we evaluate the capability and potential of a crowd of virtual workers-defined as vetted members of popular crowdsourcing platforms-to aid in the task of diagnosing autism. We evaluate workers when crowdsourcing the task of providing categorical ordinal behavioral ratings to unstructured public YouTube videos of children with autism and neurotypical controls. To evaluate emerging patterns that are consistent across independent crowds, we target workers from distinct geographic loci on two crowdsourcing platforms: an international group of workers on Amazon Mechanical Turk (MTurk) (N = 15) and Microworkers from Bangladesh (N = 56), Kenya (N = 23), and the Philippines (N = 25). We feed worker responses as input to a validated diagnostic machine learning classifier trained on clinician-filled electronic health records. We find that regardless of crowd platform or targeted country, workers vary in the average confidence of the correct diagnosis predicted by the classifier. The best worker responses produce a mean probability of the correct class above 80% and over one standard deviation above 50%, accuracy and variability on par with experts according to prior studies. There is a weak correlation between mean time spent on task and mean performance (r = 0.358, p = 0.005). These results demonstrate that while the crowd can produce accurate diagnoses, there are intrinsic differences in crowdworker ability to rate behavioral features. We propose a novel strategy for recruitment of crowdsourced workers to ensure high quality diagnostic evaluations of autism, and potentially many other pediatric behavioral health conditions. Our approach represents a viable step in the direction of crowd-based approaches for more scalable and affordable precision medicine.

    View details for DOI 10.3390/jpm10030086

    View details for PubMedID 32823538

  • Game theoretic centrality: a novel approach to prioritize disease candidate genes by combining biological networks with the Shapley value. BMC bioinformatics Sun, M. W., Moretti, S., Paskov, K. M., Stockham, N. T., Varma, M., Chrisman, B. S., Washington, P. Y., Jung, J., Wall, D. P. 2020; 21 (1): 356


    BACKGROUND: Complex human health conditions with etiological heterogeneity like Autism Spectrum Disorder (ASD) often pose a challenge for traditional genome-wide association study approaches in defining a clear genotype to phenotype model. Coalitional game theory (CGT) is an exciting method that can consider the combinatorial effect of groups of variants working in concert to produce a phenotype. CGT has been applied to associate likely-gene-disrupting variants encoded from whole genome sequence data to ASD; however, this previous approach cannot take into account for prior biological knowledge. Here we extend CGT to incorporate a priori knowledge from biological networks through a game theoretic centrality measure based on Shapley value to rank genes by their relevance-the individual gene's synergistic influence in a gene-to-gene interaction network. Game theoretic centrality extends the notion of Shapley value to the evaluation of a gene's contribution to the overall connectivity of its corresponding node in a biological network.RESULTS: We implemented and applied game theoretic centrality to rank genes on whole genomes from 756 multiplex autism families. Top ranking genes with the highest game theoretic centrality in both the weighted and unweighted approaches were enriched for pathways previously associated with autism, including pathways of the immune system. Four of the selected genes HLA-A, HLA-B, HLA-G, and HLA-DRB1-have also been implicated in ASD and further support the link between ASD and the human leukocyte antigen complex.CONCLUSIONS: Game theoretic centrality can prioritize influential, disease-associated genes within biological networks, and assist in the decoding of polygenic associations to complex disorders like autism.

    View details for DOI 10.1186/s12859-020-03693-1

    View details for PubMedID 32787845

  • Toward Continuous Social Phenotyping: Analyzing Gaze Patterns in an Emotion Recognition Task for Children With Autism Through Wearable Smart Glasses. Journal of medical Internet research Nag, A., Haber, N., Voss, C., Tamura, S., Daniels, J., Ma, J., Chiang, B., Ramachandran, S., Schwartz, J., Winograd, T., Feinstein, C., Wall, D. P. 2020; 22 (4): e13810


    BACKGROUND: Several studies have shown that facial attention differs in children with autism. Measuring eye gaze and emotion recognition in children with autism is challenging, as standard clinical assessments must be delivered in clinical settings by a trained clinician. Wearable technologies may be able to bring eye gaze and emotion recognition into natural social interactions and settings.OBJECTIVE: This study aimed to test: (1) the feasibility of tracking gaze using wearable smart glasses during a facial expression recognition task and (2) the ability of these gaze-tracking data, together with facial expression recognition responses, to distinguish children with autism from neurotypical controls (NCs).METHODS: We compared the eye gaze and emotion recognition patterns of 16 children with autism spectrum disorder (ASD) and 17 children without ASD via wearable smart glasses fitted with a custom eye tracker. Children identified static facial expressions of images presented on a computer screen along with nonsocial distractors while wearing Google Glass and the eye tracker. Faces were presented in three trials, during one of which children received feedback in the form of the correct classification. We employed hybrid human-labeling and computer vision-enabled methods for pupil tracking and world-gaze translation calibration. We analyzed the impact of gaze and emotion recognition features in a prediction task aiming to distinguish children with ASD from NC participants.RESULTS: Gaze and emotion recognition patterns enabled the training of a classifier that distinguished ASD and NC groups. However, it was unable to significantly outperform other classifiers that used only age and gender features, suggesting that further work is necessary to disentangle these effects.CONCLUSIONS: Although wearable smart glasses show promise in identifying subtle differences in gaze tracking and emotion recognition patterns in children with and without ASD, the present form factor and data do not allow for these differences to be reliably exploited by machine learning systems. Resolving these challenges will be an important step toward continuous tracking of the ASD phenotype.

    View details for DOI 10.2196/13810

    View details for PubMedID 32319961

  • The Performance of Emotion Classifiers for Children With Parent-Reported Autism: Quantitative Feasibility Study. JMIR mental health Kalantarian, H., Jedoui, K., Dunlap, K., Schwartz, J., Washington, P., Husic, A., Tariq, Q., Ning, M., Kline, A., Wall, D. P. 2020; 7 (4): e13174


    BACKGROUND: Autism spectrum disorder (ASD) is a developmental disorder characterized by deficits in social communication and interaction, and restricted and repetitive behaviors and interests. The incidence of ASD has increased in recent years; it is now estimated that approximately 1 in 40 children in the United States are affected. Due in part to increasing prevalence, access to treatment has become constrained. Hope lies in mobile solutions that provide therapy through artificial intelligence (AI) approaches, including facial and emotion detection AI models developed by mainstream cloud providers, available directly to consumers. However, these solutions may not be sufficiently trained for use in pediatric populations.OBJECTIVE: Emotion classifiers available off-the-shelf to the general public through Microsoft, Amazon, Google, and Sighthound are well-suited to the pediatric population, and could be used for developing mobile therapies targeting aspects of social communication and interaction, perhaps accelerating innovation in this space. This study aimed to test these classifiers directly with image data from children with parent-reported ASD recruited through crowdsourcing.METHODS: We used a mobile game called Guess What? that challenges a child to act out a series of prompts displayed on the screen of the smartphone held on the forehead of his or her care provider. The game is intended to be a fun and engaging way for the child and parent to interact socially, for example, the parent attempting to guess what emotion the child is acting out (eg, surprised, scared, or disgusted). During a 90-second game session, as many as 50 prompts are shown while the child acts, and the video records the actions and expressions of the child. Due in part to the fun nature of the game, it is a viable way to remotely engage pediatric populations, including the autism population through crowdsourcing. We recruited 21 children with ASD to play the game and gathered 2602 emotive frames following their game sessions. These data were used to evaluate the accuracy and performance of four state-of-the-art facial emotion classifiers to develop an understanding of the feasibility of these platforms for pediatric research.RESULTS: All classifiers performed poorly for every evaluated emotion except happy. None of the classifiers correctly labeled over 60.18% (1566/2602) of the evaluated frames. Moreover, none of the classifiers correctly identified more than 11% (6/51) of the angry frames and 14% (10/69) of the disgust frames.CONCLUSIONS: The findings suggest that commercial emotion classifiers may be insufficiently trained for use in digital approaches to autism treatment and treatment tracking. Secure, privacy-preserving methods to increase labeled training data are needed to boost the models' performance before they can be used in AI-enabled approaches to social therapy of the kind that is common in autism treatments.

    View details for DOI 10.2196/13174

    View details for PubMedID 32234701

  • Feature Selection and Dimension Reduction of Social Autism Data. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Washington, P. n., Paskov, K. M., Kalantarian, H. n., Stockham, N. n., Voss, C. n., Kline, A. n., Patnaik, R. n., Chrisman, B. n., Varma, M. n., Tariq, Q. n., Dunlap, K. n., Schwartz, J. n., Haber, N. n., Wall, D. P. 2020; 25: 707–18


    Autism Spectrum Disorder (ASD) is a complex neuropsychiatric condition with a highly heterogeneous phenotype. Following the work of Duda et al., which uses a reduced feature set from the Social Responsiveness Scale, Second Edition (SRS) to distinguish ASD from ADHD, we performed item-level question selection on answers to the SRS to determine whether ASD can be distinguished from non-ASD using a similarly small subset of questions. To explore feature redundancies between the SRS questions, we performed filter, wrapper, and embedded feature selection analyses. To explore the linearity of the SRS-related ASD phenotype, we then compressed the 65-question SRS into low-dimension representations using PCA, t-SNE, and a denoising autoencoder. We measured the performance of a multilayer perceptron (MLP) classifier with the top-ranking questions as input. Classification using only the top-rated question resulted in an AUC of over 92% for SRS-derived diagnoses and an AUC of over 83% for dataset-specific diagnoses. High redundancy of features have implications towards replacing the social behaviors that are targeted in behavioral diagnostics and interventions, where digital quantification of certain features may be obfuscated due to privacy concerns. We similarly evaluated the performance of an MLP classifier trained on the low-dimension representations of the SRS, finding that the denoising autoencoder achieved slightly higher performance than the PCA and t-SNE representations.

    View details for PubMedID 31797640

  • Feature Selection and Dimension Reduction of Social Autism Data Washington, P., Paskov, K., Kalantarian, H., Stockham, N., Voss, C., Kline, A., Patnaik, R., Chrisman, B., Varma, M., Tariq, Q., Dunlap, K., Schwartz, J., Haber, N., Wall, D. P., Altman, R. B., Dunker, A. K., Hunter, L., Ritchie, M. D., Murray, T., Klein, T. E. WORLD SCIENTIFIC PUBL CO PTE LTD. 2020: 707-718
  • A Mobile Game for Automatic Emotion-Labeling of Images. IEEE transactions on games Kalantarian, H. n., Jedoui, K. n., Washington, P. n., Wall, D. P. 2020; 12 (2): 213–18


    In this paper, we describe challenges in the development of a mobile charades-style game for delivery of social training to children with Autism Spectrum Disorder (ASD). Providing real-time feedback and adapting game difficulty in response to the child's performance necessitates the integration of emotion classifiers into the system. Due to the limited performance of existing emotion recognition platforms for children with ASD, we propose a novel technique to automatically extract emotion-labeled frames from video acquired from game sessions, which we hypothesize can be used to train new emotion classifiers to overcome these limitations. Our technique, which uses probability scores from three different classifiers and meta information from game sessions, correctly identified 83% of frames compared to a baseline of 51.6% from the best emotion classification API evaluated in our work.

    View details for DOI 10.1109/tg.2018.2877325

    View details for PubMedID 32551410

    View details for PubMedCentralID PMC7301713

  • Multi-modular AI Approach to Streamline Autism Diagnosis in Young Children. Scientific reports Abbas, H. n., Garberson, F. n., Liu-Mayo, S. n., Glover, E. n., Wall, D. P. 2020; 10 (1): 5014


    Autism has become a pressing healthcare challenge. The instruments used to aid diagnosis are time and labor expensive and require trained clinicians to administer, leading to long wait times for at-risk children. We present a multi-modular, machine learning-based assessment of autism comprising three complementary modules for a unified outcome of diagnostic-grade reliability: A 4-minute, parent-report questionnaire delivered via a mobile app, a list of key behaviors identified from 2-minute, semi-structured home videos of children, and a 2-minute questionnaire presented to the clinician at the time of clinical assessment. We demonstrate the assessment reliability in a blinded, multi-site clinical study on children 18-72 months of age (n = 375) in the United States. It outperforms baseline screeners administered to children by 0.35 (90% CI: 0.26 to 0.43) in AUC and 0.69 (90% CI: 0.58 to 0.81) in specificity when operating at 90% sensitivity. Compared to the baseline screeners evaluated on children less than 48 months of age, our assessment outperforms the most accurate by 0.18 (90% CI: 0.08 to 0.29 at 90%) in AUC and 0.30 (90% CI: 0.11 to 0.50) in specificity when operating at 90% sensitivity.

    View details for DOI 10.1038/s41598-020-61213-w

    View details for PubMedID 32193406

  • The conserved microRNA miR-34 regulates synaptogenesis via coordination of distinct mechanisms in presynaptic and postsynaptic cells. Nature communications McNeill, E. M., Warinner, C. n., Alkins, S. n., Taylor, A. n., Heggeness, H. n., DeLuca, T. F., Fulga, T. A., Wall, D. P., Griffith, L. C., Van Vactor, D. n. 2020; 11 (1): 1092


    Micro(mi)RNA-based post-transcriptional regulatory mechanisms have been broadly implicated in the assembly and modulation of synaptic connections required to shape neural circuits, however, relatively few specific miRNAs have been identified that control synapse formation. Using a conditional transgenic toolkit for competitive inhibition of miRNA function in Drosophila, we performed an unbiased screen for novel regulators of synapse morphogenesis at the larval neuromuscular junction (NMJ). From a set of ten new validated regulators of NMJ growth, we discovered that miR-34 mutants display synaptic phenotypes and cell type-specific functions suggesting distinct downstream mechanisms in the presynaptic and postsynaptic compartments. A search for conserved downstream targets for miR-34 identified the junctional receptor CNTNAP4/Neurexin-IV (Nrx-IV) and the membrane cytoskeletal effector Adducin/Hu-li tai shao (Hts) as proteins whose synaptic expression is restricted by miR-34. Manipulation of miR-34, Nrx-IV or Hts-M function in motor neurons or muscle supports a model where presynaptic miR-34 inhibits Nrx-IV to influence active zone formation, whereas, postsynaptic miR-34 inhibits Hts to regulate the initiation of bouton formation from presynaptic terminals.

    View details for DOI 10.1038/s41467-020-14761-8

    View details for PubMedID 32107390

  • Data-Driven Diagnostics and the Potential of Mobile Artificial Intelligence for Digital Therapeutic Phenotyping in Computational Psychiatry. Biological psychiatry. Cognitive neuroscience and neuroimaging Washington, P., Park, N., Srivastava, P., Voss, C., Kline, A., Varma, M., Tariq, Q., Kalantarian, H., Schwartz, J., Patnaik, R., Chrisman, B., Stockham, N., Paskov, K., Haber, N., Wall, D. P. 2019


    Data science and digital technologies have the potential to transform diagnostic classification. Digital technologies enable the collection of big data, and advances in machine learning and artificial intelligence enable scalable, rapid, and automated classification of medical conditions. In this review, we summarize and categorize various data-driven methods for diagnostic classification. In particular, we focus on autism as an example of a challenging disorder due to its highly heterogeneous nature. We begin by describing the frontier of data science methods for the neuropsychiatry of autism. We discuss early signs of autism as defined by existing pen-and-paper-based diagnostic instruments and describe data-driven feature selection techniques for determining the behaviors that are most salient for distinguishing children with autism from neurologically typical children. We then describe data-driven detection techniques, particularly computer vision and eye tracking, that provide a means of quantifying behavioral differences between cases and controls. We also describe methods of preserving the privacy of collected videos and prior efforts of incorporating humans in the diagnostic loop. Finally, we summarize existing digital therapeutic interventions that allow for data capture and longitudinal outcome tracking as the diagnosis moves along a positive trajectory. Digital phenotyping of autism is paving the way for quantitative psychiatry more broadly and will set the stage for more scalable, accessible, and precise diagnostic techniques in the field.

    View details for DOI 10.1016/j.bpsc.2019.11.015

    View details for PubMedID 32085921

  • INHERITED AND DE NOVO GENETIC RISK FOR AUTISM IMPACTS SHARED BIOLOGICAL NETWORKS Ruzzo, E., Perez-Cano, L., Jung, J., Wang, L., Kashef-Haghighi, D., Hartl, C., Lowe, J., Prober, D., Wall, D., Geschwind, D. ELSEVIER. 2019: S35–S36
  • SUPERPOWER GLASS MOBILE COMPUTING AND COMMUNICATIONS REVIEW Kline, A., Voss, C., Washington, P., Haber, N., Schwartz, J., Tariq, Q., Winograd, T., Feinstein, C., Wall, D. P. 2019; 23 (2): 35–38
  • Validity of Online Screening for Autism: Crowdsourcing Study Comparing Paid and Unpaid Diagnostic Tasks. Journal of medical Internet research Washington, P., Kalantarian, H., Tariq, Q., Schwartz, J., Dunlap, K., Chrisman, B., Varma, M., Ning, M., Kline, A., Stockham, N., Paskov, K., Voss, C., Haber, N., Wall, D. P. 2019; 21 (5): e13668


    BACKGROUND: Obtaining a diagnosis of neuropsychiatric disorders such as autism requires long waiting times that can exceed a year and can be prohibitively expensive. Crowdsourcing approaches may provide a scalable alternative that can accelerate general access to care and permit underserved populations to obtain an accurate diagnosis.OBJECTIVE: We aimed to perform a series of studies to explore whether paid crowd workers on Amazon Mechanical Turk (AMT) and citizen crowd workers on a public website shared on social media can provide accurate online detection of autism, conducted via crowdsourced ratings of short home video clips.METHODS: Three online studies were performed: (1) a paid crowdsourcing task on AMT (N=54) where crowd workers were asked to classify 10 short video clips of children as "Autism" or "Not autism," (2) a more complex paid crowdsourcing task (N=27) with only those raters who correctly rated ≥8 of the 10 videos during the first study, and (3) a public unpaid study (N=115) identical to the first study.RESULTS: For Study 1, the mean score of the participants who completed all questions was 7.50/10 (SD 1.46). When only analyzing the workers who scored ≥8/10 (n=27/54), there was a weak negative correlation between the time spent rating the videos and the sensitivity (rho=-0.44, P=.02). For Study 2, the mean score of the participants rating new videos was 6.76/10 (SD 0.59). The average deviation between the crowdsourced answers and gold standard ratings provided by two expert clinical research coordinators was 0.56, with an SD of 0.51 (maximum possible SD is 3). All paid crowd workers who scored 8/10 in Study 1 either expressed enjoyment in performing the task in Study 2 or provided no negative comments. For Study 3, the mean score of the participants who completed all questions was 6.67/10 (SD 1.61). There were weak correlations between age and score (r=0.22, P=.014), age and sensitivity (r=-0.19, P=.04), number of family members with autism and sensitivity (r=-0.195, P=.04), and number of family members with autism and precision (r=-0.203, P=.03). A two-tailed t test between the scores of the paid workers in Study 1 and the unpaid workers in Study 3 showed a significant difference (P<.001).CONCLUSIONS: Many paid crowd workers on AMT enjoyed answering screening questions from videos, suggesting higher intrinsic motivation to make quality assessments. Paid crowdsourcing provides promising screening assessments of pediatric autism with an average deviation <20% from professional gold standard raters, which is potentially a clinically informative estimate for parents. Parents of children with autism likely overfit their intuition to their own affected child. This work provides preliminary demographic data on raters who may have higher ability to recognize and measure features of autism across its wide range of phenotypic manifestations.

    View details for DOI 10.2196/13668

    View details for PubMedID 31124463

  • Effect of Wearable Digital Intervention for Improving Socialization in Children With Autism Spectrum Disorder A Randomized Clinical Trial JAMA PEDIATRICS Voss, C., Schwartz, J., Daniels, J., Kline, A., Haber, N., Washington, P., Tariq, Q., Robinson, T. N., Desai, M., Phillips, J. M., Feinstein, C., Winograd, T., Wall, D. P. 2019; 173 (5): 446–54
  • Detecting Developmental Delay and Autism Through Machine Learning Models Using Home Videos of Bangladeshi Children: Development and Validation Study JOURNAL OF MEDICAL INTERNET RESEARCH Tariq, Q., Fleming, S., Schwartz, J., Dunlap, K., Corbin, C., Washington, P., Kalantarian, H., Khan, N. Z., Darmstadt, G. L., Wall, D. 2019; 21 (4)

    View details for DOI 10.2196/13822

    View details for Web of Science ID 000465558900001

  • Effect of Wearable Digital Intervention for Improving Socialization in Children With Autism Spectrum Disorder: A Randomized Clinical Trial. JAMA pediatrics Voss, C., Schwartz, J., Daniels, J., Kline, A., Haber, N., Washington, P., Tariq, Q., Robinson, T. N., Desai, M., Phillips, J. M., Feinstein, C., Winograd, T., Wall, D. P. 2019


    Importance: Autism behavioral therapy is effective but expensive and difficult to access. While mobile technology-based therapy can alleviate wait-lists and scale for increasing demand, few clinical trials exist to support its use for autism spectrum disorder (ASD) care.Objective: To evaluate the efficacy of Superpower Glass, an artificial intelligence-driven wearable behavioral intervention for improving social outcomes of children with ASD.Design, Setting, and Participants: A randomized clinical trial in which participants received the Superpower Glass intervention plus standard of care applied behavioral analysis therapy and control participants received only applied behavioral analysis therapy. Assessments were completed at the Stanford University Medical School, and enrolled participants used the Superpower Glass intervention in their homes. Children aged 6 to 12 years with a formal ASD diagnosis who were currently receiving applied behavioral analysis therapy were included. Families were recruited between June 2016 and December 2017. The first participant was enrolled on November 1, 2016, and the last appointment was completed on April 11, 2018. Data analysis was conducted between April and October 2018.Interventions: The Superpower Glass intervention, deployed via Google Glass (worn by the child) and a smartphone app, promotes facial engagement and emotion recognition by detecting facial expressions and providing reinforcing social cues. Families were asked to conduct 20-minute sessions at home 4 times per week for 6 weeks.Main Outcomes and Measures: Four socialization measures were assessed using an intention-to-treat analysis with a Bonferroni test correction.Results: Overall, 71 children (63 boys [89%]; mean [SD] age, 8.38 [2.46] years) diagnosed with ASD were enrolled (40 [56.3%] were randomized to treatment, and 31 (43.7%) were randomized to control). Children receiving the intervention showed significant improvements on the Vineland Adaptive Behaviors Scale socialization subscale compared with treatment as usual controls (mean [SD] treatment impact, 4.58 [1.62]; P=.005). Positive mean treatment effects were also found for the other 3 primary measures but not to a significance threshold of P=.0125.Conclusions and Relevance: The observed 4.58-point average gain on the Vineland Adaptive Behaviors Scale socialization subscale is comparable with gains observed with standard of care therapy. To our knowledge, this is the first randomized clinical trial to demonstrate efficacy of a wearable digital intervention to improve social behavior of children with ASD. The intervention reinforces facial engagement and emotion recognition, suggesting either or both could be a mechanism of action driving the observed improvement. This study underscores the potential of digital home therapy to augment the standard of care.Trial Registration: identifier: NCT03569176.

    View details for PubMedID 30907929

  • Coalitional Game Theory Facilitates Identification of Non-Coding Variants Associated With Autism BIOMEDICAL INFORMATICS INSIGHTS Sun, M., Gupta, A., Varma, M., Paskov, K. M., Jung, J., Stockham, N. T., Wall, D. P. 2019; 11
  • Outgroup Machine Learning Approach Identifies Single Nucleotide Variants in Noncoding DNA Associated with Autism Spectrum Disorder. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Varma, M., Paskov, K. M., Jung, J., Sierra Chrisman, B., Stockham, N. T., Washington, P. Y., Wall, D. P. 2019; 24: 260–71


    Autism spectrum disorder (ASD) is a heritable neurodevelopmental disorder affecting 1 in 59 children. While noncoding genetic variation has been shown to play a major role in many complex disorders, the contribution of these regions to ASD susceptibility remains unclear. Genetic analyses of ASD typically use unaffected family members as controls; however, we hypothesize that this method does not effectively elevate variant signal in the noncoding region due to family members having subclinical phenotypes arising from common genetic mechanisms. In this study, we use a separate, unrelated outgroup of individuals with progressive supranuclear palsy (PSP), a neurodegenerative condition with no known etiological overlap with ASD, as a control population. We use whole genome sequencing data from a large cohort of 2182 children with ASD and 379 controls with PSP, sequenced at the same facility with the same machines and variant calling pipeline, in order to investigate the role of noncoding variation in the ASD phenotype. We analyze seven major types of noncoding variants: microRNAs, human accelerated regions, hypersensitive sites, transcription factor binding sites, DNA repeat sequences, simple repeat sequences, and CpG islands. After identifying and removing batch effects between the two groups, we trained an ℓ1-regularized logistic regression classifier to predict ASD status from each set of variants. The classifier trained on simple repeat sequences performed well on a held-out test set (AUC-ROC = 0.960); this classifier was also able to differentiate ASD cases from controls when applied to a completely independent dataset (AUC-ROC = 0.960). This suggests that variation in simple repeat regions is predictive of the ASD phenotype and may contribute to ASD risk. Our results show the importance of the noncoding region and the utility of independent control groups in effectively linking genetic variation to disease phenotype for complex disorders.

    View details for PubMedID 30864328

  • Guess What?: Towards Understanding Autism from Structured Video Using Facial Affect. Journal of healthcare informatics research Kalantarian, H., Washington, P., Schwartz, J., Daniels, J., Haber, N., Wall, D. P. 2019; 3: 43–66


    Autism Spectrum Disorder (ASD) is a condition affecting an estimated 1 in 59 children in the United States. Due to delays in diagnosis and imbalances in coverage, it is necessary to develop new methods of care delivery that can appropriately empower children and caregivers by capitalizing on mobile tools and wearable devices for use outside of clinical settings. In this paper, we present a mobile charades-style game, Guess What?, used for the acquisition of structured video from children with ASD for behavioral disease research. We then apply face tracking and emotion recognition algorithms to videos acquired through Guess What? game play. By analyzing facial affect in response to various prompts, we demonstrate that engagement and facial affect can be quantified and measured using real-time image processing algorithms: an important first-step for future therapies, at-home screenings, and outcome measures based on home video. Our study of eight subjects demonstrates the efficacy of this system for deriving highly emotive structured video from children with ASD through an engaging gamified mobile platform, while revealing the most efficacious prompts and categories for producing diverse emotion in participants.

    View details for DOI 10.1007/s41666-018-0034-9

    View details for PubMedID 33313475

  • Labeling images with facial emotion and the potential for pediatric healthcare. Artificial intelligence in medicine Kalantarian, H. n., Jedoui, K. n., Washington, P. n., Tariq, Q. n., Dunlap, K. n., Schwartz, J. n., Wall, D. P. 2019; 98: 77–86


    Autism spectrum disorder (ASD) is a neurodevelopmental disorder characterized by repetitive behaviors, narrow interests, and deficits in social interaction and communication ability. An increasing emphasis is being placed on the development of innovative digital and mobile systems for their potential in therapeutic applications outside of clinical environments. Due to recent advances in the field of computer vision, various emotion classifiers have been developed, which have potential to play a significant role in mobile screening and therapy for developmental delays that impair emotion recognition and expression. However, these classifiers are trained on datasets of predominantly neurotypical adults and can sometimes fail to generalize to children with autism. The need to improve existing classifiers and develop new systems that overcome these limitations necessitates novel methods to crowdsource labeled emotion data from children. In this paper, we present a mobile charades-style game, Guess What?, from which we derive egocentric video with a high density of varied emotion from a 90-second game session. We then present a framework for semi-automatic labeled frame extraction from these videos using meta information from the game session coupled with classification confidence scores. Results show that 94%, 81%, 92%, and 56% of frames were automatically labeled correctly for categories disgust, neutral, surprise, and scared respectively, though performance for angry and happy did not improve significantly from the baseline.

    View details for DOI 10.1016/j.artmed.2019.06.004

    View details for PubMedID 31521254

  • The Potential for Machine Learning-Based Wearables to Improve Socialization in Teenagers and Adults With Autism Spectrum Disorder-Reply. JAMA pediatrics Voss, C. n., Haber, N. n., Wall, D. P. 2019

    View details for DOI 10.1001/jamapediatrics.2019.2969

    View details for PubMedID 31498377

  • Inherited and De Novo Genetic Risk for Autism Impacts Shared Networks. Cell Ruzzo, E. K., Pérez-Cano, L. n., Jung, J. Y., Wang, L. K., Kashef-Haghighi, D. n., Hartl, C. n., Singh, C. n., Xu, J. n., Hoekstra, J. N., Leventhal, O. n., Leppä, V. M., Gandal, M. J., Paskov, K. n., Stockham, N. n., Polioudakis, D. n., Lowe, J. K., Prober, D. A., Geschwind, D. H., Wall, D. P. 2019; 178 (4): 850–66.e26


    We performed a comprehensive assessment of rare inherited variation in autism spectrum disorder (ASD) by analyzing whole-genome sequences of 2,308 individuals from families with multiple affected children. We implicate 69 genes in ASD risk, including 24 passing genome-wide Bonferroni correction and 16 new ASD risk genes, most supported by rare inherited variants, a substantial extension of previous findings. Biological pathways enriched for genes harboring inherited variants represent cytoskeletal organization and ion transport, which are distinct from pathways implicated in previous studies. Nevertheless, the de novo and inherited genes contribute to a common protein-protein interaction network. We also identified structural variants (SVs) affecting non-coding regions, implicating recurrent deletions in the promoters of DLG2 and NR3C2. Loss of nr3c2 function in zebrafish disrupts sleep and social function, overlapping with human ASD-related phenotypes. These data support the utility of studying multiplex families in ASD and are available through the Hartwell Autism Research and Technology portal.

    View details for DOI 10.1016/j.cell.2019.07.015

    View details for PubMedID 31398340

  • Identification and Quantification of Gaps in Access to Autism Resources in the United States: An Infodemiological Study. Journal of medical Internet research Ning, M. n., Daniels, J. n., Schwartz, J. n., Dunlap, K. n., Washington, P. n., Kalantarian, H. n., Du, M. n., Wall, D. P. 2019; 21 (7): e13094


    Autism affects 1 in every 59 children in the United States, according to estimates from the Centers for Disease Control and Prevention's Autism and Developmental Disabilities Monitoring Network in 2018. Although similar rates of autism are reported in rural and urban areas, rural families report greater difficulty in accessing resources. An overwhelming number of families experience long waitlists for diagnostic and therapeutic services.The objective of this study was to accurately identify gaps in access to autism care using GapMap, a mobile platform that connects families with local resources while continuously collecting up-to-date autism resource epidemiological information.After being extracted from various databases, resources were deduplicated, validated, and allocated into 7 categories based on the keywords identified on the resource website. The average distance between the individuals from a simulated autism population and the nearest autism resource in our database was calculated for each US county. Resource load, an approximation of demand over supply for diagnostic resources, was calculated for each US county.There are approximately 28,000 US resources validated on the GapMap database, each allocated into 1 or more of the 7 categories. States with the greatest distances to autism resources included Alaska, Nevada, Wyoming, Montana, and Arizona. Of the 7 resource categories, diagnostic resources were the most underrepresented, comprising only 8.83% (2472/28,003) of all resources. Alarmingly, 83.86% (2635/3142) of all US counties lacked any diagnostic resources. States with the highest diagnostic resource load included West Virginia, Kentucky, Maine, Mississippi, and New Mexico.Results from this study demonstrate the sparsity and uneven distribution of diagnostic resources in the United States, which may contribute to the lengthy waitlists and travel distances-barriers to be overcome to be able to receive diagnosis in specific regions. More data are needed on autism diagnosis demand to better quantify resource needs across the United States.

    View details for DOI 10.2196/13094

    View details for PubMedID 31293243

  • Addendum to the Acknowledgements: Validity of Online Screening for Autism: Crowdsourcing Study Comparing Paid and Unpaid Diagnostic Tasks. Journal of medical Internet research Washington, P. n., Kalantarian, H. n., Tariq, Q. n., Schwartz, J. n., Dunlap, K. n., Chrisman, B. n., Varma, M. n., Ning, M. n., Kline, A. n., Stockham, N. n., Paskov, K. n., Voss, C. n., Haber, N. n., Wall, D. P. 2019; 21 (6): e14950


    [This corrects the article DOI: 10.2196/13668.].

    View details for DOI 10.2196/14950

    View details for PubMedID 31250828

  • Detecting Developmental Delay and Autism Through Machine Learning Models Using Home Videos of Bangladeshi Children: Development and Validation Study. Journal of medical Internet research Tariq, Q. n., Fleming, S. L., Schwartz, J. N., Dunlap, K. n., Corbin, C. n., Washington, P. n., Kalantarian, H. n., Khan, N. Z., Darmstadt, G. L., Wall, D. P. 2019; 21 (4): e13822


    Autism spectrum disorder (ASD) is currently diagnosed using qualitative methods that measure between 20-100 behaviors, can span multiple appointments with trained clinicians, and take several hours to complete. In our previous work, we demonstrated the efficacy of machine learning classifiers to accelerate the process by collecting home videos of US-based children, identifying a reduced subset of behavioral features that are scored by untrained raters using a machine learning classifier to determine children's "risk scores" for autism. We achieved an accuracy of 92% (95% CI 88%-97%) on US videos using a classifier built on five features.Using videos of Bangladeshi children collected from Dhaka Shishu Children's Hospital, we aim to scale our pipeline to another culture and other developmental delays, including speech and language conditions.Although our previously published and validated pipeline and set of classifiers perform reasonably well on Bangladeshi videos (75% accuracy, 95% CI 71%-78%), this work improves on that accuracy through the development and application of a powerful new technique for adaptive aggregation of crowdsourced labels. We enhance both the utility and performance of our model by building two classification layers: The first layer distinguishes between typical and atypical behavior, and the second layer distinguishes between ASD and non-ASD. In each of the layers, we use a unique rater weighting scheme to aggregate classification scores from different raters based on their expertise. We also determine Shapley values for the most important features in the classifier to understand how the classifiers' process aligns with clinical intuition.Using these techniques, we achieved an accuracy (area under the curve [AUC]) of 76% (SD 3%) and sensitivity of 76% (SD 4%) for identifying atypical children from among developmentally delayed children, and an accuracy (AUC) of 85% (SD 5%) and sensitivity of 76% (SD 6%) for identifying children with ASD from those predicted to have other developmental delays.These results show promise for using a mobile video-based and machine learning-directed approach for early and remote detection of autism in Bangladeshi children. This strategy could provide important resources for developmental health in developing countries with few clinical resources for diagnosis, helping children get access to care at an early age. Future research aimed at extending the application of this approach to identify a range of other conditions and determine the population-level burden of developmental disabilities and impairments will be of high value.

    View details for PubMedID 31017583

  • Outgroup Machine Learning Approach Identifies Single Nucleotide Variants in Noncoding DNA Associated with Autism Spectrum Disorder Varma, M., Paskov, K., Jung, J., Chrisman, B., Stockham, N., Washington, P., Wall, D., Altman, R. B., Dunker, A. K., Hunter, L., Ritchie, M. D., Murray, T., Klein, T. E. WORLD SCIENTIFIC PUBL CO PTE LTD. 2019: 260–71
  • Coalitional Game Theory Facilitates Identification of Non-Coding Variants Associated With Autism. Biomedical informatics insights Sun, M. W., Gupta, A., Varma, M., Paskov, K. M., Jung, J., Stockham, N. T., Wall, D. P. 2019; 11: 1178222619832859


    Studies on autism spectrum disorder (ASD) have amassed substantial evidence for the role of genetics in the disease's phenotypic manifestation. A large number of coding and non-coding variants with low penetrance likely act in a combinatorial manner to explain the variable forms of ASD. However, many of these combined interactions, both additive and epistatic, remain undefined. Coalitional game theory (CGT) is an approach that seeks to identify players (individual genetic variants or genes) who tend to improve the performance-association to a disease phenotype of interest-of any coalition (subset of co-occurring genetic variants) they join. This method has been previously applied to boost biologically informative signal from gene expression data and exome sequencing data but remains to be explored in the context of cooperativity among non-coding genomic regions. We describe our extension of previous work, highlighting non-coding chromosomal regions relevant to ASD using CGT on alteration data of 4595 fully sequenced genomes from 756 multiplex families. Genomes were encoded into binary matrices for three types of non-coding regions previously implicated in ASD and separated into ASD (case) and unaffected (control) samples. A player metric, the Shapley value, enabled determination of individual variant contributions in both sets of cohorts. A total of 30 non-coding positions were found to have significantly elevated player scores and likely represent significant contributors to the genetic coordination underlying ASD. Cross-study analyses revealed that a subset of mutated non-coding regions (all of which are in human accelerated regions (HARs)) and related genes are involved in biological pathways or behavioral outcomes known to be affected in autism, suggesting the importance of single nucleotide polymorphisms (SNPs) within HARs in ASD. These findings support the use of CGT in identifying hidden yet influential non-coding players from large-scale genomic data, to better understand the precise underpinnings of complex neurodevelopmental disorders such as autism.

    View details for PubMedID 30886520

  • Mobile detection of autism through machine learning on home video: A development and prospective validation study. PLoS medicine Tariq, Q., Daniels, J., Schwartz, J. N., Washington, P., Kalantarian, H., Wall, D. P. 2018; 15 (11): e1002705


    BACKGROUND: The standard approaches to diagnosing autism spectrum disorder (ASD) evaluate between 20 and 100 behaviors and take several hours to complete. This has in part contributed to long wait times for a diagnosis and subsequent delays in access to therapy. We hypothesize that the use of machine learning analysis on home video can speed the diagnosis without compromising accuracy. We have analyzed item-level records from 2 standard diagnostic instruments to construct machine learning classifiers optimized for sparsity, interpretability, and accuracy. In the present study, we prospectively test whether the features from these optimized models can be extracted by blinded nonexpert raters from 3-minute home videos of children with and without ASD to arrive at a rapid and accurate machine learning autism classification.METHODS AND FINDINGS: We created a mobile web portal for video raters to assess 30 behavioral features (e.g., eye contact, social smile) that are used by 8 independent machine learning models for identifying ASD, each with >94% accuracy in cross-validation testing and subsequent independent validation from previous work. We then collected 116 short home videos of children with autism (mean age = 4 years 10 months, SD = 2 years 3 months) and 46 videos of typically developing children (mean age = 2 years 11 months, SD = 1 year 2 months). Three raters blind to the diagnosis independently measured each of the 30 features from the 8 models, with a median time to completion of 4 minutes. Although several models (consisting of alternating decision trees, support vector machine [SVM], logistic regression (LR), radial kernel, and linear SVM) performed well, a sparse 5-feature LR classifier (LR5) yielded the highest accuracy (area under the curve [AUC]: 92% [95% CI 88%-97%]) across all ages tested. We used a prospectively collected independent validation set of 66 videos (33 ASD and 33 non-ASD) and 3 independent rater measurements to validate the outcome, achieving lower but comparable accuracy (AUC: 89% [95% CI 81%-95%]). Finally, we applied LR to the 162-video-feature matrix to construct an 8-feature model, which achieved 0.93 AUC (95% CI 0.90-0.97) on the held-out test set and 0.86 on the validation set of 66 videos. Validation on children with an existing diagnosis limited the ability to generalize the performance to undiagnosed populations.CONCLUSIONS: These results support the hypothesis that feature tagging of home videos for machine learning classification of autism can yield accurate outcomes in short time frames, using mobile devices. Further work will be needed to confirm that this approach can accelerate autism diagnosis at scale.

    View details for PubMedID 30481180

  • Mobile detection of autism through machine learning on home video: A development and prospective validation study PLOS MEDICINE Tariq, Q., Daniels, J., Schwartz, J., Washington, P., Kalantarian, H., Wall, D. 2018; 15 (11)
  • Exploratory study examining the at-home feasibility of a wearable tool for social-affective learning in children with autism NPJ DIGITAL MEDICINE Daniels, J., Schwartz, J. N., Voss, C., Haber, N., Fazel, A., Kline, A., Washington, P., Feinstein, C., Winograd, T., Wall, D. P. 2018; 1
  • Exploratory study examining the at-home feasibility of a wearable tool for social-affective learning in children with autism. NPJ digital medicine Daniels, J., Schwartz, J. N., Voss, C., Haber, N., Fazel, A., Kline, A., Washington, P., Feinstein, C., Winograd, T., Wall, D. P. 2018; 1: 32


    Although standard behavioral interventions for autism spectrum disorder (ASD) are effective therapies for social deficits, they face criticism for being time-intensive and overdependent on specialists. Earlier starting age of therapy is a strong predictor of later success, but waitlists for therapies can be 18 months long. To address these complications, we developed Superpower Glass, a machine-learning-assisted software system that runs on Google Glass and an Android smartphone, designed for use during social interactions. This pilot exploratory study examines our prototype tool's potential for social-affective learning for children with autism. We sent our tool home with 14 families and assessed changes from intake to conclusion through the Social Responsiveness Scale (SRS-2), a facial affect recognition task (EGG), and qualitative parent reports. A repeated-measures one-way ANOVA demonstrated a decrease in SRS-2 total scores by an average 7.14 points (F(1,13) = 33.20, p = <.001, higher scores indicate higher ASD severity). EGG scores also increased by an average 9.55 correct responses (F(1,10) = 11.89, p = <.01). Parents reported increased eye contact and greater social acuity. This feasibility study supports using mobile technologies for potential therapeutic purposes.

    View details for DOI 10.1038/s41746-018-0035-3

    View details for PubMedID 31304314

    View details for PubMedCentralID PMC6550272

  • Machine learning approach for early detection of autism by combining questionnaire and home video screening JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Abbas, H., Garberson, F., Glover, E., Wall, D. P. 2018; 25 (8): 1000–1007


    Existing screening tools for early detection of autism are expensive, cumbersome, time- intensive, and sometimes fall short in predictive value. In this work, we sought to apply Machine Learning (ML) to gold standard clinical data obtained across thousands of children at-risk for autism spectrum disorder to create a low-cost, quick, and easy to apply autism screening tool.Two algorithms are trained to identify autism, one based on short, structured parent-reported questionnaires and the other on tagging key behaviors from short, semi-structured home videos of children. A combination algorithm is then used to combine the results into a single assessment of higher accuracy. To overcome the scarcity, sparsity, and imbalance of training data, we apply novel feature selection, feature engineering, and feature encoding techniques. We allow for inconclusive determination where appropriate in order to boost screening accuracy when conclusive. The performance is then validated in a controlled clinical study.A multi-center clinical study of n = 162 children is performed to ascertain the performance of these algorithms and their combination. We demonstrate a significant accuracy improvement over standard screening tools in measurements of AUC, sensitivity, and specificity.These findings suggest that a mobile, machine learning process is a reliable method for detection of autism outside of clinical settings. A variety of confounding factors in the clinical analysis are discussed along with the solutions engineered into the algorithms. Final results are statistically limited and will benefit from future clinical studies to extend the sample size.

    View details for PubMedID 29741630

  • Brain-specific functional relationship networks inform autism spectrum disorder gene prediction TRANSLATIONAL PSYCHIATRY Duda, M., Zhang, H., Li, H., Wall, D. P., Burmeister, M., Guan, Y. 2018; 8: 56


    Autism spectrum disorder (ASD) is a neuropsychiatric disorder with strong evidence of genetic contribution, and increased research efforts have resulted in an ever-growing list of ASD candidate genes. However, only a fraction of the hundreds of nominated ASD-related genes have identified de novo or transmitted loss of function (LOF) mutations that can be directly attributed to the disorder. For this reason, a means of prioritizing candidate genes for ASD would help filter out false-positive results and allow researchers to focus on genes that are more likely to be causative. Here we constructed a machine learning model by leveraging a brain-specific functional relationship network (FRN) of genes to produce a genome-wide ranking of ASD risk genes. We rigorously validated our gene ranking using results from two independent sequencing experiments, together representing over 5000 simplex and multiplex ASD families. Finally, through functional enrichment analysis on our highly prioritized candidate gene network, we identified a small number of pathways that are key in early neural development, providing further support for their potential role in ASD.

    View details for PubMedID 29507298

  • Feasibility Testing of a Wearable Behavioral Aid for Social Learning in Children with Autism APPLIED CLINICAL INFORMATICS Daniels, J., Haber, N., Voss, C., Schwartz, J., Tamura, S., Fazel, A., Kline, A., Washington, P., Phillips, J., Winograd, T., Feinstein, C., Wall, D. P. 2018; 9 (1): 129–40


    Recent advances in computer vision and wearable technology have created an opportunity to introduce mobile therapy systems for autism spectrum disorders (ASD) that can respond to the increasing demand for therapeutic interventions; however, feasibility questions must be answered first.We studied the feasibility of a prototype therapeutic tool for children with ASD using Google Glass, examining whether children with ASD would wear such a device, if providing the emotion classification will improve emotion recognition, and how emotion recognition differs between ASD participants and neurotypical controls (NC).We ran a controlled laboratory experiment with 43 children: 23 with ASD and 20 NC. Children identified static facial images on a computer screen with one of 7 emotions in 3 successive batches: the first with no information about emotion provided to the child, the second with the correct classification from the Glass labeling the emotion, and the third again without emotion information. We then trained a logistic regression classifier on the emotion confusion matrices generated by the two information-free batches to predict ASD versus NC.All 43 children were comfortable wearing the Glass. ASD and NC participants who completed the computer task with Glass providing audible emotion labeling (n = 33) showed increased accuracies in emotion labeling, and the logistic regression classifier achieved an accuracy of 72.7%. Further analysis suggests that the ability to recognize surprise, fear, and neutrality may distinguish ASD cases from NC.This feasibility study supports the utility of a wearable device for social affective learning in ASD children and demonstrates subtle differences in how ASD and NC children perform on an emotion recognition task.

    View details for DOI 10.1055/s-0038-1626727

    View details for Web of Science ID 000428690000006

    View details for PubMedID 29466819

    View details for PubMedCentralID PMC5821509

  • A Gamified Mobile System for Crowdsourcing Video for Autism Research Kalantarian, H., Washington, P., Schwartz, J., Daniels, J., Haber, N., Wall, D., IEEE Comp Soc IEEE COMPUTER SOC. 2018: 350-352
  • Coalitional game theory as a promising approach to identify candidate autism genes Gupta, A., Sun, M., Paskov, K., Stockham, N., Jung, J., Wall, D., Altman, R. B., Dunker, A. K., Hunter, L., Ritchie, M. D., Murray, T., Klein, T. E. WORLD SCIENTIFIC PUBL CO PTE LTD. 2018: 436–47
  • Coalitional game theory as a promising approach to identify candidate autism genes. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Gupta, A., Sun, M. W., Paskov, K. M., Stockham, N. T., Jung, J., Wall, D. P. 2018; 23: 436–47


    Despite mounting evidence for the strong role of genetics in the phenotypic manifestation of Autism Spectrum Disorder (ASD), the specific genes responsible for the variable forms of ASD remain undefined. ASD may be best explained by a combinatorial genetic model with varying epistatic interactions across many small effect mutations. Coalitional or cooperative game theory is a technique that studies the combined effects of groups of players, known as coalitions, seeking to identify players who tend to improve the performance--the relationship to a specific disease phenotype--of any coalition they join. This method has been previously shown to boost biologically informative signal in gene expression data but to-date has not been applied to the search for cooperative mutations among putative ASD genes. We describe our approach to highlight genes relevant to ASD using coalitional game theory on alteration data of 1,965 fully sequenced genomes from 756 multiplex families. Alterations were encoded into binary matrices for ASD (case) and unaffected (control) samples, indicating likely gene-disrupting, inherited mutations in altered genes. To determine individual gene contributions given an ASD phenotype, a "player" metric, referred to as the Shapley value, was calculated for each gene in the case and control cohorts. Sixty seven genes were found to have significantly elevated player scores and likely represent significant contributors to the genetic coordination underlying ASD. Using network and cross-study analysis, we found that these genes are involved in biological pathways known to be affected in the autism cases and that a subset directly interact with several genes known to have strong associations to autism. These findings suggest that coalitional game theory can be applied to large-scale genomic data to identify hidden yet influential players in complex polygenic disorders such as autism.

    View details for PubMedID 29218903

  • Analysis of Sex and Recurrence Ratios in Simplex and Multiplex Autism Spectrum Disorder Implicates Sex-Specific Alleles as Inheritance Mechanism Chrisman, B., Varma, M., Washington, P., Paskov, K., Stockham, N., Jung, J., Wall, D. P., Zheng, H., Callejas, Z., Griol, D., Wang, H., Hu, Schmidt, H., Baumbach, J., Dickerson, J., Zhang, L. IEEE. 2018: 1470–77
  • A Low Rank Model for Phenotype Imputation in Autism Spectrum Disorder. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science Paskov, K. M., Wall, D. P. 2018; 2017: 178–87


    Autism Spectrum Disorder is a highly heterogeneous condition currently diagnosed using behavioral symptoms. A better understanding of the phenotypic subtypes of autism is a necessary component of the larger goal of mapping autism genotype to phenotype. However, as with most clinical records describing human disease, the phenotypic data available for autism contains varying levels of noise and incompleteness that complicate analysis. Here we analyze behavioral data from 16,291 subjects using 250 items from three gold standard diagnostic instruments. We apply a low-rank model to impute missing entries and entire missing instruments with high fidelity, showing that we can complete clinical records for all subjects. Finally, we analyze the low-rank representation of our subjects to identify plausible subtypes of autism, setting the stage for genome-to-phenome prediction experiments. These procedures can be adapted and used with other similarly structured clinical records to enable a more complete mapping between genome and phenome.

    View details for PubMedID 29888068

  • Sparsifying machine learning models identify stable subsets of predictive features for behavioral detection of autism MOLECULAR AUTISM Levy, S., Duda, M., Haber, N., Wall, D. P. 2017; 8: 65


    Autism spectrum disorder (ASD) diagnosis can be delayed due in part to the time required for administration of standard exams, such as the Autism Diagnostic Observation Schedule (ADOS). Shorter and potentially mobilized approaches would help to alleviate bottlenecks in the healthcare system. Previous work using machine learning suggested that a subset of the behaviors measured by ADOS can achieve clinically acceptable levels of accuracy. Here we expand on this initial work to build sparse models that have higher potential to generalize to the clinical population.We assembled a collection of score sheets for two ADOS modules, one for children with phrased speech (Module 2; 1319 ASD cases, 70 controls) and the other for children with verbal fluency (Module 3; 2870 ASD cases, 273 controls). We used sparsity/parsimony enforcing regularization techniques in a nested cross validation grid search to select features for 17 unique supervised learning models, encoding missing values as additional indicator features. We augmented our feature sets with gender and age to train minimal and interpretable classifiers capable of robust detection of ASD from non-ASD.By applying 17 unique supervised learning methods across 5 classification families tuned for sparse use of features and to be within 1 standard error of the optimal model, we find reduced sets of 10 and 5 features used in a majority of models. We tested the performance of the most interpretable of these sparse models, including Logistic Regression with L2 regularization or Linear SVM with L1 regularization. We obtained an area under the ROC curve of 0.95 for ADOS Module 3 and 0.93 for ADOS Module 2 with less than or equal to 10 features.The resulting models provide improved stability over previous machine learning efforts to minimize the time complexity of autism detection due to regularization and a small parameter space. These robustness techniques yield classifiers that are sparse, interpretable and that have potential to generalize to alternative modes of autism screening, diagnosis and monitoring, possibly including analysis of short home videos.

    View details for PubMedID 29270283

  • The GapMap project: a mobile surveillance system to map diagnosed autism cases and gaps in autism services globally MOLECULAR AUTISM Daniels, J., Schwartz, J., Albert, N., Du, M., Wall, D. P. 2017; 8: 55


    Although the number of autism diagnoses is on the rise, we have no evidence-based tracking of size and severity of gaps in access to autism-related resources, nor do we have methods to geographically triangulate the locations of the widest gaps in either the US or elsewhere across the globe. To combat these related issues of (1) mapping diagnosed cases of autism and (2) quantifying gaps in access to key intervention services, we have constructed a crowd-based mobile platform called "GapMap" ( for real-time tracking of autism prevalence and autism-related resources that can be accessed from any mobile device with cellular or wireless connectivity. Now in beta, our aim is for this Android/iOS compatible mobile tool to simultaneously crowd-enroll the massive and growing community of families with autism to capture geographic, diagnostic, and resource usage information while automatically computing prevalence at granular geographical scales to yield a more complete and dynamic understanding of autism resource epidemiology.

    View details for PubMedID 29075431

  • Human Genome Sequencing at the Population Scale: A Primer on High-Throughput DNA Sequencing and Analysis AMERICAN JOURNAL OF EPIDEMIOLOGY Goldfeder, R. L., Wall, D. P., Khoury, M. J., Ioannidis, J. A., Ashley, E. A. 2017; 186 (8): 1000–1009


    Most human diseases have underlying genetic causes. To better understand the impact of genes on disease and its implications for medicine and public health, researchers have pursued methods for determining the sequences of individual genes, then all genes, and now complete human genomes. Massively parallel high-throughput sequencing technology, where DNA is sheared into smaller pieces, sequenced, and then computationally reordered and analyzed, enables fast and affordable sequencing of full human genomes. As the price of sequencing continues to decline, more and more individuals are having their genomes sequenced. This may facilitate better population-level disease subtyping and characterization, as well as individual-level diagnosis and personalized treatment and prevention plans. In this review, we describe several massively parallel high-throughput DNA sequencing technologies and their associated strengths, limitations, and error modes, with a focus on applications in epidemiologic research and precision medicine. We detail the methods used to computationally process and interpret sequence data to inform medical or preventative action.

    View details for DOI 10.1093/aje/kww224

    View details for Web of Science ID 000412798300013

    View details for PubMedID 29040395

  • DESIGN AND EFFICACY OF A WEARABLE DEVICE FOR SOCIAL AFFECTIVE LEARNING IN CHILDREN WITH AUTISM Daniels, J., Schwartz, J., Haber, N., Voss, C., Kline, A., Fazel, A., Washington, P., De, T., Feinstein, C., Winograd, T., Wall, D. ELSEVIER SCIENCE INC. 2017: S257
  • Crowdsourced validation of a machine-learning classification system for autism and ADHD. Translational psychiatry Duda, M., Haber, N., Daniels, J., Wall, D. P. 2017; 7 (5)


    Autism spectrum disorder (ASD) and attention deficit hyperactivity disorder (ADHD) together affect >10% of the children in the United States, but considerable behavioral overlaps between the two disorders can often complicate differential diagnosis. Currently, there is no screening test designed to differentiate between the two disorders, and with waiting times from initial suspicion to diagnosis upwards of a year, methods to quickly and accurately assess risk for these and other developmental disorders are desperately needed. In a previous study, we found that four machine-learning algorithms were able to accurately (area under the curve (AUC)>0.96) distinguish ASD from ADHD using only a small subset of items from the Social Responsiveness Scale (SRS). Here, we expand upon our prior work by including a novel crowdsourced data set of responses to our predefined top 15 SRS-derived questions from parents of children with ASD (n=248) or ADHD (n=174) to improve our model's capability to generalize to new, 'real-world' data. By mixing these novel survey data with our initial archival sample (n=3417) and performing repeated cross-validation with subsampling, we created a classification algorithm that performs with AUC=0.89±0.01 using only 15 questions.

    View details for DOI 10.1038/tp.2017.86

    View details for PubMedID 28509905

  • Cross-disorder comparative analysis of comorbid conditions reveals novel autism candidate genes BMC GENOMICS Diaz-Beltran, L., Esteban, F. J., Varma, M., Ortuzk, A., David, M., Wall, D. P. 2017; 18


    Numerous studies have highlighted the elevated degree of comorbidity associated with autism spectrum disorder (ASD). These comorbid conditions may add further impairments to individuals with autism and are substantially more prevalent compared to neurotypical populations. These high rates of comorbidity are not surprising taking into account the overlap of symptoms that ASD shares with other pathologies. From a research perspective, this suggests common molecular mechanisms involved in these conditions. Therefore, identifying crucial genes in the overlap between ASD and these comorbid disorders may help unravel the common biological processes involved and, ultimately, shed some light in the understanding of autism etiology.In this work, we used a two-fold systems biology approach specially focused on biological processes and gene networks to conduct a comparative analysis of autism with 31 frequently comorbid disorders in order to define a multi-disorder subcomponent of ASD and predict new genes of potential relevance to ASD etiology. We validated our predictions by determining the significance of our candidate genes in high throughput transcriptome expression profiling studies. Using prior knowledge of disease-related biological processes and the interaction networks of the disorders related to autism, we identified a set of 19 genes not previously linked to ASD that were significantly differentially regulated in individuals with autism. In addition, these genes were of potential etiologic relevance to autism, given their enriched roles in neurological processes crucial for optimal brain development and function, learning and memory, cognition and social behavior.Taken together, our approach represents a novel perspective of autism from the point of view of related comorbid disorders and proposes a model by which prior knowledge of interaction networks may enlighten and focus the genome-wide search for autism candidate genes to better define the genetic heterogeneity of ASD.

    View details for DOI 10.1186/s12864-017-3667-9

    View details for PubMedID 28427329

  • Refining the role of de novo protein-truncating variants in neurodevelopmental disorders by using population reference samples. Nature genetics Kosmicki, J. A., Samocha, K. E., Howrigan, D. P., Sanders, S. J., Slowikowski, K., Lek, M., Karczewski, K. J., Cutler, D. J., Devlin, B., Roeder, K., Buxbaum, J. D., Neale, B. M., MacArthur, D. G., Wall, D. P., Robinson, E. B., Daly, M. J. 2017


    Recent research has uncovered an important role for de novo variation in neurodevelopmental disorders. Using aggregated data from 9,246 families with autism spectrum disorder, intellectual disability, or developmental delay, we found that ∼1/3 of de novo variants are independently present as standing variation in the Exome Aggregation Consortium's cohort of 60,706 adults, and these de novo variants do not contribute to neurodevelopmental risk. We further used a loss-of-function (LoF)-intolerance metric, pLI, to identify a subset of LoF-intolerant genes containing the observed signal of associated de novo protein-truncating variants (PTVs) in neurodevelopmental disorders. LoF-intolerant genes also carry a modest excess of inherited PTVs, although the strongest de novo-affected genes contribute little to this excess, thus suggesting that the excess of inherited risk resides in lower-penetrant genes. These findings illustrate the importance of population-based reference cohorts for the interpretation of candidate pathogenic variants, even for analyses of complex diseases and de novo variation.

    View details for DOI 10.1038/ng.3789

    View details for PubMedID 28191890

  • MC-GenomeKey: a multicloud system for the detection and annotation of genomic variants. BMC bioinformatics Elshazly, H., Souilmi, Y., Tonellato, P. J., Wall, D. P., Abouelhoda, M. 2017; 18 (1): 49-?


    Next Generation Genome sequencing techniques became affordable for massive sequencing efforts devoted to clinical characterization of human diseases. However, the cost of providing cloud-based data analysis of the mounting datasets remains a concerning bottleneck for providing cost-effective clinical services. To address this computational problem, it is important to optimize the variant analysis workflow and the used analysis tools to reduce the overall computational processing time, and concomitantly reduce the processing cost. Furthermore, it is important to capitalize on the use of the recent development in the cloud computing market, which have witnessed more providers competing in terms of products and prices.In this paper, we present a new package called MC-GenomeKey (Multi-Cloud GenomeKey) that efficiently executes the variant analysis workflow for detecting and annotating mutations using cloud resources from different commercial cloud providers. Our package supports Amazon, Google, and Azure clouds, as well as, any other cloud platform based on OpenStack. Our package allows different scenarios of execution with different levels of sophistication, up to the one where a workflow can be executed using a cluster whose nodes come from different clouds. MC-GenomeKey also supports scenarios to exploit the spot instance model of Amazon in combination with the use of other cloud platforms to provide significant cost reduction. To the best of our knowledge, this is the first solution that optimizes the execution of the workflow using computational resources from different cloud providers.MC-GenomeKey provides an efficient multicloud based solution to detect and annotate mutations. The package can run in different commercial cloud platforms, which enables the user to seize the best offers. The package also provides a reliable means to make use of the low-cost spot instance model of Amazon, as it provides an efficient solution to the sudden termination of spot machines as a result of a sudden price increase. The package has a web-interface and it is available for free for academic use.

    View details for DOI 10.1186/s12859-016-1454-2

    View details for PubMedID 28107819

    View details for PubMedCentralID PMC5248509

  • Machine learning for early detection of autism (and other conditions) using a parental questionnaire and home video screening Abbas, H., Garberson, F., Glover, E., Wall, D. P., Nie, J. Y., Obradovic, Z., Suzumura, T., Ghosh, R., Nambiar, R., Wang, C., Zang, H., BaezaYates, R., Hu, Kepner, J., Cuzzocrea, A., Tang, J., Toyoda, M. IEEE. 2017: 3558–61
  • GapMap: Enabling Comprehensive Autism Resource Epidemiology. JMIR public health and surveillance Albert, N. n., Daniels, J. n., Schwartz, J. n., Du, M. n., Wall, D. P. 2017; 3 (2): e27


    For individuals with autism spectrum disorder (ASD), finding resources can be a lengthy and difficult process. The difficulty in obtaining global, fine-grained autism epidemiological data hinders researchers from quickly and efficiently studying large-scale correlations among ASD, environmental factors, and geographical and cultural factors.The objective of this study was to define resource load and resource availability for families affected by autism and subsequently create a platform to enable a more accurate representation of prevalence rates and resource epidemiology.We created a mobile application, GapMap, to collect locational, diagnostic, and resource use information from individuals with autism to compute accurate prevalence rates and better understand autism resource epidemiology. GapMap is hosted on AWS S3, running on a React and Redux front-end framework. The backend framework is comprised of an AWS API Gateway and Lambda Function setup, with secure and scalable end points for retrieving prevalence and resource data, and for submitting participant data. Measures of autism resource scarcity, including resource load, resource availability, and resource gaps were defined and preliminarily computed using simulated or scraped data.The average distance from an individual in the United States to the nearest diagnostic center is approximately 182 km (50 miles), with a standard deviation of 235 km (146 miles). The average distance from an individual with ASD to the nearest diagnostic center, however, is only 32 km (20 miles), suggesting that individuals who live closer to diagnostic services are more likely to be diagnosed.This study confirmed that individuals closer to diagnostic services are more likely to be diagnosed and proposes GapMap, a means to measure and enable the alleviation of increasingly overburdened diagnostic centers and resource-poor areas where parents are unable to diagnose their children as quickly and easily as needed. GapMap will collect information that will provide more accurate data for computing resource loads and availability, uncovering the impact of resource epidemiology on age and likelihood of diagnosis, and gathering localized autism prevalence rates.

    View details for DOI 10.2196/publichealth.7150

    View details for PubMedID 28473303

    View details for PubMedCentralID PMC5438459

  • Can we accelerate autism discoveries through crowdsourcing? RESEARCH IN AUTISM SPECTRUM DISORDERS David, M. M., Babineau, B. A., Wall, D. P. 2016; 32: 80-83
  • Comorbid Analysis of Genes Associated with Autism Spectrum Disorders Reveals Differential Evolutionary Constraints PLOS ONE David, M. M., Enard, D., Ozturk, A., Daniels, J., Jung, J., Diaz-Beltran, L., Wall, D. P. 2016; 11 (7)


    The burden of comorbidity in Autism Spectrum Disorder (ASD) is substantial. The symptoms of autism overlap with many other human conditions, reflecting common molecular pathologies suggesting that cross-disorder analysis will help prioritize autism gene candidates. Genes in the intersection between autism and related conditions may represent nonspecific indicators of dysregulation while genes unique to autism may play a more causal role. Thorough literature review allowed us to extract 125 ICD-9 codes comorbid to ASD that we mapped to 30 specific human disorders. In the present work, we performed an automated extraction of genes associated with ASD and its comorbid disorders, and found 1031 genes involved in ASD, among which 262 are involved in ASD only, with the remaining 779 involved in ASD and at least one comorbid disorder. A pathway analysis revealed 13 pathways not involved in any other comorbid disorders and therefore unique to ASD, all associated with basal cellular functions. These pathways differ from the pathways associated with both ASD and its comorbid conditions, with the latter being more specific to neural function. To determine whether the sequence of these genes have been subjected to differential evolutionary constraints, we studied long term constraints by looking into Genomic Evolutionary Rate Profiling, and showed that genes involved in several comorbid disorders seem to have undergone more purifying selection than the genes involved in ASD only. This result was corroborated by a higher dN/dS ratio for genes unique to ASD as compare to those that are shared between ASD and its comorbid disorders. Short-term evolutionary constraints showed the same trend as the pN/pS ratio indicates that genes unique to ASD were under significantly less evolutionary constraint than the genes associated with all other disorders.

    View details for DOI 10.1371/journal.pone.0157937

    View details for PubMedID 27414027

  • Clinical Evaluation of a Novel and Mobile Autism Risk Assessment JOURNAL OF AUTISM AND DEVELOPMENTAL DISORDERS Duda, M., Daniels, J., Wall, D. P. 2016; 46 (6): 1953-1961


    The Mobile Autism Risk Assessment (MARA) is a new, electronically administered, 7-question autism spectrum disorder (ASD) screen to triage those at highest risk for ASD. Children 16 months-17 years (N = 222) were screened during their first visit in a developmental-behavioral pediatric clinic. MARA scores were compared to diagnosis from the clinical encounter. Participant median age was 5.8 years, 76.1 % were male, and most participants had an intelligence/developmental quotient score >85; 69 of the participants (31 %) received a clinical diagnosis of ASD. The sensitivity of the MARA in detecting ASD was 89.9 % [95 % CI = 82.7-97]; the specificity was 79.7 % [95 % CI = 73.4-86.1]. In a high-risk clinical setting, the MARA shows promise as a screen to distinguish ASD from other developmental/behavioral disorders.

    View details for DOI 10.1007/s10803-016-2718-4

    View details for Web of Science ID 000376100200007

    View details for PubMedID 26873142

    View details for PubMedCentralID PMC4860199

  • Automated integration of continuous glucose monitor data in the electronic health record using consumer technology. Journal of the American Medical Informatics Association Kumar, R. B., Goren, N. D., Stark, D. E., Wall, D. P., Longhurst, C. A. 2016; 23 (3): 532-537


    The diabetes healthcare provider plays a key role in interpreting blood glucose trends, but few institutions have successfully integrated patient home glucose data in the electronic health record (EHR). Published implementations to date have required custom interfaces, which limit wide-scale replication. We piloted automated integration of continuous glucose monitor data in the EHR using widely available consumer technology for 10 pediatric patients with insulin-dependent diabetes. Establishment of a passive data communication bridge via a patient's/parent's smartphone enabled automated integration and analytics of patient device data within the EHR between scheduled clinic visits. It is feasible to utilize available consumer technology to assess and triage home diabetes device data within the EHR, and to engage patients/parents and improve healthcare provider workflow.

    View details for DOI 10.1093/jamia/ocv206

    View details for PubMedID 27018263

  • Characterisation of agricultural drainage ditch sediments along the phosphorus transfer continuum in two contrasting headwater catchments JOURNAL OF SOILS AND SEDIMENTS Shore, M., Jordan, P., Mellander, P., Kelly-Quinn, M., Daly, K., Sims, J. T., Wall, D. P., Melland, A. R. 2016; 16 (5): 1643-1654
  • A research roadmap for next-generation sequencing informatics SCIENCE TRANSLATIONAL MEDICINE Altman, R. B., Prabhu, S., Sidow, A., Zook, J. M., Goldfeder, R., Litwack, D., Ashley, E., Asimenos, G., Bustamante, C. D., Donigan, K., Giacomini, K. M., Johansen, E., Khuri, N., Lee, E., Liang, X. S., Salit, M., Serang, O., Tezak, Z., Wall, D. P., Mansfield, E., Kass-Hout, T. 2016; 8 (335)


    Next-generation sequencing technologies are fueling a wave of new diagnostic tests. Progress on a key set of nine research challenge areas will help generate the knowledge required to advance effectively these diagnostics to the clinic.

    View details for DOI 10.1126/scitranslmed.aaf7314

    View details for PubMedID 27099173

  • A Complex Systems Approach to Causal Discovery in Psychiatry PLOS ONE Saxe, G. N., Statnikov, A., Fenyo, D., Ren, J., Li, Z., Prasad, M., Wall, D., Bergman, N., Briggs, E. C., Aliferis, C. 2016; 11 (3)


    Conventional research methodologies and data analytic approaches in psychiatric research are unable to reliably infer causal relations without experimental designs, or to make inferences about the functional properties of the complex systems in which psychiatric disorders are embedded. This article describes a series of studies to validate a novel hybrid computational approach-the Complex Systems-Causal Network (CS-CN) method-designed to integrate causal discovery within a complex systems framework for psychiatric research. The CS-CN method was first applied to an existing dataset on psychopathology in 163 children hospitalized with injuries (validation study). Next, it was applied to a much larger dataset of traumatized children (replication study). Finally, the CS-CN method was applied in a controlled experiment using a 'gold standard' dataset for causal discovery and compared with other methods for accurately detecting causal variables (resimulation controlled experiment). The CS-CN method successfully detected a causal network of 111 variables and 167 bivariate relations in the initial validation study. This causal network had well-defined adaptive properties and a set of variables was found that disproportionally contributed to these properties. Modeling the removal of these variables resulted in significant loss of adaptive properties. The CS-CN method was successfully applied in the replication study and performed better than traditional statistical methods, and similarly to state-of-the-art causal discovery algorithms in the causal detection experiment. The CS-CN method was validated, replicated, and yielded both novel and previously validated findings related to risk factors and potential treatments of psychiatric disorders. The novel approach yields both fine-grain (micro) and high-level (macro) insights and thus represents a promising approach for complex systems-oriented research in psychiatry.

    View details for DOI 10.1371/journal.pone.0151174

    View details for Web of Science ID 000373116500019

    View details for PubMedID 27028297

    View details for PubMedCentralID PMC4814084

  • A common molecular signature in ASD gene expression: following Root 66 to autism TRANSLATIONAL PSYCHIATRY Diaz-Beltran, L., Esteban, F. J., Wall, D. P. 2016; 6


    Several gene expression experiments on autism spectrum disorders have been conducted using both blood and brain tissue. Individually, these studies have advanced our understanding of the molecular systems involved in the molecular pathology of autism and have formed the bases of ongoing work to build autism biomarkers. In this study, we conducted an integrated systems biology analysis of 9 independent gene expression experiments covering 657 autism, 9 mental retardation and developmental delay and 566 control samples to determine if a common signature exists and to test whether regulatory patterns in the brain relevant to autism can also be detected in blood. We constructed a matrix of differentially expressed genes from these experiments and used a Jaccard coefficient to create a gene-based phylogeny, validated by bootstrap. As expected, experiments and tissue types clustered together with high statistical confidence. However, we discovered a statistically significant subgrouping of 3 blood and 2 brain data sets from 3 different experiments rooted by a highly correlated regulatory pattern of 66 genes. This Root 66 appeared to be non-random and of potential etiologic relevance to autism, given their enriched roles in neurological processes key for normal brain growth and function, learning and memory, neurodegeneration, social behavior and cognition. Our results suggest that there is a detectable autism signature in the blood that may be a molecular echo of autism-related dysregulation in the brain.

    View details for DOI 10.1038/tp.2015.112

    View details for Web of Science ID 000368549500005

    View details for PubMedID 26731442

  • Use of machine learning for behavioral distinction of autism and ADHD. Translational psychiatry Duda, M., Ma, R., Haber, N., Wall, D. P. 2016; 6


    Although autism spectrum disorder (ASD) and attention deficit hyperactivity disorder (ADHD) continue to rise in prevalence, together affecting >10% of today's pediatric population, the methods of diagnosis remain subjective, cumbersome and time intensive. With gaps upward of a year between initial suspicion and diagnosis, valuable time where treatments and behavioral interventions could be applied is lost as these disorders remain undetected. Methods to quickly and accurately assess risk for these, and other, developmental disorders are necessary to streamline the process of diagnosis and provide families access to much-needed therapies sooner. Using forward feature selection, as well as undersampling and 10-fold cross-validation, we trained and tested six machine learning models on complete 65-item Social Responsiveness Scale score sheets from 2925 individuals with either ASD (n=2775) or ADHD (n=150). We found that five of the 65 behaviors measured by this screening tool were sufficient to distinguish ASD from ADHD with high accuracy (area under the curve=0.965). These results support the hypotheses that (1) machine learning can be used to discern between autism and ADHD with high accuracy and (2) this distinction can be made using a small number of commonly measured behaviors. Our findings show promise for use as an electronically administered, caregiver-directed resource for preliminary risk evaluation and/or pre-clinical screening and triage that could help to speed the diagnosis of these disorders.

    View details for DOI 10.1038/tp.2015.221

    View details for PubMedID 26859815

    View details for PubMedCentralID PMC4872425

  • Superpower Glass: Delivering Unobtrusive Real-time Social Cues in Wearable Systems Voss, C., Washington, P., Haber, N., Kline, A., Daniels, J., Fazel, A., De, T., McCarthy, B., Feinstein, C., Winograd, T., Wall, D., Assoc Comp Machinery ASSOC COMPUTING MACHINERY. 2016: 1218–26
  • A Practical Approach to Real-Time Neutral Feature Subtraction for Facial Expression Recognition Haber, N., Voss, C., Fazel, A., Winograd, T., Wall, D. P., IEEE IEEE. 2016
  • DE NOVO MUTATIONS IN AUTISM IMPLICATE THE SYNAPTIC ELIMINATION NETWORK. Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing Ram Venkataraman, G., O'Connell, C., Egawa, F., Kashef-Haghighi, D., Wall, D. P. 2016; 22: 521-532


    Autism has been shown to have a major genetic risk component; the architecture of documented autism in families has been over and again shown to be passed down for generations. While inherited risk plays an important role in the autistic nature of children, de novo (germline) mutations have also been implicated in autism risk. Here we find that autism de novo variants verified and published in the literature are Bonferroni-significantly enriched in a gene set implicated in synaptic elimination. Additionally, several of the genes in this synaptic elimination set that were enriched in protein-protein interactions (CACNA1C, SHANK2, SYNGAP1, NLGN3, NRXN1, and PTEN) have been previously confirmed as genes that confer risk for the disorder. The results demonstrate that autism-associated de novos are linked to proper synaptic pruning and density, hinting at the etiology of autism and suggesting pathophysiology for downstream correction and treatment.

    View details for PubMedID 27897003

  • The Quantified Brain: A Framework for Mobile Device-Based Assessment of Behavior and Neurological Function. Applied clinical informatics Stark, D. E., Kumar, R. B., Longhurst, C. A., Wall, D. P. 2016; 7 (2): 290–98

    View details for PubMedID 27437041

  • Identification of Human Neuronal Protein Complexes Reveals Biochemical Activities and Convergent Mechanisms of Action in Autism Spectrum Disorders. Cell systems Li, J., Ma, Z., Shi, M., Malty, R. H., Aoki, H., Minic, Z., Phanse, S., Jin, K., Wall, D. P., Zhang, Z., Urban, A. E., Hallmayer, J., Babu, M., Snyder, M. 2015; 1 (5): 361-374


    The prevalence of autism spectrum disorders (ASDs) is rapidly growing, yet its molecular basis is poorly understood. We used a systems approach in which ASD candidate genes were mapped onto the ubiquitous human protein complexes and the resulting complexes were characterized. The studies revealed the role of histone deacetylases (HDAC1/2) in regulating the expression of ASD orthologs in the embryonic mouse brain. Proteome-wide screens for the co-complexed subunits with HDAC1 and six other key ASD proteins in neuronal cells revealed a protein interaction network, which displayed preferential expression in fetal brain development, exhibited increased deleterious mutations in ASD cases, and were strongly regulated by FMRP and MECP2 causal for Fragile X and Rett syndromes, respectively. Overall, our study reveals molecular components in ASD, suggests a shared mechanism between the syndromic and idiopathic forms of ASDs, and provides a systems framework for analyzing complex human diseases.

    View details for PubMedID 26949739

  • Identification of Human Neuronal Protein Complexes Reveals Biochemical Activities and Convergent Mechanisms of Action in Autism Spectrum Disorders CELL SYSTEMS Li, J., Ma, Z., Shi, M., Malty, R. H., Aoki, H., Minic, Z., Phanse, S., Jin, K., Wall, D. P., Zhang, Z., Urban, A. E., Hallmayer, J., Babu, M., Snyder, M. 2015; 1 (5): 361-374


    The prevalence of autism spectrum disorders (ASDs) is rapidly growing, yet its molecular basis is poorly understood. We used a systems approach in which ASD candidate genes were mapped onto the ubiquitous human protein complexes and the resulting complexes were characterized. The studies revealed the role of histone deacetylases (HDAC1/2) in regulating the expression of ASD orthologs in the embryonic mouse brain. Proteome-wide screens for the co-complexed subunits with HDAC1 and six other key ASD proteins in neuronal cells revealed a protein interaction network, which displayed preferential expression in fetal brain development, exhibited increased deleterious mutations in ASD cases, and were strongly regulated by FMRP and MECP2 causal for Fragile X and Rett syndromes, respectively. Overall, our study reveals molecular components in ASD, suggests a shared mechanism between the syndromic and idiopathic forms of ASDs, and provides a systems framework for analyzing complex human diseases.

    View details for DOI 10.1016/j.cels.2015.11.002

    View details for Web of Science ID 000209926300009

    View details for PubMedCentralID PMC4776331

  • Scalable and cost-effective NGS genotyping in the cloud BMC MEDICAL GENOMICS Souilmi, Y., Lancaster, A. K., Jung, J., Rizzo, E., Hawkins, J. B., Powles, R., Amzazi, S., Ghazal, H., Tonellato, P. J., Wall, D. P. 2015; 8


    While next-generation sequencing (NGS) costs have plummeted in recent years, cost and complexity of computation remain substantial barriers to the use of NGS in routine clinical care. The clinical potential of NGS will not be realized until robust and routine whole genome sequencing data can be accurately rendered to medically actionable reports within a time window of hours and at scales of economy in the 10's of dollars.We take a step towards addressing this challenge, by using COSMOS, a cloud-enabled workflow management system, to develop GenomeKey, an NGS whole genome analysis workflow. COSMOS implements complex workflows making optimal use of high-performance compute clusters. Here we show that the Amazon Web Service (AWS) implementation of GenomeKey via COSMOS provides a fast, scalable, and cost-effective analysis of both public benchmarking and large-scale heterogeneous clinical NGS datasets.Our systematic benchmarking reveals important new insights and considerations to produce clinical turn-around of whole genome analysis optimization and workflow management including strategic batching of individual genomes and efficient cluster resource configuration.

    View details for DOI 10.1186/s12920-015-0134-9

    View details for Web of Science ID 000362868300001

    View details for PubMedID 26470712

    View details for PubMedCentralID PMC4608296

  • Rising interdisciplinary collaborations refine our understanding of autisms and give hope to more personalized solutions. Personalized medicine Duda, M., Wall, D. P. 2015; 12 (4): 359-369


    Autism is heterogeneous, complex and arguably a condition of many conditions. Both the number of researchers and the number of research collaborations in the field of autism have been growing at unprecedented rates. Interdisciplinary collaborations have increased more than eightfold since the year 2000. In fact, most - if not all - areas of autism research are starting to converge, and these convergences are leading not only to a richer research network but also to a causal network for autism. This network can, and likely will, decode the many forms of autism into its various subcomponents, enabling increasingly more personalized approaches for both the detection and treatment of those different forms of autism.

    View details for DOI 10.2217/pme.15.8

    View details for PubMedID 29771659

  • A transgenic resource for conditional competitive inhibition of conserved Drosophila microRNAs NATURE COMMUNICATIONS Fulga, T. A., McNeill, E. M., Binari, R., Yelick, J., Blanche, A., Booker, M., Steinkraus, B. R., Schnall-Levin, M., Zhao, Y., Deluca, T., Bejarano, F., Han, Z., Lai, E. C., Wall, D. P., Perrimon, N., Van Vactor, D. 2015; 6


    Although the impact of microRNAs (miRNAs) in development and disease is well established, understanding the function of individual miRNAs remains challenging. Development of competitive inhibitor molecules such as miRNA sponges has allowed the community to address individual miRNA function in vivo. However, the application of these loss-of-function strategies has been limited. Here we offer a comprehensive library of 141 conditional miRNA sponges targeting well-conserved miRNAs in Drosophila. Ubiquitous miRNA sponge delivery and consequent systemic miRNA inhibition uncovers a relatively small number of miRNA families underlying viability and gross morphogenesis, with false discovery rates in the 4-8% range. In contrast, tissue-specific silencing of muscle-enriched miRNAs reveals a surprisingly large number of novel miRNA contributions to the maintenance of adult indirect flight muscle structure and function. A strong correlation between miRNA abundance and physiological relevance is not observed, underscoring the importance of unbiased screens when assessing the contributions of miRNAs to complex biological processes.

    View details for DOI 10.1038/ncomms8279

    View details for Web of Science ID 000357170800006

    View details for PubMedID 26081261

    View details for PubMedCentralID PMC4471878

  • Searching for a minimal set of behaviors for autism detection through feature selection-based machine learning TRANSLATIONAL PSYCHIATRY Kosmicki, J. A., Sochat, V., Duda, M., Wall, D. P. 2015; 5


    Although the prevalence of autism spectrum disorder (ASD) has risen sharply in the last few years reaching 1 in 68, the average age of diagnosis in the United States remains close to 4-well past the developmental window when early intervention has the largest gains. This emphasizes the importance of developing accurate methods to detect risk faster than the current standards of care. In the present study, we used machine learning to evaluate one of the best and most widely used instruments for clinical assessment of ASD, the Autism Diagnostic Observation Schedule (ADOS) to test whether only a subset of behaviors can differentiate between children on and off the autism spectrum. ADOS relies on behavioral observation in a clinical setting and consists of four modules, with module 2 reserved for individuals with some vocabulary and module 3 for higher levels of cognitive functioning. We ran eight machine learning algorithms using stepwise backward feature selection on score sheets from modules 2 and 3 from 4540 individuals. We found that 9 of the 28 behaviors captured by items from module 2, and 12 of the 28 behaviors captured by module 3 are sufficient to detect ASD risk with 98.27% and 97.66% accuracy, respectively. A greater than 55% reduction in the number of behaviorals with negligible loss of accuracy across both modules suggests a role for computational and statistical methods to streamline ASD risk detection and screening. These results may help enable development of mobile and parent-directed methods for preliminary risk evaluation and/or clinical triage that reach a larger percentage of the population and help to lower the average age of detection and diagnosis.

    View details for DOI 10.1038/tp.2015.7

    View details for Web of Science ID 000367652200002

  • COSMOS: cloud enabled NGS analysis Souilmi, Y., Jung, J., Lancaster, A., Gafni, E., Amzazi, S., Ghazal, H., Wall, D., Tonellato, P. BIOMED CENTRAL LTD. 2015
  • Searching for a minimal set of behaviors for autism detection through feature selection-based machine learning. Translational psychiatry Kosmicki, J. A., Sochat, V., Duda, M., Wall, D. P. 2015; 5


    Although the prevalence of autism spectrum disorder (ASD) has risen sharply in the last few years reaching 1 in 68, the average age of diagnosis in the United States remains close to 4-well past the developmental window when early intervention has the largest gains. This emphasizes the importance of developing accurate methods to detect risk faster than the current standards of care. In the present study, we used machine learning to evaluate one of the best and most widely used instruments for clinical assessment of ASD, the Autism Diagnostic Observation Schedule (ADOS) to test whether only a subset of behaviors can differentiate between children on and off the autism spectrum. ADOS relies on behavioral observation in a clinical setting and consists of four modules, with module 2 reserved for individuals with some vocabulary and module 3 for higher levels of cognitive functioning. We ran eight machine learning algorithms using stepwise backward feature selection on score sheets from modules 2 and 3 from 4540 individuals. We found that 9 of the 28 behaviors captured by items from module 2, and 12 of the 28 behaviors captured by module 3 are sufficient to detect ASD risk with 98.27% and 97.66% accuracy, respectively. A greater than 55% reduction in the number of behaviorals with negligible loss of accuracy across both modules suggests a role for computational and statistical methods to streamline ASD risk detection and screening. These results may help enable development of mobile and parent-directed methods for preliminary risk evaluation and/or clinical triage that reach a larger percentage of the population and help to lower the average age of detection and diagnosis.

    View details for DOI 10.1038/tp.2015.7

    View details for PubMedID 25710120

  • Testing the accuracy of an observation-based classifier for rapid detection of autism risk. Translational psychiatry Duda, M., Kosmicki, J. A., Wall, D. P. 2015; 5

    View details for DOI 10.1038/tp.2015.51

    View details for PubMedID 25918993

  • Translational Meta-analytical Methods to Localize the Regulatory Patterns of Neurological Disorders in the Human Brain. AMIA ... Annual Symposium proceedings / AMIA Symposium. AMIA Symposium Sochat, V., David, M., Wall, D. P. 2015; 2015: 2073-2082


    The task of mapping neurological disorders in the human brain must be informed by multiple measurements of an individual's phenotype - neuroimaging, genomics, and behavior. We developed a novel meta-analytical approach to integrate disparate resources and generated transcriptional maps of neurological disorders in the human brain yielding a purely computational procedure to pinpoint the brain location of transcribed genes likely to be involved in either onset or maintenance of the neurological condition.

    View details for PubMedID 26958307

  • Rising interdisciplinary collaborations refine our understanding of autisms and give hope to more personalized solutions PERSONALIZED MEDICINE Duda, M., Wall, D. P. 2015; 12 (4): 359-369

    View details for DOI 10.2217/PME.15.8

    View details for Web of Science ID 000358945300006

  • COSMOS: Python library for massively parallel workflows BIOINFORMATICS Gafni, E., Luquette, L. J., Lancaster, A. K., Hawkins, J. B., Jung, J., Souilmi, Y., Wall, D. P., Tonellato, P. J. 2014; 30 (20): 2956-2958


    Efficient workflows to shepherd clinically generated genomic data through the multiple stages of a next-generation sequencing pipeline are of critical importance in translational biomedical science. Here we present COSMOS, a Python library for workflow management that allows formal description of pipelines and partitioning of jobs. In addition, it includes a user interface for tracking the progress of jobs, abstraction of the queuing system and fine-grained control over the workflow. Workflows can be created on traditional computing clusters as well as cloud-based services.Source code is available for academic non-commercial research purposes. Links to code and documentation are provided at and or data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btu385

    View details for Web of Science ID 000343083600015

    View details for PubMedID 24982428

    View details for PubMedCentralID PMC4184253

  • A framework for the interpretation of de novo mutation in human disease NATURE GENETICS Samocha, K. E., Robinson, E. B., Sanders, S. J., Stevens, C., Sabo, A., McGrath, L. M., Kosmicki, J. A., Rehnstrom, K., Mallick, S., Kirby, A., Wall, D. P., MacArthur, D. G., Gabriel, S. B., DePristo, M., Purcell, S. M., Palotie, A., Boerwinkle, E., Buxbaum, J. D., Cook, E. H., Gibbs, R. A., Schellenberg, G. D., Sutcliffe, J. S., Devlin, B., Roeder, K., Neale, B. M., Daly, M. J. 2014; 46 (9): 944-?


    Spontaneously arising (de novo) mutations have an important role in medical genetics. For diseases with extensive locus heterogeneity, such as autism spectrum disorders (ASDs), the signal from de novo mutations is distributed across many genes, making it difficult to distinguish disease-relevant mutations from background variation. Here we provide a statistical framework for the analysis of excesses in de novo mutation per gene and gene set by calibrating a model of de novo mutation. We applied this framework to de novo mutations collected from 1,078 ASD family trios, and, whereas we affirmed a significant role for loss-of-function mutations, we found no excess of de novo loss-of-function mutations in cases with IQ above 100, suggesting that the role of de novo mutations in ASDs might reside in fundamental neurodevelopmental processes. We also used our model to identify ∼1,000 genes that are significantly lacking in functional coding variation in non-ASD samples and are enriched for de novo loss-of-function mutations identified in ASD cases.

    View details for DOI 10.1038/ng.3050

    View details for Web of Science ID 000341579400007

    View details for PubMedID 25086666

  • Evaluating the critical source area concept of phosphorus loss from soils to water-bodies in agricultural catchments. The Science of the total environment Shore, M., Jordan, P., Mellander, P., Kelly-Quinn, M., Wall, D. P., Murphy, P. N., Melland, A. R. 2014; 490: 405-415


    Using data collected from six basins located across two hydrologically contrasting agricultural catchments, this study investigated whether transport metrics alone provide better estimates of storm phosphorus (P) loss from basins than critical source area (CSA) metrics which combine source factors as well. Concentrations and loads of P in quickflow (QF) were measured at basin outlets during four storm events and were compared with dynamic (QF magnitude) and static (extent of highly-connected, poorly-drained soils) transport metrics and a CSA metric (extent of highly-connected, poorly-drained soils with excess plant-available P). Pairwise comparisons between basins with similar CSA risks but contrasting QF magnitudes showed that QF flow-weighted mean TRP (total molybdate-reactive P) concentrations and loads were frequently (at least 11 of 14 comparisons) more than 40% higher in basins with the highest QF magnitudes. Furthermore, static transport metrics reliably discerned relative QF magnitudes between these basins. However, particulate P (PP) concentrations were often (6 of 14 comparisons) higher in basins with the lowest QF magnitudes, most likely due to soil-management activities (e.g. ploughing), in these predominantly arable basins at these times. Pairwise comparisons between basins with contrasting CSA risks and similar QF magnitudes showed that TRP and PP concentrations and loads did not reflect trends in CSA risk or QF magnitude. Static transport metrics did not discern relative QF magnitudes between these basins. In basins with contrasting transport risks, storm TRP concentrations and loads were well differentiated by dynamic or static transport metrics alone, regardless of differences in soil P. In basins with similar transport risks, dynamic transport metrics and P source information additional to soil P may be required to predict relative storm TRP concentrations and loads. Regardless of differences in transport risk, information on land use and management, may be required to predict relative differences in storm PP concentrations between these agricultural basins.

    View details for DOI 10.1016/j.scitotenv.2014.04.122

    View details for PubMedID 24863139

  • A literature search tool for intelligent extraction of disease-associated genes JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION Jung, J., DeLuca, T. F., Nelson, T. H., Wall, D. P. 2014; 21 (3): 399-405


    To extract disorder-associated genes from the scientific literature in PubMed with greater sensitivity for literature-based support than existing methods.We developed a PubMed query to retrieve disorder-related, original research articles. Then we applied a rule-based text-mining algorithm with keyword matching to extract target disorders, genes with significant results, and the type of study described by the article.We compared our resulting candidate disorder genes and supporting references with existing databases. We demonstrated that our candidate gene set covers nearly all genes in manually curated databases, and that the references supporting the disorder-gene link are more extensive and accurate than other general purpose gene-to-disorder association databases.We implemented a novel publication search tool to find target articles, specifically focused on links between disorders and genotypes. Through comparison against gold-standard manually updated gene-disorder databases and comparison with automated databases of similar functionality we show that our tool can search through the entirety of PubMed to extract the main gene findings for human diseases rapidly and accurately.

    View details for DOI 10.1136/amiajnl-2012-001563

    View details for Web of Science ID 000334611600003

    View details for PubMedID 23999671

    View details for PubMedCentralID PMC3994846

  • The Potential of Accelerating Early Detection of Autism through Content Analysis of YouTube Videos. PloS one Fusaro, V. A., Daniels, J., Duda, M., DeLuca, T. F., D'Angelo, O., Tamburello, J., Maniscalco, J., Wall, D. P. 2014; 9 (4)


    Autism is on the rise, with 1 in 88 children receiving a diagnosis in the United States, yet the process for diagnosis remains cumbersome and time consuming. Research has shown that home videos of children can help increase the accuracy of diagnosis. However the use of videos in the diagnostic process is uncommon. In the present study, we assessed the feasibility of applying a gold-standard diagnostic instrument to brief and unstructured home videos and tested whether video analysis can enable more rapid detection of the core features of autism outside of clinical environments. We collected 100 public videos from YouTube of children ages 1-15 with either a self-reported diagnosis of an ASD (N = 45) or not (N = 55). Four non-clinical raters independently scored all videos using one of the most widely adopted tools for behavioral diagnosis of autism, the Autism Diagnostic Observation Schedule-Generic (ADOS). The classification accuracy was 96.8%, with 94.1% sensitivity and 100% specificity, the inter-rater correlation for the behavioral domains on the ADOS was 0.88, and the diagnoses matched a trained clinician in all but 3 of 22 randomly selected video cases. Despite the diversity of videos and non-clinical raters, our results indicate that it is possible to achieve high classification accuracy, sensitivity, and specificity as well as clinically acceptable inter-rater reliability with nonclinical personnel. Our results also demonstrate the potential for video-based detection of autism in short, unstructured home videos and further suggests that at least a percentage of the effort associated with detection and monitoring of autism may be mobilized and moved outside of traditional clinical environments.

    View details for DOI 10.1371/journal.pone.0093533

    View details for PubMedID 24740236

    View details for PubMedCentralID PMC3989176

  • Testing the accuracy of an observation-based classifier for rapid detection of autism risk. Translational psychiatry Duda, M., Kosmicki, J. A., Wall, D. P. 2014; 4


    Current approaches for diagnosing autism have high diagnostic validity but are time consuming and can contribute to delays in arriving at an official diagnosis. In a pilot study, we used machine learning to derive a classifier that represented a 72% reduction in length from the gold-standard Autism Diagnostic Observation Schedule-Generic (ADOS-G), while retaining >97% statistical accuracy. The pilot study focused on a relatively small sample of children with and without autism. The present study sought to further test the accuracy of the classifier (termed the observation-based classifier (OBC)) on an independent sample of 2616 children scored using ADOS from five data repositories and including both spectrum (n=2333) and non-spectrum (n=283) individuals. We tested OBC outcomes against the outcomes provided by the original and current ADOS algorithms, the best estimate clinical diagnosis, and the comparison score severity metric associated with ADOS-2. The OBC was significantly correlated with the ADOS-G (r=-0.814) and ADOS-2 (r=-0.779) and exhibited >97% sensitivity and >77% specificity in comparison to both ADOS algorithm scores. The correspondence to the best estimate clinical diagnosis was also high (accuracy=96.8%), with sensitivity of 97.1% and specificity of 83.3%. The correlation between the OBC score and the comparison score was significant (r=-0.628), suggesting that the OBC provides both a classification as well as a measure of severity of the phenotype. These results further demonstrate the accuracy of the OBC and suggest that reductions in the process of detecting and monitoring autism are possible.

    View details for DOI 10.1038/tp.2014.65

    View details for PubMedID 25116834

    View details for PubMedCentralID PMC4150240

  • Responding to a Diagnosis of Localized Prostate Cancer Men's Experiences of Normal Distress During the First 3 Postdiagnostic Months CANCER NURSING Wall, D. P., Kristjanson, L. J., Fisher, C., Boldy, D., Kendall, G. E. 2013; 36 (6): E44-E50


    Men experience localized prostate cancer (PCa) as aversive and distressing. Little research has studied the distress men experience as a normal response to PCa, or how they manage this distress during the early stages of the illness.The objective of this study was to explore the experience of men diagnosed with localized PCa during their first postdiagnostic year.This constructivist qualitative study interviewed 8 men between the ages of 44 and 77 years, in their homes, on 2 occasions during the first 3 postdiagnostic months. Individual, in-depth semistructured interviews were used to collect the data.After an initial feeling of shock, the men in this study worked diligently to camouflage their experience of distress through hiding and attenuating their feelings and minimizing the severity of PCa.Men silenced distress because they believed it was expected of them. Maintaining silence allowed men to protect their strong and stoic self-image. This stereotype, of the strong and stoic man, prevented men from expressing their feelings of distress and from seeking support from family and friends and health professionals.It is important for nurses to acknowledge and recognize the normal distress experienced by men as a result of a PCa diagnosis. Hence, nurses must learn to identify the ways in which men avoid expressing their distress and develop early supportive relationships that encourage them to express and subsequently manage it.

    View details for DOI 10.1097/NCC.0b013e3182747bef

    View details for Web of Science ID 000326532000006

    View details for PubMedID 23154517

  • Quantification of Phosphorus Transport from a Karstic Agricultural Watershed to Emerging Spring Water ENVIRONMENTAL SCIENCE & TECHNOLOGY Mellander, P., Jordan, P., Melland, A. R., Murphy, P. N., Wall, D. P., Mechan, S., Meehan, R., Kelly, C., Shine, O., Shortle, G. 2013; 47 (12): 6111-6119


    The degree to which waters in a given watershed will be affected by nutrient export can be defined as that watershed's nutrient vulnerability. This study applied concepts of specific phosphorus (P) vulnerability to develop intrinsic groundwater vulnerability risk assessments in a 32 km(2) karst watershed (spring zone of contribution) in a relatively intensive agricultural landscape. To explain why emergent spring water was below an ecological impairment threshold, concepts of P attenuation potential were investigated along the nutrient transfer continuum based on soil P buffering, depth to bedrock, and retention within the aquifer. Surface karst features, such as enclosed depressions, were reclassified based on P attenuation potential in soil at the base. New techniques of high temporal resolution monitoring of P loads in the emergent spring made it possible to estimate P transfer pathways and retention within the aquifer and indicated small-medium fissure flows to be the dominant pathway, delivering 52-90% of P loads during storm events. Annual total P delivery to the main emerging spring was 92.7 and 138.4 kg total P (and 52.4 and 91.3 kg as total reactive P) for two monitored years, respectively. A revised groundwater vulnerability assessment was used to produce a specific P vulnerability map that used the soil and hydrogeological P buffering potential of the watershed as key assumptions in moderating P export to the emergent spring. Using this map and soil P data, the definition of critical source areas in karst landscapes was demonstrated.

    View details for DOI 10.1021/es304909y

    View details for Web of Science ID 000320749000007

    View details for PubMedID 23672730

  • Systems biology as a comparative approach to understand complex gene expression in neurological diseases. Behavioral sciences (Basel, Switzerland) Diaz-Beltran, L., Cano, C., Wall, D. P., Esteban, F. J. 2013; 3 (2): 253-272


    Systems biology interdisciplinary approaches have become an essential analytical tool that may yield novel and powerful insights about the nature of human health and disease. Complex disorders are known to be caused by the combination of genetic, environmental, immunological or neurological factors. Thus, to understand such disorders, it becomes necessary to address the study of this complexity from a novel perspective. Here, we present a review of integrative approaches that help to understand the underlying biological processes involved in the etiopathogenesis of neurological diseases, for example, those related to autism and autism spectrum disorders (ASD) endophenotypes. Furthermore, we highlight the role of systems biology in the discovery of new biomarkers or therapeutic targets in complex disorders, a key step in the development of personalized medicine, and we demonstrate the role of systems approaches in the design of classifiers that can shorten the time for behavioral diagnosis of autism.

    View details for DOI 10.3390/bs3020253

    View details for PubMedID 25379238

    View details for PubMedCentralID PMC4217627

  • Systems Biology as a Comparative Approach to Understand Complex Gene Expression in Neurological Diseases BEHAVIORAL SCIENCES Diaz-Beltran, L., Cano, C., Wall, D. P., Esteban, F. J. 2013; 3 (2): 253–72

    View details for DOI 10.3390/bs3020253

    View details for Web of Science ID 000215416400007

  • Haplotype structure enables prioritization of common markers and candidate genes in autism spectrum disorder TRANSLATIONAL PSYCHIATRY Vardarajan, B. N., Eran, A., Jung, J., KUNKEL, L. M., Wall, D. P. 2013; 3


    Autism spectrum disorder (ASD) is a neurodevelopmental condition that results in behavioral, social and communication impairments. ASD has a substantial genetic component, with 88-95% trait concordance among monozygotic twins. Efforts to elucidate the causes of ASD have uncovered hundreds of susceptibility loci and candidate genes. However, owing to its polygenic nature and clinical heterogeneity, only a few of these markers represent clear targets for further analyses. In the present study, we used the linkage structure associated with published genetic markers of ASD to simultaneously improve candidate gene detection while providing a means of prioritizing markers of common genetic variation in ASD. We first mined the literature for linkage and association studies of single-nucleotide polymorphisms, copy-number variations and multi-allelic markers in Autism Genetic Resource Exchange (AGRE) families. From markers that reached genome-wide significance, we calculated male-specific genetic distances, in light of the observed strong male bias in ASD. Four of 67 autism-implicated regions, 3p26.1, 3p26.3, 3q25-27 and 5p15, were enriched with differentially expressed genes in blood and brain from individuals with ASD. Of 30 genes differentially expressed across multiple expression data sets, 21 were within 10 cM of an autism-implicated locus. Among them, CNTN4, CADPS2, SUMF1, SLC9A9, NTRK3 have been previously implicated in autism, whereas others have been implicated in neurological disorders comorbid with ASD. This work leverages the rich multimodal genomic information collected on AGRE families to present an efficient integrative strategy for prioritizing autism candidates and improving our understanding of the relationships among the vast collection of past genetic studies.

    View details for DOI 10.1038/tp.2013.38

    View details for Web of Science ID 000321184400008

    View details for PubMedID 23715297

    View details for PubMedCentralID PMC3669925

  • Genomics-Informed Pathology SCIENTIST Wall, D. P., Tonellato, P. J. 2013; 27 (1): 22-23
  • Autworks: a cross-disease analysis application for Autism and related disorders. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science Wall, D. 2013; 2013: 42–43

    View details for PubMedID 24303295

  • Genetic Networks of Complex Disorders: from a Novel Search Engine for PubMed Article Database. AMIA Joint Summits on Translational Science proceedings AMIA Summit on Translational Science Jung, J., Wall, D. P. 2013; 2013: 99-?


    Finding genetic risk factors of complex disorders may involve reviewing hundreds of genes or thousands of research articles iteratively, but few tools have been available to facilitate this procedure. In this work, we built a novel publication search engine that can identify target-disorder specific, genetics-oriented research articles and extract the genes with significant results. Preliminary test results showed that the output of this engine has better coverage in terms of genes or publications, than other existing applications. We consider it as an essential tool for understanding genetic networks of complex disorders.

    View details for PubMedID 24303309

  • Streaming Support for Data Intensive Cloud-Based Sequence Analysis BIOMED RESEARCH INTERNATIONAL Issa, S. A., Kienzler, R., El-Kalioby, M., Tonellato, P. J., Wall, D., Bruggmann, R., Abouelhoda, M. 2013


    Cloud computing provides a promising solution to the genomics data deluge problem resulting from the advent of next-generation sequencing (NGS) technology. Based on the concepts of "resources-on-demand" and "pay-as-you-go", scientists with no or limited infrastructure can have access to scalable and cost-effective computational resources. However, the large size of NGS data causes a significant data transfer latency from the client's site to the cloud, which presents a bottleneck for using cloud computing services. In this paper, we provide a streaming-based scheme to overcome this problem, where the NGS data is processed while being transferred to the cloud. Our scheme targets the wide class of NGS data analysis tasks, where the NGS sequences can be processed independently from one another. We also provide the elastream package that supports the use of this scheme with individual analysis programs or with workflow systems. Experiments presented in this paper show that our solution mitigates the effect of data transfer latency and saves both time and cost of computation.

    View details for DOI 10.1155/2013/791051

    View details for Web of Science ID 000318725500001

    View details for PubMedID 23710461

    View details for PubMedCentralID PMC3655485

  • Personalized cloud-based bioinformatics services for research and education: use cases and the elasticHPC package Asia Pacific Bioinformatics Network (APBioNet) 11th International Conference on Bioinformatics (InCoB) El-Kalioby, M., Abouelhoda, M., Krueger, J., Giegerich, R., Sczyrba, A., Wall, D. P., Tonellato, P. BIOMED CENTRAL LTD. 2012


    Bioinformatics services have been traditionally provided in the form of a web-server that is hosted at institutional infrastructure and serves multiple users. This model, however, is not flexible enough to cope with the increasing number of users, increasing data size, and new requirements in terms of speed and availability of service. The advent of cloud computing suggests a new service model that provides an efficient solution to these problems, based on the concepts of "resources-on-demand" and "pay-as-you-go". However, cloud computing has not yet been introduced within bioinformatics servers due to the lack of usage scenarios and software layers that address the requirements of the bioinformatics domain.In this paper, we provide different use case scenarios for providing cloud computing based services, considering both the technical and financial aspects of the cloud computing service model. These scenarios are for individual users seeking computational power as well as bioinformatics service providers aiming at provision of personalized bioinformatics services to their users. We also present elasticHPC, a software package and a library that facilitates the use of high performance cloud computing resources in general and the implementation of the suggested bioinformatics scenarios in particular. Concrete examples that demonstrate the suggested use case scenarios with whole bioinformatics servers and major sequence analysis tools like BLAST are presented. Experimental results with large datasets are also included to show the advantages of the cloud model.Our use case scenarios and the elasticHPC package are steps towards the provision of cloud based bioinformatics services, which would help in overcoming the data challenge of recent biological research. All resources related to elasticHPC and its web-interface are available at

    View details for DOI 10.1186/1471-2105-13-S17-S22

    View details for Web of Science ID 000317183600002

    View details for PubMedID 23281941

    View details for PubMedCentralID PMC3521398

  • Autworks: a cross-disease network biology application for Autism and related disorders BMC MEDICAL GENOMICS Nelson, T. H., Jung, J., DeLuca, T. F., Hinebaugh, B. K., St Gabriel, K. C., Wall, D. P. 2012; 5


    The genetic etiology of autism is heterogeneous. Multiple disorders share genotypic and phenotypic traits with autism. Network based cross-disorder analysis can aid in the understanding and characterization of the molecular pathology of autism, but there are few tools that enable us to conduct cross-disorder analysis and to visualize the results.We have designed Autworks as a web portal to bring together gene interaction and gene-disease association data on autism to enable network construction, visualization, network comparisons with numerous other related neurological conditions and disorders. Users may examine the structure of gene interactions within a set of disorder-associated genes, compare networks of disorder/disease genes with those of other disorders/diseases, and upload their own sets for comparative analysis.Autworks is a web application that provides an easy-to-use resource for researchers of varied backgrounds to analyze the autism gene network structure within and between disorders.

    View details for DOI 10.1186/1755-8794-5-56

    View details for Web of Science ID 000313043800001

    View details for PubMedID 23190929

    View details for PubMedCentralID PMC3533944

  • Cross-pollination of research findings, although uncommon, may accelerate discovery of human disease genes BMC MEDICAL GENETICS Duda, M., Nelson, T., Wall, D. P. 2012; 13


    Technological leaps in genome sequencing have resulted in a surge in discovery of human disease genes. These discoveries have led to increased clarity on the molecular pathology of disease and have also demonstrated considerable overlap in the genetic roots of human diseases. In light of this large genetic overlap, we tested whether cross-disease research approaches lead to faster, more impactful discoveries.We leveraged several gene-disease association databases to calculate a Mutual Citation Score (MCS) for 10,853 pairs of genetically related diseases to measure the frequency of cross-citation between research fields. To assess the importance of cooperative research, we computed an Individual Disease Cooperation Score (ICS) and the average publication rate for each disease.For all disease pairs with one gene in common, we found that the degree of genetic overlap was a poor predictor of cooperation (r(2)=0.3198) and that the vast majority of disease pairs (89.56%) never cited previous discoveries of the same gene in a different disease, irrespective of the level of genetic similarity between the diseases. A fraction (0.25%) of the pairs demonstrated cross-citation in greater than 5% of their published genetic discoveries and 0.037% cross-referenced discoveries more than 10% of the time. We found strong positive correlations between ICS and publication rate (r(2)=0.7931), and an even stronger correlation between the publication rate and the number of cross-referenced diseases (r(2)=0.8585). These results suggested that cross-disease research may have the potential to yield novel discoveries at a faster pace than singular disease research.Our findings suggest that the frequency of cross-disease study is low despite the high level of genetic similarity among many human diseases, and that collaborative methods may accelerate and increase the impact of new genetic discoveries. Until we have a better understanding of the taxonomy of human diseases, cross-disease research approaches should become the rule rather than the exception.

    View details for DOI 10.1186/1471-2350-13-114

    View details for Web of Science ID 000312866300001

    View details for PubMedID 23190421

    View details for PubMedCentralID PMC3532152

  • Use of Artificial Intelligence to Shorten the Behavioral Diagnosis of Autism PLOS ONE Wall, D. P., Dally, R., Luyster, R., Jung, J., DeLuca, T. F. 2012; 7 (8)


    The Autism Diagnostic Interview-Revised (ADI-R) is one of the most commonly used instruments for assisting in the behavioral diagnosis of autism. The exam consists of 93 questions that must be answered by a care provider within a focused session that often spans 2.5 hours. We used machine learning techniques to study the complete sets of answers to the ADI-R available at the Autism Genetic Research Exchange (AGRE) for 891 individuals diagnosed with autism and 75 individuals who did not meet the criteria for an autism diagnosis. Our analysis showed that 7 of the 93 items contained in the ADI-R were sufficient to classify autism with 99.9% statistical accuracy. We further tested the accuracy of this 7-question classifier against complete sets of answers from two independent sources, a collection of 1654 individuals with autism from the Simons Foundation and a collection of 322 individuals with autism from the Boston Autism Consortium. In both cases, our classifier performed with nearly 100% statistical accuracy, properly categorizing all but one of the individuals from these two resources who previously had been diagnosed with autism through the standard ADI-R. Our ability to measure specificity was limited by the small numbers of non-spectrum cases in the research data used, however, both real and simulated data demonstrated a range in specificity from 99% to 93.8%. With incidence rates rising, the capacity to diagnose autism quickly and effectively requires careful design of behavioral assessment methods. Ours is an initial attempt to retrospectively analyze large data repositories to derive an accurate, but significantly abbreviated approach that may be used for rapid detection and clinical prioritization of individuals likely to have an autism spectrum disorder. Such a tool could assist in streamlining the clinical diagnostic process overall, leading to faster screening and earlier treatment of individuals with autism.

    View details for DOI 10.1371/journal.pone.0043855

    View details for Web of Science ID 000308044800067

    View details for PubMedID 22952789

    View details for PubMedCentralID PMC3428277

  • Delivery and impact bypass in a karst aquifer with high phosphorus source and pathway potential WATER RESEARCH Mellander, P., Jordan, P., Wall, D. P., Melland, A. R., Meehan, R., Kelly, C., Shortle, G. 2012; 46 (7): 2225-2236


    Conduit and other karstic flows to aquifers, connecting agricultural soils and farming activities, are considered to be the main hydrological mechanisms that transfer phosphorus from the land surface to the groundwater body of a karstified aquifer. In this study, soil source and pathway components of the phosphorus (P) transfer continuum were defined at a high spatial resolution; field-by-field soil P status and mapping of all surface karst features was undertaken in a > 30 km(2) spring contributing zone. Additionally, P delivery and water discharge was monitored in the emergent spring at a sub-hourly basis for over 12 months. Despite moderate to intensive agriculture, varying soil P status with a high proportion of elevated soil P concentrations and a high karstic connectivity potential, background P concentrations in the emergent groundwater were low and indicative of being insufficient to increase the surface water P status of receiving surface waters. However, episodic P transfers via the conduit system increased the P concentrations in the spring during storm events (but not >0.035 mg total reactive P L(-1)) and this process is similar to other catchments where the predominant transfer is via episodic, surface flow pathways; but with high buffering potential over karst due to delayed and attenuated runoff. These data suggest that the current definitions of risk and vulnerability for P delivery to receiving surface waters should be re-evaluated as high source risk need not necessarily result in a water quality impact. Also, inclusion of conduit flows from sparse water quality data in these systems may over-emphasise their influence on the overall status of the groundwater body.

    View details for DOI 10.1016/j.watres.2012.01.048

    View details for Web of Science ID 000302645300020

    View details for PubMedID 22377147

  • Deriving clinical action from whole-genome analysis PERSONALIZED MEDICINE Wall, D. P., Tonellato, P. J. 2012; 9 (3): 247–52

    View details for PubMedID 29758797

  • Systems analysis of inflammatory bowel disease based on comprehensive gene information BMC MEDICAL GENETICS Suzuki, S., Takai-Igarashi, T., Fukuoka, Y., Wall, D. P., Tanaka, H., Tonellato, P. J. 2012; 13


    The rise of systems biology and availability of highly curated gene and molecular information resources has promoted a comprehensive approach to study disease as the cumulative deleterious function of a collection of individual genes and networks of molecules acting in concert. These "human disease networks" (HDN) have revealed novel candidate genes and pharmaceutical targets for many diseases and identified fundamental HDN features conserved across diseases. A network-based analysis is particularly vital for a study on polygenic diseases where many interactions between molecules should be simultaneously examined and elucidated. We employ a new knowledge driven HDN gene and molecular database systems approach to analyze Inflammatory Bowel Disease (IBD), whose pathogenesis remains largely unknown.Based on drug indications for IBD, we determined sibling diseases of mild and severe states of IBD. Approximately 1,000 genes associated with the sibling diseases were retrieved from four databases. After ranking the genes by the frequency of records in the databases, we obtained 250 and 253 genes highly associated with the mild and severe IBD states, respectively. We then calculated functional similarities of these genes with known drug targets and examined and presented their interactions as PPI networks.The results demonstrate that this knowledge-based systems approach, predicated on functionally similar genes important to sibling diseases is an effective method to identify important components of the IBD human disease network. Our approach elucidates a previously unknown biological distinction between mild and severe IBD states.

    View details for DOI 10.1186/1471-2350-13-25

    View details for Web of Science ID 000305184200001

    View details for PubMedID 22480395

    View details for PubMedCentralID PMC3368714

  • Use of machine learning to shorten observation-based screening and diagnosis of autism TRANSLATIONAL PSYCHIATRY Wall, D. P., Kosmicki, J., DeLuca, T. F., Harstad, E., Fusaro, V. A. 2012; 2


    The Autism Diagnostic Observation Schedule-Generic (ADOS) is one of the most widely used instruments for behavioral evaluation of autism spectrum disorders. It is composed of four modules, each tailored for a specific group of individuals based on their language and developmental level. On average, a module takes between 30 and 60 min to deliver. We used a series of machine-learning algorithms to study the complete set of scores from Module 1 of the ADOS available at the Autism Genetic Resource Exchange (AGRE) for 612 individuals with a classification of autism and 15 non-spectrum individuals from both AGRE and the Boston Autism Consortium (AC). Our analysis indicated that 8 of the 29 items contained in Module 1 of the ADOS were sufficient to classify autism with 100% accuracy. We further validated the accuracy of this eight-item classifier against complete sets of scores from two independent sources, a collection of 110 individuals with autism from AC and a collection of 336 individuals with autism from the Simons Foundation. In both cases, our classifier performed with nearly 100% sensitivity, correctly classifying all but two of the individuals from these two resources with a diagnosis of autism, and with 94% specificity on a collection of observed and simulated non-spectrum controls. The classifier contained several elements found in the ADOS algorithm, demonstrating high test validity, and also resulted in a quantitative score that measures classification confidence and extremeness of the phenotype. With incidence rates rising, the ability to classify autism effectively and quickly requires careful design of assessment and diagnostic tools. Given the brevity, accuracy and quantitative nature of the classifier, results from this study may prove valuable in the development of mobile tools for preliminary evaluation and clinical prioritization-in particular those focused on assessment of short home videos of children--that speed the pace of initial evaluation and broaden the reach to a significantly larger percentage of the population at risk.

    View details for DOI 10.1038/tp.2012.10

    View details for Web of Science ID 000306218400003

    View details for PubMedID 22832900

    View details for PubMedCentralID PMC3337074

  • Roundup 2.0: enabling comparative genomics for over 1800 genomes BIOINFORMATICS DeLuca, T. F., Cui, J., Jung, J., Gabriel, K. C., Wall, D. P. 2012; 28 (5): 715-716


    Roundup is an online database of gene orthologs for over 1800 genomes, including 226 Eukaryota, 1447 Bacteria, 113 Archaea and 21 Viruses. Orthologs are inferred using the Reciprocal Smallest Distance algorithm. Users may query Roundup for single-linkage clusters of orthologous genes based on any group of genomes. Annotated query results may be viewed in a variety of ways including as clusters of orthologs and as phylogenetic profiles. Genomic results may be downloaded in formats suitable for functional as well as phylogenetic analysis, including the recent OrthoXML standard. In addition, gene IDs can be retrieved using FASTA sequence search. All source code and orthologs are freely available.

    View details for DOI 10.1093/bioinformatics/bts006

    View details for Web of Science ID 000300986600017

    View details for PubMedID 22247275

    View details for PubMedCentralID PMC3289913

  • Cloud Computing for Comparative Genomics with Windows Azure Platform EVOLUTIONARY BIOINFORMATICS Kim, I., Jung, J., DeLuca, T. F., Nelson, T. H., Wall, D. P. 2012; 8: 527-534


    Cloud computing services have emerged as a cost-effective alternative for cluster systems as the number of genomes and required computation power to analyze them increased in recent years. Here we introduce the Microsoft Azure platform with detailed execution steps and a cost comparison with Amazon Web Services.

    View details for DOI 10.4137/EBO.S9946

    View details for Web of Science ID 000308500500001

    View details for PubMedID 23032609

    View details for PubMedCentralID PMC3433929

  • The future of genomics in pathology. F1000 medicine reports Wall, D. P., Tonellato, P. J. 2012; 4: 14-?


    The recent advances in technology and the promise of cheap and fast whole genomic data offer the possibility to revolutionise the discipline of pathology. This should allow pathologists in the near future to diagnose disease rapidly and early to change its course, and to tailor treatment programs to the individual. This review outlines some of these technical advances and the changes needed to make this revolution a reality.

    View details for DOI 10.3410/M4-14

    View details for PubMedID 22802873

  • Phylogenetically informed logic relationships improve detection of biological network organization BMC BIOINFORMATICS Cui, J., DeLuca, T. F., Jung, J., Wall, D. P. 2011; 12


    A "phylogenetic profile" refers to the presence or absence of a gene across a set of organisms, and it has been proven valuable for understanding gene functional relationships and network organization. Despite this success, few studies have attempted to search beyond just pairwise relationships among genes. Here we search for logic relationships involving three genes, and explore its potential application in gene network analyses.Taking advantage of a phylogenetic matrix constructed from the large orthologs database Roundup, we invented a method to create balanced profiles for individual triplets of genes that guarantee equal weight on the different phylogenetic scenarios of coevolution between genes. When we applied this idea to LAPP, the method to search for logic triplets of genes, the balanced profiles resulted in significant performance improvement and the discovery of hundreds of thousands more putative triplets than unadjusted profiles. We found that logic triplets detected biological network organization and identified key proteins and their functions, ranging from neighbouring proteins in local pathways, to well separated proteins in the whole pathway, and to the interactions among different pathways at the system level. Finally, our case study suggested that the directionality in a logic relationship and the profile of a triplet could disclose the connectivity between the triplet and surrounding networks.Balanced profiles are superior to the raw profiles employed by traditional methods of phylogenetic profiling in searching for high order gene sets. Gene triplets can provide valuable information in detection of biological network organization and identification of key genes at different levels of cellular interaction.

    View details for DOI 10.1186/1471-2105-12-476

    View details for Web of Science ID 000299824500001

    View details for PubMedID 22172058

    View details for PubMedCentralID PMC3402364

  • Identification of autoimmune gene signatures in autism TRANSLATIONAL PSYCHIATRY Jung, J., Kohane, I. S., Wall, D. P. 2011; 1


    The role of the immune system in neuropsychiatric diseases, including autism spectrum disorder (ASD), has long been hypothesized. This hypothesis has mainly been supported by family cohort studies and the immunological abnormalities found in ASD patients, but had limited findings in genetic association testing. Two cross-disorder genetic association tests were performed on the genome-wide data sets of ASD and six autoimmune disorders. In the polygenic score test, we examined whether ASD risk alleles with low effect sizes work collectively in specific autoimmune disorders and show significant association statistics. In the genetic variation score test, we tested whether allele-specific associations between ASD and autoimmune disorders can be found using nominally significant single-nucleotide polymorphisms. In both tests, we found that ASD is probabilistically linked to ankylosing spondylitis (AS) and multiple sclerosis (MS). Association coefficients showed that ASD and AS were positively associated, meaning that autism susceptibility alleles may have a similar collective effect in AS. The association coefficients were negative between ASD and MS. Significant associations between ASD and two autoimmune disorders were identified. This genetic association supports the idea that specific immunological abnormalities may underlie the etiology of autism, at least in a number of cases.

    View details for DOI 10.1038/tp.2011.62

    View details for Web of Science ID 000306217100007

    View details for PubMedID 22832355

    View details for PubMedCentralID PMC3309496

  • Detecting biological network organization and functional gene orthologs BIOINFORMATICS Cui, J., DeLuca, T. F., Jung, J., Wall, D. P. 2011; 27 (20): 2919-2920


    We developed a package TripletSearch to compute relationships within triplets of genes based on Roundup, an orthologous gene database containing >1500 genomes. These relationships, derived from the coevolution of genes, provide valuable information in the detection of biological network organization from the local to the system level, in the inference of protein functions and in the identification of functional orthologs. To run the computation, users need to provide the GI IDs of the genes of interest. data are available at Bioinformatics online.

    View details for DOI 10.1093/bioinformatics/btr485

    View details for Web of Science ID 000295680600025

    View details for PubMedID 21856738

    View details for PubMedCentralID PMC3187654

  • Biomedical Cloud Computing With Amazon Web Services PLOS COMPUTATIONAL BIOLOGY Fusaro, V. A., Patil, P., Gafni, E., Wall, D. P., Tonellato, P. J. 2011; 7 (8)


    In this overview to biomedical computing in the cloud, we discussed two primary ways to use the cloud (a single instance or cluster), provided a detailed example using NGS mapping, and highlighted the associated costs. While many users new to the cloud may assume that entry is as straightforward as uploading an application and selecting an instance type and storage options, we illustrated that there is substantial up-front effort required before an application can make full use of the cloud's vast resources. Our intention was to provide a set of best practices and to illustrate how those apply to a typical application pipeline for biomedical informatics, but also general enough for extrapolation to other types of computational problems. Our mapping example was intended to illustrate how to develop a scalable project and not to compare and contrast alignment algorithms for read mapping and genome assembly. Indeed, with a newer aligner such as Bowtie, it is possible to map the entire African genome using one m2.2xlarge instance in 48 hours for a total cost of approximately $48 in computation time. In our example, we were not concerned with data transfer rates, which are heavily influenced by the amount of available bandwidth, connection latency, and network availability. When transferring large amounts of data to the cloud, bandwidth limitations can be a major bottleneck, and in some cases it is more efficient to simply mail a storage device containing the data to AWS ( More information about cloud computing, detailed cost analysis, and security can be found in references.

    View details for DOI 10.1371/journal.pcbi.1002147

    View details for Web of Science ID 000294299700022

    View details for PubMedID 21901085

    View details for PubMedCentralID PMC3161908

  • Using game theory to detect genes involved in Autism Spectrum Disorder TOP Esteban, F. J., Wall, D. P. 2011; 19 (1): 121-129
  • The semantic organization of the animal category: evidence from semantic verbal fluency and network theory COGNITIVE PROCESSING Goni, J., Arrondo, G., Sepulcre, J., Martincorena, I., Velez de Mendizabal, N., Corominas-Murtra, B., Bejarano, B., Ardanza-Trevijano, S., Peraita, H., Wall, D. P., Villoslada, P. 2011; 12 (2): 183-196


    Semantic memory is the subsystem of human memory that stores knowledge of concepts or meanings, as opposed to life-specific experiences. How humans organize semantic information remains poorly understood. In an effort to better understand this issue, we conducted a verbal fluency experiment on 200 participants with the aim of inferring and representing the conceptual storage structure of the natural category of animals as a network. This was done by formulating a statistical framework for co-occurring concepts that aims to infer significant concept-concept associations and represent them as a graph. The resulting network was analyzed and enriched by means of a missing links recovery criterion based on modularity. Both network models were compared to a thresholded co-occurrence approach. They were evaluated using a random subset of verbal fluency tests and comparing the network outcomes (linked pairs are clustering transitions and disconnected pairs are switching transitions) to the outcomes of two expert human raters. Results show that the network models proposed in this study overcome a thresholded co-occurrence approach, and their outcomes are in high agreement with human evaluations. Finally, the interplay between conceptual structure and retrieval mechanisms is discussed.

    View details for DOI 10.1007/s10339-010-0372-x

    View details for Web of Science ID 000289685000005

    View details for PubMedID 20938799

  • Genotator: A disease-agnostic tool for genetic annotation of disease BMC MEDICAL GENOMICS Wall, D. P., Pivovarov, R., Tong, M., Jung, J., Fusaro, V. A., DeLuca, T. F., Tonellato, P. J. 2010; 3


    Disease-specific genetic information has been increasing at rapid rates as a consequence of recent improvements and massive cost reductions in sequencing technologies. Numerous systems designed to capture and organize this mounting sea of genetic data have emerged, but these resources differ dramatically in their disease coverage and genetic depth. With few exceptions, researchers must manually search a variety of sites to assemble a complete set of genetic evidence for a particular disease of interest, a process that is both time-consuming and error-prone.We designed a real-time aggregation tool that provides both comprehensive coverage and reliable gene-to-disease rankings for any disease. Our tool, called Genotator, automatically integrates data from 11 externally accessible clinical genetics resources and uses these data in a straightforward formula to rank genes in order of disease relevance. We tested the accuracy of coverage of Genotator in three separate diseases for which there exist specialty curated databases, Autism Spectrum Disorder, Parkinson's Disease, and Alzheimer Disease. Genotator is freely available at demonstrated that most of the 11 selected databases contain unique information about the genetic composition of disease, with 2514 genes found in only one of the 11 databases. These findings confirm that the integration of these databases provides a more complete picture than would be possible from any one database alone. Genotator successfully identified at least 75% of the top ranked genes for all three of our use cases, including a 90% concordance with the top 40 ranked candidates for Alzheimer Disease.As a meta-query engine, Genotator provides high coverage of both historical genetic research as well as recent advances in the genetic understanding of specific diseases. As such, Genotator provides a real-time aggregation of ranked data that remains current with the pace of research in the disease fields. Genotator's algorithm appropriately transforms query terms to match the input requirements of each targeted databases and accurately resolves named synonyms to ensure full coverage of the genetic results with official nomenclature. Genotator generates an excel-style output that is consistent across disease queries and readily importable to other applications.

    View details for DOI 10.1186/1755-8794-3-50

    View details for Web of Science ID 000284541000001

    View details for PubMedID 21034472

    View details for PubMedCentralID PMC2990725

  • Cloud computing for comparative genomics BMC BIOINFORMATICS Wall, D. P., Kudtarkar, P., Fusaro, V. A., Pivovarov, R., Patil, P., Tonellato, P. J. 2010; 11


    Large comparative genomics studies and tools are becoming increasingly more compute-expensive as the number of available genome sequences continues to rise. The capacity and cost of local computing infrastructures are likely to become prohibitive with the increase, especially as the breadth of questions continues to rise. Alternative computing architectures, in particular cloud computing environments, may help alleviate this increasing pressure and enable fast, large-scale, and cost-effective comparative genomics strategies going forward. To test this, we redesigned a typical comparative genomics algorithm, the reciprocal smallest distance algorithm (RSD), to run within Amazon's Elastic Computing Cloud (EC2). We then employed the RSD-cloud for ortholog calculations across a wide selection of fully sequenced genomes.We ran more than 300,000 RSD-cloud processes within the EC2. These jobs were farmed simultaneously to 100 high capacity compute nodes using the Amazon Web Service Elastic Map Reduce and included a wide mix of large and small genomes. The total computation time took just under 70 hours and cost a total of $6,302 USD.The effort to transform existing comparative genomics algorithms from local compute infrastructures is not trivial. However, the speed and flexibility of cloud computing environments provides a substantial boost with manageable cost. The procedure designed to transform the RSD algorithm into a cloud-ready application is readily adaptable to similar comparative genomics problems.

    View details for DOI 10.1186/1471-2105-11-259

    View details for Web of Science ID 000279730300001

    View details for PubMedID 20482786

    View details for PubMedCentralID PMC3098063

  • Cost-Effective Cloud Computing: A Case Study Using the Comparative Genomics Tool, Roundup EVOLUTIONARY BIOINFORMATICS Kudtarkar, P., DeLuca, T. F., Fusaro, V. A., Tonellato, P. J., Wall, D. P. 2010; 6: 197-203


    Comparative genomics resources, such as ortholog detection tools and repositories are rapidly increasing in scale and complexity. Cloud computing is an emerging technological paradigm that enables researchers to dynamically build a dedicated virtual cluster and may represent a valuable alternative for large computational tools in bioinformatics. In the present manuscript, we optimize the computation of a large-scale comparative genomics resource-Roundup-using cloud computing, describe the proper operating principles required to achieve computational efficiency on the cloud, and detail important procedures for improving cost-effectiveness to ensure maximal computation at minimal costs.Utilizing the comparative genomics tool, Roundup, as a case study, we computed orthologs among 902 fully sequenced genomes on Amazon's Elastic Compute Cloud. For managing the ortholog processes, we designed a strategy to deploy the web service, Elastic MapReduce, and maximize the use of the cloud while simultaneously minimizing costs. Specifically, we created a model to estimate cloud runtime based on the size and complexity of the genomes being compared that determines in advance the optimal order of the jobs to be submitted.We computed orthologous relationships for 245,323 genome-to-genome comparisons on Amazon's computing cloud, a computation that required just over 200 hours and cost $8,000 USD, at least 40% less than expected under a strategy in which genome comparisons were submitted to the cloud randomly with respect to runtime. Our cost savings projections were based on a model that not only demonstrates the optimal strategy for deploying RSD to the cloud, but also finds the optimal cluster size to minimize waste and maximize usage. Our cost-reduction model is readily adaptable for other comparative genomics tools and potentially of significant benefit to labs seeking to take advantage of the cloud as an alternative to local computing infrastructure.

    View details for DOI 10.4137/EBO.S6259

    View details for Web of Science ID 000288866900009

    View details for PubMedID 21258651

    View details for PubMedCentralID PMC3023304

  • Collaborative text-annotation resource for disease-centered relation extraction from biomedical text JOURNAL OF BIOMEDICAL INFORMATICS Cano, C., Monaghan, T., Blanco, A., Wall, D. P., Peshkin, L. 2009; 42 (5): 967-977


    Agglomerating results from studies of individual biological components has shown the potential to produce biomedical discovery and the promise of therapeutic development. Such knowledge integration could be tremendously facilitated by automated text mining for relation extraction in the biomedical literature. Relation extraction systems cannot be developed without substantial datasets annotated with ground truth for benchmarking and training. The creation of such datasets is hampered by the absence of a resource for launching a distributed annotation effort, as well as by the lack of a standardized annotation schema. We have developed an annotation schema and an annotation tool which can be widely adopted so that the resulting annotated corpora from a multitude of disease studies could be assembled into a unified benchmark dataset. The contribution of this paper is threefold. First, we provide an overview of available benchmark corpora and derive a simple annotation schema for specific binary relation extraction problems such as protein-protein and gene-disease relation extraction. Second, we present BioNotate: an open source annotation resource for the distributed creation of a large corpus. Third, we present and make available the results of a pilot annotation effort of the autism disease network.

    View details for DOI 10.1016/j.jbi.2009.02.001

    View details for Web of Science ID 000270870500021

    View details for PubMedID 19232400

    View details for PubMedCentralID PMC2757509

  • Reply to the "Letter to the Editors" by Steven Buyske NEUROGENETICS Abu-Elneel, K., Liu, T., Gazzaniga, F. S., Nishimura, Y., Wall, D. P., Geschwind, D. H., Lao, K., Kosik, K. S. 2009; 10 (2): 169–70
  • Comparative analysis of neurological disorders focuses genome-wide search for autism genes GENOMICS Wall, D. P., Esteban, F. J., DeLuca, T. F., Huyck, M., Monaghan, T., de Mendizabal, N. V., Goni, J., Kohane, I. S. 2009; 93 (2): 120-129


    The behaviors of autism overlap with a diverse array of other neurological disorders, suggesting common molecular mechanisms. We conducted a large comparative analysis of the network of genes linked to autism with those of 432 other neurological diseases to circumscribe a multi-disorder subcomponent of autism. We leveraged the biological process and interaction properties of these multi-disorder autism genes to overcome the across-the-board multiple hypothesis corrections that a purely data-driven approach requires. Using prior knowledge of biological process, we identified 154 genes not previously linked to autism of which 42% were significantly differentially expressed in autistic individuals. Then, using prior knowledge from interaction networks of disorders related to autism, we uncovered 334 new genes that interact with published autism genes, of which 87% were significantly differentially regulated in autistic individuals. Our analysis provided a novel picture of autism from the perspective of related neurological disorders and suggested a model by which prior knowledge of interaction networks can inform and focus genome-scale studies of complex neurological disorders.

    View details for DOI 10.1016/j.ygeno.2008.09.015

    View details for Web of Science ID 000263227600003

    View details for PubMedID 18950700

  • Heterogeneous dysregulation of microRNAs across the autism spectrum NEUROGENETICS Abu-Elneel, K., Liu, T., Gazzaniga, F. S., Nishimura, Y., Wall, D. P., Geschwind, D. H., Lao, K., Kosik, K. S. 2008; 9 (3): 153-161


    microRNAs (miRNAs) are approximately 21 nt transcripts capable of regulating the expression of many mRNAs and are abundant in the brain. miRNAs have a role in several complex diseases including cancer as well as some neurological diseases such as Tourette's syndrome and Fragile x syndrome. As a genetically complex disease, dysregulation of miRNA expression might be a feature of autism spectrum disorders (ASDs). Using multiplex quantitative polymerase chain reaction (PCR), we compared the expression of 466 human miRNAs from postmortem cerebellar cortex tissue of individuals with ASD (n = 13) and a control set of non-autistic cerebellar samples (n = 13). While most miRNAs levels showed little variation across all samples suggesting that autism does not induce global dysfunction of miRNA expression, some miRNAs among the autistic samples were expressed at significantly different levels compared to the mean control value. Twenty-eight miRNAs were expressed at significantly different levels compared to the non-autism control set in at least one of the autism samples. To validate the finding, we reversed the analysis and compared each non-autism control to a single mean value for each miRNA across all autism cases. In this analysis, the number of dysregulated miRNAs fell from 28 to 9 miRNAs. Among the predicted targets of dysregulated miRNAs are genes that are known genetic causes of autism such Neurexin and SHANK3. This study finds that altered miRNA expression levels are observed in postmortem cerebellar cortex from autism patients, a finding which suggests that dysregulation of miRNAs may contribute to autism spectrum phenotype.

    View details for DOI 10.1007/s10048-008-0133-5

    View details for Web of Science ID 000257216200001

    View details for PubMedID 18563458

  • Testing the Accuracy of Eukaryotic Phylogenetic Profiles for Prediction of Biological Function EVOLUTIONARY BIOINFORMATICS Singh, S., Wall, D. P. 2008; 4: 217-223


    A phylogenetic profile captures the pattern of gene gain and loss throughout evolutionary time. Proteins that interact directly or indirectly within the cell to perform a biological function will often co-evolve, and this co-evolution should be well reflected within their phylogenetic profiles. Thus similar phylogenetic profiles are commonly used for grouping proteins into functional groups. However, it remains unclear how the size and content of the phylogenetic profile impacts the ability to predict function, particularly in Eukaryotes. Here we developed a straightforward approach to address this question by constructing a complete set of phylogenetic profiles for 31 fully sequenced Eukaryotes. Using Gene Ontology as our gold standard, we compared the accuracy of functional predictions made by a comprehensive array of permutations on the complete set of genomes. Our permutations showed that phylogenetic profiles containing between 25 and 31 Eukaryotic genomes performed equally well and significantly better than all other permuted genome sets, with one exception: we uncovered a core of group of 18 genomes that achieved statistically identical accuracy. This core group contained genomes from each branch of the eukaryotic phylogeny, but also contained several groups of closely related organisms, suggesting that a balance between phylogenetic breadth and depth may improve our ability to use Eukaryotic specific phylogenetic profiles for functional annotations.

    View details for Web of Science ID 000264677700019

    View details for PubMedID 19204819

  • Phylogeny of the Calymperaceae with a rank-free systematic treatment BRYOLOGIST Fisher, K. M., Wall, D. P., Yip, K., Mishler, B. D. 2007; 110 (1): 46–73
  • Ortholog detection using the reciprocal smallest distance algorithm. Methods in molecular biology (Clifton, N.J.) Wall, D. P., Deluca, T. 2007; 396: 95-110


    All protein coding genes have a phylogenetic history that when understood can lead to deep insights into the diversification or conservation of function, the evolution of developmental complexity, and the molecular basis of disease. One important part to reconstructing the relationships among genes in different organisms is an accurate method to find orthologs as well as an accurate measure of evolutionary diversification. The present chapter details such a method, called the reciprocal smallest distance algorithm (RSD). This approach improves upon the common procedure of taking reciprocal best Basic Local Alignment Search Tool hits (RBH) in the identification of orthologs by using global sequence alignment and maximum likelihood estimation of evolutionary distances to detect orthologs between two genomes. RSD finds many putative orthologs missed by RBH because it is less likely to be misled by the presence of close paralogs in genomes. The package offers a tremendous amount of flexibility in investigating parameter settings allowing the user to search for increasingly distant orthologs between highly divergent species, among other advantages. The flexibility of this tool makes it a unique and powerful addition to other available approaches for ortholog detection.

    View details for PubMedID 18025688

  • Roundup: a multi-genome repository of orthologs and evolutionary distances BIOINFORMATICS DeLuca, T. F., Wu, I., Pu, J., Monaghan, T., Peshkin, L., Singh, S., Wall, D. P. 2006; 22 (16): 2044-2046


    We have created a tool for ortholog and phylogenetic profile retrieval called Roundup. Roundup is backed by a massive repository of orthologs and associated evolutionary distances that was built using the reciprocal smallest distance algorithm, an approach that has been shown to improve upon alternative approaches of ortholog detection, such as reciprocal blast. Presently, the Roundup repository contains all possible pair-wise comparisons for over 250 genomes, including 32 Eukaryotes, more than doubling the coverage of any similar resource. The orthologs are accessible through an intuitive web interface that allows searches by genome or gene identifier, presenting results as phylogenetic profiles together with gene and molecular function annotations. Results may be downloaded as phylogenetic matrices for subsequent analysis, including the construction of whole-genome phylogenies based on gene-content data.

    View details for DOI 10.1093/bioinformatics/btl286

    View details for Web of Science ID 000239900200016

    View details for PubMedID 16777906

  • Heparan sulfate proteoglycans and the emergence of neuronal connectivity CURRENT OPINION IN NEUROBIOLOGY Van Vactor, D., Wall, D. P., Johnson, K. G. 2006; 16 (1): 40-51


    With the identification of the molecular determinants of neuronal connectivity, our understanding of the extracellular information that controls axon guidance and synapse formation has evolved from single factors towards the complexity that neurons face in a living organism. As we move in this direction - ready to see the forest for the trees - attention is returning to one of the most ancient regulators of cell-cell interaction: the extracellular matrix. Among many matrix components that influence neuronal connectivity, recent studies of the heparan sulfate proteoglycans suggest that these ancient molecules function as versatile extracellular scaffolds that both sculpt the landscape of extracellular cues and modulate the way that neurons perceive the world around them.

    View details for DOI 10.1016/j.conb.2006.01.011

    View details for Web of Science ID 000236136200007

    View details for PubMedID 16417999

  • The role of selection in the evolution of human mitochondrial genomes GENETICS Kivisild, T., Shen, P. D., Wall, D. P., Do, B., Sung, R., Davis, K., Passarino, G., Underhill, P. A., Scharfe, C., Torroni, A., Scozzari, R., Modiano, D., Coppa, A., de Knijff, P., Feldman, M., Cavalli-Sforza, L. L., Oefner, P. J. 2006; 172 (1): 373-387


    High mutation rate in mammalian mitochondrial DNA generates a highly divergent pool of alleles even within species that have dispersed and expanded in size recently. Phylogenetic analysis of 277 human mitochondrial genomes revealed a significant (P < 0.01) excess of rRNA and nonsynonymous base substitutions among hotspots of recurrent mutation. Most hotspots involved transitions from guanine to adenine that, with thymine-to-cytosine transitions, illustrate the asymmetric bias in codon usage at synonymous sites on the heavy-strand DNA. The mitochondrion-encoded tRNAThr varied significantly more than any other tRNA gene. Threonine and valine codons were involved in 259 of the 414 amino acid replacements observed. The ratio of nonsynonymous changes from and to threonine and valine differed significantly (P = 0.003) between populations with neutral (22/58) and populations with significantly negative Tajima's D values (70/76), independent of their geographic location. In contrast to a recent suggestion that the excess of nonsilent mutations is characteristic of Arctic populations, implying their role in cold adaptation, we demonstrate that the surplus of nonsynonymous mutations is a general feature of the young branches of the phylogenetic tree, affecting also those that are found only in Africa. We introduce a new calibration method of the mutation rate of synonymous transitions to estimate the coalescent times of mtDNA haplogroups.

    View details for DOI 10.1534/genetics.105.043901

    View details for Web of Science ID 000235197700033

    View details for PubMedID 16172508

    View details for PubMedCentralID PMC1456165

  • Converging on a general model of protein evolution TRENDS IN BIOTECHNOLOGY Herbeck, J. T., Wall, D. P. 2005; 23 (10): 485-487


    The availability of high-throughput genomic databases that establish protein dispensability, expression and interaction networks enables rigorous tests of competing models of protein evolution. Recent research utilizing these new data sets shows that protein evolution is more complex than was previously thought. Several variables, including protein dispensability, expression, functional density, and genetic modularity, appear to have independent effects on the evolutionary rate of proteins, suggesting that proteomes have evolved via an assembly of selectional regimes. These results indicate that a general model of protein evolution will emerge as more functional genomic data from a diversity of organisms accumulate.

    View details for DOI 10.1016/j.tibtech.2005.07.009

    View details for Web of Science ID 000232605900001

    View details for PubMedID 16054255

  • Origin and rapid diversification of a tropical moss EVOLUTION Wall, D. P. 2005; 59 (7): 1413-1424


    Molecular sequences rarely evolve at a constant rate. Yet, even in instances where a clock can be assumed or approximated for a particular set of sequences, fossils or clear patterns of vicariance are rarely available to calibrate the clock. Thus, obtaining absolute timing for diversification of natural lineages can prove difficult. Unfortunately, without absolute time we cannot develop a complete understanding of important evolutionary processes, including adaptive radiations and key innovations. In the present study, the coding sequence of the nuclear gene, glyceraldehyde 3-phosphate dehydrogenase (gpd), extracted from the paleotropical moss, Mitthyridium, was found to exhibit clocklike behavior and used to reconstruct the history of 80 distinct molecular lineages that cover the full geographic range of Mitthyridium. Two separate clades endemic to two geographically distinct oceanic archipelagos were revealed by this phylogenetic analysis. This allowed the use of island age (as derived from potassium-argon dating) as a maximum age of origin of each monophyletic group, providing two independent time anchors for the clock found in gpd, the final piece needed to study absolute time. Based on results from both maximum age calibrations, which separately yielded highly consistent estimates, the ancestor of this moss group arose approximately 8 million years ago, and then diversified at the rapid rate of 0.56 +/- 0.004 new lineages per million years. Such a rate is on par with the highest diversification rates reported in the literature including rapidly radiating insular groups like the Hawaiian silversword alliance, a classic example of an adaptive radiation. Using independent sources of data, it was found that neither the age nor diversification estimates were affected by the use of molecular lineages rather than species as the operational taxonomic units. Identifying the cause for this rapid diversification requires further testing, but it appears to be related to a general shift in reproductive strategy from sexual to asexual, which may be a key innovation for this young group.

    View details for Web of Science ID 000230975600004

    View details for PubMedID 16153028

  • Conservation of the RBI gene in human and primates (vol 25, pg 396, 2005) HUMAN MUTATION Sivakumaran, T. A., Shen, P. D., Wall, D. P., Do, B. H., Kucheria, K., Oefner, P. J. 2005; 25 (5): 501

    View details for DOI 10.1002/humu.20186

    View details for Web of Science ID 000228905000013

  • Functional genomic analysis of the rates of protein evolution PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Wall, D. P., Hirsh, A. E., Fraser, H. B., Kumm, J., Giaever, G., Eisen, M. B., Feldman, M. W. 2005; 102 (15): 5483-5488


    The evolutionary rates of proteins vary over several orders of magnitude. Recent work suggests that analysis of large data sets of evolutionary rates in conjunction with the results from high-throughput functional genomic experiments can identify the factors that cause proteins to evolve at such dramatically different rates. To this end, we estimated the evolutionary rates of >3,000 proteins in four species of the yeast genus Saccharomyces and investigated their relationship with levels of expression and protein dispensability. Each protein's dispensability was estimated by the growth rate of mutants deficient for the protein. Our analyses of these improved evolutionary and functional genomic data sets yield three main results. First, dispensability and expression have independent, significant effects on the rate of protein evolution. Second, measurements of expression levels in the laboratory can be used to filter data sets of dispensability estimates, removing variates that are unlikely to reflect real biological effects. Third, structural equation models show that although we may reasonably infer that dispensability and expression have significant effects on protein evolutionary rate, we cannot yet accurately estimate the relative strengths of these effects.

    View details for DOI 10.1073/pnas.0501761102

    View details for Web of Science ID 000228376600036

    View details for PubMedID 15800036

    View details for PubMedCentralID PMC555735

  • Adjusting for selection on synonymous sites in estimates of evolutionary distance MOLECULAR BIOLOGY AND EVOLUTION Hirsh, A. E., Fraser, H. B., Wall, D. P. 2005; 22 (1): 174-177


    Evolution at silent sites is often used to estimate the pace of selectively neutral processes or to infer differences in divergence times of genes. However, silent sites are subject to selection in favor of preferred codons, and the strength of such selection varies dramatically across genes. Here, we use the relationship between codon bias and synonymous divergence observed in four species of the genus Saccharomyces to provide a simple correction for selection on silent sites.

    View details for DOI 10.1093/molbev/msh265

    View details for Web of Science ID 000225730100018

    View details for PubMedID 15371530

  • Conservation of the RB1 gene in human and primates HUMAN MUTATION Sivakumaran, T. A., Shen, P. D., Wall, D. P., Do, B. H., Kucheria, K., Oefner, P. J. 2005; 25 (4): 396-409


    Mutations in the RB1 gene are associated with retinoblastoma, which has served as an important model for understanding hereditary predisposition to cancer. Despite the great scrutiny that RB1 has enjoyed as the prototypical tumor suppressor gene, it has never been the object of a comprehensive survey of sequence variation in diverse human populations and primates. Therefore, we analyzed the coding (2,787 bp) and adjacent intronic and untranslated (7,313 bp) sequences of RB1 in 137 individuals from a wide range of ethnicities, including 19 Asian Indian hereditary retinoblastoma cases, and five primate species. Aside from nine apparently disease-associated mutations, 52 variants were identified. They included six singleton, coding variants that comprised five amino acid replacements and one silent site. Nucleotide diversity of the coding region (pi=0.0763+/-1.35 x 10(-4)) was 52 times lower than that of the noncoding regions (pi=3.93+/-5.26 x 10(-4)), indicative of significant sequence conservation. The occurrence of purifying selection was corroborated by phylogeny-based maximum likelihood analysis of the RB1 sequences of human and five primates, which yielded an estimated ratio of replacement to silent substitutions (omega) of 0.095 across all lineages. RB1 displayed extensive linkage disequilibrium over 174 kb, and only four unique recombination events, two in Africa and one each in Europe and Southwest Asia, were observed. Using a parsimony approach, 15 haplotypes could be inferred. Ten were found in Africa, though only 12.4% of the 274 chromosomes screened were of African origin. In non-Africans, a single haplotype accounted for from 63 to 84% of all chromosomes, most likely the consequence of natural selection and a significant bottleneck in effective population size during the colonization of the non-African continents.

    View details for DOI 10.1002/humu.20154

    View details for Web of Science ID 000228099600009

    View details for PubMedID 15776430

  • Improved haematopoietic recovery following transplantation with ex vivo-expanded mobilized blood cells 45th Annual Meeting and Exhibition of the American-Society-of-Hematology Prince, H. M., Simmons, P. J., Whitty, G., Wall, D. P., Barber, L., Toner, G. C., Seymour, J. F., Richardson, G., Mrongovius, R., Haylock, D. N. WILEY-BLACKWELL PUBLISHING, INC. 2004: 536–45


    Infusions of ex vivo-expanded (EXE) mobilized blood cells have been explored to enhance haematopoietic recovery following high dose chemotherapy (HDT). However, prior studies have not consistently demonstrated improvements in trilineage haematopoietic recovery. Three cohorts of three patients with breast cancer received three cycles of repetitive HDT supported by either unmanipulated (UM) and/or EXE cells. Efficacy was assessed by an internal comparison of each patient's consecutive HDT cycles, and to 106 historical UM infusions. Twenty-one cycles were supported by EXE cells and six by UM cells alone. Infusions of EXE cells resulted in fewer days with an absolute neutrophil count (ANC) <0.1 x 10(9)/l (median 2 vs. 4 d, P = 0.002) and 3 d faster ANC recovery to >0.1 x 10(9)/l (median 5 vs. 8 d, P = 0.0002). This resulted in a major reduction in the incidence of febrile neutropenia compared with UM cycles (0% vs. 83%; P = 0.008) and in 66% of historical UM cycles (P = 0.01) and a marked reduction in hospital re-admission. There were also fewer platelet transfusions required (43% vs. 100%; P = 0.009). We conclude that EXE cells enhance both neutrophil and platelet recovery and reduce febrile neutropenia, platelet transfusion and hospital re-admission.

    View details for DOI 10.1111/j.1365-2141.2004.05081.x

    View details for Web of Science ID 000223036300011

    View details for PubMedID 15287947

  • Coevolution of gene expression among interacting proteins PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA Fraser, H. B., Hirsh, A. E., Wall, D. P., Eisen, M. B. 2004; 101 (24): 9033-9038


    Physically interacting proteins or parts of proteins are expected to evolve in a coordinated manner that preserves proper interactions. Such coevolution at the amino acid-sequence level is well documented and has been used to predict interacting proteins, domains, and amino acids. Interacting proteins are also often precisely coexpressed with one another, presumably to maintain proper stoichiometry among interacting components. Here, we show that the expression levels of physically interacting proteins coevolve. We estimate average expression levels of genes from four closely related fungi of the genus Saccharomyces using the codon adaptation index and show that expression levels of interacting proteins exhibit coordinated changes in these different species. We find that this coevolution of expression is a more powerful predictor of physical interaction than is coevolution of amino acid sequence. These results demonstrate that gene expression levels can coevolve, adding another dimension to the study of the coevolution of interacting proteins and underscoring the importance of maintaining coexpression of interacting proteins over evolutionary time. Our results also suggest that expression coevolution can be used for computational prediction of protein-protein interactions.

    View details for DOI 10.1073/pnas.0402591101

    View details for Web of Science ID 000222104900038

    View details for PubMedID 15175431

    View details for PubMedCentralID PMC439012

  • Extended haplotype block structure and evidence for selection in a 900 kb region of the ATM Gene in human and chimpanzee. Thorstenson, Y. R., Shen, P., Wall, D. P., Wayne, T. L., Chou, Davis, R. W., Oefner, P. J. UNIV CHICAGO PRESS. 2003: 427
  • Detecting putative orthologs BIOINFORMATICS Wall, D. P., Fraser, H. B., Hirsh, A. E. 2003; 19 (13): 1710-1711


    We developed an algorithm that improves upon the common procedure of taking reciprocal best blast hits(rbh) in the identification of orthologs. The method-reciprocal smallest distance algorithm (rsd)-relies on global sequence alignment and maximum likelihood estimation of evolutionary distances to detect orthologs between two genomes. rsd finds many putative orthologs missed by rbh because it is less likely than rbh to be misled by the presence of a close paralog.

    View details for DOI 10.1093/bioinformatics/btg213

    View details for Web of Science ID 000185310600016

    View details for PubMedID 15593400

  • Gene expression level influences amino acid usage, but not codon usage, in the tsetse fly endosymbiont Wigglesworthia MICROBIOLOGY-SGM Herbeck, J. T., Wall, D. P., Wernegreen, J. J. 2003; 149: 2585-2596


    Wigglesworthia glossinidia brevipalpis, the obligate bacterial endosymbiont of the tsetse fly Glossina brevipalpis, is characterized by extreme genome reduction and AT nucleotide composition bias. Here, multivariate statistical analyses are used to test the hypothesis that mutational bias and genetic drift shape synonymous codon usage and amino acid usage of Wigglesworthia. The results show that synonymous codon usage patterns vary little across the genome and do not distinguish genes of putative high and low expression levels, thus indicating a lack of translational selection. Extreme AT composition bias across the genome also drives relative amino acid usage, but predicted high-expression genes (ribosomal proteins and chaperonins) use GC-rich amino acids more frequently than do low-expression genes. The levels and configuration of amino acid differences between Wigglesworthia and Escherichia coli were compared to test the hypothesis that the relatively GC-rich amino acid profiles of high-expression genes reflect greater amino acid conservation at these loci. This hypothesis is supported by reduced levels of protein divergence at predicted high-expression Wigglesworthia genes and similar configurations of amino acid changes across expression categories. Combined, the results suggest that codon and amino acid usage in the Wigglesworthia genome reflect a strong AT mutational bias and elevated levels of genetic drift, consistent with expected effects of an endosymbiotic lifestyle and repeated population bottlenecks. However, these impacts of mutation and drift are apparently attenuated by selection on amino acid composition at high-expression genes.

    View details for DOI 10.1099/mic.0.26381-0

    View details for Web of Science ID 000185342900027

    View details for PubMedID 12949182

  • Evolutionary patterns of codon usage in the chloroplast gene rbcL JOURNAL OF MOLECULAR EVOLUTION Wall, D. P., Herbeck, J. T. 2003; 56 (6): 673-688


    In this study we reconstruct the evolution of codon usage bias in the chloroplast gene rbcL using a phylogeny of 92 green-plant taxa. We employ a measure of codon usage bias that accounts for chloroplast genomic nucleotide content, as an attempt to limit plausible explanations for patterns of codon bias evolution to selection- or drift-based processes. This measure uses maximum likelihood-ratio tests to compare the performance of two models, one in which a single codon is overrepresented and one in which two codons are overrepresented. The measure allowed us to analyze both the extent of bias in each lineage and the evolution of codon choice across the phylogeny. Despite predictions based primarily on the low G + C content of the chloroplast and the high functional importance of rbcL, we found large differences in the extent of bias, suggesting differential molecular selection that is clade specific. The seed plants and simple leafy liverworts each independently derived a low level of bias in rbcL, perhaps indicating relaxed selectional constraint on molecular changes in the gene. Overrepresentation of a single codon was typically plesiomorphic, and transitions to overrepresentation of two codons occurred commonly across the phylogeny, possibly indicating biochemical selection. The total codon bias in each taxon, when regressed against the total bias of each amino acid, suggested that twofold amino acids play a strong role in inflating the level of codon usage bias in rbcL, despite the fact that twofolds compose a minority of residues in this gene. Those amino acids that contributed most to the total codon usage bias of each taxon are known through amino acid knockout and replacement to be of high functional importance. This suggests that codon usage bias may be constrained by particular amino acids and, thus, may serve as a good predictor of what residues are most important for protein fitness.

    View details for DOI 10.1007/s00239-002-2436-8

    View details for Web of Science ID 000183129100004

    View details for PubMedID 12911031

  • A simple dependence between protein evolution rate and the number of protein-protein interactions BMC EVOLUTIONARY BIOLOGY Fraser, H. B., Wall, D. P., Hirsh, A. E. 2003; 3


    It has been shown for an evolutionarily distant genomic comparison that the number of protein-protein interactions a protein has correlates negatively with their rates of evolution. However, the generality of this observation has recently been challenged. Here we examine the problem using protein-protein interaction data from the yeast Saccharomyces cerevisiae and genome sequences from two other yeast species.In contrast to a previous study that used an incomplete set of protein-protein interactions, we observed a highly significant correlation between number of interactions and evolutionary distance to either Candida albicans or Schizosaccharomyces pombe. This study differs from the previous one in that it includes all known protein interactions from S. cerevisiae, and a larger set of protein evolutionary rates. In both evolutionary comparisons, a simple monotonic relationship was found across the entire range of the number of protein-protein interactions. In agreement with our earlier findings, this relationship cannot be explained by the fact that proteins with many interactions tend to be important to yeast. The generality of these correlations in other kingdoms of life unfortunately cannot be addressed at this time, due to the incompleteness of protein-protein interaction data from organisms other than S. cerevisiae.Protein-protein interactions tend to slow the rate at which proteins evolve. This may be due to structural constraints that must be met to maintain interactions, but more work is needed to definitively establish the mechanism(s) behind the correlations we have observed.

    View details for Web of Science ID 000188122100011

    View details for PubMedID 12769820

  • Use of the nuclear gene glyceraldehyde 3-phosphate dehydrogenase for phylogeny reconstruction of recently diverged lineages in Mitthyridium (Musci : Calymperaceae) MOLECULAR PHYLOGENETICS AND EVOLUTION Wall, D. P. 2002; 25 (1): 10-26


    A portion of the nuclear gene glyceraldehyde 3-phosphate dehydrogenase (gpd) was sequenced in 26 representatives of the paleotropical moss, Mitthyridium, and a group of 20 outgroup taxa to assess its utility for phylogenetic reconstruction compared with the better understood chloroplast markers, rps4 and trnL. Primers based on plant and fungal sequences were designed to amplify gpd in plants universally with the exclusion of fungal contaminants. The piece amplified spanned 4 introns and 3 of 9 exons, based on comparisons with complete sequence from Arabidopsis. Size variation in gpd ranged from 891 to 1007 bp, in part attributable to 6 indels of variable length found within the introns. Intron 6 contributed most of the length variation and contained a variable purine-repeat motif of possible use as a microsatellite. Phylogenetic analyses of the full gpd amplicon yielded well-resolved trees that were in nearly full accord with the trees derived from the cpDNA partitions for analyses of both the ingroup and ingroup + outgroup taxon sets. Pairwise nucleotide substitution rates of gpd were as much as 2.2 times higher than those in rps4 and 2.8 times higher than in trnL. Excision of the introns left suitable numbers of parsimony informative characters and demonstrated that the full gpd amplicon could be compartmentalized to provide resolution for both shallow and deep phylogenetic branches. Exons of gpd were found to behave in a clock-like fashion for the 26 ingroup taxa and select outgroups. In general, gpd was found to hold great promise not only for improving resolution of chloroplast-derived phylogenies, but also for phylogenetic reconstruction of recent, diversifying lineages.

    View details for Web of Science ID 000179028400002

    View details for PubMedID 12383747

  • Phylogenetic relationships within the haplolepideous mosses BRYOLOGIST La Farge, C., Mishler, B. D., Wheeler, J. A., Wall, D. P., Johannes, K., Schaffer, S., Shaw, A. J. 2000; 103 (2): 257–76
  • Vegetation and elevational gradients within a bottomland hardwood forest of southeastern Louisiana AMERICAN MIDLAND NATURALIST Wall, D. P., Darwin, S. P. 1999; 142 (1): 17–30

    View details for Web of Science ID A1990EK67800024

    View details for PubMedID 2283290