Bio


Tong completed her Ph.D. at the University of Rochester. She also holds an M.S. in Biostatistics from Northwestern University and a B.S. in Medical Imaging from Sichuan University.
In her research, Tong has explored topics such as subcortical and cortical neural responses to naturalistic speech and music, neural mechanisms underlying musical perception, and the impact of visual cues on speech-in-noise comprehension.
Currently, Tong is involved in the Speaker-Listener projects, where she investigates brain activities related to natural communication. She is excited to deepen her understanding of auditory processing of speech during communication and its implications for improving quality of life, particularly in clinical populations such as individuals with ASD, AD, etc.
Outside of her research, Tong is a music producer, creating original songs and soundtracks for video games. She has a passion for exploring the intersection of art and technology.

Boards, Advisory Committees, Professional Organizations


  • Committee member, Association for Research in Otolaryngology (2023 - Present)

Professional Education


  • PhD, University of Rochester, Biomedical Engineering (2024)
  • MSc, Northwestern University, Biostatistics (2018)
  • BSc, Sichuan University, Medical Technology (Medical Imaging) (2016)

Stanford Advisors


All Publications


  • Chimeric Music Reveals an Interaction of Pitch and Time in Electrophysiological Signatures of Music Encoding JOURNAL OF NEUROSCIENCE Shan, T., Lalor, E. C., Maddox, R. K. 2026; 46 (4)

    Abstract

    Pitch and time are the essential dimensions defining musical melody. Recent electrophysiological studies have explored the neural encoding of musical pitch and time by leveraging probabilistic models of their sequences, but few have studied how the features might interact. This study examines these interactions by introducing "chimeric music," which pairs two distinct melodies and exchanges their pitch contours and note onset times to create two new melodies, distorting musical pattern while maintaining the marginal statistics of the original pieces' pitch and temporal sequences. Through this manipulation, we aimed to dissect the music processing and the interaction between pitch and time. Employing the temporal response function framework, we analyzed the neural encoding of melodic expectation and musical downbeats in participants with varying levels of musical training. Our findings from 27 participants of either sex revealed differences in the encoding of melodic expectation between original and chimeric stimuli in both dimensions, with a significant impact of musical experience. This suggests that decoupling the pitch and temporal structure affects expectation processing. In our analysis of downbeat encoding, we found an enhanced neural response when participants heard a note that aligned with the downbeat during music listening. In chimeric music, responses to downbeats were larger when the note was also a downbeat in the original music that provided the pitch sequence, indicating an effect of pitch structure on beat perception. This study advances our understanding of the neural underpinnings of music, emphasizing the significance of pitch-time interaction in the neural encoding of music.

    View details for DOI 10.1523/JNEUROSCI.2083-24.2025

    View details for Web of Science ID 001682089100007

    View details for PubMedID 41419334

    View details for PubMedCentralID PMC12853259

  • Comparing methods for deriving the auditory brainstem response to continuous speech in human listeners. Imaging neuroscience (Cambridge, Mass.) Shan, T., Maddox, R. K. 2025; 3

    Abstract

    Several methods have recently been developed to derive the auditory brainstem response (ABR) from continuous natural speech, facilitating investigation into subcortical encoding of speech. These tools rely on deconvolution to compute the temporal response function (TRF), which models the subcortical auditory pathway as a linear system, where a nonlinearly processed stimulus is taken as the input (i.e., regressor), the electroencephalogram (EEG) data as the output, and the ABR as the impulse response deconvolved from the recorded EEG and the regressor. In this study, we analyzed EEG recordings from subjects listening to both unaltered natural speech and synthesized "peaky speech." We compared the derived ABR TRFs using three regressors: the half-wave rectified stimulus (HWR) fromMaddox and Lee (2018), the glottal pulse train (GP) fromPolonenko and Maddox (2021), and the auditory nerve modeled response (ANM;Zilany et al. (2014); (2009)) used inShan et al. (2024). Our evaluation focused on the signal-to-noise ratio, prediction accuracy, efficiency, and practicality of applying each regressor in both unaltered and peaky speech. The results indicate that the ANM regressor with peaky speech provides the best performance, with the ANM for unaltered speech and the GP regressor for peaky speech close behind, whereas the HWR regressor demonstrated relatively poorer performance. There are, thus, multiple stimulus and analysis tools that can provide high-quality subcortical TRFs, with the choices for which to use dictated by experimental needs. The findings in this study will guide future research and clinical use in selecting the most appropriate paradigm for ABR derivation from continuous, naturalistic speech.

    View details for DOI 10.1162/IMAG.a.19

    View details for PubMedID 40800859

    View details for PubMedCentralID PMC12319856

  • Subcortical responses to music and speech are alike while cortical responses diverge. Scientific reports Shan, T., Cappelloni, M. S., Maddox, R. K. 2024; 14 (1): 789

    Abstract

    Music and speech are encountered daily and are unique to human beings. Both are transformed by the auditory pathway from an initial acoustical encoding to higher level cognition. Studies of cortex have revealed distinct brain responses to music and speech, but differences may emerge in the cortex or may be inherited from different subcortical encoding. In the first part of this study, we derived the human auditory brainstem response (ABR), a measure of subcortical encoding, to recorded music and speech using two analysis methods. The first method, described previously and acoustically based, yielded very different ABRs between the two sound classes. The second method, however, developed here and based on a physiological model of the auditory periphery, gave highly correlated responses to music and speech. We determined the superiority of the second method through several metrics, suggesting there is no appreciable impact of stimulus class (i.e., music vs speech) on the way stimulus acoustics are encoded subcortically. In this study's second part, we considered the cortex. Our new analysis method resulted in cortical music and speech responses becoming more similar but with remaining differences. The subcortical and cortical results taken together suggest that there is evidence for stimulus-class dependent processing of music and speech at the cortical but not subcortical level.

    View details for DOI 10.1038/s41598-023-50438-0

    View details for PubMedID 38191488

    View details for PubMedCentralID PMC10774448

  • Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face. Trends in hearing Shan, T., Wenner, C. E., Xu, C., Duan, Z., Maddox, R. K. 2022; 26: 23312165221136934

    Abstract

    Listening in a noisy environment is challenging, but many previous studies have demonstrated that comprehension of speech can be substantially improved by looking at the talker's face. We recently developed a deep neural network (DNN) based system that generates movies of a talking face from speech audio and a single face image. In this study, we aimed to quantify the benefits that such a system can bring to speech comprehension, especially in noise. The target speech audio was masked with signal to noise ratios of -9, -6, -3, and 0 dB and was presented to subjects in three audio-visual (AV) stimulus conditions: (1) synthesized AV: audio with the synthesized talking face movie; (2) natural AV: audio with the original movie from the corpus; and (3) audio-only: audio with a static image of the talker. Subjects were asked to type the sentences they heard in each trial and keyword recognition was quantified for each condition. Overall, performance in the synthesized AV condition fell approximately halfway between the other two conditions, showing a marked improvement over the audio-only control but still falling short of the natural AV condition. Every subject showed some benefit from the synthetic AV stimulus. The results of this study support the idea that a DNN-based model that generates a talking face from speech audio can meaningfully enhance comprehension in noisy environments, and has the potential to be used as a visual hearing aid.

    View details for DOI 10.1177/23312165221136934

    View details for PubMedID 36384325

    View details for PubMedCentralID PMC9677167

  • Abnormal developmental of hippocampal subfields and amygdalar subnuclei volumes in young adults with heavy cannabis use: A three-year longitudinal study. Progress in neuro-psychopharmacology & biological psychiatry Zhang, X., Chen, Z., Becker, B., Shan, T., Chen, T., Gong, Q. 2024; 136: 111156

    Abstract

    Differences in the volumes of the hippocampus and amygdala have consistently been observed between young adults with heavy cannabis use relative to their non-using counterparts. However, it remains unclear whether the subfields of these functionally and structurally heterogenous regions exhibit similar patterns of change in young adults with long-term heavy cannabis use disorder (CUD).This study aims to investigate the effects of long-term heavy cannabis use in young adults on the subregional structures of the hippocampus and amygdala, as well as their longitudinal alterations.The study sample comprised 20 young adults with heavy cannabis use and 22 matched non-cannabis using healthy volunteers. All participants completed the Cannabis Use Disorder Identification Test (CUDIT) and underwent two T1-structural magnetic resonance imaging (MRI) scans, one at baseline and another at follow-up 3 years later. The amygdala, hippocampus, and their subregions were segmented on T1-weighted anatomical MRI scans, using a previously validated procedure.At baseline, young adults with heavy CUD exhibited significantly larger volumes in several hippocampal (bilateral presubiculum, subiculum, Cornu Ammonis (CA) regions CA1, CA2-CA3, and right CA4-Dentate Gyrus (DG)) and amygdala (bilateral paralaminar nuclei, right medial nucleus, and right lateral nucleus) subregions compared to healthy controls, but these differences were attenuated at follow-up. Longitudinal analysis revealed an accelerated volumetric decrease in these subregions in young adults with heavy CUD relative to controls. Particularly, compared to healthy controls, significant accelerated volume decreases were observed in the right hippocampal subfields of the parasubiculum, subiculum, and CA4-DG. In the amygdala, similar trends of accelerated volumetric decreases were observed in the left central nucleus, right paralaminar nucleus, right basal nucleus, and right accessory basal nucleus.The current findings suggest that long-term heavy cannabis use impacts maturational process of the amygdala and hippocampus, especially in subregions with high concentrations of cannabinoid type 1 receptors (CB1Rs) and involvement in adult neurogenesis.

    View details for DOI 10.1016/j.pnpbp.2024.111156

    View details for PubMedID 39353549

  • Long-term tract-specific white matter microstructural changes after acute stress. Brain imaging and behavior Meng, L., Shan, T., Li, K., Gong, Q. 2021; 15 (4): 1868-1875

    Abstract

    Acute stress has substantial impact on white matter microstructure of people exposed to trauma. Its long-term consequence and how the brain changes from the stress remain unclear. In this study, we address this issue via diffusion tensor imaging (DTI). Twenty-two trauma-exposed individuals who did not meet post-traumatic stress disorder (PTSD) diagnostic criteria were recruited from the most affected area of Wenchuan earthquake and scanned twice (within twenty-five days and two years after the quake, respectively). Their emotional distress was evaluated with the Self-Rating Anxiety/Depression Scales (SAS/SDS) at both scans. Automatic fiber quantification was used to examine brain microstructure alterations. Correlation analyses were also conducted to investigate relationships between brain microstructure changes and symptom improvement. A group of demographically matched healthy controls (N = 22) from another project were scanned once before the quake using the same imaging protocols as used with trauma-exposed non-PTSD (TENP) participants. Two years after the earthquake, TENP individuals exhibited significantly reduced FA in the parietal portion of left superior longitudinal fasciculus and high FA in the parietal portion of left corticospinal tract. Over the follow-up, increased FA of the left uncinate fasciculus and the left corticospinal tract with parallel reduction of SAS and SDS were observed in TENP. No significant association was found between brain microstructure changes and symptom improvement. These results indicate changes in WM microstructure integrity of TENP brains parallel with symptom improvement over time after acute stress. However, the change would be a long-term process without external intervention.

    View details for DOI 10.1007/s11682-020-00380-w

    View details for PubMedID 32918183

    View details for PubMedCentralID PMC8413208