Invited Talk
in
Affinity Workshop: 4th MusIML workshop at ICML’25
Gasser Elbanna (Havard): A Model of Continuous Speech Recognition Reproduces Signatures of Human Speech Perception
Gasser Elbanna
Humans excel at transforming acoustic waveforms into meaningful linguistic representations, despite the inherent variability in speech signals. However, the underlying mechanisms that enable such robust perception remain unclear. One bottleneck is the absence of models that replicate human performance and that could be used to probe for mechanistic hypotheses. To address this bottleneck, we developed PARROT, an artificial neural network model of continuous speech perception. PARROT maps acoustic inputs from a simulated cochlear front-end into linguistic units. To evaluate human-model alignment, we designed a novel behavioral experiment in which participants transcribed spoken nonwords. This experiment allowed us to compute the first full phoneme confusion matrix in humans, enabling a systematic comparison of human–model phoneme confusions. We found that PARROT exhibited similar patterns of phoneme confusions as humans as well as patterns of phoneme accuracy. To study the role of contextual cues in human speech perception, we manipulated the model’s access to surrounding context. We found that models with access to both future and past context aligned more with human phonemic judgments than those using past or future alone. This result provides evidence that humans integrate across a local time window extending into the future to disambiguate speech sounds. Overall, the results suggest that aspects of human-like speech perception emerge by optimizing for sub-word recognition from cochlear representations. Our work is a first step towards building biologically-plausible models that explain human speech encoding.