The Auditory Cognitive Neuroscience of Speech Perception in Context

  • First Online: 22 February 2022

Cite this chapter

definition of speech perception

  • Lori L. Holt 21 &
  • Jonathan E. Peelle 22  

Part of the book series: Springer Handbook of Auditory Research ((SHAR,volume 74))

1332 Accesses

Speech is undeniably significant as a conspecific human communication signal, and it is also perhaps the most ubiquitous class of acoustic signals encountered by the human auditory system. However, historically there was little integration between speech research and the field of auditory neuroscience. Much of this divide can be traced back to the Motor Theory of speech perception, which framed speech not as an auditory process but as one grounded in motor gestures. Recent decades have seen a marked shift in perspective, with mutual interest from researchers in understanding both how neuroscientific principles can be used to study speech perception and, conversely, how speech as a complex acoustic stimulus can advance auditory neuroscience. This introductory chapter reviews this historical context for the modern field of auditory cognitive neuroscience before placing the remaining chapters of the book in context. A number of important themes emerge: methodological improvements, particularly in human brain imaging; the ability to study more natural speech (stories and conversations, rather than isolated stimuli); an appreciation for ways in which different listeners (e.g., of different ages or hearing levels) perceive speech; and incorporation of regions outside traditional auditory and language networks into our neuroanatomical frameworks for speech perception. Evolving techniques, theories, and approaches have provided unprecedented progress in understanding speech perception. These opportunities challenge researchers to ask new questions and to fully integrate speech perception into auditory neuroscience.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

definition of speech perception

Speech rhythms and their neural foundations

definition of speech perception

Auditory Cognition: The Integration of Psychophysics with Neurophysiology

definition of speech perception

Human Auditory Cortex: In Search of the Flying Dutchman

Cooper FS, Liberman AM, Borst JM (1951) The interconversion of audible and visible patterns as a basis for research in the perception of speech. Proc Natl Acad Sci U S A 37:318–325

Article   CAS   PubMed   PubMed Central   Google Scholar  

Delattre PC, Liberman AM, Cooper FS (1955) Acoustic loci and transitional cues for consonants. J Acoust Soc Am 27:769–773

Article   Google Scholar  

Diehl RL, Lotto AJ, Holt LL (2004) Speech perception. Annu Rev Psychol 55:149–179

Article   PubMed   Google Scholar  

Elman JL, McClelland JL (1988) Cognitive penetration of the mechanisms of perception: compensation for coarticulation of lexically restored phonemes. J Mem Lang 27:143–165

Evans S, McGettigan C (2017) Comprehending auditory speech: previous and potential contributions of functional MRI. Lang Cogn Neurosci 32:829–846

Fowler CA (2001) Obituary: Alvin M. Liberman (1917-2000). Am Psychol 56:1164–1165

Galantucci B, Fowler CA, Turvey MT (2006) The motor theory of speech perception reviewed. Psychon Bull Rev 13:361–377

Article   PubMed   PubMed Central   Google Scholar  

Greenberg S, Ainsworth WA (2004) Speech processing in the auditory system: an overview. Springer, New York

Book   Google Scholar  

Harnad S (1987) Categorical perception: The groundwork of cognition. Cambridge University Press, Cambridge

Google Scholar  

Heald S, Nusbaum HC (2014) Speech perception as an active cognitive process. Front Syst Neurosci 8:35

Hickok G (2009) Eight problems for the mirror neuron theory of action understanding in monkeys and humans. J Cogn Neurosci 21:1229–1243

Holt LL (2005) Temporally nonadjacent nonlinguistic sounds affect speech categorization. Psychol Sci 16:305–312

Holt LL, Lotto AJ (2010) Speech perception as categorization. Atten Percept Psychophys 72:1218–1227

Klemmer ET, Snyder FW (1972) Measurement of time spent communicating. J Commun 22:142–158

Kluender KR, Diehl RL, Killeen PR (1987) Japanese quail can learn phonetic categories. Science 237:1195–1197

Article   CAS   PubMed   Google Scholar  

Kraljic T, Samuel AG, Brennan SE (2008) First impressions and last resorts: how listeners adjust to speaker variability. Psychol Sci 19:332–338

Kraus MJ, Torrez B, Park JW, Ghayebi F (2019) Evidence for the reproduction of social class in brief speech. Proc Natl Acad Sci USA 116:22998–23003

Kronrod Y, Coppess E, Feldman NH (2016) A unified account of categorical effects in phonetic perception. Psychon Bull Rev 23:1681–1712

Kuhl PK, Miller JD (1978) Speech perception by the chinchilla: identification function for synthetic VOT stimuli. J Acoust Soc Am 63:905–917

Leonard MK, Chang EF (2016) Direct cortical neurophysiology of speech perception. In: Hickok G, Small SL (eds) Neurobiology of language. Academic Press, London, pp 479–489

Chapter   Google Scholar  

Liberman AM (1957) Some results of research on speech perception. J Acoust Soc Am 29:117–123

Liberman AM, Mattingly IG (1985) The motor theory of speech perception revised. Cognition 21:1–36

Liberman AM, Harris KS, Hoffman HS, Griffith BC (1957) The discrimination of speech sounds within and across phoneme boundaries. J Exp Psychol 54:358–368

Liberman AM, Cooper FS, Shankweiler DP, Studdert-Kennedy M (1967) Perception of the speech code. Psychol Rev 74:431–461

Lotto AJ, Kluender KR (1998) General contrast effects in speech perception: effect of preceding liquid on stop consonant identification. Percept Psychophys 60:602–619

Lotto AJ, Kluender KR, Holt LL (1997) Perceptual compensation for coarticulation by Japanese quail ( Coturnix coturnix japonica ). J Acoust Soc Am 102:1135–1140

McClelland JL, Elman JL (1986) The TRACE model of speech perception. Cogn Psychol 18:1–86

Mehl MR, Vazire S, Ramírez-Esparza N et al (2007) Are women really more talkative than men? Science 317:82

Mirman D, Holt LL, McClelland JL (2004) Categorization and discrimination of nonspeech sounds: differences between steady-state and rapidly-changing acoustic cues. J Acoust Soc Am 116:1198–1207

Mirman D, McClelland JL, Holt LL, Magnuson JS (2008) Effects of attention on the strength of lexical influences on speech perception: behavioral experiments and computational mechanisms. Cogn Sci 32:398–417

Moineau S, Dronkers NF, Bates E (2005) Exploring the processing continuum of single-word comprehension in aphasia. J Speech Lang Hear Res 48:884–896

Norris D (1999) The merge model: speech perception is bottom-up. J Acoust Soc Am 106:2295–2295

Norris D, McQueen JM, Cutler A (2003) Perceptual learning in speech. Cogn Psychol 47:204–238

Palmer A, Shamma S (2004) Physiological representations of speech. In: Greenberg S, Ainsworth WA (eds) Speech processing in the auditory system: an overview. Springer, New York

Peelle JE (2017) Optical neuroimaging of spoken language. Lang Cogn Neurosci 32:847–854

Peelle JE (2018) Listening effort: how the cognitive consequences of acoustic challenge are reflected in brain and behavior. Ear Hear 39:204–214

Peterson GE, Barney HL (1952) Control methods used in a study of the vowels. J Acoust Soc Am 24:175–184

Pichora-Fuller MK, Kramer SE, Eckert MA, Edwards B, Hornsby BW, Humes LE, Lemke U, Lunner T, Matthen M, Mackersie CL, Naylor G, Phillips NA, Richter M, Rudner M, Sommers MS, Tremblay KL, Wingfield A (2016) Hearing impairment and cognitive energy: the framework for understanding effortful listening (FUEL). Ear Hear 37:5S–27S

Quam RM, Ramsier MA, Fay RR, Popper AN (2017) Primate hearing and communication. Springer, Cham

Samuel AG (2011) Speech perception. Annu Rev Psychol 62:49–72

Samuel AG (2020) Psycholinguists should resist the allure of linguistic units as perceptual units. J Mem Lang 111:104070

Wöstmann M, Fiedler L, Obleser J (2017) Tracking the signal, cracking the code: speech and speech comprehension in non-invasive human electrophysiology. Lang Cogn Neurosci 32:855–869

Download references

Acknowledgments

This work was supported in part by grants R01 DC014281, R21 DC016086, R21 DC015884, and R56 AG059265 from the US National Institutes of Health to JEP as well as R01DC017734, R03HD099382, and R21DC019217 from the US National Institutes of Health and BCS1950054 and BCS1655126 from the US National Science Foundation to LLH.

Compliance with Ethics Requirements

Lori L. Holt declares that she has no conflict of interest.

Jonathan E. Peelle declares that he has no conflict of interest.

Author information

Authors and affiliations.

Department of Psychology, Carnegie Mellon University, Pittsburgh, PA, USA

Lori L. Holt

Department of Otolaryngology, Washington University in St. Louis, St. Louis, MO, USA

Jonathan E. Peelle

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Lori L. Holt .

Editor information

Editors and affiliations.

Department of Otolaryngology, Washington University in St. Louis, Saint Louis, MO, USA

Integrative Physiology and Neuroscience, Washington State University, Vancouver, WA, USA

Allison B. Coffin

Department of Biology, University of Maryland, Silver Spring, MD, USA

Arthur N. Popper

Department of Psychology, Loyola University Chicago, Chicago, IL, USA

Richard R. Fay

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Holt, L.L., Peelle, J.E. (2022). The Auditory Cognitive Neuroscience of Speech Perception in Context. In: Holt, L.L., Peelle, J.E., Coffin, A.B., Popper, A.N., Fay, R.R. (eds) Speech Perception. Springer Handbook of Auditory Research, vol 74. Springer, Cham. https://doi.org/10.1007/978-3-030-81542-4_1

Download citation

DOI : https://doi.org/10.1007/978-3-030-81542-4_1

Published : 22 February 2022

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-81541-7

Online ISBN : 978-3-030-81542-4

eBook Packages : Biomedical and Life Sciences Biomedical and Life Sciences (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Subject List
  • Take a Tour
  • For Authors
  • Subscriber Services
  • Publications
  • African American Studies
  • African Studies
  • American Literature
  • Anthropology
  • Architecture Planning and Preservation
  • Art History
  • Atlantic History
  • Biblical Studies
  • British and Irish Literature
  • Childhood Studies
  • Chinese Studies
  • Cinema and Media Studies
  • Communication
  • Criminology
  • Environmental Science
  • Evolutionary Biology
  • International Law
  • International Relations
  • Islamic Studies
  • Jewish Studies
  • Latin American Studies
  • Latino Studies

Linguistics

  • Literary and Critical Theory
  • Medieval Studies
  • Military History
  • Political Science
  • Public Health
  • Renaissance and Reformation
  • Social Work
  • Urban Studies
  • Victorian Literature
  • Browse All Subjects

How to Subscribe

  • Free Trials

In This Article Expand or collapse the "in this article" section Speech Perception

Introduction, general overviews.

  • Edited Collections
  • Acoustic Variation
  • Theoretical Approaches
  • Categorical Perception and Category Structure
  • Perceptual Integration of Phonetic Cues
  • Lexical Influences
  • Vowel Perception
  • Experiential Influences on Adult Listeners
  • Developmental Speech Perception
  • Sound Change
  • Phonological Similarity and Contrast
  • Perception-Sociolinguistics Relation

Related Articles Expand or collapse the "related articles" section about

About related articles close popup.

Lorem Ipsum Sit Dolor Amet

Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Aliquam ligula odio, euismod ut aliquam et, vestibulum nec risus. Nulla viverra, arcu et iaculis consequat, justo diam ornare tellus, semper ultrices tellus nunc eu tellus.

  • Acoustic Phonetics
  • Articulatory Phonetics
  • Consonant Epenthesis
  • Contrastive Analysis in Linguistics
  • Cross-Language Speech Perception and Production
  • Dementia and Language
  • Dialectology
  • Early Child Phonology
  • Interface Between Phonology and Phonetics
  • Machine Translation
  • Psycholinguistic Methodology in Phonological Research
  • Second Language Listening
  • Second-Language Reading
  • Sociolinguistic Fieldwork
  • Speech Production
  • Speech Synthesis
  • Spoken Word Recognition
  • Voice and Voice Quality
  • William Labov

Other Subject Areas

Forthcoming articles expand or collapse the "forthcoming articles" section.

  • Cognitive Grammar
  • Edward Sapir
  • Find more forthcoming articles...
  • Export Citations
  • Share This Facebook LinkedIn Twitter

Speech Perception by Patrice Speeter Beddor LAST REVIEWED: 19 March 2013 LAST MODIFIED: 19 March 2013 DOI: 10.1093/obo/9780199772810-0089

Speech perception as an experimental discipline has a roughly sixty-year history. In a very broad sense, much of the research in this field investigates how listeners map the input acoustic signal onto phonological units. Determining the nature of the mapping is an intriguing issue because the acoustic signal is highly variable, yet perception remains remarkably constant (and accurate) across many types of variation. Consequently, an overarching goal that unifies and motivates much of the work is to account for perceptual constancy, that is, to understand the perceptual mechanisms by which listeners arrive at stable percepts despite acoustic variation. Some theoretical approaches to speech perception postulate that invariant properties in the input signal underlie perceptual constancy, thereby defining a research program aimed at identifying the nature of the invariants. Other approaches do not assume invariants but either require principles that account for the necessarily more complex mapping between signal and phonological representation, or require more complex representations. As a result, theoretical approaches differ as well in their assumptions concerning the relevant phonological units (features, gestures, segments, syllables, words) and the structure of these units (e.g., abstract representations, categories consisting of traces of acoustic episodes). Within this overarching agenda, researchers also address many more specific questions. Is speech perception different from other types of auditory processing? How do listeners integrate multiple sources of information into a coherent percept? What initial perceptual capabilities do infants have? How does perception change with linguistic experience? What is the nature of perceptual influences on phonological structures? How do social categories and phonetic categories interact in perception? This bibliography is selective in several respects. “Speech perception” has traditionally referred to perception of phonetic and phonological information, distinct from recognition of spoken words. The division between these two perspectives on the listener’s task has long been a questionable one, and is in many respects an artificial one that does not reflect important current research questions and methods. Although ideally a bibliography would bridge these two approaches, the focus here is almost exclusively on speech perception. Moreover, within this focus, particular emphasis has been given to perceptual issues that are at the interface with other subdisciplines of linguistics—in particular, phonology, historical linguistics, and sociolinguistics. Another area, in addition to word recognition, that is underrepresented in this bibliography is perception of prosodic properties, although some of the edited collections cited here include reviews of both of these areas.

Several excellent overview articles by major figures in the field of speech perception have appeared in the past decade. Although all approach the main issues in the field from a perspective intended to be accessible by nonspecialists, they will all likely be challenging resources for undergraduates if they have little background in phonetics or psychology. Diehl, et al. 2004 focuses exclusively on speech perception. Cleary and Pisoni 2001 , Jusczyk and Luce 2002 , and Samuel 2011 consider issues in word recognition as well. Fowler 2003 summarizes and assesses both the speech perception and production literatures.

Cleary, M., and D. B. Pisoni. 2001. Speech perception and spoken word recognition: Research and theory. In Blackwell handbook of sensation and perception . Edited by E. B. Goldstein, 499–534. Malden, MA: Blackwell.

Comprehensive review of major issues and findings in speech perception; offers more condensed coverage of theoretical approaches and of spoken word recognition.

Diehl, R. L., A. J. Lotto, and L. L. Holt. 2004. Speech perception. Annual Review of Psychology 55:149–179.

DOI: 10.1146/annurev.psych.55.090902.142028

Detailed presentation of three theoretical approaches: motor theory, direct realism, and general auditory and learning approaches. Provides critical assessment of the strengths and weaknesses of these approaches in light of selected classic perceptual phenomena. Available online for purchase or by subscription.

Fowler, C. A. 2003. Speech production and perception. In Handbook of psychology . Vol. 4, Experimental psychology . Edited by A. F. Healy, R. W. Proctor, and I. B. Weiner, 237–266. Hoboken, NJ: Wiley.

Presents key arguments and findings for acoustic (auditory) and gestural theories of perception; also assesses the literature on the influences of experience and learning on perception. Linguists may especially appreciate that the review frames issues of perception and production within the context of the relation between phonetic and phonological forms.

Jusczyk, P. W., and P. A. Luce. 2002. Speech perception and spoken word recognition: Past and present. Ear and Hearing 23:2–40.

DOI: 10.1097/00003446-200202000-00002

Overview of major issues and findings, with particular attention to developmental speech perception. Theoretically, gives greater consideration to models of spoken word recognition than to theories of speech perception. An especially helpful aspect of this review is its focus on the historical context in which the major issues emerged. Available online for purchase or by subscription.

Samuel, A. G. 2011. Speech perception. Annual Review of Psychology 62:49–72.

DOI: 10.1146/annurev.psych.121208.131643

The most recent survey of the field. Pulls together issues, theories, and findings in speech perception and spoken word recognition, including work on statistical and perceptual learning of speech. Available online for purchase or by subscription.

back to top

Users without a subscription are not able to see the full content on this page. Please subscribe or login .

Oxford Bibliographies Online is available by subscription and perpetual access to institutions. For more information or to contact an Oxford Sales Representative click here .

  • About Linguistics »
  • Meet the Editorial Board »
  • Acceptability Judgments
  • Accessibility Theory in Linguistics
  • Acquisition, Second Language, and Bilingualism, Psycholin...
  • Adpositions
  • African Linguistics
  • Afroasiatic Languages
  • Algonquian Linguistics
  • Altaic Languages
  • Ambiguity, Lexical
  • Analogy in Language and Linguistics
  • Animal Communication
  • Applicatives
  • Applied Linguistics, Critical
  • Arawak Languages
  • Argument Structure
  • Artificial Languages
  • Australian Languages
  • Austronesian Linguistics
  • Auxiliaries
  • Balkans, The Languages of the
  • Baudouin de Courtenay, Jan
  • Berber Languages and Linguistics
  • Bilingualism and Multilingualism
  • Biology of Language
  • Borrowing, Structural
  • Caddoan Languages
  • Caucasian Languages
  • Celtic Languages
  • Celtic Mutations
  • Chomsky, Noam
  • Chumashan Languages
  • Classifiers
  • Clauses, Relative
  • Clinical Linguistics
  • Cognitive Linguistics
  • Colonial Place Names
  • Comparative Reconstruction in Linguistics
  • Comparative-Historical Linguistics
  • Complementation
  • Complexity, Linguistic
  • Compositionality
  • Compounding
  • Computational Linguistics
  • Conditionals
  • Conjunctions
  • Connectionism
  • Constructions, Verb-Particle
  • Conversation Analysis
  • Conversation, Maxims of
  • Conversational Implicature
  • Cooperative Principle
  • Coordination
  • Creoles, Grammatical Categories in
  • Critical Periods
  • Cyberpragmatics
  • Default Semantics
  • Definiteness
  • Dene (Athabaskan) Languages
  • Dené-Yeniseian Hypothesis, The
  • Dependencies
  • Dependencies, Long Distance
  • Derivational Morphology
  • Determiners
  • Distinctive Features
  • Dravidian Languages
  • Endangered Languages
  • English as a Lingua Franca
  • English, Early Modern
  • English, Old
  • Eskimo-Aleut
  • Euphemisms and Dysphemisms
  • Evidentials
  • Exemplar-Based Models in Linguistics
  • Existential
  • Existential Wh-Constructions
  • Experimental Linguistics
  • Fieldwork, Sociolinguistic
  • Finite State Languages
  • First Language Attrition
  • Formulaic Language
  • Francoprovençal
  • French Grammars
  • Gabelentz, Georg von der
  • Genealogical Classification
  • Generative Syntax
  • Genetics and Language
  • Grammar, Categorial
  • Grammar, Construction
  • Grammar, Descriptive
  • Grammar, Functional Discourse
  • Grammars, Phrase Structure
  • Grammaticalization
  • Harris, Zellig
  • Heritage Languages
  • History of Linguistics
  • History of the English Language
  • Hmong-Mien Languages
  • Hokan Languages
  • Humor in Language
  • Hungarian Vowel Harmony
  • Idiom and Phraseology
  • Imperatives
  • Indefiniteness
  • Indo-European Etymology
  • Inflected Infinitives
  • Information Structure
  • Interjections
  • Iroquoian Languages
  • Isolates, Language
  • Jakobson, Roman
  • Japanese Word Accent
  • Jones, Daniel
  • Juncture and Boundary
  • Khoisan Languages
  • Kiowa-Tanoan Languages
  • Kra-Dai Languages
  • Labov, William
  • Language Acquisition
  • Language and Law
  • Language Contact
  • Language Documentation
  • Language, Embodiment and
  • Language for Specific Purposes/Specialized Communication
  • Language, Gender, and Sexuality
  • Language Geography
  • Language Ideologies and Language Attitudes
  • Language in Autism Spectrum Disorders
  • Language Nests
  • Language Revitalization
  • Language Shift
  • Language Standardization
  • Language, Synesthesia and
  • Languages of Africa
  • Languages of the Americas, Indigenous
  • Languages of the World
  • Learnability
  • Lexical Access, Cognitive Mechanisms for
  • Lexical Semantics
  • Lexical-Functional Grammar
  • Lexicography
  • Lexicography, Bilingual
  • Linguistic Accommodation
  • Linguistic Anthropology
  • Linguistic Areas
  • Linguistic Landscapes
  • Linguistic Prescriptivism
  • Linguistic Profiling and Language-Based Discrimination
  • Linguistic Relativity
  • Linguistics, Educational
  • Listening, Second Language
  • Literature and Linguistics
  • Maintenance, Language
  • Mande Languages
  • Mass-Count Distinction
  • Mathematical Linguistics
  • Mayan Languages
  • Mental Health Disorders, Language in
  • Mental Lexicon, The
  • Mesoamerican Languages
  • Minority Languages
  • Mixed Languages
  • Mixe-Zoquean Languages
  • Modification
  • Mon-Khmer Languages
  • Morphological Change
  • Morphology, Blending in
  • Morphology, Subtractive
  • Munda Languages
  • Muskogean Languages
  • Nasals and Nasalization
  • Niger-Congo Languages
  • Non-Pama-Nyungan Languages
  • Northeast Caucasian Languages
  • Oceanic Languages
  • Papuan Languages
  • Penutian Languages
  • Philosophy of Language
  • Phonetics, Acoustic
  • Phonetics, Articulatory
  • Phonological Research, Psycholinguistic Methodology in
  • Phonology, Computational
  • Phonology, Early Child
  • Policy and Planning, Language
  • Politeness in Language
  • Positive Discourse Analysis
  • Possessives, Acquisition of
  • Pragmatics, Acquisition of
  • Pragmatics, Cognitive
  • Pragmatics, Computational
  • Pragmatics, Cross-Cultural
  • Pragmatics, Developmental
  • Pragmatics, Experimental
  • Pragmatics, Game Theory in
  • Pragmatics, Historical
  • Pragmatics, Institutional
  • Pragmatics, Second Language
  • Pragmatics, Teaching
  • Prague Linguistic Circle, The
  • Presupposition
  • Psycholinguistics
  • Quechuan and Aymaran Languages
  • Reading, Second-Language
  • Reciprocals
  • Reduplication
  • Reflexives and Reflexivity
  • Register and Register Variation
  • Relevance Theory
  • Representation and Processing of Multi-Word Expressions in...
  • Salish Languages
  • Sapir, Edward
  • Saussure, Ferdinand de
  • Second Language Acquisition, Anaphora Resolution in
  • Semantic Maps
  • Semantic Roles
  • Semantic-Pragmatic Change
  • Semantics, Cognitive
  • Sentence Processing in Monolingual and Bilingual Speakers
  • Sign Language Linguistics
  • Sociolinguistics
  • Sociolinguistics, Variationist
  • Sociopragmatics
  • South American Indian Languages
  • Specific Language Impairment
  • Speech, Deceptive
  • Speech Perception
  • Switch-Reference
  • Syntactic Change
  • Syntactic Knowledge, Children’s Acquisition of
  • Tense, Aspect, and Mood
  • Text Mining
  • Tone Sandhi
  • Transcription
  • Transitivity and Voice
  • Translanguaging
  • Translation
  • Trubetzkoy, Nikolai
  • Tucanoan Languages
  • Tupian Languages
  • Usage-Based Linguistics
  • Uto-Aztecan Languages
  • Valency Theory
  • Verbs, Serial
  • Vocabulary, Second Language
  • Vowel Harmony
  • Whitney, William Dwight
  • Word Classes
  • Word Formation in Japanese
  • Word Recognition, Spoken
  • Word Recognition, Visual
  • Word Stress
  • Writing, Second Language
  • Writing Systems
  • Zapotecan Languages
  • Privacy Policy
  • Cookie Policy
  • Legal Notice
  • Accessibility

Powered by:

  • [66.249.64.20|185.194.105.172]
  • 185.194.105.172
  • Search Menu
  • Browse content in Arts and Humanities
  • Browse content in Archaeology
  • Anglo-Saxon and Medieval Archaeology
  • Archaeological Methodology and Techniques
  • Archaeology by Region
  • Archaeology of Religion
  • Archaeology of Trade and Exchange
  • Biblical Archaeology
  • Contemporary and Public Archaeology
  • Environmental Archaeology
  • Historical Archaeology
  • History and Theory of Archaeology
  • Industrial Archaeology
  • Landscape Archaeology
  • Mortuary Archaeology
  • Prehistoric Archaeology
  • Underwater Archaeology
  • Urban Archaeology
  • Zooarchaeology
  • Browse content in Architecture
  • Architectural Structure and Design
  • History of Architecture
  • Residential and Domestic Buildings
  • Theory of Architecture
  • Browse content in Art
  • Art Subjects and Themes
  • History of Art
  • Industrial and Commercial Art
  • Theory of Art
  • Biographical Studies
  • Byzantine Studies
  • Browse content in Classical Studies
  • Classical History
  • Classical Philosophy
  • Classical Mythology
  • Classical Literature
  • Classical Reception
  • Classical Art and Architecture
  • Classical Oratory and Rhetoric
  • Greek and Roman Epigraphy
  • Greek and Roman Law
  • Greek and Roman Papyrology
  • Greek and Roman Archaeology
  • Late Antiquity
  • Religion in the Ancient World
  • Digital Humanities
  • Browse content in History
  • Colonialism and Imperialism
  • Diplomatic History
  • Environmental History
  • Genealogy, Heraldry, Names, and Honours
  • Genocide and Ethnic Cleansing
  • Historical Geography
  • History by Period
  • History of Emotions
  • History of Agriculture
  • History of Education
  • History of Gender and Sexuality
  • Industrial History
  • Intellectual History
  • International History
  • Labour History
  • Legal and Constitutional History
  • Local and Family History
  • Maritime History
  • Military History
  • National Liberation and Post-Colonialism
  • Oral History
  • Political History
  • Public History
  • Regional and National History
  • Revolutions and Rebellions
  • Slavery and Abolition of Slavery
  • Social and Cultural History
  • Theory, Methods, and Historiography
  • Urban History
  • World History
  • Browse content in Language Teaching and Learning
  • Language Learning (Specific Skills)
  • Language Teaching Theory and Methods
  • Browse content in Linguistics
  • Applied Linguistics
  • Cognitive Linguistics
  • Computational Linguistics
  • Forensic Linguistics
  • Grammar, Syntax and Morphology
  • Historical and Diachronic Linguistics
  • History of English
  • Language Acquisition
  • Language Evolution
  • Language Reference
  • Language Variation
  • Language Families
  • Lexicography
  • Linguistic Anthropology
  • Linguistic Theories
  • Linguistic Typology
  • Phonetics and Phonology
  • Psycholinguistics
  • Sociolinguistics
  • Translation and Interpretation
  • Writing Systems
  • Browse content in Literature
  • Bibliography
  • Children's Literature Studies
  • Literary Studies (Asian)
  • Literary Studies (European)
  • Literary Studies (Eco-criticism)
  • Literary Studies (Romanticism)
  • Literary Studies (American)
  • Literary Studies (Modernism)
  • Literary Studies - World
  • Literary Studies (1500 to 1800)
  • Literary Studies (19th Century)
  • Literary Studies (20th Century onwards)
  • Literary Studies (African American Literature)
  • Literary Studies (British and Irish)
  • Literary Studies (Early and Medieval)
  • Literary Studies (Fiction, Novelists, and Prose Writers)
  • Literary Studies (Gender Studies)
  • Literary Studies (Graphic Novels)
  • Literary Studies (History of the Book)
  • Literary Studies (Plays and Playwrights)
  • Literary Studies (Poetry and Poets)
  • Literary Studies (Postcolonial Literature)
  • Literary Studies (Queer Studies)
  • Literary Studies (Science Fiction)
  • Literary Studies (Travel Literature)
  • Literary Studies (War Literature)
  • Literary Studies (Women's Writing)
  • Literary Theory and Cultural Studies
  • Mythology and Folklore
  • Shakespeare Studies and Criticism
  • Browse content in Media Studies
  • Browse content in Music
  • Applied Music
  • Dance and Music
  • Ethics in Music
  • Ethnomusicology
  • Gender and Sexuality in Music
  • Medicine and Music
  • Music Cultures
  • Music and Religion
  • Music and Media
  • Music and Culture
  • Music Education and Pedagogy
  • Music Theory and Analysis
  • Musical Scores, Lyrics, and Libretti
  • Musical Structures, Styles, and Techniques
  • Musicology and Music History
  • Performance Practice and Studies
  • Race and Ethnicity in Music
  • Sound Studies
  • Browse content in Performing Arts
  • Browse content in Philosophy
  • Aesthetics and Philosophy of Art
  • Epistemology
  • Feminist Philosophy
  • History of Western Philosophy
  • Metaphysics
  • Moral Philosophy
  • Non-Western Philosophy
  • Philosophy of Science
  • Philosophy of Language
  • Philosophy of Mind
  • Philosophy of Perception
  • Philosophy of Action
  • Philosophy of Law
  • Philosophy of Religion
  • Philosophy of Mathematics and Logic
  • Practical Ethics
  • Social and Political Philosophy
  • Browse content in Religion
  • Biblical Studies
  • Christianity
  • East Asian Religions
  • History of Religion
  • Judaism and Jewish Studies
  • Qumran Studies
  • Religion and Education
  • Religion and Health
  • Religion and Politics
  • Religion and Science
  • Religion and Law
  • Religion and Art, Literature, and Music
  • Religious Studies
  • Browse content in Society and Culture
  • Cookery, Food, and Drink
  • Cultural Studies
  • Customs and Traditions
  • Ethical Issues and Debates
  • Hobbies, Games, Arts and Crafts
  • Lifestyle, Home, and Garden
  • Natural world, Country Life, and Pets
  • Popular Beliefs and Controversial Knowledge
  • Sports and Outdoor Recreation
  • Technology and Society
  • Travel and Holiday
  • Visual Culture
  • Browse content in Law
  • Arbitration
  • Browse content in Company and Commercial Law
  • Commercial Law
  • Company Law
  • Browse content in Comparative Law
  • Systems of Law
  • Competition Law
  • Browse content in Constitutional and Administrative Law
  • Government Powers
  • Judicial Review
  • Local Government Law
  • Military and Defence Law
  • Parliamentary and Legislative Practice
  • Construction Law
  • Contract Law
  • Browse content in Criminal Law
  • Criminal Procedure
  • Criminal Evidence Law
  • Sentencing and Punishment
  • Employment and Labour Law
  • Environment and Energy Law
  • Browse content in Financial Law
  • Banking Law
  • Insolvency Law
  • History of Law
  • Human Rights and Immigration
  • Intellectual Property Law
  • Browse content in International Law
  • Private International Law and Conflict of Laws
  • Public International Law
  • IT and Communications Law
  • Jurisprudence and Philosophy of Law
  • Law and Politics
  • Law and Society
  • Browse content in Legal System and Practice
  • Courts and Procedure
  • Legal Skills and Practice
  • Primary Sources of Law
  • Regulation of Legal Profession
  • Medical and Healthcare Law
  • Browse content in Policing
  • Criminal Investigation and Detection
  • Police and Security Services
  • Police Procedure and Law
  • Police Regional Planning
  • Browse content in Property Law
  • Personal Property Law
  • Study and Revision
  • Terrorism and National Security Law
  • Browse content in Trusts Law
  • Wills and Probate or Succession
  • Browse content in Medicine and Health
  • Browse content in Allied Health Professions
  • Arts Therapies
  • Clinical Science
  • Dietetics and Nutrition
  • Occupational Therapy
  • Operating Department Practice
  • Physiotherapy
  • Radiography
  • Speech and Language Therapy
  • Browse content in Anaesthetics
  • General Anaesthesia
  • Neuroanaesthesia
  • Browse content in Clinical Medicine
  • Acute Medicine
  • Cardiovascular Medicine
  • Clinical Genetics
  • Clinical Pharmacology and Therapeutics
  • Dermatology
  • Endocrinology and Diabetes
  • Gastroenterology
  • Genito-urinary Medicine
  • Geriatric Medicine
  • Infectious Diseases
  • Medical Toxicology
  • Medical Oncology
  • Pain Medicine
  • Palliative Medicine
  • Rehabilitation Medicine
  • Respiratory Medicine and Pulmonology
  • Rheumatology
  • Sleep Medicine
  • Sports and Exercise Medicine
  • Clinical Neuroscience
  • Community Medical Services
  • Critical Care
  • Emergency Medicine
  • Forensic Medicine
  • Haematology
  • History of Medicine
  • Browse content in Medical Dentistry
  • Oral and Maxillofacial Surgery
  • Paediatric Dentistry
  • Restorative Dentistry and Orthodontics
  • Surgical Dentistry
  • Browse content in Medical Skills
  • Clinical Skills
  • Communication Skills
  • Nursing Skills
  • Surgical Skills
  • Medical Ethics
  • Medical Statistics and Methodology
  • Browse content in Neurology
  • Clinical Neurophysiology
  • Neuropathology
  • Nursing Studies
  • Browse content in Obstetrics and Gynaecology
  • Gynaecology
  • Occupational Medicine
  • Ophthalmology
  • Otolaryngology (ENT)
  • Browse content in Paediatrics
  • Neonatology
  • Browse content in Pathology
  • Chemical Pathology
  • Clinical Cytogenetics and Molecular Genetics
  • Histopathology
  • Medical Microbiology and Virology
  • Patient Education and Information
  • Browse content in Pharmacology
  • Psychopharmacology
  • Browse content in Popular Health
  • Caring for Others
  • Complementary and Alternative Medicine
  • Self-help and Personal Development
  • Browse content in Preclinical Medicine
  • Cell Biology
  • Molecular Biology and Genetics
  • Reproduction, Growth and Development
  • Primary Care
  • Professional Development in Medicine
  • Browse content in Psychiatry
  • Addiction Medicine
  • Child and Adolescent Psychiatry
  • Forensic Psychiatry
  • Learning Disabilities
  • Old Age Psychiatry
  • Psychotherapy
  • Browse content in Public Health and Epidemiology
  • Epidemiology
  • Public Health
  • Browse content in Radiology
  • Clinical Radiology
  • Interventional Radiology
  • Nuclear Medicine
  • Radiation Oncology
  • Reproductive Medicine
  • Browse content in Surgery
  • Cardiothoracic Surgery
  • Gastro-intestinal and Colorectal Surgery
  • General Surgery
  • Neurosurgery
  • Paediatric Surgery
  • Peri-operative Care
  • Plastic and Reconstructive Surgery
  • Surgical Oncology
  • Transplant Surgery
  • Trauma and Orthopaedic Surgery
  • Vascular Surgery
  • Browse content in Science and Mathematics
  • Browse content in Biological Sciences
  • Aquatic Biology
  • Biochemistry
  • Bioinformatics and Computational Biology
  • Developmental Biology
  • Ecology and Conservation
  • Evolutionary Biology
  • Genetics and Genomics
  • Microbiology
  • Molecular and Cell Biology
  • Natural History
  • Plant Sciences and Forestry
  • Research Methods in Life Sciences
  • Structural Biology
  • Systems Biology
  • Zoology and Animal Sciences
  • Browse content in Chemistry
  • Analytical Chemistry
  • Computational Chemistry
  • Crystallography
  • Environmental Chemistry
  • Industrial Chemistry
  • Inorganic Chemistry
  • Materials Chemistry
  • Medicinal Chemistry
  • Mineralogy and Gems
  • Organic Chemistry
  • Physical Chemistry
  • Polymer Chemistry
  • Study and Communication Skills in Chemistry
  • Theoretical Chemistry
  • Browse content in Computer Science
  • Artificial Intelligence
  • Computer Architecture and Logic Design
  • Game Studies
  • Human-Computer Interaction
  • Mathematical Theory of Computation
  • Programming Languages
  • Software Engineering
  • Systems Analysis and Design
  • Virtual Reality
  • Browse content in Computing
  • Business Applications
  • Computer Security
  • Computer Games
  • Computer Networking and Communications
  • Digital Lifestyle
  • Graphical and Digital Media Applications
  • Operating Systems
  • Browse content in Earth Sciences and Geography
  • Atmospheric Sciences
  • Environmental Geography
  • Geology and the Lithosphere
  • Maps and Map-making
  • Meteorology and Climatology
  • Oceanography and Hydrology
  • Palaeontology
  • Physical Geography and Topography
  • Regional Geography
  • Soil Science
  • Urban Geography
  • Browse content in Engineering and Technology
  • Agriculture and Farming
  • Biological Engineering
  • Civil Engineering, Surveying, and Building
  • Electronics and Communications Engineering
  • Energy Technology
  • Engineering (General)
  • Environmental Science, Engineering, and Technology
  • History of Engineering and Technology
  • Mechanical Engineering and Materials
  • Technology of Industrial Chemistry
  • Transport Technology and Trades
  • Browse content in Environmental Science
  • Applied Ecology (Environmental Science)
  • Conservation of the Environment (Environmental Science)
  • Environmental Sustainability
  • Environmentalist Thought and Ideology (Environmental Science)
  • Management of Land and Natural Resources (Environmental Science)
  • Natural Disasters (Environmental Science)
  • Nuclear Issues (Environmental Science)
  • Pollution and Threats to the Environment (Environmental Science)
  • Social Impact of Environmental Issues (Environmental Science)
  • History of Science and Technology
  • Browse content in Materials Science
  • Ceramics and Glasses
  • Composite Materials
  • Metals, Alloying, and Corrosion
  • Nanotechnology
  • Browse content in Mathematics
  • Applied Mathematics
  • Biomathematics and Statistics
  • History of Mathematics
  • Mathematical Education
  • Mathematical Finance
  • Mathematical Analysis
  • Numerical and Computational Mathematics
  • Probability and Statistics
  • Pure Mathematics
  • Browse content in Neuroscience
  • Cognition and Behavioural Neuroscience
  • Development of the Nervous System
  • Disorders of the Nervous System
  • History of Neuroscience
  • Invertebrate Neurobiology
  • Molecular and Cellular Systems
  • Neuroendocrinology and Autonomic Nervous System
  • Neuroscientific Techniques
  • Sensory and Motor Systems
  • Browse content in Physics
  • Astronomy and Astrophysics
  • Atomic, Molecular, and Optical Physics
  • Biological and Medical Physics
  • Classical Mechanics
  • Computational Physics
  • Condensed Matter Physics
  • Electromagnetism, Optics, and Acoustics
  • History of Physics
  • Mathematical and Statistical Physics
  • Measurement Science
  • Nuclear Physics
  • Particles and Fields
  • Plasma Physics
  • Quantum Physics
  • Relativity and Gravitation
  • Semiconductor and Mesoscopic Physics
  • Browse content in Psychology
  • Affective Sciences
  • Clinical Psychology
  • Cognitive Psychology
  • Cognitive Neuroscience
  • Criminal and Forensic Psychology
  • Developmental Psychology
  • Educational Psychology
  • Evolutionary Psychology
  • Health Psychology
  • History and Systems in Psychology
  • Music Psychology
  • Neuropsychology
  • Organizational Psychology
  • Psychological Assessment and Testing
  • Psychology of Human-Technology Interaction
  • Psychology Professional Development and Training
  • Research Methods in Psychology
  • Social Psychology
  • Browse content in Social Sciences
  • Browse content in Anthropology
  • Anthropology of Religion
  • Human Evolution
  • Medical Anthropology
  • Physical Anthropology
  • Regional Anthropology
  • Social and Cultural Anthropology
  • Theory and Practice of Anthropology
  • Browse content in Business and Management
  • Business Strategy
  • Business Ethics
  • Business History
  • Business and Government
  • Business and Technology
  • Business and the Environment
  • Comparative Management
  • Corporate Governance
  • Corporate Social Responsibility
  • Entrepreneurship
  • Health Management
  • Human Resource Management
  • Industrial and Employment Relations
  • Industry Studies
  • Information and Communication Technologies
  • International Business
  • Knowledge Management
  • Management and Management Techniques
  • Operations Management
  • Organizational Theory and Behaviour
  • Pensions and Pension Management
  • Public and Nonprofit Management
  • Strategic Management
  • Supply Chain Management
  • Browse content in Criminology and Criminal Justice
  • Criminal Justice
  • Criminology
  • Forms of Crime
  • International and Comparative Criminology
  • Youth Violence and Juvenile Justice
  • Development Studies
  • Browse content in Economics
  • Agricultural, Environmental, and Natural Resource Economics
  • Asian Economics
  • Behavioural Finance
  • Behavioural Economics and Neuroeconomics
  • Econometrics and Mathematical Economics
  • Economic Systems
  • Economic History
  • Economic Methodology
  • Economic Development and Growth
  • Financial Markets
  • Financial Institutions and Services
  • General Economics and Teaching
  • Health, Education, and Welfare
  • History of Economic Thought
  • International Economics
  • Labour and Demographic Economics
  • Law and Economics
  • Macroeconomics and Monetary Economics
  • Microeconomics
  • Public Economics
  • Urban, Rural, and Regional Economics
  • Welfare Economics
  • Browse content in Education
  • Adult Education and Continuous Learning
  • Care and Counselling of Students
  • Early Childhood and Elementary Education
  • Educational Equipment and Technology
  • Educational Strategies and Policy
  • Higher and Further Education
  • Organization and Management of Education
  • Philosophy and Theory of Education
  • Schools Studies
  • Secondary Education
  • Teaching of a Specific Subject
  • Teaching of Specific Groups and Special Educational Needs
  • Teaching Skills and Techniques
  • Browse content in Environment
  • Applied Ecology (Social Science)
  • Climate Change
  • Conservation of the Environment (Social Science)
  • Environmentalist Thought and Ideology (Social Science)
  • Natural Disasters (Environment)
  • Social Impact of Environmental Issues (Social Science)
  • Browse content in Human Geography
  • Cultural Geography
  • Economic Geography
  • Political Geography
  • Browse content in Interdisciplinary Studies
  • Communication Studies
  • Museums, Libraries, and Information Sciences
  • Browse content in Politics
  • African Politics
  • Asian Politics
  • Chinese Politics
  • Comparative Politics
  • Conflict Politics
  • Elections and Electoral Studies
  • Environmental Politics
  • European Union
  • Foreign Policy
  • Gender and Politics
  • Human Rights and Politics
  • Indian Politics
  • International Relations
  • International Organization (Politics)
  • International Political Economy
  • Irish Politics
  • Latin American Politics
  • Middle Eastern Politics
  • Political Methodology
  • Political Communication
  • Political Philosophy
  • Political Sociology
  • Political Behaviour
  • Political Economy
  • Political Institutions
  • Political Theory
  • Politics and Law
  • Public Administration
  • Public Policy
  • Quantitative Political Methodology
  • Regional Political Studies
  • Russian Politics
  • Security Studies
  • State and Local Government
  • UK Politics
  • US Politics
  • Browse content in Regional and Area Studies
  • African Studies
  • Asian Studies
  • East Asian Studies
  • Japanese Studies
  • Latin American Studies
  • Middle Eastern Studies
  • Native American Studies
  • Scottish Studies
  • Browse content in Research and Information
  • Research Methods
  • Browse content in Social Work
  • Addictions and Substance Misuse
  • Adoption and Fostering
  • Care of the Elderly
  • Child and Adolescent Social Work
  • Couple and Family Social Work
  • Developmental and Physical Disabilities Social Work
  • Direct Practice and Clinical Social Work
  • Emergency Services
  • Human Behaviour and the Social Environment
  • International and Global Issues in Social Work
  • Mental and Behavioural Health
  • Social Justice and Human Rights
  • Social Policy and Advocacy
  • Social Work and Crime and Justice
  • Social Work Macro Practice
  • Social Work Practice Settings
  • Social Work Research and Evidence-based Practice
  • Welfare and Benefit Systems
  • Browse content in Sociology
  • Childhood Studies
  • Community Development
  • Comparative and Historical Sociology
  • Economic Sociology
  • Gender and Sexuality
  • Gerontology and Ageing
  • Health, Illness, and Medicine
  • Marriage and the Family
  • Migration Studies
  • Occupations, Professions, and Work
  • Organizations
  • Population and Demography
  • Race and Ethnicity
  • Social Theory
  • Social Movements and Social Change
  • Social Research and Statistics
  • Social Stratification, Inequality, and Mobility
  • Sociology of Religion
  • Sociology of Education
  • Sport and Leisure
  • Urban and Rural Studies
  • Browse content in Warfare and Defence
  • Defence Strategy, Planning, and Research
  • Land Forces and Warfare
  • Military Administration
  • Military Life and Institutions
  • Naval Forces and Warfare
  • Other Warfare and Defence Issues
  • Peace Studies and Conflict Resolution
  • Weapons and Equipment

The Oxford Handbook of Philosophy of Perception

  • < Previous chapter
  • Next chapter >

The Oxford Handbook of Philosophy of Perception

25 Speech Perception

Casey O'Callaghan is professor of philosophy at Washington University in St Louis. He is author of Sounds (Oxford 2007), Beyond Vision (Oxford 2017), and A Multisensory Philosophy of Perception (Oxford 2019).

  • Published: 13 January 2014
  • Cite Icon Cite
  • Permissions Icon Permissions

Is speech special? This chapter evaluates the evidence that speech perception is distinctive when compared with non-linguistic auditory perception. It addresses the phenomenology, contents, objects, and mechanisms involved in the perception of spoken language. According to the account it proposes, the capacity to perceive speech in a manner that enables understanding is an acquired perceptual skill. It involves learning to hear language-specific types of ethologically significant sounds. According to this account, the contents of perceptual experience when listening to familiar speech are of a variety that is distinctive to hearing spoken utterances. However, perceiving speech involves neither novel perceptual objects nor a unique perceptual modality. Much of what makes speech special stems from our interest in it.

Philosophers have devoted tremendous effort to explicating what it takes to understand language. The answers focus on things such as possessing concepts, mastering grammar, and grasping meanings and truth conditions. The answers thereby focus on extra-perceptual cognition. Understanding spoken language, however, also involves perception —grasping a spoken utterance requires hearing or seeing it. Perception’s role in understanding spoken language has received far less philosophical attention. According to a simple view, understanding speech is just a matter of assigning meaning to the sounds you hear or to the gestures you see. If so, what perception contributes to understanding spoken language is not distinctive to the case of spoken utterances. Against this, however, is the prospect that speech is special. In this chapter, I present and evaluate the evidence that speech perception differs from non-linguistic auditory perception. In particular, I discuss the phenomenology, contents, objects, and mechanisms of speech perception. I make proposals about the ways in which speech is and is not perceptually special. According to the account I offer, the capacity to perceive speech in a manner that enables understanding is an acquired perceptual skill. It involves learning to hear language-specific types of ethologically significant sounds. According to this account, while the contents of perceptual experience when listening to familiar speech are of a variety that is distinctive to hearing spoken utterances, perceiving speech involves neither novel perceptual objects nor a unique perceptual modality. Much of what makes speech special stems from our fierce interest in it.

1 Is Speech Perceptually Special?

There is a thriving debate about whether the human capacity to use and understand language is special (see, e.g., Hauser et al., 2002 ; Pinker and Jackendoff, 2005 ). A key part of this wider debate is whether the capacity to speak and understand speech is special (see, e.g., Liberman, 1996 ; Trout, 2001 ; Mole, 2009 ). My concern here is with speech perception. Is the human capacity to perceive spoken language special?

To be special requires a difference. However, the debate about whether speech is special is not just about whether speech perception in some respect differs from other forms of perception. It concerns whether speech perception should be distinguished as a distinctive or a unique perceptual capacity. Put in this way, the question relies on a comparison. The most common contrast is with general audition. The question thus is whether speech perception differs or is a distinct perceptual capacity when compared with non-linguistic auditory perception . A separate contrast is with the capacities of non-human animals . Is speech perception uniquely human? The contrast between human and non-human responses to spoken language is frequently used to illuminate the contrast between human speech perception and non-linguistic audition.

A difference is a difference in some respect, and being distinctive or unique is being distinctive or unique in some way, for some reason. In what respects is speech special? It is helpful to divide the candidates into four broad classes.

The first concerns the phenomenology of speech perception. Does what it is like to perceptually experience spoken utterances contrast with what it is like to perceptually experience non-linguistic sounds and events? One way to make progress on this question is to ask whether the perceptual experience of hearing speech in a language you know differs phenomenologically from that of hearing speech in an unfamiliar language.

The second concerns the contents of speech perception. Does the perceptual experience of speech involve contents absent from non-linguistic auditory experience? Does understanding a language affect which properties perceptual experiences represent spoken utterances to have?

The third concerns the objects of speech perception. Are the objects of speech perception distinct from the objects of non-linguistic audition? Does speech perception share objects with non-linguistic audition?

The fourth concerns the mechanisms of speech perception. Does perceiving speech involve perceptual processes that differ from those involved in perceiving non-linguistic sounds and events? Does speech perception involve a special perceptual module ? Is speech perception the work of a distinct perceptual modality ?

Answering the question, ‘Is speech special?’ thus means addressing a number of different questions. This essay focuses on the contrast between speech perception and human non-linguistic auditory perception. I distinguish the various respects in which speech might be special when compared with non-linguistic audition. I assess the evidence and advance proposals about the respects in which speech perception is special.

2 Phenomenology

Is perceiving speech phenomenologically special? Is what it’s like, for the subject, to perceptually experience speech different, distinctive, or unique when compared with non-linguistic audition?

It is natural to think that the perceptual experience of listening to spoken language differs phenomenologically from the perceptual experience of listening to non-linguistic sounds, simply because speech sounds and non-linguistic sounds differ acoustically. Hearing the sound of a drop of water differs phenomenologically from hearing the sound of the spoken word ‘drop’ because the sounds differ in their basic audible qualities.

However, the perceptual experience of spoken language may also involve distinctive phenomenological features that are absent from non-linguistic auditory experience. Start with the experiential contrast between listening to non-linguistic sounds and listening to spoken language. Begin with the case of a language you know. The experience of listening to speech in a language you know differs noticeably from the experience of listening to ordinary, non-linguistic environmental sounds, even once we eliminate acoustical differences. The phenomenological shifts associated with sinewave speech support this claim. Sinewave speech is an artificial signal in which an acoustically complex human voice is replaced by several sinewaves that vary in frequency and amplitude with the primary formants of the original speech signal, while removing acoustical energy at other frequencies ( Remez et al., 1981 ). At first, it is difficult to recognize the sounds of sinewave speech as speech sounds. Instead, they just sound like computer-generated noises. However, after hearing the original human speech from which the sinewave speech is derived, it is easy to hear what the sinewave speech says. The same stimulus is first experienced as non-speech sounds, and then it is experienced as speech. And this change is accompanied by a dramatic phenomenological shift.

In the case just described, you come to comprehend the speech. Thus, understanding might suffice to explain the phenomenological difference when you are listening to speech in a language you know. You grasp meanings, so the experiential difference could in principle be explained in terms of cognitive, rather than perceptual, phenomenology. (This explanation is unavailable if you reject that extra-perceptual cognition has proprietary phenomenology.)

To control for any contribution from understanding, consider the experiential contrast between listening to non-speech sounds and listening to speech in a language you do not know. Is there any phenomenological difference? It is possible reliably to discriminate speech in a language you do not understand from ordinary environmental sounds. Neonates prefer speech sounds to non-speech sounds though they do not understand language. In addition, sinewave speech in a language you do not know may appear first as non-speech sounds and then as speech. Thus, we have evidence that perceptually experiencing a stimulus as speech rather than as non-speech sounds makes a phenomenological difference that does not depend on understanding.

Understanding spoken utterances need not, however, contribute exclusively to the phenomenology of extra-perceptual cognition. Knowing a language may also impact the phenomenal character of perceptual experience. Consider the phenomenological contrast between the perceptual experience of listening to speech in a language you know and of listening to speech in an unfamiliar language. Of course, languages differ acoustically in ways that affect how they sound. For instance, whether or not you know Hindi, it sounds different from German. To control for acoustical differences that affect phenomenology, fix the language. Contrast the experience of a person who knows the language with that of a person who does not know the language when faced with the same spoken utterance. Or, consider a person’s experience prior to and after learning the language. Many philosophers agree that knowing the language affects the phenomenological character of perceptual experience, even while they disagree about the diagnosis (see O’Callaghan, 2011 : 784-787).

What is the source of the difference? Speech in a language you know differs perceptually in several respects. Most obviously, your perceptual experience of its temporal characteristics differs. When you know the language, audible speech does not seem like an unbroken stream of sounds. It seems instead to include discernible gaps, pauses, and other boundaries between words, clauses, and sentences, and you are able perceptually to resolve qualitative features and contrasts at a much finer temporal grain. Familiar speech also appears in other respects to differ qualitatively from unfamiliar speech. For instance, when you have mastered a spoken language, you are able to detect subtle qualitative features and their contrasts, such as the difference between ‘s’ and ‘z’, or the dropped ‘g’ or ‘t’ of certain accents. The stimulus sounds different and more detailed when you recognize it as speech and you know the language.

The argument of the last paragraph, unlike the argument from sinewave speech, requires comparing phenomenology across subjects or across long stretches of time. Thus, it is more controversial. An alternative way to establish the point is to compare the shift that occurs with sinewave speech in a language you know with the shift that occurs with sinewave speech in a language you do not know. In each case, recognizing the sounds as speech leads to a shift in phenomenal character. The change, however, is far more dramatic when you know the language. The difference between the two phenomenological contrasts is the difference that accrues thanks to knowing the language.

These arguments indicate that one’s perceptual experiences may differ phenomenologically when listening to speech in a known language, when listening to speech in an unfamiliar language, and when listening to non-speech sounds. Moreover, such phenomenological differences can be evoked even when we have controlled for acoustical differences. This supports the following two claims: knowing a language impacts the phenomenal character of perceptual experience when listening to spoken utterances; and, speech perception has phenomenal features that are distinctive when compared with non-linguistic audition.

Content concerns how things are represented to be. Content thus concerns things that are perceptually experienced and the features they are perceptually experienced to have. One way to characterize the contents of perceptual experiences appeals to their accuracy or veridicality conditions. Some prefer to speak of what a given perceptual experience purports to be facts about the world, or of how things seem or appear perceptually. Some philosophers hold that perceptual experiences differ phenomenologically only if they differ in how they represent things as being. Some also hold that there is a variety of content such that perceptual experiences differ in content only if they differ phenomenologically. In either case, a difference in content may help to explain the sorts of phenomenological differences mentioned in Section 2 . What we perceive when we perceive speech may, in this sense, differ from what we perceive when we perceive non-speech sounds. Speech perception may involve contents that are special or distinctive when compared with non-linguistic audition.

In what respects does the content of speech perception differ from that of non-linguistic audition? The characteristic sounds of human vocalization differ acoustically from the sounds of non-linguistic happenings such as blowing leaves, backfiring automobiles, and violins. The perceptual experience of speech reflects this. Such evident qualitative differences, which are underpinned by acoustical differences, are part of why sinewave speech at first sounds like meaningless computer noise, and why artificial speech often sounds inhuman. Perhaps, then, the perceptual experience of speech differs phenomenologically from the perceptual experience of non-linguistic sounds and happenings because its perceptually apparent features differ in a way that is recognizable and distinctive to spoken language.

This is compatible with an austere view of the types of features that one perceptually experiences when listening to speech or to non-speech sounds. The phenomenological difference between perceptually experiencing speech and non-speech may just stem from a difference in the patterns of low-level properties that the perceptual experiences represent. For instance, it may just stem from a difference in the apparent pattern of pitch, timbre, and loudness of a sound stream over time. Any further experiential differences may result from extra-perceptual cognition, such as thought or imagination.

This austere picture also suggests an attractive account of how perceptually experiencing speech in an unfamiliar language differs phenomenologically from perceptually experiencing speech in a language you know. As discussed in Section 2 , the audibly apparent temporal and qualitative features of spoken utterances in a language you know generally differ from those of speech in a language that is unfamiliar to you. Foreign language may sound like a continuous stream of indistinct babble, but familiar speech perceptually appears to be chunked into units that correspond to words and phrases and to include discernible gaps, pauses, and boundaries that distinguish such units from each other. Hearing familiar language also involves the capacity to perceptually experience sublexical features at a finer temporal grain, and to discern linguistically significant qualitative details and contrasts that you could not make out before. Conversely, it also involves failing to discern other qualitative contrasts that are linguistically irrelevant. Thus, in these ways, differences in the perceptually apparent pattern of individual sounds and low-level audible qualities such as pitch, timbre, and loudness over time may explain the phenomenological difference that knowing a language makes.

Nevertheless, such an austere account might not suffice. Some philosophers have claimed that grasping meanings or semantic properties contributes in a constitutive rather than merely causal manner to the phenomenal character of perceptual experience. They argue therefore that listening to spoken utterances when you know the language involves perceptually experiencing meanings or semantic properties (e.g., McDowell, 1998 ; Siegel, 2006 ; Bayne, 2009 ). According to such an account, perceptual experiences may represent or involve awareness not just as of low-level sensible features, such as pitch, timbre, loudness, and timing, but also as of high-level features, including semantic properties. Such an account supports a liberal view about what types of properties may be represented by episodes of perceptual experience (see, e.g., Siegel, 2006 ; Bayne, 2009 ).

The liberal view of speech perception’s contents faces an objection if it also must explain the phenomenological difference between the perceptual experience of listening to speech in a familiar language and of listening to speech in an unfamiliar language. The account requires that, for an utterance you understand, there is something distinctive it is like for you to perceptually experience its specific meaning. That is because nothing suggests you could not hear foreign utterances as meaningful if that does not require hearing specific meanings. Hearing meaningfulness, if not specific meanings, for instance, could help to explain the phenomenological difference between hearing speech in an unfamiliar language and hearing non-linguistic sounds. However, perceptually experiencing specific meanings can also account for the difference between hearing familiar and unfamiliar speech. Suppose, therefore, that you perceptually experience specific meanings, rather than just meaningfulness. Thus, differences in apparent meaning should impact the phenomenal character of perceptual experience for utterances in a known language. But consider homophonic utterances, which share pronunciation but not meaning. Homophonic utterances do not apparently cause perceptual experiences that differ in phenomenal character. For instance, even when they are embedded appropriately in meaningful linguistic contexts, the perceptual experience of hearing an utterance of ‘to’ does not clearly differ in phenomenal character from the perceptual experience of hearing an utterance of ‘too’ or ‘two’ (the same holds for homographic homophones). Complete sentences present a similar problem. Utterances of structurally ambiguous statements, such as, ‘Visiting relatives can be boring’, and those with scope ambiguities, such as, ‘Everyone chose someone’, may not, under their differing interpretations, lead to perceptual experiences that differ phenomenologically. The argument from homophones thus casts doubt on the claim that specific meanings make a distinctive difference to the phenomenal character of perceptual experience ( O’Callaghan, 2011 ).

A moderate account denies that the perceptual experience of speech includes awareness as of meanings or high-level semantic properties. It nevertheless explains the phenomenological difference that accrues thanks to knowing a language using resources beyond the austere account’s low-level acoustical features. According to one such account, listening to speech in a familiar language involves the perceptual experience of language-specific but non-semantic properties of spoken utterances.

Phonological features, such as phones and phonemes , form the basis for recognizing and distinguishing spoken words. Phonological features in general are respects of discernible non-semantic similarity and difference among utterances that may make a semantic difference. Phonological features are like the basic perceptible vocabulary or ‘building blocks’ of spoken language. 1 To illustrate, consider utterances of ‘bad’, ‘imbue’, and ‘glob’. In one respect, these utterances are perceptibly similar. Each apparently shares with the others the ‘b’ sound—[b]‌ in phonological notation. Next consider utterances of ‘lab’ and ‘lash’. They perceptibly match, except that the former contains the ‘b’ sound and the latter contains the ‘sh’ sound—[∫] in phonological notion. The phones [b] and [∫] are examples of features which may be shared among distinct spoken utterances, which may differ among otherwise indistinguishable utterances, and which may make a semantic difference. Distinct phones are distinguished by a perceptible difference that is linguistically significant in some human language. A phone thus is usefully understood in terms of a type whose members make a common linguistic contribution to any given language. One phone is distinguished from another by some perceptually discernible difference that is or may be exploited by some spoken language to signal a semantically significant difference. Since phones are the minimal perceptible features that make a linguistic difference in some world language, they are in this sense the perceptible ‘building blocks’ of spoken language.

Specific spoken languages do not all make use of this basic stock of building blocks in the same manner. Some spoken languages, for instance, include clicks and buzzes, while others do not. Moreover, spoken languages may, even when they make use of the same basic stock, differ in which classes of utterances they treat as linguistically equivalent and in which classes of utterances they treat as distinct. For example, spoken English distinguishes [l]‌ from [r], 2 but Japanese does not. Thus, the phones [l] and [r] correspond to distinct English phonemes, /l/ and /r/, but are allophones or linguistically equivalent variations of a single Japanese phoneme. Another example is that [p] and [p h ] are allophones of the English phoneme, /p/, but Mandarin Chinese treats them as distinct phonemes, /p/ and /p h /. The difference between [p] and [p h ] suffices to distinguish Chinese but not English words. So, some languages treat [p] and [p h ] as allophones of a single phoneme, while others treat them as distinct phonemes that may suffice for a semantic difference.

Phonemes thus may usefully be understood in terms of language-specific classes whose members are treated as linguistically equivalent, or as allophonic , within the context of that spoken language, even if under certain conditions its members may be perceptually distinguishable. A language’s phonemes are distinguished from one another by perceptually discernible differences that are semantically significant. The lesson is that certain utterance pairs are treated as linguistically equivalent by some languages but as linguistically distinct by others. Thus, spoken languages yield differing families of equivalence classes of utterances that make a common semantic contribution. So, the way in which a basic stock of speech sounds, which have the potential to signal semantic difference, in fact is utilized by a particular language is specific to that language. A language’s stock of linguistically significant sound types is distinctive to that language.

Since phonemes differ across languages, discerning a language’s phonemes requires substantial exposure and learning. That such features may be perceptually experienced nonetheless helps to explain patterns of similarity and difference among utterances that are apparent to users of a given language. The capacity perceptually to discern such similarities and differences is critical to understanding spoken language. It is not, however, explained by the perceptual experience of low-level audible attributes alone.

What is noteworthy is that users of a given language commonly treat certain crucial pairs of sounds or utterances as perceptibly equivalent, while those who do not know that language treat them as perceptibly distinct. For example, auditory perceptual discrimination tasks in linguistic contexts reveal that the sounds corresponding to ‘t’ in utterances of ‘ton’ and ‘stun’ auditorily appear virtually the same to fluent monolingual English users, but appear noticeably to differ to fluent monolingual users of Chinese. Spoken utterances of ‘bed’ and ‘bad’ in linguistic contexts differ audibly to English speakers but not to Dutch speakers. Speakers of one language may discern a common linguistic sound across utterances that differ acoustically while speakers of another language do not. So, suppose we have two groups of language users. Suppose all are attentively listening, and that each is presented with two sounds uttered by the same talker in a linguistic context. Those in the first group do not notice a difference between the speech sounds. They judge that they are audibly equivalent, and they behave as if the sounds are equivalent. Those in the other group do notice a difference between the speech sounds. They judge that they audibly differ, and they behave as if the sounds are not audibly equivalent. In this case, for at least one of the speech sounds, it is plausible to say that the perceptual experience of a language listener from the first group differs phenomenologically from the perceptual experience of a listener from the second group. If so, then for a large class of linguistic sounds, the perceptual experience of someone who knows a given language may differ from the perceptual experience of someone who does not. If only those who know a spoken language perceptually experience its language-specific phonological attributes, such as its phonemes, then this provides an attractive explanation for the difference. For instance, having a perceptual experience that represents the English phoneme /l/, rather than /r/, may explain why hearing an utterance of ‘law’ differs phenomenally from hearing an utterance of ‘raw’. Having perceptual experiences as of a single English phoneme explains a monolingual English speaker’s failure to perceptually distinguish utterances of distinct Chinese words. A central part of the phenomenological difference that accrues thanks to knowing a language thus stems from the perceptual experience of attributes whose linguistic significance is specific to that language.

The perceptual experience of language-specific features explains apparent patterns of similarity and difference that to a noteworthy degree are independent from lower-level audible attributes, such as pitch, timbre, and loudness over time. For instance, the low-level audible qualities of an utterance of /p/ vary across phonological contexts, speakers, moods, and social contexts. The perceptual experience of a single phoneme explains this kind of perceptually apparent sameness in the face of differing lower-level audible qualities. On the other hand, the same acoustical signal may appear as a /p/ in some contexts and as a /b/ or /k/ in another. In different contexts, distinct apparent phonemes may accompany matching low-level audible qualities.

A moderate account of this sort finds converging support from three sources of evidence. First, developmental evidence shows that young infants discern a wide variety of phonetic differences that are linguistically significant in various languages. However, between five and twelve months, infants cease to discern phonetic differences that are not linguistically significant in the languages to which they have been exposed. Babies in Pittsburgh stop distinguishing utterances that differ with respect to [p]‌ and [p h ], and babies in Madrid stop distinguishing utterances that differ with respect to [s] and [z]. Such pruning requires regular exposure to the language, and it is part of learning to become perceptually responsive to the features that are distinctive to a spoken language. Children thus learn to hear the sounds of their language (see, e.g., Eimas et al., 1971 ; Jusczyk, 1997 ).

Second, adult perception of certain critical speech sounds, such as stop consonants, is categorical (see Chapter XX, this volume; Harnad, 1987 ). This means that, in critical cases, such as the perception of stop consonants, gradually varying the value of a diagnostic physical parameter leads to uneven perceptual variation. For example, suppose we start with a stimulus experienced as /ba/ and gradually increase its voice onset time. At first, this makes little difference. At some point, however, the stimulus abruptly appears to shift to a /pa/. In a dramatic case of categorical perception, the change seems perfectly abrupt. Thus, given a boundary that is diagnostic for a perceptual category, stimuli that differ by a certain physical magnitude may differ only slightly in perceptual appearance when each falls within that boundary; however, stimuli that differ by that same physical magnitude may differ greatly in perceptual appearance when one but not the other falls within the boundary.

Patterns of categorical perception in fact vary accordingly. Adult categorical perception of speech sounds corresponds to language-specific phonological categories, generally those of the listener’s first language (though there is some flexibility). Perceptual awareness of phonological features thus helps to explain both perceptually apparent patterns of similarity and difference among utterances within a language and variation in patterns of apparent similarity and difference across speakers of different languages.

Third, evidence from aphasias , language-related disorders, suggests that the capacity to understand spoken language normally requires the capacity to perceive language-specific attributes of speech that are not meanings. Moreover, the latter capacity affects the phenomenal character of auditory perceptual experience. Individuals with transcortical sensory aphasia (TSA) have a severely impaired capacity to grasp and to understand linguistic meanings, but they retain the capacities to hear, to generate, and to repeat spoken utterances. They commonly are unaware of their disorder. In contrast, individuals with pure word deafness (PWD) have intact semantic capacities but lack the capacity to perceive spoken language as such. Individuals with PWD are unable to hear sounds or utterances as spoken words or linguistic units. Their deficit is limited to auditory language perception. They may learn to use sign language or even read lips. And their hearing otherwise remains normal. They can hear and recognize barking dogs, cars, and even the sounds of familiar voices. Individuals with PWD say, however, that words fail to ‘come up’ and describe the auditory experience of spoken language as like hearing garbled sound or foreign language (see, especially, Poeppel 2001 : 681). These descriptions of TSA and PWD suggest that there is an important phenomenological difference in perceptual experience that stems from being able to discern and to recognize language-specific features but that does not require the capacity to discern and to recognize the meanings of spoken utterances. Auditorily experiencing language-specific features other than meanings therefore plausibly captures this difference. Phonological and other structural features of spoken utterances are good candidates. 3

Appealing to the content of perceptual experience thus helps to explain what is distinctive about the perceptual experience of listening to speech. In particular, two sorts of features help to account for the difference between the perceptual experience of listening to unfamiliar speech and of listening to speech in a language you know. When you know a language, the patterns of determinate low-level audible attributes you perceptually experience differ from when you do not know the language. This difference concerns the specific arrangement of low-level qualitative and temporal attributes, each of which you could, in principle, perceptually experience even non-linguistic sounds to bear. However, understanding speech also involves perceptually experiencing spoken utterances to bear language-specific attributes, including phonological properties such as phonemes. Developing the capacity to perceptually experience such language-specific features requires exposure and perceptual learning. Its exercise is part of any adequate explanation for the experiential difference that accrues thanks to knowing a language. While I have expressed doubt that meanings and high-level semantic properties are represented by perceptual experiences, I leave open whether and which additional language-specific features are among the contents of perceptual experience when listening to speech. For instance, you may perceptually experience morphemes, lexemes, or even grammatical properties when you listen to speech in a language you understand. Greater attention to the ways such features affect the phenomenal character of perceptual experience will inform broader debates about the richness of perceptual experience—that is, about the types of features awareness of which constitutively shapes the phenomenal character of perceptual experience. This, in turn, should impact how we understand the interface of perception with cognition.

The previous section argued that the perceptual experience of speech differs in content from non-linguistic audition. This section concerns whether the objects of speech perception differ from those of non-linguistic audition. There are two ways to understand the objects of perception. Construed broadly, the objects of perception simply are targets of perception, and may include particular individuals, their attributes, happenings, or states of affairs. In this broad sense, to be an object of perception is just to be perceived. According to some accounts, objects of perception in the broad sense are the components of content. In Section 3 , I proposed that the perceptual experience of speech involves awareness as of language-specific features. So, in the broad sense, the objects of speech perception are special when compared with those of non-linguistic audition.

Construed more narrowly, however, the objects of perception are the individuals that bear perceptible attributes. In this narrow sense, vision’s objects might include ordinary material objects that look to have attributes such as shape and colour, and audition’s objects plausibly include individual sounds that have pitch, timbre, and loudness. Further philosophical debates concern the natures of the objects of perception, including whether they are public or private. The phenomenological differences between speech perception and non-linguistic audition, especially since they are dramatic, might be taken to suggest that the objects of speech perception in this sense differ from those of non-linguistic audition. This discussion concerns whether the objects of speech perception are special in the narrow sense that includes only individuals.

In one respect, it is trivial that the objects of speech perception differ from those of non-linguistic audition. One case involves perceiving speech, and the other involves perceiving non-speech. At the very least, perceiving speech involves perceiving sounds of a kind to which non-linguistic sounds do not belong, and vice versa. Speech sounds and non-linguistic sounds differ in their causes, their sources, and their effects, as well as in their semantic and other linguistic properties.

The claim that speech perception and general audition have different objects typically is not just the claim that they involve hearing different kinds of sounds or sounds with distinctive features. Speech perception researchers have claimed that the objects of speech perception are not sounds at all. This is a claim about the sorts of individuals perceived when one perceives speech. In particular, it is the claim that while the individuals you perceive in non-linguistic auditory perception are sounds, the individuals that you perceive when you listen to speech are not sounds. The objects of speech perception instead are individuals of a wholly different sort.

Three main sorts of argument are offered. The first type of argument appeals to the mismatch between salient features of the objects of speech perception and features of the acoustic signal. We can reconstruct the argument in the following way. The objects of non-linguistic audition are sounds. The perceptible features of sounds correspond to aspects of the acoustic signal. But, the perceptible features of speech do not correspond to aspects of the acoustical signal. The perceptible features of speech are, thus, not perceptible features of sounds. So, the objects of speech perception differ from those of non-linguistic audition.

This argument can be illustrated using the case of apparent phonological features, such as phones or phonemes . The acoustic attributes that correspond to a perceived phonological feature vary greatly depending upon setting and context. Not only do they vary in expected ways, with speaker, mood, and accent, but they also depend locally upon the surrounding linguistic context. For example, phonological features are not uttered in discrete, isolated units. Instead, they are articulated in a continuous stream that flows gradually from one to the next. This has two noteworthy consequences. First, information about one phoneme is blended with information about surrounding phonemes. Because distinct speech sounds are coarticulated , when I utter ‘imbue’, the fact that the /i/ is followed by /m/ shapes how I pronounce the /i/. This differs from how I pronounce the /i/ when it is followed by /d/, as in ‘idiom’. In fact, no clear invariant acoustic signature corresponds to an utterance of a given phoneme in all of its perceptible instances. And a given acoustical configuration might contribute to distinct apparent phonemes in different contexts. Second, some have been inclined to say that perceptible speech appears to be segmented into discrete phonemes. However, the acoustic information by which you discern the presence of a given phoneme is present during the utterance of surrounding phonemes. For instance, the acoustical information corresponding to /æ/ in an utterance of ‘dab’ is present during the articulation of both the /d/ and the /b/ (and vice versa). Thus, no clear acoustic boundaries correspond to any segmentation that is apparent between adjacent phonemes. Therefore, there exists no consistent, context-independent, homomorphic mapping between apparent phonemes and straightforward features of the acoustic signal (see, e.g., Appelbaum, 1999 ; Remez and Trout, 2009 ). 4 This point should be evident to anyone who has laboured with speech recognition software. It leads some philosophers to anti-realism about phonological features. Rey (2012) , for instance, holds that phonemes are intentional inexistents (see also Smith, 2009 ).

In light of this, Liberman et al. (1967) and other early proponents of the Motor Theory famously proposed that the objects of speech perception are not sounds at all, but instead are something involved in the pronunciation of speech (see also the papers collected in Liberman, 1996 ). The core idea is that features of perceived speech do map in a homomorphic, invariant way onto types of gestures involved in the production of speech. For instance, pronouncing an instance of /d/ involves stopping airflow by placing the tongue at the front of the palate behind the teeth and then releasing it while activating the vocal folds. Pronouncing /b/ involves a voiced release of air from pursed lips. Such articulatory gestures , and the component configurations and movements they comprise, make the manner in which speech is perceptually experienced intelligible in a way that attention to the acoustic signal does not, since such gestures and their descriptions are less sensitive to context. 5 The claim was that the acoustical signal encodes information about articulatory gestures and their features. If articulatory gestures and their features rather than sounds and their attributes are the best candidates for what we perceive when we are perceptually aware of instances of phonemes, then articulatory gestures are the objects of speech perception. Thus, the objects of speech perception and of non-linguistic audition differ in kind. The former are articulatory gestures with phonological characteristics, and the latter are sounds with audible attributes.

These arguments do not establish that the bearers of phonological features are not bearers of non-linguistic audible attributes. Thus, they do not establish that the objects of speech perception include individuals of a wholly different kind from the objects of non-linguistic audition. On one hand, the mismatch argument relies on the presumption that ordinary auditory awareness does map in an invariant, homomorphic way onto features of the acoustic stimulus. However, even pitch, an apparently simple audible quality, has a complex relationship to frequency. In addition, context effects abound. For instance, varying the attack of a sound affects its timbre, and the apparent duration of a tone is affected by the duration of a tone presented earlier or even later. More generally, the apparent objects of auditory awareness in acoustically complex environments do not map clearly and in invariant ways onto straightforward features of the acoustic signal. Nothing obvious in an acoustical stream signals how to distinguish the sound of a guitar from the sound of a voice in a crowded bar. The central lesson of work on auditory scene analysis is that ordinary sounds are individuated—they are distinguished from each other at a time, and they are tracked and segmented over time—in the face of highly complex, interwoven acoustic information ( Bregman, 1990 ).

On the other hand, the argument also relies on the presumption that non-linguistic audition’s objects do not map in an illuminating way onto the events that produce acoustic information. However, audition’s vital function is to provide perceptual access to events in the environment. Accordingly, human audition carves up the acoustical scene in a way that is predicated upon an interest in identifying sound sources. In fact, the way in which sounds are individuated suggests that the objects of non-linguistic auditory perception include sound sources rather than mere acoustical events or sound streams. In the face of complex, entangled acoustical information, you distinguish the sound of the guitar from the sound of the voice because they have distinct sources. We attend to and identify sounds relative to sources, and this is reflected in our thought and talk about sounds, which concern, for instance, the sound of the car door , the sound of the dog , the sound of scratching . Many descriptive sound words are source oriented: rattle , bang , crack . So, just as articulatory gestures illuminate the manner in which the objects of speech perception are individuated and classified (see Matthen, 2005 ), considering the environmental happenings that make sounds illuminates the manner in which the objects of non-linguistic auditory perception are individuated and classified (see, e.g., Nudds, 2010 ). Audition’s objects thus fail to map in an invariant, homomorphic manner onto simple physical properties of an acoustic stimulus, and sound sources help to explain the manner in which audition’s objects are individuated and classified. In these respects, non-linguistic audition does not differ from speech perception. The mismatch argument fails.

The second type of argument is that cross-modal influences in the perception of speech reveal that the objects of speech perception differ in kind from the objects of non-linguistic audition (see, e.g., Trout, 2001 , for discussion). The McGurk effect is one powerful example ( McGurk and Macdonald, 1976 ). Subjects presented with audio of an utterance of the velar /ga/ along with video of a speaker uttering the bilabial /ba/ regularly report perceptually experiencing the alveolar /ga/. Seeing the speaker impacts which phoneme perceptually appears to be uttered. In fact, visual information systematically affects which phoneme you perceptually experience, so both vision and audition provide information about the objects of speech perception. Moreover, Gick and Derrick (2009) demonstrate tactile influences on speech perception. The objects of speech perception are multi-modally accessible. Sounds, however, are neither visible nor multi-modally accessible. Therefore, since sounds are the objects of ordinary non-linguistic audition, the argument concludes that the objects of speech perception and non-linguistic audition must differ.

One objection stems from the reply to the first argument. If audition’s objects include sound sources, and sound sources are ordinary happenings like collisions and vibrations, then audition’s objects might include things that are visible. The other objection is that speech perception is not unique in being subject to influence from multiple senses. Cross-modal recalibrations and illusions are rampant. The ventriloquist illusion shows that vision impacts non-linguistic audition. The motion bounce effect and the sound-induced flash illusion show that non-linguistic audition alters visual experience. Visual capture and the rubber hand illusion show that vision affects touch and proprioception. And the touch-induced flash shows that touch alters vision. The examples multiply (for references and discussion, see, e.g., Spence and Driver, 2004 ; O’Callaghan, 2012 ; Bayne and Spence, Chapter 32 , this volume). In many such cases, the best explanation for some cross-modal effect is that perceptual modalities share common objects ( O’Callaghan, 2008 , 2012 ). Consider the sound-induced flash illusion. When presented with a single flash accompanied by two beeps, many subjects illusorily visually experience two flashes instead of one as a result of the two sounds. This illusion occurs because an apparent conflict between vision and audition is resolved in audition’s favour. Since even apparent conflict requires the assumption of a common subject matter, perceptual processes unfold as if a common environmental source produces both the visual and the auditory stimulation. Since, under such conditions, audition is more reliable for temporal features, the overall perceptual experience that results is as of two events rather than one. If, therefore, cross-modal effects support the claim that multimodal speech perception targets common objects of perception, cross-modal effects may support the claim that there are common objects of perception in multi-modal cases that do not involve speech. Such cross-modal effects thus offer additional support for the claim that non-linguistic audition reveals the sources of sounds, which also are visible. Multi-modality is not unique to speech.

The third type of argument stems from the received view that speech perception is categorical . Some have argued that the categorical nature of phoneme perception (see Section 3 ) shows that its objects are not ordinary sounds, since ordinary sounds need not be perceived categorically (for discussion, see, e.g., Trout, 2001 ; Pinker and Jackendoff, 2005 ; for a critical perspective, see, e.g., Diehl et al., 2004 ). It is true that some attributes of sounds, such as loudness or pitch height (cf. pitch chroma), are not perceived categorically. Nevertheless, there are several lines of response to the argument from categorical perception. First, categorical perception may be limited to certain types of phonemes, such as stop consonants, so not all phoneme perception is categorical. Second, non-linguistic audition may involve categorical perception if speech perception does. Third, non-linguistic creatures, such as quail and monkeys, perceive some speech sounds categorically (see, e.g., Diehl et al., 2004 : 177). Finally, colour perception commonly is regarded as categorical, but this does not establish that the objects of colour vision differ from the objects of ordinary vision. Categorical perception for selected phonemes therefore does not show that the objects of speech perception and the objects of non-linguistic audition differ in kind.

Arguments from mismatch, cross-modal influence, and categorical perception thus do not show that the objects of speech perception differ in nature from the objects of ordinary audition. Sounds are among the objects of auditory perception. But to deny that the objects of speech perception include sounds would require denying that spoken utterances may perceptually appear to have pitch, timbre, and loudness. Nonetheless, the considerations discussed above do support the claim that the objects of speech perception include events or happenings beyond sounds, such as the articulatory gestures of speakers. However, I have maintained that environmental happenings that make or have sounds are also among the objects of non-linguistic auditory perception. For instance, while you hear the crashing sound, you also may hear the collision that makes it. Thus, in speech perception and in general audition, both sounds and sound sources plausibly are among the objects of perceptual awareness.

Suppose one held that phonological features of perceptible speech, such as phones and phonemes, themselves were the objects of speech perception. Since phonological features are not individual sounds, one might be tempted to hold that the objects of speech perception differ from the objects of non-linguistic audition.

This would be a mistake. It conflates the broad and the narrow ways to understand the objects of perception. I have been discussing the narrow understanding of the objects of perception as individuals that bear perceptible attributes. Phonological features as I have characterized them may be among the objects of perception in the broad sense, but they are not objects of perception in the narrow sense.

The account I have offered denies that phones and phonemes are novel perceptible objects , understood as items or individuals , wholly distinct from audible sounds and articulatory events. It maintains instead that phonological features, including specific phones and phonemes, are perceptible properties or attributes of audible and multi-modally perceptible objects, such as sounds and articulatory events. Thus, for instance, a stream of utterances may perceptually appear to have, to bear, or to instantiate phonological attributes, such as [d]‌ or /d/ . Such perceptible linguistic features may be complex properties, and they may have complex relationships to simple acoustical, physical, or physiological properties. They may be common sensibles. One important virtue of this account is that it allows us to abandon the troublesome ‘beads on a string’ model of perceptible phonemes and to accommodate coarticulation. It does so because continuous sound streams or gestural events may perceptually appear at certain moments to instantiate multiple phonological attributes. Rather than perceptually appearing as discrete perceptible items or individuals arranged in a neatly segmented series (like typed letters in a written word), phonological properties of continuously unfolding spoken utterances may instead appear to be instantiated in connected, blended, or overlapping sequences by a common perceptible individual.

The objects of speech perception thus need not be wholly distinct from the objects of non-linguistic audition. Each may include sounds and happenings in the environment that ordinarily are understood to be the sources of sounds. In the specific case of speech, the objects of perception may include sounds of speech and gestures used to articulate spoken language. In a broad sense, they also may include phonological features.

5 Processes

What are the implications for questions about how humans perceive speech—about the means or mechanisms involved in speech perception? Does the perception of speech involve special processes, a special module, or perhaps even a special perceptual modality?

There is evidence that perceiving speech sounds does involve distinctive perceptual processes beyond those involved in hearing non-linguistic sounds. Duplex perception for dichotic stimuli shows that a single stimulus presented to one ear can, in conjunction with information presented to the other ear, contribute simultaneously to the perceptual experience as of both a non-linguistic sound and an apparently distinct speech sound ( Rand, 1974 ). The same acoustic cue is integrated into two distinct percepts. Duplex perception is thought by some to provide evidence for a special system or mode of listening for speech. That is because, under similar experimental conditions with only non-speech tones, masking rather than integration takes place. However, duplex perception does occur for complex non-linguistic sounds, such as slamming doors, so others have responded that speech perception does not involve dedicated perceptual processes distinct from general audition ( Fowler and Rosenblum, 1990 ). Nevertheless, the capacity to perceive non-linguistic sounds does differ developmentally from the capacity to perceive speech. Notably, for instance, the timing of critical periods for the development of linguistic and non-linguistic perceptual capacities differs. In addition, functional neuroimaging establishes that the patterns of brain activity associated with the perception of speech sounds do not match those associated with the perception of non-linguistic sounds. Most tellingly, however, perceptual capacities and disorders related to speech may dissociate from those related to non-linguistic audition. The example of pure word deafness discussed above puts this into relief. Individuals with PWD have intact abilities to hear and to recognize ordinary sounds but are unable to hear and recognize speech sounds as such. In addition, auditory agnosia concerning environmental sounds may leave linguistic capacities intact ( Saygin et al., 2010 ). This shows that one could auditorily perceive speech while lacking other commonplace auditory capacities. Thus, there is evidence to support the claim that there exist perceptual resources and processes devoted to the perception of speech.

Some have held on such grounds that, when compared with general, non-linguistic audition, speech perception is special in that it is modular (e.g., Fodor, 1983 ). Others even have claimed that it involves a special perceptual modality ( Liberman, 1996 ). I am reluctant to accept the strong view that speech perception involves a dedicated perceptual modality that is distinct from general audition and vision. Audition and vision may treat speech sounds and spoken utterances in a manner that differs from non-linguistic sounds and events, but this does not show that speech perception is a novel perceptual modality. Vision, for instance, devotes special resources and deals in different ways with the perception of objects, colour, motion, and shape. Still, there is considerable debate concerning how to count and individuate perceptual modalities. We might identify modalities by their distinctive objects, stimuli, physiology, function, or phenomenology, or by some combination of these criteria. In the case of the classic sense modalities, at least, the criteria tend to align. Some have maintained that we should be pluralists when individuating and counting sense modalities ( Macpherson, 2011 ). Maintaining that speech perception involves a novel perceptual modality nevertheless requires appealing to one or more of the criteria. None of these criteria, however, warrants positing a modality devoted to the perception of speech that is distinct from but on a par with the familiar examples of vision, hearing, smell, taste, and touch. For instance, speech perception does not involve awareness of novel perceptual objects, and it lacks proper sensibles inaccessible to other modalities. Speech perception lacks a distinguishing kind of proximal stimulus, and it lacks a dedicated sense organ and receptors. Its functional relations do not clearly mark it off as a wholly distinct way or manner of perceiving independent from audition or vision. And it is not apparent that its phenomenology has the type of proprietary, internally unified qualitative character that is distinctive to other perceptual modalities. For instance, while the phenomenology of other sensory modalities doubly dissociates, speech perception requires auditory or visual phenomenology and, thus, does not fully dissociate. Despite these indications, however, a more satisfactory theoretical understanding of the modalities of sensory perception will help to make progress on this question (see, e.g., Matthen ).

The weaker claim is that speech perception is modular. But good reasons also exist to doubt that a devoted perceptual module is responsible for the perception of speech. Appelbaum (1998) , for instance, argues forcefully against Fodor that domain general, top-down influences impact the perception of speech sounds. If a process is modular only if it is informationally encapsulated, then speech perception is not modular.

Perhaps it is possible to make do with a minimal story about the sense in which the processes associated with speech perception are special without appealing to a perceptual modality or even a perceptual module devoted to the perception of spoken language. Such a story may be framed in terms of our perceptual treatment of speech and speech sounds. Humans do have a special or differential selectivity or sensitivity for the sounds of speech. The striking evidence is that even neonates distinguish and prefer speech to non-speech sounds ( Vouloumanos and Werker, 2007 ). The sounds of spoken utterances are of special interest to us, relative to other kinds of environmental sounds and events.

Humans are not, however, born able to perceive all of the attributes that are distinctive to specific languages. Infants must prune and cease to perceive audible differences that are not linguistically significant in their own languages. They also must learn perceptually to discern linguistic sameness in the face of variation across speakers, moods, and contexts. This is learning perceptually to ignore irrelevant differences and to attend to crucial similarities, and it alters the language-specific perceptual similarity space involving speech sounds. Understanding a language, as it is spoken in a variety of contexts, demands such learning. In coming to know a spoken language, we begin to perceive the relevant language-specific features of sounds and utterances. Humans thus have a propensity for learning perceptually to discern the appropriate language-specific types to which spoken utterances belong.

6 What Makes Speech Special?

Perceiving the attributes that are distinctive to the speech sounds of a given language, I have argued, requires experience and learning. Learning a language thus is not simply a matter of learning a sound–meaning mapping. It involves acquiring the capacity perceptually to discern language-specific attributes of spoken utterances. In this sense, you learn to hear the sounds of your language. Learning a language is partly a matter of acquiring a perceptual skill.

Humans have a special propensity to learn to perceive language-specific attributes of speech sounds from birth, but this capacity develops later than other familiar perceptual capacities. For instance, young infants perceive individual objects and events, persistence, and sensible qualities, including colour, pitch, and loudness, prior to perceptually discerning types of sounds that are specific to a particular language. Perceptual awareness of spoken language may therefore be more like perceptual awareness of clapping of hands, barking dogs, or fingernails scratching a chalkboard, each of which involves acquired perceptual skills.

As with other auditory phenomena, the manner in which language-specific sounds are perceptually individuated and classified is illuminated by taking into account the environmental happenings that generate sounds. In particular, articulatory gestures and talking faces make sense of why users of a given language discern and treat various speech sounds as standing in relations of similarity and difference that do not stem in straightforward ways from acoustical characteristics. Considered as such, perceiving speech is a matter of detecting and discerning biologically significant kinds of sounds and happenings, rather than just detecting abstract features of an acoustic signal.

How does perceiving speech differ from perceiving other biologically significant kinds of environmental sounds? Consider a family of perceptual capacities attuned to varieties of animacy . For instance, humans may sometimes perceptually experience a pattern of moving dots as running , or seem to be aware of one dot chasing another dot around a display ( Heider and Simmel, 1944 ; see Scholl and Tremoulet, 2000 ; Gao et al., 2009 ). Here we describe the perception of inanimate things and motion in terms applicable to animate creatures and activities. Since such effects require only very minimal cues, this suggests humans have a special propensity to perceive aspects of animate creatures and their activities. That is, we have differential sensitivity to certain kinds of activity that creatures engage in, in contrast to simple mechanical patterns of motion traced by inanimate things. Perceiving speech is similar to such perceptual capacities in that its concern is a type of animacy exhibited by living things to which we have special sensitivity. In the case of speech (as in the case of faces ) this perceptual capacity is directed predominantly at members of our own species.

Speech perception belongs to an even more special subclass. Speech sounds are generated by communicative intentions of other humans. Like some facial expressions and non-linguistic vocalic sounds, the sounds of spoken utterances are caused by and thus have the potential to reveal the communicative intentions of their animate sources. Speech perception is among a class of ethologically significant perceptual phenomena that serve to disclose intentional activities involved in communication. Perceiving speech is detecting and discerning language-specific kinds of biologically significant events: those generated by communicative intentions of fellow human talkers. We hear people talking. We hear them as interlocutors.

Acknowledgements

I have learned a great deal about the philosophical issues raised by speech perception from Matthen (2005) , Mole (2009) , Remez and Trout (2009) , Rey (2012) , and Smith (2009) . These works, and conversations with their authors, drew me from my more general concern with sounds, audition, and multi-modality to the philosophically and empirically rich subject matter whose focus is the perception of spoken language. I gratefully acknowledge their influence upon my approach to this topic. Thanks to Mohan Matthen for helpful comments on this chapter.

Appelbaum, I. ( 1998 ). ‘ Fodor, modularity, and speech perception ’. Philosophical Psychology, 11(3), 317–330.

Google Scholar

Appelbaum, I. ( 1999 ). ‘ The dogma of isomorphism: A case study from speech perception ’. Philosophy of Science, 66, S250–S259.

Bayne, T. ( 2009 ). ‘ Perception and the reach of phenomenal content ’. The Philosophical Quarterly, 59(236), 385–404.

Bregman, A. S. ( 1990 ). Auditory Scene Analysis: The Perceptual Organization of Sound . Cambridge, MA: MIT Press.

Google Preview

Diehl, R. L. , Lotto, A. J. , and Holt, L. L. ( 2004 ). ‘ Speech perception ’. Annual Review of Psychology, 55, 149–179.

Eimas, P. D. , Siqueland, E. R. , Jusczyk, P. , and Vigorito, J. ( 1971 ). ‘ Speech perception in infants ’. Science, 171(3968), 303–306.

Fodor, J. ( 1983 ). The Modularity of Mind . Cambridge, MA: MIT Press.

Fowler, C. A. ( 1986 ). ‘ An event approach to the study of speech perception from a direct-realist perspective ’. Journal of Phonetics, 14, 3–28.

Fowler, C. A. and Rosenblum, L. D. ( 1990 ). ‘ Duplex perception: A comparison of monosyllables and slamming doors ’. Journal of Experimental Psychology: Human Perception and Performance, 16(4), 742–754.

Gao, T. , Newman, G. E. , and Scholl, B. J. ( 2009 ). ‘ The psychophysics of chasing: A case study in the perception of animacy ’. Cognitive Psychology, 59, 154–179.

Gick, B. and Derrick, D. ( 2009 ). ‘ Aero-tactile integration in speech perception ’. Nature, 462(7272), 502–504.

Harnad, S. ( 1987 ). Categorical Perception: The Groundwork of Cognition . Cambridge: Cambridge University Press.

Hauser, M. D. , Chomsky, N. , and Fitch, W. T. ( 2002 ). ‘ The faculty of language: What is it, who has it, and how did it evolve? ’ Science, 298, 1569–1579.

Heider, F. and Simmel, M. ( 1944 ). ‘ An experimental study of apparent behavior ’. The American Journal of Psychology, 57(2), 243–259.

Jusczyk, P. W. ( 1997 ). The Discovery of Spoken Language . Cambridge, MA: MIT Press.

Liberman, A. M. ( 1996 ). Speech: A Special Code . Cambridge, MA: MIT Press.

Liberman, A. M. and Mattingly, I. G. ( 1985 ). ‘ The motor theory of speech perception revised ’. Cognition, 21, 1–36.

Liberman, A. M. , Cooper, F. S. , Shankweiler, D. P. , and Studdert-Kennedy, M. ( 1967 ). ‘ Perception of the speech code ’. Psychological Review, 74(6), 431–461.

McDowell, J. ( 1998 ). Meaning, Knowledge, and Reality . Cambridge, MA: Harvard University Press.

McGurk, H. and MacDonald, J. ( 1976 ). ‘ Hearing lips and seeing voices ’. Nature, 264, 746–748.

Macpherson, F. ( 2011 ). ‘ Taxonomising the senses ’. Philosophical Studies, 153(1), 123–142.

Matthen, M. ( 2005 ). Seeing, Doing, and Knowing: A Philosophical Theory of Sense Perception . Oxford: Oxford University Press.

Mole, C. ( 2009 ). ‘The Motor Theory of speech perception’. In M. Nudds and C. O’Callaghan (eds), Sounds and Perception: New Philosophical Essays (pp. 211–233). Oxford: Oxford University Press.

Nudds, M. ( 2010 ). ‘ What are auditory objects? ’ Review of Philosophy and Psychology, 1(1), 105–122.

O’Callaghan, C. ( 2008 ). ‘ Seeing what you hear: Cross-modal illusions and perception ’. Philosophical Issues: A Supplement to Noýs, 18, 316–338.

O’Callaghan, C. ( 2011 ). Against hearing meanings.   Philosophical Quarterly, 61, 783–807.

O’Callaghan, C. ( 2012 ). Perception and multimodality. In E. Margolis , R. Samuels , and S. Stich , S. (eds), Oxford Handbook of Philosophy and Cognitive Science . Oxford: Oxford University Press.

Pinker, S. and Jackendoff, R. ( 2005 ). ‘ The faculty of language: What’s special about it? ’ Cognition, 95, 201–236.

Poeppel, D. ( 2001 ). ‘ Pure word deafness and the bilateral processing of the speech code ’. Cognitive Science, 25, 679–693.

Rand, T. C. ( 1974 ). ‘ Dichotic release from masking for speech ’. Journal of the Acoustical Society of America, 55, 678–680.

Remez, R. E. and Trout, J. D. ( 2009 ). ‘Philosophical messages in the medium of spoken language’. In M. Nudds , and C. O’Callaghan (eds), Sounds and Perception: New Philosophical Essays (pp. 234–264). Oxford: Oxford University Press.

’ Remez, R. E. , Rubin, P. E. , Pisoni, D. B. , and Carell, T. D. ( 1981 ). Speech perception without traditional speech cues ’. Science, 212, 947–950.

Rey, G. ( 2012 ). ‘Externalism and inexistence in early content’. In R. Schantz (ed.), Prospects for Meaning (pp. 503–530). New York: de Gruyter.

Saygin, A. P. , Leech, R. , and Dick, F. ( 2010 ). ‘ Nonverbal auditory agnosia with lesion to Wernicke’s area ’. Neuropsychologia, 48, 107–113.

Scholl, B. and Tremoulet, P. ( 2000 ). ‘ Perceptual causality and animacy ’. Trends in Cognitive Sciences, 4(8), 299–309.

Siegel, S. ( 2006 ). ‘Which properties are represented in perception?’ In T. S. Gendler and J. Hawthorne (eds), Percepual Experience (pp. 481–503). New York: Oxford University Press.

Smith, B. ( 2009 ). ‘Speech sounds and the direct meeting of minds’. In M. Nudds and C. O’Callaghan (eds), Sounds and Perception: New Philosophical Essays (pp. 183–210). Oxford: Oxford University Press.

Spence, C. and Driver, J. (eds) ( 2004 ). Crossmodal Space and Crossmodal Attention . Oxford: Oxford University Press.

Trout, J. D. ( 2001 ). ‘ The biological basis of speech: What to infer from talking to the animals ’. Psychological Review, 108(3), 523–549.

Vouloumanos, A. and Werker, J. F. ( 2007 ). ‘ Listening to language at birth: evidence for a bias for speech in neonates ’. Developmental Science, 10(2), 159–164.

Here I am alluding to but not endorsing the notorious ‘beads on a string’ analogy. I do not accept that characterization of phonological attributes, because I believe neither that they are items or individuals nor that they occur in neat, discrete sequences. Instead, I believe they are properties whose instances overlap. Further discussion in Section 4 .

For readability, I use the upright rather than inverted ‘r’ for the alveolar approximant. The upright ‘r’ standardly (in the International Phonetic Alphabet) is used for the trill.

Indeed, individuals with PWD perform poorly on tasks that require categorical perception for language-specific attributes. Thanks to Bob Slevc for discussion.

Early text-to-speech methods failed to appreciate this context dependence, and thus failed. Early attempts assigned each letter a sound and played the sounds assigned to specific letters in sequences that mirrored written texts. The results were unintelligible.

One complication is that due to coarticulation the gestures pronounced in normal speaking also exhibit some lack of invariance. Liberman and Mattingly (1985) revised the Motor Theory to claim that intended motor commands are the objects of speech perception. See Mole (2009) for a convincing critique of the revised account. Fowler’s (1986) Direct Realism maintains that articulatory gestures are the objects of speech perception but rejects that gestural events differ in kind from the objects of non-linguistic audition.

  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

SEP home page

  • Table of Contents
  • Random Entry
  • Chronological
  • Editorial Information
  • About the SEP
  • Editorial Board
  • How to Cite the SEP
  • Special Characters
  • Advanced Tools
  • Support the SEP
  • PDFs for SEP Friends
  • Make a Donation
  • SEPIA for Libraries
  • Entry Contents

Bibliography

Academic tools.

  • Friends PDF Preview
  • Author and Citation Info
  • Back to Top

Auditory Perception

Auditory perception raises a variety of challenging philosophical questions. What do we hear? What are the objects of auditory awareness? What is the content of audition? Is hearing spatial? How does audition differ from vision and other sense modalities? How does the perception of sounds differ from that of colors and ordinary objects? This entry presents the main debates in this developing area and discusses promising avenues for future inquiry. It discusses the motivation for exploring non-visual modalities, how audition bears on theorizing about perception, and questions concerning the objects, contents, phenomenology, varieties, and bounds of auditory perception.

1. Other Modalities and the Philosophy of Perception

2.1.1 private or public, 2.1.2 proximal or distal, 2.1.3 properties or individuals, 2.1.4 objects or events, 2.2.1 object perception in audition, 2.2.2 what is an auditory object, 2.3.1 do humans hear sound sources, 2.3.2 the mereology of sounds and sources.

  • 3.1.1 Skepticism about Spatial Hearing

3.1.2 Strawson and the Purely Auditory Experience

3.1.3 does audition have spatial structure, 3.1.4 how spatial audition differs from spatial vision, 3.2.1 sounds and colors, 3.2.2 pitch, timbre, and loudness, 3.2.3 is pitch physical.

  • 3.3.4 Disanalogies with Color
  • 3.3.5 Analogies with Color

4.1.1 Acousmatic Experience

  • 4.1.2 Acousmatic Listening as Attention

4.2.1 Is Speech Special?

4.2.2 the objects of speech perception, 4.2.3 the contents of speech perception.

  • 4.3.4 Is Speech Perception Auditory?
  • Supplement on Speech Perception

4.3.1 Crossmodal Illusions

4.3.2 causal or constitutive, 4.3.3 multimodality in perception, 5. conclusion and future directions, other internet resources, related entries.

The philosophy of sounds and auditory perception is one area of the philosophy of perception that reaches beyond vision for insights about the nature, objects, contents, and varieties of perception. This entry characterizes central issues in the philosophy of auditory perception, many of which bear upon theorizing about perception more generally, and it mentions outstanding questions and promising future areas for inquiry in this developing literature. Before beginning the substantive discussion of audition itself, it is worthwhile to discuss the motivation and rationale for this kind of work.

Philosophical thinking about perception has focused predominantly on vision. The philosophical puzzle of perception and its proposed solutions have been shaped by a concern for visual experience and visual illusions. Questions and proposals about the nature of perceptual content have been framed and evaluated in visual terms, and detailed accounts of what we perceive frequently address just the visual case. Vision informs our understanding of perception’s epistemological role and of its role in guiding action. It is not a great exaggeration to say that much of the philosophy of perception translates roughly as philosophy of visual perception.

Recently, however, other perceptual modalities have attracted attention (see, e.g., Stokes et al. 2015, Matthen 2015). In addition to auditory perception and the experience of sound, touch and tactile awareness have generated philosophical interest concerning, for instance, the tactile and proprioceptive experience of space, the objects of touch, whether contact is required for touch, and whether distinct modalities detect pressure, heat, and pain (see, e.g., O’Shaughnessy 1989, Martin 1993, Scott 2001, Fulkerson 2013, 2016). The unique phenomenology of olfaction and smells has been used to argue that vision is atypical in supporting the transparency of perceptual experience (Lycan 2000, 282; cf. Batty 2010) and that perceptual objectivity does not require spatiality (Smith 2002, ch 5). Lycan (2000) even suggests that the philosophy of perception would have taken a different course had it focused upon olfaction instead of vision (see also Batty 2011). Some authors have appealed to taste and flavor to challenge traditional ways of dividing and counting senses (Smith 2015; cf. Richardson 2013).

This kind of work is philosophically interesting in its own right. But it is also worthwhile because theorizing about perception commonly aims to address general questions about perception, rather than concerns specific to vision. Hope for a comprehensive and general understanding of perception rests upon extending and testing claims, arguments, and theories beyond vision. One might view work on non-visual modalities as filling out the particulars required for a thoroughly detailed account of perceiving that applies not just to vision but across the modalities. At least three approaches might be adopted, with potential for increasingly revisionist outcomes.

First, one might take work on non-visual modalities as translating what we have learned from the visual case into terms that apply to other modalities. This approach is relatively conservative. It assumes that vision is representative or paradigmatic and that we have a good understanding of perception that is derived from the case of vision. One example of this kind of approach would be to develop an account of the representational content of auditory experience.

Second, considering other modalities might extend our vision-based understanding of perception. Non-visual cases might draw attention to new kinds of phenomena that are missing from or not salient in vision. If so, a vision-based account of perception is satisfactory as far as it goes, but it leaves out critical pieces. For example, speech perception, multimodal perception, and flavor perception might involve novel kinds of perceptual phenomena absent from the visual case.

Third, considering other modalities might challenge vision-based claims about perception. If falsifying evidence is discovered in non-visual cases, then theorizing beyond vision may force revision of general claims about perception that are supported by vision. For example, if olfactory experience is not diaphanous, but olfactory experience is perceptual, the transparency thesis for perceptual experience fails.

Finally, we might attempt to determine whether any unified account exists that applies generally to all of the perceptual modalities. We can ask this question either at the level of quite specific claims, such as those concerning the objects of perception or the nature and structure of content. We can ask it about the relationships among perceiving, believing, and acting. Or we can ask it about the general theory of necessary and sufficient conditions for perceiving. Some philosophers, impressed by findings concerning non-visual modalities, express skepticism whether a unified theory exists (e.g., Martin 1992).

Whatever the approach, extending our knowledge about perception beyond the visual requires systematic attention to individual modalities as well as careful accounting in order to determine how the results bear on general questions about perception. Whatever the outcome, audition is a rich subject matter in its own right, and investigating this subject matter is crucial to our overall understanding of perception.

2. The Objects of Auditory Perception

What do we hear? One way to address this question concerns the objects of auditory perception.

In the first instance, typical human perceivers hear sounds . It is plausible that sounds are objects of auditory perception.

What are sounds? Sounds traditionally have been counted with colors, smells, and tastes as secondary , sensible , or sensory qualities (see, e.g., Locke 1689/1975, Pasnau 1999, 2000, Leddington 2019). However, recently it has been proposed that sounds are individuals to which sensible features are attributed. In particular, several philosophers have proposed that sounds are public, distally-located, event-like individuals (Casati and Dokic 1994, 2005, O’Callaghan 2007, Matthen 2010).

Four questions about audition’s objects define the debate and constrain theories of sound (see also the entry on sounds , for extensive discussion).

Are sounds private or public? Maclachlan (1989) argues that the sounds we hear are sensations (rather than, for instance, the pressure waves that cause auditory experiences). Such sensations are internal and private, and we experience them directly, or without apparent mediation. On Maclachlan’s account, we hear the ordinary things and happenings that are the sources of sounds only indirectly , by means of inference from auditory data.

Maclachlan’s story is noteworthy partly because he uses hearing and sounds to motivate a general claim about perception. He claims that what seems perfectly intuitive and obvious in the case of sounds and hearing—that something other than material objects are the direct objects of hearing; that the direct objects of audition are internal; and that we indirectly hear things in the world by hearing their sounds—helps us to discover what is true of all perception. According to Maclachlan, for instance, seeing involves direct awareness of sensations of patterns of light, while surfaces and ordinary objects figure only indirectly and thanks to inference among the intentional objects of sight. The case of sounds and audition is important because it reveals that perceiving involves awareness of sensations in the first instance, and of the external world only indirectly.

Maclachlan’s description of sounds and auditory experience has some attractions. First, sounds are among the things we hear. And sounds are among the direct or immediate objects of audition in the relatively innocuous sense that hearing a sound does not seem to require hearing as of something else. Hearing a collision, on the other hand, may seem to require awareness as of a sound. Furthermore, sounds are unlike the ordinary material objects (e.g., bottles and staplers) we see. You cannot reach out and grab a sound, or determine its temperature. Instead, sounds may strike us as byproducts or effects of such ordinary things and their transactions. Sounds result from activities or interactions of material bodies and thus are experienced as distinct or independent from them (cf. Nudds 2001). Nevertheless, audition does afford some variety of awareness of the sources of sounds, or at least provides information about them.

However, the claim that sounds are sensations is unattractive. Good reasons suggest that sounds are public rather than private, even if sounds are not identical with ordinary objects and events such as clothespins and collisions. Suppose I am near the stage in a hall listening to some music, and that I have a headache. It is a confusion to think you could feel my headache, but I assume you hear the sounds I hear. Suppose I move to the back of the hall, and the headache then gets better. My experience of the sounds of the music differs once I am at the back of the room, and my experience of the headache differs. The sound of the music itself need not differ (the musicians could make the same sounds), but the headache itself changes. The sounds can continue once I leave the room, but if I stop experiencing the headache, it is gone. Moreover, the notion of an unfelt headache is puzzling, but it makes good sense to say that a tree makes a sound when it falls in the woods without being heard. Finally, tinnitus, or ringing of the ears, is an illusory or hallucinatory experience as of a sound, but received wisdom maintains that there are no illusory headaches.

This suggests that audition does not provide special reasons to believe that the objects of perception are private sensations. Sounds, construed as objects of auditory perception, plausibly inhabit the public world. (See the section 3.1 Spatial Hearing for further discussion.)

Are sounds proximal or distal? The customary science-based view holds that sounds are pressure waves that travel through a medium (see also Sorensen 2008). On this account, sounds are caused by objects and events such as collisions, and sounds cause auditory experiences. However, sounds are not auditorily experienced to travel through the surrounding medium as waves do. Thus, if sounds are waves, then the sounds we hear may be proximal, located at the ear of the hearer.

Alternatively, some have argued that audition presents sounds as being located in some direction at a distance (Pasnau 1999, O’Callaghan 2007, ch. 3, 2010). On such an account, sounds commonly appear auditorily to be in the neighborhood of their sources and thereby furnish useful information about the locations of those sources. The sound of the drumming across the street seems to come from across the street but does not seem audibly to travel. When sounds do appear to fill a room, sound seems located all around. Sounds that seem to “bounce” around a room appear intermittently at different locations rather than as traveling continuously from place to place. Experiencing a missile-like sound speeding towards your ears illustrates the contrast with ordinary hearing (O’Callaghan 2007, 35). Sounds, according to this conception, ordinarily appear to have distal locations and to remain stationary relative to their sources.

If sounds are not usually experienced to travel, then unless auditory experience is illusory with respect to the apparent locations of sounds, sounds themselves do not travel. Sounds thus are not identical with and do not supervene locally upon the waves, since waves travel (Pasnau 1999). Several philosophers have argued on these and related grounds that sounds are located distally, near their sources (Pasnau 1999, Casati and Dokic 2005, O’Callaghan 2007). On this view, pressure waves bear information about sounds and are the proximal causes of auditory experiences, but are not identical with sounds.

One might object by resisting the phenomenological claim that we experience sounds as distally located, for instance by suggesting that audition is aspatial, or that audition is spatial but sound sources rather than sounds are auditorily localized (see section 3.1 Spatial Hearing for further discussion). Or, one might accept some measure of illusion. Another possibility is that we experience only a small subset of the locations sounds occupy during their lifetimes (for instance, while at their sources), and simply fail to experience where they are at other times. This avoids ascribing illusion. Finally, Fowler (2013) argues indirectly on the basis of echoes against distal theories of sound.

Are sounds properties or individuals? Among both proximal and distal theories, disagreement exists concerning the ontological category to which sounds belong. Philosophers traditionally have understood sounds as properties—either as sensible or secondary qualities, or as the categorical or physical properties that ground powers to affect subjects. Commonly, sounds are attributed to the medium that intervenes between sources and perceivers. More recently, however, some distal theorists have argued that sounds are properties of what we ordinarily understand as sound sources—bells and whistles have or possess rather than make or produce sounds. Pasnau (1999), for instance, claims that sounds are transient properties that are identical with or supervene upon vibrations of objects. Kulvicki (2008) argues against transience in an attempt to subsume sounds to the model of colors, and claims that sounds are persistent, stable dispositional properties of objects to vibrate in response to being “thwacked”. He distinguishes “having” a stable sound from “making” a sound on some occasion (manifesting the stable disposition). This account implies that sounds sometimes make sounds they do not have, and that they have sounds when silent. One also might ask whether events such as collisions and strummings, rather than objects, bear sounds. Leddington (2019) recently has defended such an account.

A revisionist challenge comes from those who argue that sounds are individuals rather than properties. Several arguments support this understanding. First, empirical work on auditory scene analysis suggests that one primary task of audition is to carve up the acoustic scene into distinct sounds, each of which may possess its own pitch, timbre, and loudness (Bregman 1990). Multiple distinct sounds with different audible attributes can be heard simultaneously. An analog of Jackson’s (1977, see also Clark 2000) many properties problem thus arises for audition since feature awareness alone cannot explain the bundling or grouping of audible attributes into distinct sounds. Such bundling or grouping of audible features suggests that sounds are perceptible individuals to which these features are attributed.

Furthermore, the temporal characteristics of experienced sounds suggest that sounds are not simple qualities. Sounds audibly seem to persist through time and to survive change. A particular sound, such as that of an emergency siren, might begin high-pitched and loud and end low-pitched and soft. This suggests that sounds are individuals that bear different features at different times, rather than sensible qualities.

Several responses to these arguments are available (see Cohen 2009 for the most developed reply). One might argue that sounds are complex properties, such as pitch-timbre-loudness complexes, instantiated at a time. To account for feature binding, one might hold that such complex properties are ascribed to ordinary objects such as bells and whistles. Or, one might hold that they are particularized properties, such as tropes. To accommodate sounds that survive change through time, a property account could hold that sounds are yet more complex properties that have patterns of change built into their identity conditions. However, any such view differs a great deal from the familiar secondary or sensible quality view pioneered by Locke. Pitch, timbre, and loudness are better candidates for simple sensible features in audition (see section 3.2 Audible Qualities ).

If sounds are individuals, are they object-like or event-like individuals? Intuitively, the material objects we see are capable of existing wholly at any given moment, and all that is required to perceptually recognize such individuals is present at a moment. On the other hand, event-like individuals occupy time and need not exist wholly at any given moment. Their individuation and recognition frequently appeal to patterns of features over time. Event-like individuals intuitively comprise temporal parts, while object-like individuals intuitively do not. The issue here is not the truth of endurantism or perdurantism as an account of the persistence of objects or events. Instead, the issue concerns a difference in how we perceptually individuate, experience, and recognize individuals.

No contemporary philosopher has yet claimed that sounds are objects in the ordinary sense . Those who argue that sounds are individuals commonly point out that sounds not only persist and survive change (as do ordinary material objects), but also require time to occur or unfold. It is difficult to imagine an instantaneous sound, or one that lacks duration. Sounds are not commonly treated as existing wholly at a given moment during their duration. Indeed, the identities of many common sounds are tied to patterns of change in qualities through time. The sound of an ambulance siren differs from that of a police siren precisely because the two differ in patterns of qualitative change through time. The sound of the spoken word ‘team’ differs from that of ‘meat’ because each instantiates a common set of audible qualities in a different temporal pattern. These considerations support the view that sounds are event-like individuals (see Casati and Dokic 1994, 2005, Scruton 1997, O’Callaghan 2007, Matthen 2010).

This may bear on debates about persistence in the following way. Differences in the intuitive plausibility of endurantism and perdurantism may be grounded in facts about perception. In particular, vision may treat objects as persisting by enduring or being wholly present at each time at which they exist, while audition may treat its objects as persisting by perduring or having temporal parts. This may stem from differences in perceptual organization. For instance, exhibiting a visible property profile at a time may suffice for being a visual object of a given sort, while being an audible object of a given sort may require exhibiting an audible property profile over time.

2.2 Auditory Objects

Though most philosophers construe sounds either as properties or as event-like individuals (see section 2.1 Sounds ), psychologists commonly have discussed auditory objects (see, e.g., Kubovy and Van Valkenburg 2001, Griffiths and Warren 2004, Heald et al. 2017). The target of such discussion is not simply audition’s intentional objects or proper (specific to audition) objects. The intended analogy is with visual objects . Talk of auditory objects gestures at the visual processes involved in perceptually discriminating, attentively tracking, recognizing and categorizing ordinary material objects. What justifies talk of object perception in audition?

First of all, humans typically do not auditorily perceive three-dimensional, bounded material objects as such , though it is plausible to think we visually perceive them. Hearing does not resolve the edges, boundaries, and filled volumes in space that I see, and I do not hear audible items to complete spatially behind occluders as do visible surfaces of objects. If perceiving a three-dimensional object requires awareness of its edges, boundaries, and extension, perhaps in order to discriminate it from its surroundings, humans typically do not auditorily perceive such objects.

Nevertheless, striking and illuminating parallels do exist between the perceptual processes and experiences that take place in vision and audition. Such parallels may warrant talk of object perception in a more general sense that is common to both vision and audition (O’Callaghan 2008a).

Perceiving objects requires parsing a perceptual scene into distinct units that one can attend to and distinguish from each other and from a background. In vision, bounded, cohesive collections of surfaces that are extended in space and that persist through time play this role (see, e.g., Spelke 1990, Nakayama et al. 1995, Leslie et al. 1998, Scholl 2001, Matthen 2005). In audition, as in vision, multiple distinct perceptible individuals might exist simultaneously, and each might persist and survive change (see the discussion of auditory scene analysis in section 2.1 Sounds ). A critical difference, however, is that while vision’s objects are extended in space, and are individuated and recognized primarily in virtue of spatial characteristics, audible individuals are extended in time, and are perceptually individuated and recognized primarily in virtue of pitch and temporal characteristics (see, e.g., Bregman 1990, Kubovy and Van Valkenburg 2001). For instance, audible individuals have temporal edges and boundaries, and boundary elements can belong only to a single audible individual. They also are susceptible to figure-ground effects over time. One can, for instance, shift attention among continuous audible individuals that differ in pitch. Furthermore, they are susceptible to completion effects over time in much the same way that visible objects are perceptually completed in space. Seeing a single visible region to continue behind a barrier is analogous to hearing a sound stream to continue through masking noise, which may take place even when there is no corresponding signal (Bregman 1990, 28). Finally, multiple distinct, discrete audible individuals, such as the temporally bounded notes in a tune, can form audible streams that comprise a single perceptible unit. Such streams are subject to figure-ground shifts, and, like collections of surfaces, they can be attentively tracked through changes to their features and to one’s perspective. Though such complex audible individuals include sounds, they comprise temporally unified collections of sounds and silence that are analogous to spatially complex visible objects, such as tractors.

Such audible individuals are temporally extended and bounded, serve as the locus for auditory attention, prompt completion effects, and are subject to figure-ground distinctions in pitch space. For these reasons, the auditory processes involved in their perception parallel those involved in the visual perception of ordinary three-dimensional objects. The parallels suggest a shared sense in which vision and audition involve a more general form of object perception (see, e.g., Kubovy and Van Valkenburg 2001, Scholl 2001, Griffiths and Warren 2004, O’Callaghan 2008a, Matthen 2010).

What is the shared sense in which both visible and audible individuals deserve to be called ‘objects’? Kubovy and Van Valkenburg (2001, 2003) define objecthood in terms of figure-ground segregation, which requires perceptual grouping. They propose the theory of indispensable attributes as an account of the necessary conditions on perceptual grouping (see also Kubovy 1981). Indispensable attributes for a modality are those without which perceptual numerosity is impossible. They claim that while space and time are indispensable attributes for vision (and color is not), pitch and time are indispensable attributes for auditory objects. Though they are more skeptical about whether audition parallels vision, Griffiths and Warren (2004) sympathize with a figure-ground characterization but suggest a working notion of an auditory object defined in terms of “an acoustic experience that produces a two-dimensional image with frequency and time dimensions” (Griffiths and Warren 2004, 891).

O’Callaghan (2008a) proposes that both visible and audible objects are mereologically complex individuals, though their mereology differs in noteworthy respects. While vision’s objects possess a spatial mereology and are individuated and tracked in terms of spatial features, audition’s objects have a temporal mereology and are individuated and tracked in terms of both pitch and temporal characteristics. Discussion of auditory objects thus draws attention to two roles that space plays in vision. First, there is the role of space in determining the structure internal to visible objects, which facilitates identifying and recognizing visible objects. Second, space serves as the external structure among visible objects, and is critical in distinguishing objects from each other. In audition, time plays a role similar to space in vision in determining the structure internal to auditory objects. Pitch, on the other hand, serves as an external structural framework, along with space, that helps to distinguish among audible individuals.

Why is it useful to perceive such individuals in audition? One promising account is that they provide useful information about the happenings that produce sounds. Carving the acoustic world into mereologically complex individuals informs us about what is going on in the extra-acoustic environment. It provides ecologically significant information about what the furniture is doing, rather than just how it is arranged. It is one thing to perceive a tree; it is another to hear that it is falling behind you.

Discussion of auditory objects and accounts of their nature and perception is relatively new among philosophers (see, e.g., O’Callaghan 2008a, and essays in Bullot and Egré 2010, including Matthen 2010, Nudds 2010). Such work has led to the development of general accounts of perceptual objects designed to avoid visuocentrism (see, e.g., O’Callaghan 2016, Green 2019). This area is ripe for philosophical contributions.

2.3 Sound Sources

Sounds are among the objects of audition. Plausibly, so are complex, temporally extended individuals composed of sounds. Do we hear anything else? Reflection suggests we hear things beyond sounds and sound complexes. In hearing sounds, one may seem to experience the backfiring of the car or the banging of the drum. One might hold that a primary part of audition’s function is to reveal sound sources , the things and happenings that make sounds.

If sounds were internal sensations or sense-data, then, as Maclachlan (1989) observes, we would hear sound sources only indirectly, in an epistemological sense, perhaps thanks to something akin to inference. Acquiring beliefs about the environment would require mediation by propositions connecting experienced internal sounds with environmental causes.

If, however, sounds are properties attributed either to ordinary objects, as Pasnau (1999) and Kulvicki (2008) hold, or to events, as Leddington (2019) holds, then hearing a tuba or the playing of a tuba might only require hearing its sounds. Perceptually ascribing such audible attributes to their sources might ground epistemically unmediated awareness of tubas or their playings.

However, the individuals to which audible attributes are perceptually attributed need not be identical with ordinary objects or events. Instead, audible attributes may belong in the first instance to sounds. Sounds plausibly are distinct from ordinary or extra-acoustic individuals (O’Callaghan 2007, 2011). Suppose then that one could not hear an ordinary object or event without there existing an audible sound, as well as that sounds can mislead about their sources (it might sound like drumming but be hammering).

Given this, forming beliefs about ordinary things and happenings connected with sounds might seem to require inference, association, or some otherwise cognitive process, and so awareness of a sound source might appear to always involve more than perceptual awareness. According to such an account, awareness of environmental things and happenings thanks to audition is epistemically mediated by awareness as of sounds and auditory objects, but does not itself constitute auditory perceptual awareness as of those things and happenings. You are inclined to think you hear the source because your representing or being aware of it co-occurs with, but is no more than a downstream consequence triggered by, your auditory experience.

Such an account is not wholly satisfactory. First, the phenomenology of audition suggests something stronger than indirect, epistemically mediated awareness of things such as collisions or guitar strummings or lions roaring. Reflection suggests auditory awareness as of collisions, strummings, and lions. Second, the capacity to refer demonstratively to such things and events on auditory grounds also suggests genuine perceptual awareness of them. Third, we commonly perceptually individuate sounds in terms of their apparent sources, and our taxonomy reflects this. “What did you hear?” “I heard paper ripping,” or, “The sound of a dripping faucet.” We distinguish two quite similar rattles once we hear one as of a muffler clamp and the other as of a loose fender. Furthermore, characterizing certain audible features and explaining perceptual constancy effects involving such features requires appeal to sound sources. Handel says of timbre: “At this point, no known acoustic invariants can be said to underlie timbre... The cues that determine timbre quality are interdependent because all are determined by the method of sound production and the physical construction of the instrument” (Handel 1995, 441). Explaining loudness constancy—why moving to the back of the room does not change how loudly the lecturer seems to speak—appeals to facts about the sources of sounds (Zahorik and Wightman 2001). Auditory processing proceeds in accordance with natural constraints concerning characteristics of sound sources, and information concerning sources shapes how auditory experiences are organized. This is to say that processes responsible for auditory experience proceed as if acoustic information is information about sound sources. Finally, audition-guided action supports the claim that we hear such things and events. Turning to look toward the source of a sound or ducking out of the way of something we hear to be approaching—behaviors guided by auditory experience—would make little sense if we heard only sounds. In the first place, these reasons ground a case for thinking that auditory perceptual experience does not strictly end with sounds and auditory objects. In particular, awareness as of a source, even if dependent upon awareness as of a sound, may be constitutive of one’s auditory perceptual experience.

The main barrier to an alternative is that the relation between sounds and ordinary things or happenings is commonly understood as causal (see, e.g., Nudds 2001). Awareness as of an effect does not itself typically furnish epistemically unmediated awareness of its cause. Seeing smoke is not seeing fire. The right sort of dependence between characteristics of the experience and the cause is not apparent, and awareness as of an effect does not by itself ground perceptual demonstratives that concern the cause. The metaphysical indirectness of the causal relation appears to block epistemic directness (see O’Callaghan 2011a for further discussion).

Is there another explanatory route? Suppose that instead of a causal relation, we understand the relationship between sounds and sources mereologically , or as one of part to whole (see O’Callaghan 2011a). Parthood frequently does ground perceptual awareness. For instance, seeing distinct parts of a surface interrupted by an occluder leads to perceptual experience as of a single surface (imagine seeing a dog behind a picket fence). Seeing the facing surfaces of a cube affords awareness as of a cube, and we can attentively track that same cube as it rotates and reveals different surfaces. Suppose, then, that a sound is an event-like individual (recall, property accounts escape the worry). This event is part of a more encompassing event, such as a collision or the playing of a trumpet, that occurs in the environment and that includes the sound. So, the typical horse race includes the sounds, and you might auditorily perceive the racing in hearing some of its proper parts: the sounds. More specifically, you may hear the galloping thanks to hearing the sounds it includes. You may fail to hear certain parts of the racing event, such as the jockey’s glance back after crossing the wire, but you also fail to see parts of the race, such as the misstep of the horse in second place. If the sounds are akin to the audible “profile” of the event, analogous to the visible surfaces of objects and visible parts of events, you might then enjoy auditory awareness as of the galloping of the horses in virtue of your awareness as of the sounds of the hooves. The sound is not identical with the galloping, and it is not just a property or a causal byproduct of the galloping. It is a part of a particular event of galloping. The metaphysical relation of part to whole, in contrast to that between effect and cause, might ground the sort of epistemically unmediated awareness of interest (cf. Nakayama et al. 1995, Bermúdez 2000, Noë 2004). Auditory perceptual awareness as of the whole may occur thanks to experiencing the part.

One objection is that this mereological account of the relation between sounds and sources cannot account for hearing ordinary objects by hearing their sounds. You could not strictly hear a tuba by hearing its sound because a tuba is not an event of which a sound is a part. However, the sound is part of the event of playing the tuba, and the tuba is a participant in that playing. So, though you are not aware as of a tuba, you are aware as of an event that involves a tuba. That perhaps is enough to explain talk of hearing tubas and to assuage the worry (for further discussion, see Young 2018).

Another more serious objection contends that the events we seem to hear are ones that do not constitutively involve sounds or that might have taken place without sounds. For instance, we hear the collision, but the collision is something that could have occurred in a vacuum and not made a sound. (Note that this differs from the claim discussed below according to which sounds are identical with source events and so inaudibly exist in vacuums.) If so, the collision and the sound differ and the collision does not strictly include a sound. The collision therefore must have made the sound as a causal byproduct. This suggests that, strictly speaking, you could not hear that very collision event (since it causes the sound). The best response is to bite the bullet and accept that events that do occur or that could occur in vacuums cannot be heard since they include no sounds. This is not so bad, since you could hear a different, more encompassing event that includes a sound along with a collision. Alternatively, one might say the very same event that occurs in a vacuum also could occur in air, but that it would have involved a sound had it occurred in air. In that case, one can only hear such events when they occur in air and include a sound. The choice depends upon one’s metaphysics of events. In either case, it seems reasonable that token events that do not include sounds are inaudible.

Casati et al. (2013) sidestep some of these concerns by rejecting the distinction between sounds and events that are typically understood as sound sources. They propose to “Ockhamize” the “event sources” of sounds by identifying sounds with events such as collisions and vibrations. The sound just is the collision or the vibrating, whether or not it occurs in air. This account implies that sounds could exist in vacuums.

What hinges on the debate about hearing sources? The first upshot is epistemological and concerns the nature of the justification for empirical beliefs grounded in perceptual experience. The evidential status of beliefs about what one perceptually experiences differs from that of beliefs about what is causally responsible for what one perceptually experiences. So, whether or not we hear sound sources impacts the epistemology of audition. The second upshot concerns the relation between audition and certain actions. If we hear only sounds and auditory objects, what appears to be effortless, auditorily guided action to avoid or orient toward sound sources requires another explanation (because sounds are invisible and usually do no harm). Finally, it affects how we understand the adaptive significance of audition. Did audition evolve so as to furnish awareness of sounds alone, while leaving their environmental significance to extra-perceptual cognition, or did it evolve so as to furnish perceptual responsiveness to the sources of sounds?

3. The Contents of Auditory Perception

Another way to address the question, “What do we hear?” concerns the contents of auditory perception. Two topics are especially noteworthy in the context of related debates about vision and its contents. The first concerns the whether audition has spatial content. The second concerns the perception of audible qualities. Parallel questions can be raised without relying on the perceptual content framework, though important complications arise.

3.1 Spatial Hearing

One topic where the contrast between vision and audition has been thought to be particularly philosophically significant concerns space. Vision is a robustly spatial perceptual modality. Vision furnishes awareness of space and spatial features. Some claim vision has an inherently spatial structure, or, further, that vision’s spatial structure is a necessary condition on being visually aware of things as independent from oneself.

Hearing also provides information about space—humans learn about space on the basis of hearing. If audition represents space or spatial features, there is a natural account of being so informed. We might form beliefs about spatial features of environments on the basis of auditory perceptual experiences simply by accepting or endorsing what is apparent in having those experiences.

But learning about spatial features on the basis of audition and audition’s bearing information about space both are consistent with entirely aspatial auditory phenomenology. For instance, volume might bear information about distance, and differences in volume at the two ears might bear information about direction. In that case, audition bears information about space, and learning about space on the basis of audition is possible, but it does not follow that auditory experience is spatial or that audition represents space.

3.1.1 Skepticism about Spatial Audition

Notably, a tradition of skepticism about audition’s spatiality exists in philosophy. Certainly, our capacity to glean information about space is less acute in audition than in vision. Vision reveals fine-grained spatial details that audition cannot convey, such as patterns and textures. But philosophers who are skeptical about spatial audition are not just concerned about a difference in spatial acuity between audition and vision. Malpas says of the expression, ‘the location of sound’:

I do not mean by ‘location’ ‘locality’, but ‘the act of locating’, and by ‘the act of locating’ I do not mean ‘the act of establishing in a place’, but ‘the act of discovering the place of’. Even so ‘location’ is misleading, because it implies that there is such a thing as discovering the place of sounds. Since sounds do not have places there is no such act. (Malpas 1965, 131)

O’Shaughnessy states, “…We absolutely never immediately perceive sounds to be at any place. (Inference from auditory data being another thing)” (O’Shaughnessy 2002, 446). The claim is that, in contrast to visible objects, audible sounds are not experienced as having locations. Rather, we determine the places of sounds and sources from acoustic features, such as loudness and interaural differences, that bear information about distance and direction. We do not auditorily experience spatial features.

This debate, and the purported contrast between vision and audition, has consequences for perceptual theorizing. One route to the conclusion that hearing sounds involves auditory awareness of sensations involves denying that audition satisfies spatial prerequisites on experiencing sounds as objective or public. For instance, Maclachlan (1989) claims that audition’s phenomenology—in particular, its aspatial phenomenology—provides reasons to think sounds are sensations. Comparing sounds with pains, which we readily recognize as sensations, he says, “[A]lthough the sounds we hear are just as much effects produced in us as are the pains produced by pins and mosquitoes, there is no variety in the location of these effects [the sounds]. Because of the lack of contrast, we are not even aware that the sounds we hear are bodily sensations” (Maclachlan 1989, 31, my emphasis). Maclachlan means that, in contrast even to the case of pains, which are felt at different bodily locations, sounds are not experienced to be at differing locations, and so we are not even inclined to recognize that they are bodily sensations. Maclachlan then suggests that we associate sounds with things and happenings outside the body rather than appreciate that they are effects in us. Given the lack of spatial variation among experienced sounds, we projectively associate sounds with distal sources. This explanation assumes that experienced sounds exhibit no audibly apparent spatial variation: sounds seem located at the ears or lack apparent location altogether. Denying that auditory experiences present sounds at varying locations beyond the ears invites difficulty in finding a place for sounds in the world. If audition is wholly aspatial, this may encourage a retreat to the view that sounds lack locations outside the mind.

This kind of strategy has companions and precursors. Lycan’s suggestion that olfactory experiences are apparent as modifications of one’s own consciousness depends heavily on the aspatial phenomenology of olfactory experience (Lycan 2000, 278–82). Each recalls the Kantian claim that objectivity requires space, or that grasping something as independent from oneself requires the experience of space, a version of which is deployed by Strawson (1959, ch 2) in his famous discussion of sounds.

Two lines of response are open. The first appeals to the thriving empirical research program in “spatial hearing” (see, e.g., Blauert 1997). Scientists aim to discover the cues and perceptual mechanisms that ground spatial audition, such as interaural time and level differences, secondary and reverberant signals, and head-related transfer functions. Audition clearly cannot match vision’s singular acuity—vision’s resolution limit is nearly two orders of magnitude better than audition’s (Blauert 1997, 38–9). Nevertheless, this research strongly supports the claim that human subjects auditorily perceive such spatial characteristics as direction and distance.

Second, a number of philosophers have objected on phenomenological grounds. Audition, they argue, involves experiencing or perceptually representing such spatial characteristics as direction and distance (Pasnau 1999, Casati and Dokic 2005, Matthen 2005, O’Callaghan 2007, 2010). Introspection and performance support the claim that sounds or sound sources are in many ordinary cases perceptually experienced as located in the environment at a distance in some direction. We hear the sound of the knocking over near the door; we hear footsteps approaching from behind and to the left; hearing sound to “fill” a room is itself a form of spatial hearing. Though hearing is more error prone than vision, we frequently do not need to work out the locations of sounds or sources—we simply hear them.

A subtler form of skepticism about spatial audition aims just to block the requirements on objectivity. Strawson (1959) famously argues in Chapter 2 of Individuals that because auditory experience is not intrinsically spatial—spatial concepts have no intrinsic auditory significance—a purely auditory experience would be non-spatial. Thus, it would not satisfy the requirements on non-solipsistic consciousness. Others have endorsed versions of Strawson’s claim. “[T]he truth of a proposition to the effect that there is a sound at such-and-such a position must consist in this: if someone was to go to that position, he would have certain auditory experiences,” states Evans (1980, 274).

The claim that audition is not intrinsically spatial admits at least two readings. First, since Strawson suggests that audition might inherit spatial content from other sense modalities, such as vision or touch, it could mean that audition depends for its spatial content upon that of other modalities. If, unlike vision and touch, audition’s spatial capacities are parasitic upon those of other modalities, audition is spatial only thanks to its relations to other intrinsically spatial modalities. Second, it might be understood as a claim about the objects of audition. Strawson indicates that sounds themselves are not intrinsically spatial. He says that although sounds have pitch, timbre, and loudness, they lack “intrinsic spatial characteristics” (1959, 65). Since these interpretations are not clearly distinguished by Strawson, it is helpful to consider his master argument.

Strawson claims that a purely auditory experience would be non-spatial. By “purely auditory experience” Strawson means an exclusively auditory experience, or an auditory experience in absence of experience associated with any other modality. However, if any modality in isolation ever could ground spatial experience, audition could. On one hand, given the mechanisms of spatial hearing, it is empirically implausible that a normal acoustic environment with rich spatial cues would fail to produce even a minimally spatial purely auditory experience. Even listening only to stereo headphones could produce a directional auditory experience. On the other hand, it does seem possible that there could be a non-spatial but impoverished exclusively auditory experience if no binaural or other spatial cues were present. But similarly impoverished, non-spatial experiences seem possible for other modalities. Consider visually experiencing a uniform gray ganzfeld, or floating weightlessly in a uniformly warm bath. Neither provides the materials for spatial concepts, so neither differs from audition in this respect. One might contend that we therefore lack a good reason to think that, in contrast to a purely visual or tactile experience , a purely auditory experience would be an entirely non-spatial experience (see O’Callaghan 2010).

Nudds (2001) suggests another way to understand the claim, and interprets Strawson as making an observation about the internal structure of audition:

When we see (or seem to see) something, we see it as occupying or as located within a region of space; when we hear (or appear to hear) a sound we simply hear the sound, and we don’t experience it as standing in any relation to the space it may in fact occupy. (Nudds 2001, 213–14)

Audition, unlike vision, lacks a spatial structure or field, claims Nudds. A purely auditory experience thus would not comprise a spatial field into which individuals independent from oneself might figure. Following an example from Martin (1992), Nudds argues that while vision involves awareness of unoccupied locations, audition does not involve awareness of regions of space as empty or unoccupied. Martin’s example is seeing the space in the center of a ring as empty . In audition, Nudds claims, one never experiences a space as empty or unoccupied.

In response, one might simply deny a difference between vision and audition in this respect. If one can attend to a location near the center of the visible ring as empty, one can attend to the location between the sounding alarm clock and the slamming door as a place where there is no audible sound—as acoustically empty space. Of course, auditory space generally is less replete than visual space, but this is contingent. Consider seeing just a few stars flickering on and off against a dark sky. Since such an experience may have spatial structure, and since it is analogous to audition, one might on these grounds defend the claim that audition has spatial structure (see also Young 2017).

What about the second way mentioned above to understand Strawson’s claim? Though audition’s status as intrinsically spatial may not differ from that of vision or touch, perhaps sounds are not intrinsically spatial. But without further argument, or a commitment to a theory of sounds, it is difficult to state confidently the intrinsic features of sounds and thus whether they include spatial features. If, for instance, wavelength is among a sound’s intrinsic features, sounds are intrinsically spatial.

Nonetheless, the claim might be that, as they are perceptually experienced to be, sounds lack apparent intrinsic or non-relational spatial features. Roughly, independent from spatial relations to other sounds, experienced sounds seem to lack internal spatial structure. That is why you cannot auditorily experience the empty space at the center of a sound or hear its edges. Interpreted as such—that sounds are not experienced or perceptually represented to have inherent spatial features—the claim is plausible (though consider diffuse or spread out sounds in contrast to focused or pinpoint sounds). It certainly marks an important difference from vision, whose objects frequently not only seem to have rich internal spatial structure, but also are individuated in terms of inherent spatial features.

This difference, however, does not ground an argument that any purely auditory experience is non-spatial or that sounds fail to satisfy the requirement on objectivity, since sounds’ being experienced to have internal, intrinsic, or inherent spatial characteristics is necessary neither for spatial auditory experience nor to experience sounds as objective. Since sounds phenomenologically seem to be located in space and to bear extrinsic spatial relations to each other, auditory experience satisfies the requirements for objectivity, which need only secure the materials for a conception of a place for sounds to exist when not experienced.

So, vision and audition differ with respect to space in two ways. First, vision’s spatial acuity surpasses that of audition. Second, vision’s objects are perceptually experienced to have rich internal spatial structure, and audition’s are not. However, given the spatial characteristics evident in audition, such as direction and distance, the spatial status of audition presents no barrier to understanding its objects as perceiver-independent. The spatial aspects of auditory phenomenology thus may fail to ground an argument to the conclusion that sounds are modifications of one’s consciousness. If that is the case, then audition provides no special intuitive support for accounts on which private entities are the direct objects of perception.

3.2 Audible Qualities

According to theories in which sounds are individuals, sounds are not secondary or sensible qualities. But, humans hear audible qualities, such as pitch, loudness, and timbre, that are analogous to colors, tastes, and scents. Thus, familiar accounts of colors and other sensible attributes or secondary qualities might apply to the audible qualities. For instance, pitches might be either dispositions to cause certain kinds of experiences in suitable subjects, the physical or categorical bases of such dispositions, sensations or projected features of auditory experiences, or simple primitive properties of (actual or edenic) sounds.

Tradition suggests that the form of a philosophical account of visible qualities, such as color, and their perception applies to other sensible qualities, such as pitch, flavor, and smell, and their perception. Thus, according to tradition, if dispositionalism, physicalism, projectivism, or primitivism about sensible qualities is true for features associated with one modality, it is true for features associated with others. Despite tradition, we should be wary to accept that a theory of sensible qualities translates plausibly across the senses.

Debates about sensible qualities and their perception begin with concerns about whether sensible features can be identified with or reduced to any objective physical features. What follows has two aims. The first is to give a sense of how such debates might go in the case of audible qualities. The focus is on pitch, since pitch is often compared to color, and the case of color is well known (for discussion of similar questions concerning timbre, see Isaac 2017). The second is to point out the most salient differences and similarities between the cases of color and pitch that impact the plausibility of arguments translated from one case to the other.

First, I consider two noteworthy arguments that are founded on aspects of color perception. Each aims to establish that the colors we perceive cannot be identified with objective physical features. Neither argument transposes neatly to the case of pitch. Thus, we should not assume arguments that are effective in the case of color have equal force when applied to other sensible qualities. Color perhaps is a uniquely difficult case.

Second, however, I discuss two respects in which pitch experience is similar to color experience. It is instructive that these aspects of pitch experience do raise difficulties for an objective physical account of pitch that are familiar from the case of color.

What are pitch, timbre, and loudness? Pitch is a dimension along which tones can be ordered according to apparent “height”. The pitch of fingernails scratching a blackboard generally is higher than that of thumping a washtub. Loudness can be glossed as the volume, intensity, or quantity of sound. A jet plane makes louder sounds than a model plane. Timbre is more difficult to describe. Timbre is a quality in which sounds that share pitch and loudness might differ. So, a violin, a cello, and a piano all playing the same note differ in timbre. Sometimes timbre is called “tone color”.

Physics and psychoacoustics show that properties including frequency, amplitude, and wave shape determine the audible qualities sounds (auditorily) appear to have. To simplify, take the case of pitch, since pitch often is compared to color. Not all sounds appear to have pitch. Some sounds appear to have pitch thanks to a simple, sinusoidal pattern of vibration at some frequency in an object or in the air. Some sounds appear pitched thanks to a complex pattern of vibration that can be decomposed into sinusoidal constituents at multiple frequencies, since any pattern of vibration can be analyzed as some combination of simple sinusoids. Sounds appear pitched, however, just when they have sinusoidal constituents, or partials , that all are integer multiples of a common fundamental frequency. Sounds with pitch thus correspond to regular or periodic patterns of vibration that differ in fundamental frequency and complexity. Simple sinusoids and complex waveforms match in pitch (though they typically differ in timbre) when they share fundamental frequency. This is true even when the complex tone lacks a sinusoidal constituent at the fundamental frequency, which is referred to as the phenomenon of the missing fundamental .

A straightforward account identifies pitch with periodicity (perhaps within some range). Having pitch is being periodic (see O’Callaghan 2007, ch. 6). Periodicity can be expressed in terms of fundamental frequency, so individual pitches are fundamental frequencies. This has advantages as an account of pitch. It captures the linear ordering of pitches. It also explains the musical intervals, such as the octave, fifth, and fourth, for example, which are pitch relations that hold among periodic tones. Musical intervals correspond to whole-number ratios between fundamental frequencies. Sounds that differ by an octave have fundamental frequencies that stand in 1:2 ratios. Fifths involve a 2:3 relationship, fourths are 3:4, and so on. This also allows us to revise the linear pitch ordering to accommodate the auditory sense in which tones that differ by an octave nonetheless are the same pitch. If the pitch ordering is represented as a helix, upon which successive octave-related tones fall at a common angular position, each full rotation represents doubling frequency.

Is the periodicity theory of pitch plausible as an account of the audible features we perceive when hearing sounds? If so, then objective physicalism about at least some sensible qualities might succeed.

3.2.4 Disanalogies with Color

The periodicity theory of pitch fares better on two counts than theories that identify colors with objective physical properties.

First, consider the phenomenological distinction between unique and binary hues. Some colors appear to incorporate other colors, and some do not. Purple, for instance, appears both reddish and bluish; red just looks red. Some philosophers contend that the leading physical theories of color cannot explain the unique-binary distinction without essentially invoking the color experiences of subjects. How, for instance, do reflectance classes identified with unique hues differ from those associated with binary hues?

Consider a related issue concerning pitch. Some tones with pitch sound simple , while other pitched tones, such as sounds of musical instruments, auditorily appear to be complex and to have discernible components. However, the difference between audibly simple and audibly complex pitched tones is captured by the simplicity or complexity of a sound’s partials. Simple tones are sinusoids, and complex tones have multiple overtones. So, one response is to hold that the unique-binary color distinction and the simple-complex pitch distinction are disanalogous. Unlike the case of color, one might contend, no pitch that is essentially a mixture of other pitches solely occupies a distinctive place in pitch space.

Second, consider metamerism . Some surfaces with very different reflectance characteristics match in color. Metameric pairs share no obvious objective physical property. Some philosophers argue that unless color experience fails to distinguish distinct colors, metamers preclude identifying colors with natural physical properties of surfaces (see the entry on color ).

Now consider the case of pitch. Are there pitch metamers? Some sounds with very different spectral frequency profiles match in pitch. A simple sinusoidal tone at a given frequency matches the pitch of each complex tone with that fundamental frequency (even those that lack a constituent at the fundamental). But, again, the case of pitch differs from the case of color. For each matching pitch, a single natural property does unify the class. The tones all share a fundamental frequency.

3.2.5 Analogies with Color

Two kinds of argument familiar from the case of color are equally pressing when applied to the case of pitch.

First, arguments from intersubjective variation transpose. Actual variations in frequency sensitivity exist among perceivers; for instance, subjects differ in which frequency they identify as middle C. If there is no principled way to legislate whose experience is veridical, pitch might be subjective or perceiver-relative. One response is that, in contrast to the case of unique red, there is an objective standard for middle C: fundamental frequency. But, whose pitch experience has the normative significance to settle the frequency of middle C?

Some might wonder whether there is a pitch analog of the trouble posed by the kind of variation associated with spectrum inversion in the case of color (see the entry on inverted qualia ). Spectral shift in pitch, sometimes dramatic, commonly occurs after cochlear implant surgery. This is not spectral inversion for pitch; but, a dramatic shift makes most of the same trouble as inversion. Not quite all the trouble, since cochlear implants preserve the pitch ordering and its direction. But, there could be a cochlear implant that switched the placement of electrodes sensitive to 100 hertz and 1000 hertz, respectively; and there could be one that reversed the entire electrode ordering. This goes some distance to grounding the conceivability of a pitch inversion that reverses the height ordering of tones.

Second, consider an argument that frequencies cannot capture the relational structure among the pitches. This is loosely analogous to the argument that physicalism about color fails to capture the relational structure of the hues—for instance, that red is more similar to orange than either is to green. In the case of pitch, psychoacoustics experiments show that perceived pitch does not map straightforwardly onto frequency. Though each unique pitch corresponds to a unique frequency (or small frequency range), the relations among apparent pitches do not match those among frequencies. In particular, equivalent pitch intervals do not correspond to equal frequency intervals. For example, the effect upon perceived pitch of a 100 hertz change in frequency varies dramatically across the frequency range. It is dramatic at low frequency and barely detectable at high frequency. Similarly, doubling frequency does not make for equivalent pitch intervals. A 1000 hertz tone must be tripled in frequency to produce the same increase in pitch as that produced by quadrupling the frequency of a 2000 hertz tone. Apparent pitch is a complex function of frequency; it is neither linear nor logarithmic (see, e.g., Hartmann 1997, ch 12, Gelfand 2004, ch 12, Zwicker and Fastl 2006, ch 5). Pitch scales that capture the psychoacoustic data assign equal magnitudes, commonly measured in units called mels , to equal pitch intervals. The mel scale of pitch thus is an extensive or numerical pitch scale, in contrast to the intensive frequency scale for pitch. The former, but not the latter, preserves ratios among pitches.

S. S. Stevens famously argued on the basis of results drawn from psychoacoustic experiments that pitch is not frequency (see, e.g., Stevens et al. 1937, Stevens and Volkmann 1940). In light of similar results, contemporary psychoacoustics researchers commonly reject the identification of pitch with frequency or periodicity. The received scientific view thus holds that pitch is a subjective or psychological quality that is no more than correlated with objective frequency (see, e.g., Gelfand 2004, Houtsma 1995). Pitch, on this understanding, belongs only to experiences. The received view of pitch therefore implies an error theory according to which pitch experience involves a widespread projective illusion.

What is the argument against the periodicity theory of pitch? Compare an argument against reflectance physicalism about color. Reflectance physicalism identifies each hue with a class of reflectances. Periodicity physicalism identifies each pitch with a fundamental frequency. In both cases, each determinate sensible feature is identified with a determinate physical property. In the color case, it is objected that reflectance classes do not bear the relations to each other that the colors bear. In the pitch case, the frequencies do not bear the relations to each other that the pitches bear. Thus, if the relational features among a class of sensible qualities are essential to them, an account that does not accurately capture those relations fails. Frequencies, according to this line of argument, do not stand in the relations essential to pitch.

This, of course, is a quite general phenomenon among sensible qualities. Brightness and loudness vary logarithmically with simple physical quantities. Even if we identified candidate molecules for smells, nothing suggests physical similarities would mirror their olfactory similarities.

In the case of pitch and other sensible features that can be put in a linear ordering, one might respond that the relational order is essential while the magnitudes are not. In that case, if pitch is frequency, pitch experience has the right structure, but distorts magnitudes of difference in pitch. This retains the periodicity theory and explains away the results in terms of pitch experiences.

Nonetheless, Pautz (2014, 3.5) has replied that this partial error account cannot be reconciled with certain types of possible intersubjective difference. So, suppose instead we accept that the mel scale is well-founded and that it accurately captures essential relationships among pitches. This does not by itself imply a projective or subjective theory of pitch. Pitches might be dispositions to produce certain kinds of experiences, or they might be simple or primitive properties. It also is open to seek a more adequate physical candidate for pitch. For instance, pitches might be far more complex physical properties than frequencies. Such physical properties may be of no interest in developing the simplest, most complete natural physical theory, but they may be anthropocentrically interesting.

It is an important question whether a physical theory of sensible features should just provide a physical candidate for each determinate sensible feature, or whether the physical relationships among those physical candidates should capture the structural relations among sensible qualities (and, if so, which structural relations it should capture). This is an example of how considering in detail the nature and the experience of sensible qualities other than color promises insights into traditional debates concerning the sensible qualities. Pautz (2014) offers an empirically-grounded argument concerning a variety of sensible qualities, including audible qualities, that advances such discussion.

4. Varieties of Auditory Perception

4.1 musical listening.

Musical listening is a topic that bears on questions about the relationship between hearing sounds and hearing sources. While the philosophy of music has its own vast literature (see the entry on the philosophy of music ), musical experience has not been used as extensively to explore general philosophical questions about auditory perception. This section discusses links that are relevant to advancing philosophical work on auditory perception.

An account of listening to pure or non-vocal music should capture the aesthetic significance of musical listening. Appreciating music is appreciating sounds and sequences, arrangements, or structures of sounds. Thus, the temporal aspects of auditory experiences are critical to appreciatively listening to music.

One might go further and hold that sounds are all that matters in music. In particular, some have argued that appreciatively listening to music demands listening in a way that abstracts from the environmental significance, and thus from the specific sources, of the sounds it includes (Scruton 1997, 2–3). Such acousmatic listening involves experiencing sounds in a way that is “detached from the circumstances of their production,” rather than “as having a certain worldly cause” (Hamilton 2007, 58; see also Hamilton 2009). Listening to music and being receptive to its aesthetically relevant features requires not listening to violins, horns, or brushes on snare drums. It requires hearing sounds and grasping them in a way removed from their common sources. Hearing a high fidelity recording thus furnishes an aesthetically identical musical experience despite having a speaker cone rather than a violin as source. “The acousmatic experience of sound is precisely what is exploited by the art of music” (Scruton 1997, 3).

This suggests an intuitive difference between music and visual arts such as painting and sculpture. As Kivy (1991) explains, it is difficult even with the most abstract paintings and sculptures to see them in a way that takes them to be entirely formal or abstract. That is, it is difficult to avoid seeing pictures and sculptures as representational. In contrast, it seems easier to listen attentively to the formal acoustical features of musical sounds, without being compelled to think of what makes them.

Musical listening thus may be thought to provide a prima facie argument against the claim that in hearing sounds one typically hears sound sources such as the strumming of guitars and bowing of violins. If such “interested” audition were the rule, musical listening would be far more challenging.

4.1.2 Acousmatic Listening as Attention to Sounds

Acousmatic experience, however, may be a matter of attention . Nothing prevents focusing one’s attention on the sounds and audible qualities without attending to the instruments, acts, and events that are their sources, even if each is auditorily available. That musical listening requires effort and training supports the idea that one can direct attention differently in auditory experience, depending on one’s interests. Caring for an infant and safely crossing the street require attending to sound sources, while listening with aesthetic appreciation to a symphony may require abstracting from the circumstances of its production, such as the finger movements of the oboist. This response holds that musical listening is a matter of auditorily attending in a certain way. It is attending to features of sounds themselves, but does not imply failing to hear sound sources.

The acousmatic thesis is a limited view about which aspects of the things one can auditory experience are aesthetically significant. These include audible aspects of sounds themselves, but exclude, for example, other contents of auditory experience. However, room exists for debate over the aesthetically significant aspects of what you hear (see Hamilton 2007, 2009). For example, one might argue that live performances have aesthetic advantages over recordings because one hears the performance of the sounds and songs, rather than their reproduction by loudspeakers (cf. Mag Uidhir 2007). Circumstances of sound production, such as that skillful gestures generate a certain passage, or that a particularly rare wood accounts for a violin’s sounds, might be aesthetically relevant in a way that outstrips the sounds, and some such features may be audible in addition to sounds. For instance, hearing the spatial characteristics of a performance may hold aesthetic significance beyond the tones and structures admitted by traditional accounts of musical listening. Composers may even intend “spatial gestures” among aspects essential for the appreciation of a piece (see, e.g., Solomon 2007). To imagine auditorily experiencing the spatial characteristics of music in a way entirely divorced from the environmental significance of the sounds is difficult. Appreciating the relationship between experiences of sounds and of sources makes room for a view of the aesthetic value of musical listening that is more liberal than acousmatic experience allows.

4.2 Speech Perception

Speech perception presents uniquely difficult twists, and few philosophers have confronted it directly (Appelbaum 1999, Trout 2001a, Matthen 2005, ch 9, and Remez and Trout 2009 are recent exceptions). Something striking and qualitatively distinctive—perhaps uniquely human—seems to set the perception of speech apart from ordinary hearing. The main philosophical issues about speech perception concern versions of the question, Is speech special? (See O’Callaghan 2015 for a comprehensive review and discussion.)

How does perceiving speech differ from perceiving ordinary non-linguistic sounds? Listening to music and listening to speech each differ from listening to other environmental sounds in the following respect. In each case, one’s interest in listening is to some degree distanced from the specific environmental happenings involved in the production of sounds.

But this is true of listening to music and of listening to speech for different reasons. In music, it is plausible that one’s interest is in the sounds themselves, rather than in the sources of their production. However, speech is a vehicle for conventional linguistic meaning. In listening to speech, one’s main interest is in the meanings, rather than in the sources of sound. Ultimately, the information conveyed is what matters.

Nevertheless, according to the most common philosophical understanding, perceiving spoken utterances is just a matter of hearing sounds. The sounds of speech are complex audible sound structures. Listening to speech in a language you know typically involves grasping meanings, but grasping meanings requires first hearing the sounds of speech. According to this account, grasping meanings itself is a matter of extra-perceptual cognition.

The commonplace view—that perceiving speech is a variety of ordinary auditory perception that just involves hearing the sounds of speech—has been challenged in a number of ways. The challenges differ in respect of how speech perception is held to differ from non-linguistic audition.

First, consider the objects of speech perception. What are the objects of speech perception, and do they differ from those of ordinary or non-linguistic auditory perception? According to the commonplace understanding, hearing speech involves hearing sounds. Thus, hearing spoken language shares perceptual objects with ordinary audition. Alternatively, one might hold that the objects of speech perception are not ordinary sounds at all. Perhaps they are language-specific entities, such as phonemes or words . Perhaps, as some have argued, perceiving speech involves perceiving articulatory gestures or movements of the mouth and vocal organs (see the supplement on Speech Perception: Empirical and Theoretical Considerations ). Note that if audition’s objects typically include distal events, speech in this respect is not special, since its objects do not belong to an entirely different kind from ordinary sounds.

Second, consider the contents of speech perception. Does the content of speech perception differ from that of ordinary audition? If it does, how does the experience of perceiving speech differ from that of hearing ordinary sounds? Perceiving speech might involve hearing ordinary sounds but auditorily ascribing distinctive features to them. These features might simply be, or comprise, finer grained qualitative and temporal acoustical details than non-linguistic sounds audibly possess. But perceiving speech also might involve perceiving sounds as belonging to language-specific types , such as phonemes, words, or other syntactic categories.

Furthermore, speech perception’s contents might differ in a more dramatic way from those of non-linguistic audition. Listening with understanding to speech involves grasping meanings . The commonplace view is conservative. It holds that grasping meanings is an act of the understanding rather than of audition. Thus, the difference between the experience of listening to speech in a language you know and the experience of listening to speech in a language you do not know is entirely cognitive .

But one might think that there also is a perceptual difference. A liberal account of this perceptual difference holds that perceiving speech in a language you know may involve hearing sounds as meaningful or auditorily representing them as having semantic properties (see, e.g., Siegel 2006, Bayne 2009, Azzouni 2013, Brogaard 2018; cf. O’Callaghan 2011b, Reiland 2015). Alternatively, a moderately liberal account holds that the perceptual experience of speech in a language you know involves perceptually experiencing language-specific but nevertheless non-semantic features. For instance, O’Callaghan (2011b) argues that listening to speech in a familiar language typical involves perceiving its phonological features.

4.2.4 Is Speech Perception Auditory?

Third, consider the processes responsible for speech perception. To what extent does perceiving speech implicate processes that are continuous with those of ordinary or general audition, and to what extent does perceiving speech involve separate, distinctive, or modular processes? While some defend general auditory accounts of speech perception (see, e.g, Holt and Lotto 2008), some argue that perceiving speech involves dedicated perceptual resources, or even an encapsulated perceptual system distinct from ordinary non-linguistic audition (see, e.g., Fodor 1983, Pinker 1994, Liberman 1996, Trout 2001b). These arguments typically are grounded in several types of phenomena, including the multimodality of speech perception—visual cues about the movements of the mouth and tongue impact the experience of speech, as demonstrated by the McGurk effect (see the section 4.3 Crossmodal Influences ); duplex perception —a particular stimulus sometimes contributes simultaneously both to the experience of an ordinary sound and to that of a speech sound (Rand 1974); and the top-down influence of linguistic knowledge upon the experience of speech. A reasonable challenge is that each of these characteristics—multimodality, duplex perception, and top-down influence—also is displayed in general audition.

See the supplement on Speech Perception: Empirical and Theoretical Considerations .

4.3 Crossmodal Influences

Auditory perception of speech is influenced by cues from vision and touch (see Gick et al. 2008). The McGurk effect in speech perception leads to an illusory auditory experience caused by a visual stimulus (McGurk and Macdonald 1976). Do such multimodal effects occur in ordinary audition? Visual and tactile cues commonly do shape auditory experience. The ventriloquist illusion is an illusory auditory experience of location that is produced by an apparent visible sound source (see, e.g., Bertelson 1999). Audition even impacts experience in other modalities. The sound-induced flash effect involves a visual illusion as of seeing two consecutive flashes that is produced when a single flash is accompanied by two consecutive beeps (Shams et al. 2000, 2002). Such crossmodal illusions demonstrate that auditory experience is impacted by other modalities and that audition influences other modalities. In general, experiences associated with one perceptual modality are influenced by stimulation to other sensory systems.

An important question is whether the impact is merely causal, or whether perception in one modality is somehow constitutively tied to other modalities. If, for instance, vision merely causally impacts your auditory experience of a given sound, then processes associated with audition might be proprietary and characterizable in terms that do not appeal to other modalities. Relying on information from vision or touch could simply improve the existing capacity to perceive space, time, or spoken language auditorily. On the other hand, coordination between audition and other senses could enable a new perceptual capacity. In that case, audition might rely constitutively on another sense.

A first step in resolving this question is recognizing that crossmodal illusions are not mere accidents. Instead, they are intelligible as the results of adaptive perceptual strategies. In ordinary circumstances, crossmodal processes serve to reduce or resolve apparent conflicts in information drawn from several senses. In doing so, they tend to make perception more reliable overall. Thus, crossmodal illusions differ from synaesthesia . Synaesthesia is just a kind of accident. It results from mere quirks of processing, and it always involves illusion (or else is accidentally veridical). Crossmodal recalibrations, in contrast, are best understood as attempts “to maintain a perceptual experience consonant with a unitary event” (Welch and Warren 1980, 638).

In the first place, the principled reconciliation of information drawn from different sensory sources suggests that audition is governed by extra-auditory perceptual constraints. Moreover, since conflict requires a common subject matter, such constraints must concern common sources of stimulation to multiple senses. If so, audition and vision share a perceptual concern for a common subject matter. And that concern is reflected in the organization of auditory experience. But this by itself does not establish constitutive dependence of audition on another sense.

However, the perceptual concern for a common subject matter could be reflected as such in certain forms of auditory experience. For instance, the commonality may be experientially evident in jointly perceiving shared spatio-temporal features, or in the perceptual experience of audio-visual intermodal feature binding. If so, some forms of auditory perceptual experience may share with vision a common multimodal or amodal content or character (see O’Callaghan 2008b, Clark 2011). More to the point, if coordination with another sense enables a new auditory capacity, then vision or touch could have a constitutive rather than merely causal impact upon corresponding auditory experiences.

What hangs on this? First, it bears on questions about audition’s content. If we cannot exhaustively characterize auditory experience in terms that are modality-specific or distinctive to audition, then we might hear as of things we can see or experience with other senses. This is related to one puzzling question about hearing sound sources: How could you hear as of something you could see? Rather than just a claim about audition’s content that requires further explanation, we now have a story about why things like sound sources figure in the content of auditory experience. Second, all of this may bear on how to delineate what counts as auditory perception, as opposed to visual or even amodal perception. If hearing is systematically impacted by visual processes, and if it shares content and phenomenology with other sense experiences, what are the boundaries of auditory perception? Multimodal perception may bear on the question of whether there are clear and significant distinctions among the sense modalities (cf. Nudds 2003). Finally, multimodal perceptual experiences, illusions, and explanatory strategies may illuminate the phenomenological unity of experiences in different modalities, or the sense in which, for instance, an auditory experience and a visual experience of some happening comprise a single encompassing experience (see the entry on the unity of consciousness ).

We can ask questions about the relationships among modalities in different areas of explanatory concern. Worthwhile areas for attention include the objects, contents, and phenomenology of perception, as well as perceptual processes and their architecture. Crossmodal and multimodal considerations might shed doubt on whether vision-based theorizing alone can deliver a complete understanding of perception and its contents. This approach constitutes an important methodological advance in the philosophical study of perception (for further discussion, see O’Callaghan 2012, 2019, Matthen 2015, Stokes et al. 2015).

Considering modalities other than vision enhances our understanding of perception. It is necessary to developing and vetting an adequate comprehensive and general account of perception and its roles. Auditory perception is a rich territory for philosophical exploration in its own right, but it also provides a useful contrast case in which to evaluate claims about perception proposed in the context of vision. One of the most promising directions for future work concerns the nature of the relationships among perceptual modalities, how these relationships shape experience across modalities, and how they may prove essential to understanding perception itself. Philosophical work on auditory perception thus is part of the advance beyond considering modalities in isolation from each other.

  • Appelbaum, I., 1996, “The lack of invariance problem and the goal of speech perception,” ICSLP-1996 , 3(435): 1541–1544.
  • –––, 1999, “The dogma of isomorphism: A case study from speech perception,” Philosophy of Science , 66 (Supplement. Proceedings of the 1998 Biennial Meetings of the Philosophy of Science Association. Part I: Contributed Papers): S250–S259.
  • Azzouni, J., 2013, Semantic Perception: How the Illusion of a Common Language Arises and Persists , Oxford: Oxford University Press.
  • Batty, C., 2010, “Scents and sensibilia,” American Philosophical Quarterly , 47: 103–118.
  • –––, 2011, “Smelling lessons,” Philosophical Studies , 153: 161–174.
  • Bayne, T., 2009, “Perception and the reach of phenomenal content,” Philosophical Quarterly , 59: 385–404.
  • Bermúdez, J. L., 2000, “Naturalized sense data,” Philosophy and Phenomenological Research , 61(2): 353–374.
  • Bertelson, P., 1999, “Ventriloquism: A case of cross-modal perceptual grouping,” in G. Aschersleben, T. Bachmann, and J. Músseler (eds.), Cognitive Contributions to the Perception of Spatial and Temporal Events , Amsterdam: Elsevier, pp. 347–317.
  • Blauert, J., 1997, Spatial Hearing: The Psychophysics of Human Sound Localization , Cambridge, MA: MIT Press.
  • Bloomfield, L., 1933, Language , New York: Holt.
  • Blumstein, S. E. and K. N. Stevens, 1981, “Phonetic features and acoustic invariance in speech,” Cognition , 10: 25–32.
  • Bosch, L. and N. Sebastián-Gallés, 1997, “Native-language recognition abilities in 4-month-old infants from monolingual and bilingual environments,” Cognition , 65(1): 33–69.
  • Bregman, A. S., 1990, Auditory Scene Analysis: The Perceptual Organization of Sound , Cambridge, MA: MIT Press.
  • Brogaard, B., 2018, “In defense of hearing meanings,” Synthese , 195: 2967–2983.
  • Bullot, N. and P. Egré (eds.), 2010, Objects and Sound Perception , Review of Philosophy and Psychology , 1.
  • Casati, R. and J. Dokic, 1994, La Philosopie du Son , Nîmes: Chambon.
  • –––, 2005, “Sounds,” in The Stanford Encyclopedia of Philosophy (Spring 2009 Edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/spr2009/entries/sounds/ >.
  • Casati, R., E. Di Bona, and J. Dokic, 2013, “The Ockhamization of the event sources of sound,” Analysis , 73(3): 462–466.
  • Clark, A., 2000, A Theory of Sentience , New York: Oxford University Press.
  • –––, 2011, “Cross-modal cuing and selective attention,” in F. MacPherson (ed.), The Senses . Oxford: Oxford University Press.
  • Cohen, J., 2009, “Sounds and temporality,” Oxford Studies in Metaphysics , 5: 303–320.
  • Cooper, F. S., P. C. Delattre, A. M. Liberman, J. M. Borst, and L. J. Gerstman, 1952, “Some experiments on the perception of synthetic speech sounds,” Journal of the Acoustical Society of America , 24: 597–606.
  • Diehl, R. L., A. J. Lotto, and L. L. Holt, 2004, “Speech perception,” Annual Review of Psychology , 55: 149–179.
  • Evans, G., 1980, “Things without the mind—a commentary upon Chapter Two of Strawson’s Individuals ,” in Z. van Straaten (ed.), Philosophical Subjects: Essays Presented to P. F. Strawson , Oxford: Clarendon Press; reprinted in G. Evans, 1985, Collected Papers , Oxford: Clarendon Press.
  • Fodor, J. A., 1983, The Modularity of Mind , Cambridge, MA: MIT Press.
  • Fulkerson, M., 2013, The First Sense: A Philosophical Study of Human Touch , Cambridge, MA: MIT Press.
  • –––, 2016, “Touch,” in The Stanford Encyclopedia of Philosophy (Spring 2016 Edition), Edward N. Zalta (ed.), URL = < https://plato.stanford.edu/archives/spr2016/entries/touch/ >.
  • Fowler, C. A., 1986, “An event approach to the study of speech perception from a direct-realist perspective,” Journal of Phonetics , 14: 3–28.
  • Fowler, G., 2013, “Against the primary sound account of echoes,” Analysis , 73: 466–473.
  • Gelfand, S. A., 2004, Hearing: An Introduction to Psychological and Physiological Acoustics , 4th edition, New York: Marcel Dekker.
  • Gick, B., K. M. Jóhannsdóttir, D. Gibraiel, and J. Mühlbauer, 2008, “Tactile enhancement of auditory and visual speech perception in untrained perceivers,” Journal of the Acoustical Society of America , 123(4): EL72–76.
  • Green, E. J., 2019, “A theory of perceptual objects,” Philosophy and Phenomenological Research , 99(3): 663–693.
  • Griffiths, T. D. and J. D. Warren, 2004, “What is an auditory object?” Nature Reviews Neuroscience , 5: 887–892.
  • Hamilton, A., 2007, Aesthetics and Music . London: Continuum.
  • –––, 2009, “The sound of music,”, in Nudds and O’Callaghan 2009, pp. 146–182.
  • Handel, S., 1995, “Timbre perception and auditory object identification,” in B. C. Moore (ed.), Hearing , San Diego, CA: Academic Press, pp. 425–461.
  • Hartmann, W. M., 1997, Signals, Sound, and Sensation , New York: Springer.
  • Heald, S. L. M., S. C. Van Hedger, and H. C. Nusbaum, 2017, “Perceptual plasticity for auditory object recognition,” Frontiers in Psychology , 8: 781.
  • Holt, L. L. and A. J. Lotto, 2008, “Speech perception within an auditory cognitive science framework,” Current Directions in Psychological Science , 17(1): 42–46.
  • Houtsma, A., 1995, “Pitch perception,” in B. C. J. Moore (ed.), Hearing , New York: Academic Press, pp. 267–291.
  • Isaac, A. M. C., 2018, “Prospects for timbre physicalism,” Philosophical Studies , 175(2): 503–529.
  • Jackson, F., 1977, Perception: A Representative Theory , Cambridge: Cambridge University Press.
  • Kivy, P., 1991, Music Alone , Ithaca, NY: Cornell University Press.
  • Kubovy, M., 1981, “Concurrent pitch-segregation and the theory of indispensable attributes,” in M. Kubovy and J. R. Pomerantz (eds.), Perceptual Organization , Hillsdale, NJ: Erlbaum, pp. 55–98.
  • Kubovy, M. and D. Van Valkenburg, 2001, “Auditory and visual objects,” Cognition , 80: 97–126.
  • Kuhl, P. K., 2000, “A new view of language acquisition,” Proceedings of the National Academy of Science , 97(22): 11850–11857.
  • Kulvicki, J., 2008, “The nature of noise,” Philosophers’ Imprint , 8(11): 1–16.
  • Leddington, J. P., 2019, “Sounds fully simplified,” Analysis , 79(4): 621–629.
  • Leslie, A. M., F. Xu, P. D. Tremoulet, and B. J. Scholl, 1998, “Indexing and the object concept: developing ‘what’ and ‘where’ systems,” Trends in Cognitive Sciences , 2(1): 10–18.
  • Liberman, A. M., 1970, “The grammars of speech and language,” Cognitive Psychology , 1(4): 301–323.
  • –––, 1996, Speech: A Special Code , Cambridge, MA: MIT Press.
  • Liberman, A. M., F. S. Cooper, D. P. Shankweiler, and M. Studdert-Kennedy, 1967, “Perception of the speech code,” Psychological Review , 74(6): 431–461.
  • Liberman, A. M. and I. G. Mattingly, 1985, “The motor theory of speech perception revised,” Cognition , 21: 1–36.
  • –––, 1989, “A specialization for speech perception,” Science , 243(4890): 489–494.
  • Locke, J., 1689/1975, An Essay Concerning Human Understanding , Oxford: Clarendon Press.
  • Lotto, A. J., K. R. Kluender, and L. L. Holt, 1997, “Animal models of speech perception phenomena,” in K. Singer, R. Eggert, and G. Anderson (eds.), Chicago Linguistic Society , 33, Chicago: Chicago Linguistic Society, pp. 357–367.
  • Lycan, W., 2000, “The slighting of smell,” in N. Bhushan and S. Rosenfeld (eds.), Of Minds and Molecules: New Philosophical Perspectives on Chemistry , Oxford: Oxford University Press, pp. 273–89.
  • Maclachlan, D. L. C., 1989, Philosophy of Perception , Englewood Cliffs, NJ: Prentice Hall.
  • Mag Uidhir, C., 2007, “Recordings as performances,” British Journal of Aesthetics , 47(3): 298–314.
  • Malpas, R. M. P., 1965, “The location of sound,” in R. J. Butler (ed.), Analytical Philosophy , Second Series, Oxford: Basil Blackwell, pp. 131–144.
  • Martin, M. G. F., 1992, “Sight and touch,” in T. Crane (ed.), The Contents of Experience , Cambridge: Cambridge University Press.
  • –––, 1993, “Sense modalities and spatial properties,” in N. Eilan, R. McCarthy, and B. Brewer (eds.), Spatial Representation: Problems in Philosophy and Psychology , Oxford: Blackwell.
  • Matthen, M., 2005, Seeing, Doing, and Knowing: A Philosophical Theory of Sense Perception , Oxford: Oxford University Press.
  • –––, 2010, “On the diversity of auditory objects,” Review of Philosophy and Psychology , 1: 63–89.
  • ––– (ed.), 2015, Oxford Handbook of Philosophy of Perception , Oxford: Oxford University Press.
  • McGurk, H. and J. MacDonald, 1976, “Hearing lips and seeing voices,” Nature , 264: 746–748.
  • Mehler, J., P. Jusczyk, G. Lambertz, N. Halsted, J. Bertoncini, and C. Amiel-Tison, 1988, “A precursor of language acquisition in young infants,” Cognition , 29: 143–178.
  • Mole, C., 2009, “The Motor Theory of speech perception,” in M. Nudds and C. O’Callaghan (eds.), Sounds and Perception: New Philosophical Essays , Oxford: Oxford University Press.
  • Nakayama, K., Z. J. He, and S. Shimojo, 1995, “Visual surface representation,” in S. M. Kosslyn and D. N. Osherson (eds.), Visual Cognition , Volume 2 of An Invitation to Cognitive Science , second edition, Cambridge, MA: MIT, pp. 1–70.
  • Noë, A., 2004, Action in Perception , Cambridge, MA: MIT Press.
  • Nudds, M., 2001, “Experiencing the production of sounds,” European Journal of Philosophy , 9: 210–229.
  • –––, 2003, “The significance of the senses,” Proceedings of the Aristotelian Society , 104(1): 31–51.
  • –––, 2010, “What are auditory objects?” Review of Philosophy and Psychology , 1: 105–122.
  • Nudds, M. and C. O’Callaghan, 2009, Sounds and Perception: New Philosophical Essays , Oxford: Oxford University Press.
  • O’Callaghan, C., 2007, Sounds: A Philosophical Theory , Oxford: Oxford University Press.
  • –––, 2008a, “Object perception: Vision and audition,” Philosophy Compass , 3: 803–829.
  • –––, 2008b, “Seeing what you hear: Cross-modal illusions and perception,” Philosophical Issues , 18: 316–338.
  • –––, 2010, “Perceiving the locations of sounds,” Review of Philosophy and Psychology , 1: 123–140.
  • –––, 2011a, “Hearing properties, effects or parts?” Proceedings of the Aristotelian Society , 111: 375–405.
  • –––, 2011b, “Against hearing meanings,” Philosophical Quarterly , 61: 783–807.
  • –––, 2012, “Perception and multimodality,” in E. Margolis, R. Samuels, and S. Stich (eds.), Oxford Handbook of Philosophy of Cognitive Science , Oxford: Oxford University Press, pp. 92–117.
  • –––, 2015, “Speech perception,” in M. Matthen (ed.), Oxford Handbook of Philosophy of Perception , Oxford: Oxford University Press, pp. 475–494.
  • –––, 2016, “Objects for multisensory perception,” Philosophical Studies , 173(5): 1269–1289.
  • –––, 2019, A Multisensory Philosophy of Perception , Oxford: Oxford University Press.
  • O’Shaughnessy, B., 1989, “The sense of touch,” Australasian Journal of Philosophy , 69: 37–58.
  • –––, 2002, Consciousness and the World , Oxford: Oxford University Press.
  • Pasnau, R., 1999, “What is sound?” Philosophical Quarterly , 49: 309–324.
  • –––, 2000, “Sensible qualities: The case of sound,” Journal of the History of Philosophy , 38: 27–40.
  • Pautz, A., 2014, “The real trouble for phenomenal externalists,” in R. Brown (ed.), Consciousness Inside and Out: Phenomenology, Neuroscience, and the Nature of Experience , New York: Springer, pp. 237–298.
  • –––, 2017, “Experiences are representations: An empirical argument,” in B. Nanay (ed.), Current Controversies in Philosophy of Perception , New York: Routledge, pp. 23–43.
  • Pinker, S., 1994, The Language Instinct , New York: William Morrow.
  • Rand, T. C., 1974, “Dichotic release from masking for speech,” Journal of the Acoustical Society of America , 55: 678–680.
  • Remez, R. E. and J. D. Trout, 2009, “Philosophical messages in the medium of spoken language,” in M. Nudds and C. O’Callaghan (eds.), Sounds and Perception: New Philosophical Essays , Oxford: Oxford University Press, pp. 234–263.
  • Rey, G., 2012, “Externalism and inexistence in early content,” in R. Schantz (ed.), Prospects for Meaning , New York: de Gruyter, pp. 503–530.
  • Richardson, L., 2013, “Flavour, taste and smell,” Mind and Language , 28(3): 322–341.
  • Rosenblum, L. D., 2004, “Perceiving articulatory events: Lessons for an ecological psychoacoustics,” in J. G. Neuhoff (ed.), Ecological Psychoacoustics , Chapter 8, San Diego, CA: Elsevier, pp. 220–248.
  • Scholl, B. J., 2001, “Objects and attention: the state of the art,” Cognition , 80: 1–46.
  • Scott, M., 2001, “Tactual perception,” Australasian Journal of Philosophy , 79(2): 149–160.
  • Scruton, R., 1997, The Aesthetics of Music , Oxford: Oxford University Press.
  • Shams, L., Y. Kamitani, and S. Shimojo, 2000, “What you see is what you hear,” Nature , 408: 788.
  • –––, 2002, “Visual illusion induced by sound,” Cognitive Brain Research , 14: 147–152.
  • Siegel, S., 2006, “Which properties are represented in perception?” in T. Gendler and J. Hawthorne (eds.), Perceptual Experience , New York: Oxford University Press, pp. 481–503.
  • Smith, A. D., 2002, The Problem of Perception , Cambridge, MA: Harvard University Press.
  • Smith, B. C., 2015, “The chemical senses,” in M. Matthen (ed.), Oxford Handbook of Philosophy of Perception , Oxford: Oxford University Press, pp. 314–352.
  • Solomon, J., 2007, Spatialization in Music: The Analysis and Interpretation of Spatial Gestures , Ph.D. thesis, Department of Music, University of Georgia, Athens, GA. [ available online (in PDF)]
  • Sorensen, R., 2008, Seeing Dark Things , New York: Oxford University Press.
  • Soto-Faraco, S., J. Navarra, W. M. Weikum, A. Vouloumanos, N. Sebastián-Gallés, and J. F. Werker, 2007, “Discriminating languages by speech-reading,” Perception and Psychophysics , 69(2): 218.
  • Spelke, E. S., 1990, “Principles of object perception,” Cognitive Science , 14: 29–56.
  • Stevens, S. and J. Volkmann, 1940, “The relation of pitch to frequency: A revised scale,” American Journal of Psychology , 53: 329–353.
  • Stevens, S., J. Volkmann, and E. Newman, 1937, “A scale for the measurement of the psychological magnitude pitch,” Journal of the Acoustical Society of America , 8(3): 185–190.
  • Stokes, D., M. Matthen, and S. Biggs (eds.), 2015, Perception and Its Modalities , New York: Oxford University Press.
  • Strawson, P. F., 1959, Individuals , New York: Routledge.
  • Trout, J. D., 2001a, “Metaphysics, method, and the mouth: Philosophical lessons of speech perception,” Philosophical Psychology , 14(3): 261–291.
  • –––, 2001b, “The biological basis of speech: What to infer from talking to the animals,” Psychological Review , 108(3): 523–549.
  • Van Valkenburg, D. and M. Kubovy, 2003, “In defense of the theory of indispensable attributes,” Cognition , 87: 225–233.
  • Vouloumanos, A. and J. F. Werker, 2007, “Listening to language at birth: evidence for a bias for speech in neonates,” Developmental Science , 10(2): 159–164.
  • Weikum, W. M., A. Vouloumanos, J. Navarra, S. Soto-Faraco, N. Sebastián-Gallés, and J. F. Werker, 2007, “Visual language discrimination in infancy,” Science , 316(5828): 1159.
  • Welch, R. B. and D. H. Warren, 1980, “Immediate perceptual response to intersensory discrepancy,” Psychological Bulletin , 88(3): 638–667.
  • Werker, J., 1995, “Exploring developmental changes in cross-language speech perception,” in L. Gleitman and M. Liberman (eds.), Language: An Invitation to Cognitive Science , Volume 1, 2nd edition, Cambridge, MA: MIT Press, pp. 87–106.
  • Young, N., 2017, “Hearing spaces,” Australasian Journal of Philosophy , 95(2): 242–255.
  • –––, 2018, “Hearing objects and events,” Philosophical Studies , 175(11): 2931–2950.
  • Zahorik, P. and F. Wightman, 2001, “Loudness constancy with varying sound source distance,” Nature Neuroscience , 4: 78–83.
  • Zwicker, E. and H. Fastl, 2006, Psychoacoustics: Facts and Models , 3rd edition, New York: Springer.
How to cite this entry . Preview the PDF version of this entry at the Friends of the SEP Society . Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry at PhilPapers , with links to its database.
  • Diana Deutsch’s musical illusions page
  • Sound-induced flash illusion
  • McGurk effect

color | consciousness: unity of | music, philosophy of | perception: the contents of | perception: the problem of | qualia: inverted | sounds

Acknowledgments

I am very grateful to David Chalmers, Maddy Kilbride, and Shaun Nichols for extensive and helpful comments on previous versions of this entry.

Copyright © 2020 by Casey O’Callaghan < casey . ocallaghan @ wustl . edu >

  • Accessibility

Support SEP

Mirror sites.

View this site from another server:

  • Info about mirror sites

The Stanford Encyclopedia of Philosophy is copyright © 2023 by The Metaphysics Research Lab , Department of Philosophy, Stanford University

Library of Congress Catalog Data: ISSN 1095-5054

Psychology Dictionary

SPEECH PERCEPTION

a psychological process where the listener processes the speech in to a phonological presentation .

Avatar photo

Leave a Reply

Your email address will not be published. Required fields are marked *

Latest Posts

definition of speech perception

A Brief History of Brainwashing: The Science of Thought Control

definition of speech perception

A Deep Dive into the Social Psychology of Leadership

definition of speech perception

Counseling Approaches to Client Care: Theories to Apply in Practice

definition of speech perception

The Future Of Education: Can You Earn A Psychology Degree Online?

definition of speech perception

Insomnia & Mental Illness: What is the Correlation?

Psychology of Decision Making

Stop Guessing: Here Are 3 Steps to Data-Driven Psychological Decisions

definition of speech perception

Getting Help with Grief: Understanding Therapy & How It Can Help

definition of speech perception

Exploring the Psychology of Risk and Reward

definition of speech perception

Understanding ADHD in Women: Symptoms, Treatment & Support

definition of speech perception

Meeting the Milestones: A Guide to Piaget's Child Developmental Stages

definition of speech perception

Counseling, Therapy, and Psychology: What Is The Difference?

definition of speech perception

The Psychology of Metaphysical Belief Systems

Popular psychology terms, medical model, hypermnesia, affirmation, brainwashing, backup reinforcer, affiliative behavior, message-learning approach, social pressure, acculturation, assertiveness.

Psycholinguistics/Models of Speech Perception

  • 1 Introduction
  • 2.1 TRACE Model
  • 2.2 Motor Theory Model
  • 2.3 Categorical Perception
  • 2.4 Cohort Model
  • 2.5 Exemplar Theory
  • 2.6 Neurocomputational Model
  • 2.7 Dual Stream Model
  • 3 Problems with Speech Perception Models
  • 4 Conclusion
  • 5.1 Name the Model
  • 5.2 Short Answer
  • 5.3 Mini Quiz
  • 6 References
  • 7 Learning Activity Answers

Introduction [ edit | edit source ]

Speech perception is the process by which speech is interpreted. Speech perception involves three processes, hearing, interpreting and comprehending all of the sounds produced by a speaker. The combination of these features into an order that resembles speech of a given language is a main function of speech perception. Speech perception includes combining not only the phonology and phonetics of the speech to be perceived, but also the syntax of the language and the semantics of the spoken message. Adequate speech perception requires a model which will unite all the various components of speech and produce a comprehensive message. Various models have been developed to help understand the methods behind perceiving different components of speech. There are models that function on the production or perception of speech solely, and there are other models that combine both speech production and perception together. Some of the first models produced date back in time until about the mid 1900's, and models are continually being developed today.

Models of Speech Perception [ edit | edit source ]

Trace model [ edit | edit source ].

TRACE model for speech perception was one of the first models developed for perceiving speech, and is one of the better known models. TRACE Model is a framework in which the primary function is to take all of the various sources of information found in speech and integrate them to identify single words. The TRACE model, founded by McClelland and Elman (1986) is based on the principles of interactive activation [1] . All components of speech (features, phonemes, and words) have their own role in creating intelligible speech, and using TRACE to unite them leads to a complete stream of speech, instead of individual components. The TRACE model is broken into two distinct components. TRACE I deals mainly with short segments of real speech, whereas TRACE II deals with identification of phonemes and words in speech.The model as a whole, consists of a very large number of units which are organized into three separate levels. Each level is comprised of a bank of detectors for distinguishing the components of that level.

  • Feature level - At this level, there are several banks of feature detectors. Each features has its only place in speech time, and they are organized in successive order.
  • Phoneme level- At this level, there is a bank of detectors for each phoneme present in the speech sounds.
  • Word level - At this level there is a bank of detectors for each individual word that is spoken by the speaker.

The TRACE model works in two directions. TRACE allows for either words or phonemes to be derived from a spoken message. By segmenting the individual sounds, phonemes can be determined from spoken words. By combining the phonemes, words can be created and perceived by the listener.

Motor Theory Model [ edit | edit source ]

This model was developed in 1967 by Liberman and colleagues. The basic principle of this model lies with the production of speech sounds in the speaker's vocal tract. The Motor Theory proposes that a listener specifically perceives a speaker's phonetic gestures while they are speaking. A phonetic gesture, for this model, is a representation of the speaker's vocal tract constriction while producing a speech sound [2] . Each phonetic gesture is produced uniquely in the vocal tract. The different places of producing gestures permit the speaker to produce salient phonemes for listeners to perceive. The Motor Theory model functions by using separate embedded models within the main model. It is the interaction of these models that makes Motor Theory possible.

  • Trading Relations - This is the concept that not every phonetic gesture can be directly translated and defined into acoustic terms. This means that there must be another step for interpreting the vocal gestures. Some gestures can be cognitively switched with others to make interpretation simpler. If the produced gesture is similar enough to another gesture that already has a known articulatory cause, they can be switched. The perceived gesture can be traded with the known gesture and interpretation can be achieved.
  • Coarticulation - This is the idea that there is variability in the aspect of gesture production. This concept indicates that there are variations in the area of articulation of vocal gestures produced by speakers. The same gesture may be able to be produced in more than one place. The phonemes within the gestures are obtained and perceived by the ability to compensate for all the variations of speech possible due to coarticulation.

Categorical Perception [ edit | edit source ]

Categorical Perception is the concept that phonemes in speech can be divided categorically once they are produced. The main categories that speech can be divided into are places of articulation and voice onset time. Some of the vocal gestures can only occur from a single type of articulation. Other gestures have a variety of coarticulations. This means that the same sound can either be produced at a single place in the vocal tract, or it can be produced from a few different places in the vocal tract [3] .Being able to determine where the sound is being produced will assist in determining which sound has been produced. Some vocal gestures also have different places in time for the voice on-set of the gesture in speech. Different vocal gestures produce their onset of sound at different times, depending on what the sound being produced is. For example, /b/ has a different voice onset than /p/ yet they are produced in the same place in the vocal tract [4] . Knowing when the voice onset is of the sound will help when trying to assess which sound the speaker has produced. Making the distinction between articulation and voice onset enables gestures to be grouped and defined based on the ways they are produced.

Cohort Model [ edit | edit source ]

Proposed in the 1980's by Marslen-Wilson, the Cohort-Model is a representation for lexical retrieval. An individual's lexicon is his or her mental dictionary or vocabulary of all the words he or she is familiar with. According to a study, the average individual has a lexicon of about 45,000 to 60,000 words [5] . The premise of the Cohort Model is that a listener maps novel auditory information onto words that already exist in his or her lexicon to interpret the new word. Each part of an auditory utterance can be broken down into segments. The listener pays attention to the individual segments and maps these onto pre-existing words in their lexicon. As more and more segments of the utterance are perceived by the listener, he or she can omit words from their lexicon that do not follow the same pattern.

Example: Grape

1. The listener hears the /gr/ sound and begins thinking about which words he or she has in their lexicon which begin with the /gr/ sound and cancel out all of the others.

2. /gra/ all words following this pattern are thought of, and all the rest are omitted.

3. This patter continues until the listener has run out of speech segments and is left with a single option : Grape

The ideals behind Cohort Model have also recently been applied to technology to make internet searches more convenient and faster. Google has begun using this model to help make searching faster and easier for internet users.

Beginning to type item into Google Search Bar

As the first letter is typed into the search bar, Google begins "guessing" what the word is going to be.The guesses are generally based on what the most common searches tend to be, and what makes sense syntactically. As more letters are typed, different options appear in the menu which correspond with the letters typed.

Exemplar Theory [ edit | edit source ]

The main premise of the Exemplar theory is very similar to the Cohort Model. Exemplar theory based on the connection between memory and previous experience with words. The Exemplar theory aims to account for the way in which a listener can remember acoustic episodes. An acoustic episode is an experience with spoken words. There has been evidence produced that demonstrates that details relating to specific audible episodes are remembered by the listeners, if the episodes are familiar to the listener [6] . It is believed that listeners may be better at recognizing previously heard words if they are repeated by the same speaker, using the same speaking rate, meaning that the episode is familiar. With the Exemplar theory, it is believed that every word leaves a unique imprint on the listener's memory, and that this imprint is what aids a listener with remembering words. When new words enter the memory, the imprint of the new words are matched to previous ones to determine any similarities [7] . The Exemplar Theory states that as more experience is gained with lexical improvements, new words being learned or heard, the stability of the memory increases. With this lexical plasticity, the Ganong Effect comes into play. The Ganong Effect states that real-world memory traces are able to perceive much more readily than nonsense word memory [8] .

Ganong Effect Example:

  • Soot, Boot, Root will be easier to remember due to similarity in the memory of the listener
  • Snoyb, Bnoyb, and Rnoyb without being similar in the memory of the listener, will be difficult to remember

Neurocomputational Model [ edit | edit source ]

Kroger and colleagues (2009) worked on a speech perception model which is based on the neurophysiological and neuropsychological facts about speech [9] . The model they developed simulates what the neural pathways in various areas of the brain are involved in when speech is being produced and perceived. Using this model, brain areas in speech knowledge are obtained by training neural networks to detect speech in the cortical and sub-cortical regions of the brain. Through their research, Kroger and colleagues determined that the neurocomputational model has the capability of embedding in these brain areas important features of speech production and perception to achieve comprehension of speech [10] .

This model differs from previously discussed models on the basis of its role in speech perception. The authors developed their model to demonstrate that speech perception not only involves the perception of spoken language, it also heavily relies on the production of language too [11] . This model greatly reflects the findings of Liberman and associates in their work on the Motor Theory of speech production. Both of these models demonstrate that speech perception is a product of both production of speech and recieving of speech. With the work conducted by Huang and associates, it can be shown that very similar areas in the brain are activated for production along with perception of language [12] . This neurocomputational model is one of the few that adequately map the pathways of both speech functions in the brain.

Dual Stream Model [ edit | edit source ]

The Dual Stream Model, proposed by Hickok and Poeppel (2007) demonstrates the presence of two functionally distinct neural networks that process speech and language information [13] . One of the neural networks deals primarily with the sensory and phonological information pertaining to conceptual and semantics. The other network operates with sensory and phonological information pertaining to motor and articulatory systems. In this sense, the Dual Stream Model encompasses the key aspects of speech, production and perception. Despite previous assumptions about the lateralization of the human brain, the Dual Stream Model reverses the conceptions. As previously thought, the left hemisphere of the human brain dealt with only fast temporal information, but as Hickok & Poeppel (2007) demonstrate, this might not necessarily by the case. With the development of the Dual Stream Model, it has been shown that the left hemisphere of the brain is also capable of representing acoustic information as readily as the right hemisphere [14] . Along with changing the way it was thought that the brain dealt with incoming information, the basic concept of the Dual Stream Model is that acoustic information must interfere with conceptual and motor information for the entire message to be perceived [15] . This combining of roles is what makes the Dual Stream Model unique and plausible as a model for speech perception.

Problems with Speech Perception Models [ edit | edit source ]

Some of the main issues involved in producing a speech perception model is deciding which method of perception the model is going to adopt. Speech perception is a process which can occur in one of two ways, top-down processing, or bottom-up processing. With top-down processing, listeners perceive the entire word, and break it down into its components to determine its meaning, whereas in bottom-up processing, listeners perceive the individual segments of a word first, and build them together to form and determine meaning. When designing a speech perception model, both of these processes need to be taken into account. The processing direction the model takes will need to depend in which way the researchers believe speech perception occurs.

The TRACE model and the Dual Stream model each employ both the top-down and bottom-up processing as they function. This means that not only do the models explain how word can be built from the phonemes up, they are also capable of explaining how the phonemes are capable of being derived from complete words as well. The TRACE and Dual Stream models are exceptions as most of the speech perception models involve speech perception occurring in only one direction. For instance, Cohort Theory uses strictly bottom-up processing. The method of building upon some segments of a word until the entire word is built is an example of bottom-up processing. Only processing information in one direction is a downfall for a speech perception model.

The Exemplar Theory and Motor Theory each pose a different type of problem for speech perception. Both theories involve operationally defining certain aspects of speech that make the model work. It is in these definitions that errors may arise. In the Exemplar Theory, how can similarity of words be defined adequately when the level of similarity of words will be different for each individual [16] ? Same goes with defining an episode. It is difficult to ensure that listening to someone speak will be the exact same experience for more than one person, and also that it will be the same experience the second time around. With the Motor Theory, how can the gestures made by speakers be defined properly, if each speaker has a unique vocal tract and way of producing sound.

Each model is unique and capable of functioning the way it had been designed to. There are some limitations to each of the models, and there is no perfect model for speech perception. Along with these limitations, that models still work independently and co-dependently of one another. If a perception problem can not be solved using one of the models, there is a strong chance that there is another model that will work.

Conclusion [ edit | edit source ]

In conclusion, as shown above, there are many different models which can be used to perceive speech. Each model has its own method of working and usage. It depends on which aspect of speech, or for which purpose, that you would select a particular model to use. TRACE model and the Dual Stream models can both be used if speech is going to be processed from the phonemes upwards to the words, or broken down from the words into the phonemes. This is possible because these two models are the only ones capable of perceiving speech in both directions. The other models mentioned in this chapter have their own significant purpose, and are best used when in that particular circumstance. Speech perception models were designed to help detect and interpret speech for a great number of reasons, namely to help understand what utterances are being produced when it is difficult to distinguish them. These models have also been produced to help computers, microphones, and other electronic devices receive and translate human voices into intelligible messages. Speech perceptions models should be used with the understanding that each one varies from the others, and selecting the appropriate model will happen solve the speech perception problem much easier than if the incorrect model is used.

Learning Activity [ edit | edit source ]

Upon reviewing the above material on Models of Speech Perception, answer the following questions to test your knowledge about the models of speech perception. The answers to each of the questions are posted below this section for you to review once you've answered.

Name the Model [ edit | edit source ]

For this section, provide the speech perception model that may be described by the following words. There may be one or more answers for some of the questions.

  • lexicon, mapping, segmentation = __________.
  • gestures, coarticulation, trading relations = _________.
  • conceptual/semantics, motor/articulatory, networks = __________.
  • acoustic episode, imprint, Ganong Effect = __________.
  • word, feature, phoneme = __________.
  • places of articulation, voice onset, gestures = _________.
  • neurophysiological, neuropsychological, pathways = __________.

Short Answer [ edit | edit source ]

For this section please provide a written answer in the form of a paragraph. Make note of which speech perception model is being utilized, and unless otherwise specified, state what the model(s) is/are.

  • Sally, a undergraduate student is preparing an argument about speech perception models. Her argument must include models that work best at processing words or phonemes in both a top-down and a bottom-up fashion. Which model(s) should Sally use in her argument, and why?
  • You are a tutor for a grade 4 student. The student has been learning spellings, working especially on vowel sounds. You have noticed that the student can easily remember how to pronounce and spell the words read, heal, and veal, however he struggles greatly with the words kead, peaf, and feam. What could be causing a problem for this learner, and which speech perception model does this problem fall under?
  • Joe is a neuroscience student who is very familiar with neural pathways. His professor has asked him to prepare a presentation on speech perception models. The professor is also interested in the pathways involved in speech and would like to see this incorporated into the presentation. Which model could Joe use for this assignment to make his professor happy, and interest himself?

Mini Quiz [ edit | edit source ]

Fill in the blanks using the information found above.

  • Coarticulation and __________ are used as two models within the main model for __________.
  • Different vocal gestures have a different __________ for the different sounds they are making.
  • The premise of __________ is that a listener maps novel words onto pre-existing words in his __________ to interpret the new word.
  • The TRACE model is based on the principles of __________ activation.
  • One stream of the Dual stream model deals with _________ and the other stream deals with __________.
  • The Neurocomputational Model is based on _________ and __________ facts about speech.
  • Each phonemic gesture is produced __________ in the vocal tract.
  • The main categories speech can be divided into are __________ and __________.

References [ edit | edit source ]

  • ↑ McClelland J., & Elman J. (1986). The TRACE Model of Speech Perception. Cognitive Psychology , 18, 1-86
  • ↑ Liberman et al. (1967). Perception of the Speech Code. Psychological Review, 74, 431-461
  • ↑ Goldstone, L. (1994). Influences of categorization on perceptual discrimination. Journal of Experimental Psychology 123 178–200.
  • ↑ Truckenbrodt H. (2007). Spectrogram readings and other acoustics. Introduction to Phonetics and Phonology. May 27, 2007
  • ↑ Aitchison, J. 1987. Words in the Mind. Oxford: Basil Blackwell
  • ↑ Goldinger, S. (1996). Words and voices: episodic traces in spoken word identification and recognition memory. Journal of Experimental Psychology: Learning, Memory and Cognition 22 (5) 1166-1183
  • ↑ Goldinger, S. (1998). Echo of echoes? An episodic theory of lexical access. Psychological review 105 (2) 251-279
  • ↑ Goldinger S. (1998)
  • ↑ Kroger et al. (2009) Towards a neurocomputational model of pseech production and perception. Speech Communication 15 . 793-809
  • ↑ Kroger et al. (2009)
  • ↑ Hickok & Poeppel (2000). Towards a functional neuroanatomy of speech perception. Trends in Cognitive Science 4 131–138
  • ↑ Huang et al (2001). Comparing cortical activations for silent and overt speech using event-related fMRI. Human Brain Mapping 15 39–53
  • ↑ Hickok, G. & Poeppel, D. (2007).The cortical organization of speech processing. Nature Reviews Neuroscience 8 (5) 393-402
  • ↑ Luce, P. & Pisoni, D. (1998). Recognizing spoken words: the neighborhood activation model. Ear Hear 19 .1–36
  • ↑ Milner, A. & Goodale, M. The visual brain in action (Oxford University Press, Oxford, 1995)
  • ↑ Johnson et al. (1999). Auditory-visual integration of talker gender in vowel perception. Journal of Phonetics, 27,359-384

Learning Activity Answers [ edit | edit source ]

1. Cohort Model

2. Motor Theory

3. Dual Stream

4. Exemplar Theory

6. Categorical Perception

7. Neurocomputational model

1. TRACE and Dual Stream

The TRACE model, as well as the Dual Stream model use a top-down and bottom-up process while they are in function. Both of these models function in such a way that they are able to either segment entire words down into their phonemes (top-down) or build words up from their individual phonemes (bottom-up). These are the only two speech perception models with this capability, which makes them capable of handling many speech perception needs. For her argument, Sally should discuss both of these models

2. The 4th grader is struggling with what is known as the Ganong Effect. This means that the first grouping of words is easy for him to remember because they are familiar words which he has probably heard used in conversation before. the second group of words is hard to remember because they are unfamiliar to him, and he has nothing stored in his mental lexicon to map them onto. The Ganong Effect come into play with the Exemplar Theory. This is because the Exemplar Theory works with memory and experience of words. Words that are used in everyday life, or that are familiar will be easier to recall and learn compared to words that aren't.

3. Neurocomputational Model

Joe should use the neurocomputational model for his presentation. This model of speech perception utilizes information based on where in the various brain areas speech is being produced and also perceived. This model works by training specific neural pathways to not only detect speech, but also to produce it. If Joe is interested and knowledgable about the brain and its neural pathways, this model would be ideal for him to present.

1. trading relations, Motor Theory

2. voice onset time

3. Cohort Model, lexicon

4. interactive

5. conceptual and semantics, motor and articulatory

6. neurophysiological, neuropsychological

7. uniquely

8. place of articulation, voice onset time

definition of speech perception

  • Psycholinguistics
  • Speech perception

Navigation menu

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Front Syst Neurosci

Speech perception as an active cognitive process

Shannon l. m. heald.

Department of Psychology, The University of Chicago, Chicago, IL, USA

Howard C. Nusbaum

One view of speech perception is that acoustic signals are transformed into representations for pattern matching to determine linguistic structure. This process can be taken as a statistical pattern-matching problem, assuming realtively stable linguistic categories are characterized by neural representations related to auditory properties of speech that can be compared to speech input. This kind of pattern matching can be termed a passive process which implies rigidity of processing with few demands on cognitive processing. An alternative view is that speech recognition, even in early stages, is an active process in which speech analysis is attentionally guided. Note that this does not mean consciously guided but that information-contingent changes in early auditory encoding can occur as a function of context and experience. Active processing assumes that attention, plasticity, and listening goals are important in considering how listeners cope with adverse circumstances that impair hearing by masking noise in the environment or hearing loss. Although theories of speech perception have begun to incorporate some active processing, they seldom treat early speech encoding as plastic and attentionally guided. Recent research has suggested that speech perception is the product of both feedforward and feedback interactions between a number of brain regions that include descending projections perhaps as far downstream as the cochlea. It is important to understand how the ambiguity of the speech signal and constraints of context dynamically determine cognitive resources recruited during perception including focused attention, learning, and working memory. Theories of speech perception need to go beyond the current corticocentric approach in order to account for the intrinsic dynamics of the auditory encoding of speech. In doing so, this may provide new insights into ways in which hearing disorders and loss may be treated either through augementation or therapy.

In order to achieve flexibility and generativity, spoken language understanding depends on active cognitive processing (Nusbaum and Schwab, 1986 ; Nusbaum and Magnuson, 1997 ). Active cognitive processing is contrasted with passive processing in terms of the control processes that organize the nature and sequence of cognitive operations (Nusbaum and Schwab, 1986 ). A passive process is one in which inputs map directly to outputs with no hypothesis testing or information-contingent operations. Automatized cognitive systems (Shiffrin and Schneider, 1977 ) behave as though passive, in that stimuli are mandatorily mapped onto responses without demand on cognitive resources. However it is important to note that cognitive automatization does not have strong implications for the nature of the mediating control system such that various different mechanisms have been proposed to account for automatic processing (e.g., Logan, 1988 ). By comparison, active cognitive systems however have a control structure that permits “information contingent processing” or the ability to change the sequence or nature of processing in the context of new information or uncertainty. In principle, active systems can generate hypotheses to be tested as new information arrives or is derived (Nusbaum and Schwab, 1986 ) and thus provide substantial cognitive flexibility to respond to novel situations and demands.

Active and passive processes

The distinction between active and passive processes comes from control theory and reflects the degree to which a sequence of operations, in this case neural population responses, is contingent on processing outcomes (see Nusbaum and Schwab, 1986 ). A passive process is an open loop sequence of transformations that are fixed, such that there is an invariant mapping from input to output (MacKay, 1951 , 1956 ). Figure ​ Figure1A 1A illustrates a passive process in which a pattern of inputs (e.g., basilar membrane responses) is transmitted directly over the eighth nerve to the next population of neurons (e.g., in the auditory brainstem) and upward to cortex. This is the fundamental assumption of a number of theories of auditory processing in which a fixed cascade of neural population responses are transmitted from one part of the brain to the other (e.g., Barlow, 1961 ). This type of system operates the way reflexes are assumed to operate in which neural responses are transmitted and presumably transformed but in a fixed and immutable way (outside the context of longer term reshaping of responses). Considered in this way, such passive processing networks should process in a time frame that is simply the sum of the neural response times, and should not be influenced by processing outside this network, functioning something like a module (Fodor, 1983 ). In this respect then, such passive networks should operate “automatically” and not place any demands on cognitive resources. Some purely auditory theories seem to have this kind of organization (e.g., Fant, 1962 ; Diehl et al., 2004 ) and some more classical neural models (e.g., Broca, 1865 ; Wernicke, 1874/1977 ; Lichtheim, 1885 ; Geschwind, 1970 ) appear to be organized this way. In these cases, auditory processes project to perceptual interpretations with no clearly specified role for feedback to modify or guide processing.

An external file that holds a picture, illustration, etc.
Object name is fnsys-08-00035-g0001.jpg

Schematic representation of passive and active processes . The top panel (A) represents a passive process. A stimulus presented to sensory receptors is transformed through a series of processes (Ti) into a sequence of pattern representations until a final perceptual representation is the result. This could be thought of as a pattern of hair cell stimulation being transformed up to a phonological representation in cortex. The middle panel (B) represents a top-down active process. Sensory stimulation is compared as a pattern to hypothesized patterns derived from some knowledge source either derived from context or expectations. Error signals from the comparison interact with the hypothesized patterns until constrained to a single interpretation. The generation of hypothesized patterns may be in parallel or accomplished sequentially. The bottom panel (C) represents a bottom-up active process in which sensory stimulation is transformed into an initial pattern, which can be transformed into some representation. If this representation is sensitive to the unfolding of context or immediate perceptual experience, it could generate a pattern from the immediate input and context that is different than the initial pattern. Feedback from the context-based pattern in comparison with the initial pattern can generate an error signal to the representation changing how context is integrated to produce a new pattern for comparison purposes.

By contrast, active processes are variable in nature, as network processing is adjusted by an error-correcting mechanism or feedback loop. As such, outcomes may differ in different contexts. These feedback loops provide information to correct or modify processing in real time, rather than retrospectively. Nusbaum and Schwab ( 1986 ) describe two different ways an active, feedback-based system may be achieved. In one form, as illustrated in Figure ​ Figure1B, 1B , expectations (derived from context) provide a hypothesis about a stimulus pattern that is being processed. In this case, sensory patterns (e.g., basilar membrane responses) are transmitted in much the same way as in a passive process (e.g., to the auditory brainstem). However, descending projections may modify the nature of neural population responses in various ways as a consequence of neural responses in cortical systems. For example, top-down effects of knowledge or expectations have been shown to alter low level processing in the auditory brainstem (e.g., Galbraith and Arroyo, 1993 ) or in the cochlea (e.g., Giard et al., 1994 ). Active systems may occur in another form, as illustrated in Figure ​ Figure1C. 1C . In this case, there may be a strong bottom-up processing path as in a passive system, but feedback signals from higher cortical levels can change processing in real time at lower levels (e.g., brainstem). An example of this would be the kind of observation made by Spinelli and Pribram ( 1966 ) in showing that electrical stimulation of the inferotemporal cortex changed the receptive field structure for lateral geniculate neurons or Moran and Desimone’s ( 1985 ) demonstration that spatial attentional cueing changes effective receptive fields in striate and extrastriate cortex. In either case, active processing places demands on the system’s limited cognitive resources in order to achieve cognitive and perceptual flexibility. In this sense, active and passive processes differ in the cognitive and perceptual demands they place on the system.

Although the distinction between active and passive processes seems sufficiently simple, examination of computational models of spoken word recognition makes the distinctions less clear. For a very simple example of this potential issue consider the original Cohort theory (Marslen-Wilson and Welsh, 1978 ). Activation of a set of lexical candidates was presumed to occur automatically from the initial sounds in a word. This can be designated as a passive process since there is a direct invariant mapping from initial sounds to activation of a lexical candidate set, i.e., a cohort of words. Each subsequent sound in the input then deactivates members of this candidate set giving the appearance of a recurrent hypothesis testing mechanism in which the sequence of input sounds deactivates cohort members. One might consider this an active system overall with a passive first stage since the initial cohort set constitutes a set of lexical hypotheses that are tested by the use of context. However, it is important to note that the original Cohort theory did not include any active processing at the phonemic level, as hypothesis testing is carried out in the context of word recognition. Similarly, the architecture of the Distributed Cohort Model (Gaskell and Marslen-Wilson, 1997 ) asserts that activation of phonetic features is accomplished by a passive system whereas context interacts (through a hidden layer) with the mapping of phonetic features onto higher order linguistic units (phonemes and words) representing an interaction of context with passively derived phonetic features. In neither case is the activation of the features or sound input to linguistic categorization treated as hypothesis testing in the context of other sounds or linguistic information. Thus, while the Cohort models can be thought of as an active system for the recognition of words (and sometimes phonemes), they treat phonetic features as passively derived and not influenced from context or expectations.

This is often the case in a number of word recognition models. The Shortlist models (Shortlist: Norris, 1994 ; Shortlist B: Norris and McQueen, 2008 ) assume that phoneme perception is a largely passive process (at least it can be inferred as such by lack of any specification in the alternative). While Shortlist B uses phoneme confusion data (probability functions as input) and could in principle adjust the confusion data based on experience (through hypothesis testing and feedback), the nature of the derivation of the phoneme confusions is not specified; in essence assuming the problem of phoneme perception is solved. This appears to be common to models (e.g., NAM, Luce and Pisoni, 1998 ) in which the primary goal is to account for word perception rather than phoneme perception. Similarly, the second Trace model (McClelland and Elman, 1986 ) assumed phoneme perception was passively achieved albeit with competition (not feedback to the input level). It is interesting that the first Trace model (Elman and McClelland, 1986 ) did allow for feedback from phonemes to adjust activation patterns from acoustic-phonetic input, thus providing an active mechanism. However, this was not carried over into the revised version. This model was developed to account for some aspects of phoneme perception unaccounted for in the second model. It is interesting to note that the Hebb-Trace model (Mirman et al., 2006a ), while seeking to account for aspects of lexical influence on phoneme perception and speaker generalization did not incorporate active processing of the input patterns. As such, just the classification of those inputs was actively governed.

This can be understood in the context schema diagrammed in Figure ​ Figure1. 1 . Any process that maps inputs onto representations in an invariant manner or that would be classified as a finite-state deterministic system can be considered passive. A process that changes the classification of inputs contingent on context or goals or hypotheses can be considered an active system. Although word recognition models may treat the recognition of words or even phonemes as an active process, this active processing is not typically extended down to lower levels of auditory processing. These systems tend to operate as though there is a fixed set of input features (e.g., phonetic features) and the classification of such features takes place in a passive, automatized fashion.

By contrast, Elman and McClelland ( 1986 ) did describe a version of Trace in which patterns of phoneme activation actively changes processing at the feature input level. Similarly, McClelland et al. ( 2006 ) described a version of their model in which lexical information can modify input patterns at the subphonemic level. Both of these models represent active systems for speech processing at the sublexical level. However, it is important to point out that such theoretical propositions remain controversial. McQueen et al. ( 2006 ) have argued that there are no data to argue for lexical influences over sublexical processing, although Mirman et al. ( 2006b ) have countered this with empirical arguments. However, the question of whether there are top-down effects on speech perception is not the same as asking if there are active processes governing speech perception. Top-down effects assume higher level knowledge constrains interpretations, but as indicated in Figure ​ Figure1C, 1C , there can be bottom-up active processing where by antecedent auditory context constrains subsequent perception. This could be carried out in a number of ways. As an example, Ladefoged and Broadbent ( 1957 ) demonstrated that hearing a context sentence produced by one vocal tract could shift the perception of subsequent isolated vowels such that they would be consistent with the vowel space of the putative speaker. Some have accounted for this result by asserting there is an automatic auditory tuning process that shifts perception of the subsequent vowels (Huang and Holt, 2012 ; Laing et al., 2012 ). While the behavioral data could possibly be accounted for by such a simple passive mechanism, it might also be the case the auditory pattern input produces constraints on the possible vowel space or auditory mappings that might be expected. In this sense, the question of whether early auditory processing of speech is an active or passive process is still a point of open investigation and discussion.

It is important to make three additional points in order to clarify the distinction between active and passive processes. First, a Bayesian mechanism is not on its own merits necessarily active or passive. Bayes rule describes the way different statistics can be used to estimate the probability of a diagnosis or classification of an event or input. But this is essentially a computation theoretic description much in the same way Fourier’s theorem is independent of any implementation of the theorem to actually decompose a signal into its spectrum (cf. Marr, 1982 ). The calculation and derivation of relevant statistics for a Bayesian inference can be carried out passively or actively. Second, the presence of learning within a system does not on its own merits confer active processing status on a system. Learning can occur by a number of algorithms (e.g., Hebbian learning) that can be implemented passively. However to the extent that a system’s inputs are plastic during processing, would suggest whether an active system is at work. Finally, it is important to point out that active processing describes the architecture of a system (the ability to modify processing on the fly based on the processing itself) but not the behavior at any particular point in time. Given a fixed context and inputs, any active system can and likely would mimic passive behavior. The detection of an active process therefore depends on testing behavior under contextual variability or resource limitations to observe changes in processing as a consequence of variation in the hypothesized alternatives for interpretation (e.g., slower responses, higher error rate or confusions, increase in working memory load).

Computational need for active control systems in speech perception

Understanding how and why active cognitive processes are involved in speech perception is fundamental to the development of a theory of speech perception. Moreover, the nature of the theoretical problems that challenge most explanations of speech perception are structurally similar to some of the theoretical issues in language comprehension when considered more broadly. In addition to addressing the basis for language comprehension broadly, to the extent that such mechanisms play a critical role in spoken language processing, understanding their operation may be important to understanding both the effect of hearing loss on speech perception as well as suggesting ways of remediating hearing loss. If one takes an overly simplified view of hearing (and thus damage to hearing resulting in loss) as an acoustic-to-neural signal transduction mechanism comparable to a microphone-amplifier system, the simplifying assumptions may be very misleading. The notion of the peripheral auditory system as a passive acoustic transducer leads to theories that postulate passive conversion of acoustic energy to neural signals and this may underestimate both the complexity and potential of the human auditory system for processing speech. At the very least, early auditory encoding in the brain (reflected by the auditory brainstem response) is conditioned by experience (Skoe and Kraus, 2012 ) and so the distribution of auditory experiences shapes the basic neural patterns extracted from acoustic signals. However, it is appears that this auditory encoding is shaped from the top-down under active and adaptive processing of higher-level knowledge and attention (e.g., Nusbaum and Schwab, 1986 ; Strait et al., 2010 ).

This conceptualization of speech perception as an active process has large repercussions for understanding the nature of hearing loss in older adults. Rabbitt ( 1991 ) has argued, as have others, that older adults, compared with younger adults, must employ additional perceptual and cognitive processing to offset sensory deficits in frequency and temporal resolution as well as in frequency range (Murphy et al., 2000 ; Pichora-Fuller and Souza, 2003 ; McCoy et al., 2005 ; Wingfield et al., 2005 ; Surprenant, 2007 ). Wingfield et al. ( 2005 ) have further argued that the use of this extra processing at the sensory level is costly and may affect the availability of cognitive resources that could be needed for other kinds of processing. While these researchers consider the cognitive consequences that may be encountered more generally given the demands on cognitive resources, such as the deficits found in the encoding of speech content in memory, there is less consideration of the way these demands may impact speech processing itself. If speech perception itself is mediated by active processes, which require cognitive resources, then the increasing demands on additional cognitive and perceptual processing for older adults becomes more problematic. The competition for cognitive resources may shortchange aspects of speech perception. Additionally, the difference between a passive system that simply involves the transduction, filtering, and simple pattern recognition (computing a distance between stored representations and input patterns and selecting the closest fit) and an active system that uses context dependent pattern recognition and signal-contingent adaptive processing has implications for the nature of augmentative hearing aids and programs of therapy for remediating aspects of hearing loss. It is well known that simple amplification systems are not sufficient remediation for hearing loss because they amplify noise as well as signal. Understanding how active processing operates and interacts with signal properties and cognitive processing might lead to changes in the way hearing aids operate, perhaps through cueing changes in attention, or by modifying the signal structure to affect the population coding of frequency information or attentional segregation of relevant signals. Training to use such hearing aids might be more effective by simple feedback or by systematically changing the level and nature of environmental sound challenges presented to listeners.

Furthermore, understanding speech perception as an active process has implications for explaining some of the findings of the interaction of hearing loss with cognitive processes (e.g., Wingfield et al., 2005 ). One explanation of the demands on cognitive mechanisms through hearing loss is a compensatory model as noted above (e.g., Rabbitt, 1991 ). This suggests that when sensory information is reduced, cognitive processes operate inferentially to supplement or replace the missing information. In many respects this is a kind of postperceptual explanation that might be like a response bias. It suggests that mechanisms outside of normal speech perception can be called on when sensory information is degraded. However an alternative view of the same situation is that it reflects the normal operation of speech recognition processing rather than an extra postperceptual inference system. Hearing loss may specifically exacerbate the fundamental problem of lack of invariance in acoustic-phonetic relationships.

The fundamental problem faced by all theories of speech perception derives from the lack of invariance in the relationship between the acoustic patterns of speech and the linguistic interpretation of those patterns. Although the many-to-many mapping between acoustic patterns of speech and perceptual interpretations is a longstanding well-known problem (e.g., Liberman et al., 1967 ), the core computational problem only truly emerges when a particular pattern has many different interpretations or can be classified in many different ways. It is widely established that individuals are adept in understanding the constituents of a given category, for traditional categories (Rosch et al., 1976 ) or ad hoc categories developed in response to the demands of a situation (Barsalou, 1983 ). In this sense, a many-to-one mapping does not pose a substantial computational challenge. As Nusbaum and Magnuson ( 1997 ) argue, a many-to-one mapping can be understood with a simple class of deterministic computational mechanisms. In essence, a deterministic system establishes one-to-one mappings between inputs and outputs and thus can be computed by passive mechanisms such as feature detectors. It is important to note that a many-to-one mapping (e.g., rising formant transitions signaling a labial stop and diffuse consonant release spectrum signaling a labial stop) can be instantiated as a collection of one-to-one mappings.

However, when a particular sensory pattern must be classified as a particular linguistic category and there are multiple possible interpretations, this constitutes a computational problem for recognition. In this case (e.g., a formant pattern that could signal either the vowel in BIT or BET) there is ambiguity about the interpretation of the input without additional information. One solution is that additional context or information could eliminate some alternative interpretations as in talker normalization (Nusbaum and Magnuson, 1997 ). But this leaves the problem of determining the nature of the constraining information and processing it, which is contingent on the ambiguity itself. This suggests that there is no automatic or passive means of identifying and using the constraining information. Thus an active mechanism, which tests hypotheses about interpretations and tentatively identifies sources of constraining information (Nusbaum and Schwab, 1986 ), may be needed.

Given that there are multiple alternative interpretations for a particular segment of speech signal, the nature of the information needed to constrain the selection depends on the source of variability that produced the one-to-many non-determinism. Variations in speaking rate, or talker, or linguistic context or other signal modifications are all potential sources of variability that are regularly encountered by listeners. Whether the system uses articulatory or linguistic information as a constraint, the perceptual system needs to flexibly use context as a guide in determining the relevant properties needed for recognition (Nusbaum and Schwab, 1986 ). The process of eliminating or weighing potential interpretations could well involve demands on working memory. Additionally, there may be changes in attention, towards more diagnostic patterns of information. Further, the system may be required to adapt to new sources of lawful variability in order to understand the context (cf. Elman and McClelland, 1986 ).

Generally speaking, these same kinds of mechanisms could be implicated in higher levels of linguistic processing in spoken language comprehension, although the neural implementation of such mechanisms might well differ. A many-to-many mapping problem extends to all levels of linguistic analysis in language comprehension and can be observed between patterns at the syllabic, lexical, prosodic and sentential level in speech and the interpretations of those patterns as linguistic messages. This is due to the fact that across linguistic contexts, speaker differences (idiolect, dialect, etc.) and other contextual variations, there are no patterns (acoustic, phonetic, syllabic, prosodic, lexical etc.) in speech that have an invariant relationship to the interpretation of those patterns. For this reason, it could be beneficial to consider how these phenomena of acoustic perception, phonetic perception, syllabic perception, prosodic perception, lexical perception, etc., are related computationally to one another and understand the computational similarities among the mechanisms that may subserve them (Marr, 1982 ). Given that such a mechanism needs to flexibly respond to changes in context (and different kinds of context—word or sentence or talker or speaking rate) and constrain linguistic interpretations in context, suggests that the mechanism for speech understanding needs to be plastic. In other words, speech recognition should inherently demonstrate learning.

Learning mechanisms in speech

While on its face this seems uncontroversial, theories of speech perception have not traditionally incorporated learning although some have evolved over time to do so (e.g., Shortlist-B, Hebb-Trace). Indeed, there remains some disagreement about the plasticity of speech processing in adults. One issue is how the long-term memory structures that guide speech processing are modified to allow for this plasticity while at the same time maintaining and protecting previously learned information from being expunged. This is especially important as often newly acquired information may represent irrelevant information to the system in a long-term sense (Carpenter and Grossberg, 1988 ; Born and Wilhelm, 2012 ).

To overcome this problem, researchers have proposed various mechanistic accounts, and while there is no consensus amongst them, a hallmark characteristic of these accounts is that learning occurs in two stages. In the first stage, the memory system is able to use fast learning temporary storage to achieve adaptability, and in a subsequent stage, during an offline period such as sleep, this information is consolidated into long-term memory structures if the information is found to be germane (Marr, 1971 ; McClelland et al., 1995 ; Ashby et al., 2007 ). While this is a general cognitive approach to the formation of categories for recognition, this kind of mechanism does not figure into general thinking about speech recognition theories. The focus of these theories is less on the formation of category representations and the need for plasticity during recognition, than it is on the stability and structure of the categories (e.g., phonemes) to be recognized. Theories of speech perception often avoid the plasticity-stability trade off problem by proposing that the basic categories of speech are established early in life, tuned by exposure, and subsequently only operate as a passive detection system (e.g., Abbs and Sussman, 1971 ; Fodor, 1983 ; McClelland and Elman, 1986 ; although see Mirman et al., 2006b ). According to these kinds of theories, early exposure to a system of speech input has important effects on speech processing.

Given the importance of early exposure for establishing the phonological system, there is no controversy regarding the significance of linguistic experience in shaping an individual’s ability to discriminate and identify speech sounds (Lisker and Abramson, 1964 ; Strange and Jenkins, 1978 ; Werker and Tees, 1984 ; Werker and Polka, 1993 ). An often-used example of this is found in how infants’ perceptual abilities change via exposure to their native language. At birth, infants are able to discriminate a wide range of speech sounds whether present or not in their native language (Werker and Tees, 1984 ). However, as a result of early linguistic exposure and experience, infants gain sensitivity to phonetic contrasts to which they are exposed and eventually lose sensitivity for phonetic contrasts that are not experienced (Werker and Tees, 1983 ). Additionally, older children continue to show developmental changes in perceptual sensitivity to acoustic-phonetic patterns (e.g., Nittrouer and Miller, 1997 ; Nittrouer and Lowenstein, 2007 ) suggesting that learning a phonology is not simply a matter of acquiring a simple set of mappings between the acoustic patterns of speech and the sound categories of language. Further, this perceptual learning does not end with childhood as it is quite clear that even adult listeners are capable of learning new phonetic distinctions not present in their native language (Werker and Logan, 1985 ; Pisoni et al., 1994 ; Francis and Nusbaum, 2002 ; Lim and Holt, 2011 ).

A large body of research has now established that adult listeners can learn a variety of new phonetic contrasts from outside their native language. Adults are able to learn to split a single native phonological category into two functional categories, such as Thai pre-voicing when learned by native English speakers (Pisoni et al., 1982 ) as well as to learn completely novel categories such as Zulu clicks for English speakers (Best et al., 1988 ). Moreover, adults possess the ability to completely change the way they attend to cues, for example Japanese speakers are able to learn the English /r/-/l/ distinction, a contrast not present in their native language (e.g., Logan et al., 1991 ; Yamada and Tohkura, 1992 ; Lively et al., 1993 ). While learning is limited, Francis and Nusbaum ( 2002 ) demonstrated that given appropriate feedback, listeners can learn to direct perceptual attention to acoustic cues that were not previously used to form phonetic distinctions in their native language. In their study, learning new categories was manifest as a change in the structure of the acoustic-phonetic space wherein individuals shifted from the use of one perceptual dimension (e.g., voicing) to a complex of two perceptual dimensions, enabling native English speakers to correctly perceive Korean stops after training. How can we describe this change? What is the mechanism by which this change in perceptual processing occurs?

From one perspective this change in perceptual processing can be described as a shift in attention (Nusbaum and Schwab, 1986 ). Auditory receptive fields may be tuned (e.g., Cruikshank and Weinberger, 1996 ; Weinberger, 1998 ; Wehr and Zador, 2003 ; Znamenskiy and Zador, 2013 ) or reshaped as a function of appropriate feedback (cf. Moran and Desimone, 1985 ) or context (Asari and Zador, 2009 ). This is consistent with theories of category learning (e.g., Schyns et al., 1998 ) in which category structures are related to corresponding sensory patterns (Francis et al., 2007 , 2008 ). From another perspective this adaptation process could be described as the same kind of cue weighting observed in the development of phonetic categories (e.g., Nittrouer and Miller, 1997 ; Nittrouer and Lowenstein, 2007 ). Yamada and Tohkura ( 1992 ) describe native Japanese listeners as typically directing attention to acoustic properties of /r/-/l/ stimuli that are not the dimensions used by English speakers, and as such are not able to discriminate between these categories. This misdirection of attention occurs because these patterns are not differentiated functionally in Japanese as they are in English. For this reason, Japanese and English listeners distribute attention in the acoustic pattern space for /r/ and /l/ differently as determined by the phonological function of this space in their respective languages. Perceptual learning of these categories by Japanese listeners suggests a shift of attention to the English phonetically relevant cues.

This idea of shifting attention among possible cues to categories is part and parcel of a number of theories of categorization that are not at all specific to speech perception (e.g., Gibson, 1969 ; Nosofsky, 1986 ; Goldstone, 1998 ; Goldstone and Kersten, 2003 ) but have been incorporated into some theories of speech perception (e.g., Jusczyk, 1993 ). Recently, McMurray and Jongman ( 2011 ) proposed the C-Cure model of phoneme classification in which the relative importance of cues varies with context, although the model does not specify a mechanism by which such plasticity is implemented neurally.

One issue to consider in examining the paradigm of training non-native phonetic contrasts is that adult listeners bring an intact and complete native phonological system to bear on any new phonetic category-learning problem. This pre-existing phonological knowledge about the sound structure of a native language operates as a critical mass of an acoustic-phonetic system with which a new category likely does not mesh (Nusbaum and Lee, 1992 ). New contrasts can re-parse the acoustic cue space into categories that are at odds with the native system, can be based on cues that are entirely outside the system (e.g., clicks), or can completely remap native acoustic properties into new categories (see Best et al., 2001 ). In all these cases however listeners need to not only learn the pattern information that corresponds to these categories, but additionally learn the categories themselves. In most studies participants do not actually learn a completely new phonological system that exhibits an internal structure capable of supporting the acquisition of new categories, but instead learn isolated contrasts that are not part of their native system. Thus, learning non-native phonological contrasts requires individuals to learn both new category structures, as well as how to direct attention to the acoustic cues that define those categories without colliding with extant categories.

How do listeners accommodate the signal changes encountered on a daily basis in listening to speech? Echo and reverberation can distort speech. Talkers speak while eating. Accents can change the acoustic to percept mappings based on the articulatory phonetics of a native language. While some of the distortions in signals can probably be handled by some simple filtering in the auditory system, more complex signal changes that are systematic cannot be handled in this way. The use of filtering as a solution for speech signal distortion assumes a model of speech perception whereby a set of acoustic-phonetic representations (whether talker-specific or not) are obscured by some distortion and that some simple acoustic transform (like amplification or time-dilation) is used to restore the signal.

An alternative to this view was proposed by Elman and McClelland ( 1986 ). They suggested that the listener can use systematicity in distortions of acoustic patterns as information about the sources of variability that affected the signal in the conditions under which the speech was produced. This idea, that systematic variability in acoustic patterns of phonetic categories provides information about the intended phonetic message, suggests that even without learning new phonetic categories or contrasts, learning the sources and structure of acoustic-phonetic variability may be a fundamental aspect of speech perception. Nygaard et al. ( 1994 ) and Nygaard and Pisoni ( 1998 ) demonstrated that listeners learning the speech of talkers using the same phonetic categories as the listeners show significant improvements in speech recognition. Additionally, Dorman et al. ( 1977 ) elegantly demonstrated that different talkers speaking the same language can use different acoustic cues to make the same phonetic contrasts. In these situations, in order to recognize speech, listeners must learn to direct attention to the specific cues for a particular talker in order to ameliorate speech perception. In essence, this suggests that learning may be an intrinsic part of speech perception rather than something added on. Phonetic categories must remain plastic even in adults in order to flexibly respond to the changing demands of the lack of invariance problem across talkers and contexts of speaking.

One way of investigating those aspects of learning that are specific to directing attention to appropriate and meaningful acoustic cues without additionally having individuals learn new phonetic categories or a new phonological system, is to examine how listeners adapt to synthetic speech that uses their own native phonological categories. Synthetic speech generated by rule is “defective” in relation to natural speech in that it oversimplifies the acoustic pattern structure (e.g., fewer cues, less cue covariation) and some cues may actually be misleading (Nusbaum and Pisoni, 1985 ). Learning synthetic speech requires listeners to learn how acoustic information, produced by a particular talker, is used to define the speech categories the listener already possesses. In order to do this, listeners need to make use of degraded, sparse and often misleading acoustic information, which contributes to the poor intelligibility of synthesized speech. Given that such cues are not available to awareness, and that most of such learning is presumed to occur early in life, it seems difficult to understand that adult listeners could even do this. In fact, it is this ability to rapidly learn synthetic speech that lead Nusbaum and Schwab ( 1986 ) to conclude that speech must be guided by active control processes.

Generalization learning

In a study reported by Schwab et al. ( 1985 ), listeners were trained on synthetic speech for 8 days with feedback and tested before and after training. Before training, recognition was about 20% correct, but improved after training to about 70% correct. More impressively this learning occurred even though listeners were never trained or tested on the same words twice, meaning that individuals had not just explicitly learned what they were trained on, but instead gained generalized knowledge about the synthetic speech. Additionally, Schwab et al. ( 1985 ) demonstrated that listeners are able to substantially retain this generalized knowledge without any additionally exposure to the synthesizer, as listeners showed similar performance 6 months later. This suggests that even without hearing the same words over and over again, listeners were able to change the way they used acoustic cues at a sublexical level. In turn, listeners used this sublexical information to drive recognition of these cues in completely novel lexical contexts. This is far different from simply memorizing the specific and complete acoustic patterns of particular words, but instead could reflect a kind of procedural knowledge of how to direct attention to the speech of the synthetic talker.

This initial study demonstrated clear generalization beyond the specific patterns heard during training. However on its own it gives little insight into the way such generalization emerges. In a subsequent study, Greenspan et al. ( 1988 ) expanded on this and examined the ability of adult listeners to generalize from various training regimes asking the question of how acoustic-phonetic variability affects generalization of speech learning. Listeners were either given training on repeated words or novel words, and when listeners memorize specific acoustic patterns of spoken words, there is very good recognition performance for those words. However this does not afford the same level of perceptual generalization that is produced by highly variable training experiences. This is akin to the benefits of training variability seen in motor learning in which generalization of a motor behavior is desired (e.g., Lametti and Ostry, 2010 ; Mattar and Ostry, 2010 ; Coelho et al., 2012 ). Given that training set variability modulates the type of learning, adult perceptual learning of spoken words cannot be seen as simply a rote process. Moreover, even from a small amount of repeated and focused rote training there is some reliable generalization indicating that listeners can use even restricted variability in learning to go beyond the training examples (Greenspan et al., 1988 ). Listeners may infer this generalized information from the training stimuli, or they might develop a more abstract representation of sound patterns based on variability in experience and apply this knowledge to novel speech patterns in novel contexts.

Synthetic speech, produced by rule, as learned in those studies, represents a complete model of speech production from orthographic-to-phonetic-to-acoustic generation. The speech that is produced is recognizable but it is artificial. Thus learning of this kind of speech is tantamount to learning a strange idiolect of speech that contains acoustic-phonetic errors, missing acoustic cues and does not possess correct cue covariation. However if listeners learn this speech by gleaning the new acoustic-phonetic properties for this kind of talker, it makes sense that listeners should be able to learn other kinds of speech as well. This is particularly true if learning is accomplished by changing the way listeners attend to the acoustic properties of speech by focusing on the acoustic properties that are most phonetically diagnostic. And indeed, beyond being able to learn synthesized speech in this fashion, adults have been shown to quickly adapt to a variety of other forms of distorted speech where the distortions initially cause a reduction in intelligibility, such as simulated cochlear implant speech (Shannon et al., 1995 ), spectrally shifted speech (Rosen et al., 1999 ) as well as foreign-accented speech (Weil, 2001 ; Clarke and Garrett, 2004 ; Bradlow and Bent, 2008 ; Sidaras et al., 2009 ). In these studies, listeners learn speech that has been produced naturally with coarticulation and the full range of acoustic-phonetic structure, however, the speech signal deviates from listener expectations due to a transform of some kind, either through signal processing or through phonological changes in speaking. Different signal transforms may distort or mask certain cues and phonological changes may change cue complex structure. These distortions are unlike synthetic speech however, as these transforms tend to be uniform across the phonological inventory. This would provide listeners with a kind of lawful variability (as described by Elman and McClelland, 1986 ) that can be exploited as an aid to recognition. Given that in all these speech distortions listeners showed a robust ability to apply what they learned during training to novel words and contexts, learning does not appear to be simply understanding what specific acoustic cues mean, but rather understanding what acoustic cues are most relevant for a given source and how to attend to them (Nusbaum and Lee, 1992 ; Nygaard et al., 1994 ; Francis and Nusbaum, 2002 ).

How do individuals come to learn what acoustic cues are most diagnostic for a given source? One possibility is that acoustic cues are mapped to their perceptual counterparts in an unguided fashion, that is, without regard for the systematicity of native acoustic-phonetic experience. Conversely, individuals may rely on their native phonological system to guide the learning process. In order to examine if perceptual learning is influenced by an individual’s native phonological experience, Davis et al. ( 2005 ) examined if perceptual learning was more robust when individuals were trained on words versus non-words. Their rationale was that if training on words led to better perceptual learning than non-words, then one could conclude that the acoustic to phonetic remapping process is guided or structured by information at the lexical level. Indeed, Davis et al. ( 2005 ) showed that training was more effective when the stimuli consisted of words than non-words, indicating that information at the lexical level allows individuals to use their knowledge about how sounds are related in their native phonological system to guide the perceptual learning process. The idea that perceptual learning in speech is driven to some extent by lexical knowledge is consistent with both autonomous (e.g., Shortlist: Norris, 1994 ; Merge: Norris et al., 2000 ; Shortlist B: Norris and McQueen, 2008 ) and interactive (e.g., TRACE: McClelland and Elman, 1986 ; Hebb-Trace: Mirman et al., 2006a ) models of speech perception (although whether learning can successfully operate in these models is a different question altogether). A subsequent study by Dahan and Mead ( 2010 ) examined the structure of the learning process further by asking how more localized or recent experience, such as the specific contrasts present during training, may organize and determine subsequent learning. To do this, Dahan and Mead ( 2010 ) systematically controlled the relationship between training and test stimuli as individuals learned to understand noise vocoded speech. Their logic was that if localized or recent experience organizes learning, then the phonemic contrasts present during training may provide such a structure, such that phonemes will be better recognized at test if they had been heard in a similar syllable position or vocalic context during training than if they had been heard in a different context. Their results showed that individuals’ learning was directly related to the local phonetic context of training, as consonants were recognized better if they had been heard in a similar syllable position or vocalic context during training than if they had been heard in a dissimilar context.

This is unsurprising as the acoustic realization of a given consonant can be dramatically different depending on the position of a consonant within a syllable (Sproat and Fujimura, 1993 ; Browman and Goldstein, 1995 ). Further, there are coarticulation effects such that the acoustic characteristics of a consonant are heavily modified by the phonetic context in which it occurs (Liberman et al., 1954 ; Warren and Marslen-Wilson, 1987 ; Whalen, 1991 ). In this sense, the acoustic properties of speech are not dissociable beads on a string and as such, the linguistic context of a phoneme is very much apart of the acoustic definition of a phoneme. While experience during training does appear to be the major factor underlying learning, individuals also show transfer of learning to phonemes that were not presented during training provided that were perceptually similar to the phonemes that were present. This is consistent with a substantial body of speech research using perceptual contrast procedures that showed that there are representations for speech sounds both at the level of the allophonic or acoustic-phonetic specification as well as at a more abstract phonological level (e.g., Sawusch and Jusczyk, 1981 ; Sawusch and Nusbaum, 1983 ; Hasson et al., 2007 ). Taken together both the Dahan and Mead ( 2010 ) and the Davis et al. ( 2005 ) studies provide clear evidence that previous experience, such as the knowledge of one’s native phonological system, as well as more localized experience relating to the occurrence of specific contrasts in a training set help to guide the perceptual learning process.

What is the nature of the mechanism underlying the perceptual learning process that leads to better recognition after training? To examine if training shifts attention to phonetically meaningful cues and away from misleading cues, Francis et al. ( 2000 ), trained listeners on CV syllables containing /b/, /d/, and or /g/ cued by a chimeric acoustic structure containing either consistent or conflicting properties. The CV syllables were constructed such that the place of articulation was specified by the spectrum of the burst (Blumstein and Stevens, 1980 ) as well as by the formant transitions from the consonant to the vowel (e.g., Liberman et al., 1967 ). However, for some chimeric CVs, the spectrum of the burst indicated a different place of articulation than the transition cue. Previously Walley and Carrell ( 1983 ) had demonstrated that listeners tend to identify place of articulation based on transition information rather than the spectrum of the burst when these cues conflict. And of course listeners never consciously hear either of these as separate signals—they simply hear a consonant at a particular place of articulation. Given that listeners cannot identify the acoustic cues that define the place of articulation consciously and only experience the categorical identity of the consonant itself, it seems hard to understand how attention can be directed towards these cues.

Francis et al. ( 2000 ) trained listeners to recognize the chimeric speech in their experiment by providing feedback about the consonant identity that was either consistent with the burst cues or the transition cues depending on the training group. For the burst-trained group, when listeners heard a CV and identified it as a B, D, or G, they would receive feedback following identification. For a chimeric consonant cued with a labial burst and an alveolar transition pattern (combined), whether listeners identified the consonant as B (correct for the burst-trained group) or another place, after identification they would hear the CV again and see feedback printed identifying the consonant as B. In other words, burst-trained listeners would get feedback during training consistent with the spectrum of the burst whereas transition-trained listeners would get feedback consistent with the pattern of the transitions. The results showed that cue-based feedback shifted identification performance over training trials such that listeners were able to learn to use the specific cue (either transition based or spectral burst based) that was consistent with the feedback and generalized to novel stimuli. This kind of learning research (also Francis and Nusbaum, 2002 ; Francis et al., 2007 ) suggests shifting attention may serve to restructure perceptual space as a result of appropriate feedback.

Although the standard view of speech perception is one that does not explicitly incorporate learning mechanisms, this is in part because of a very static view of speech recognition whereby stimulus patterns are simply mapped onto phonological categories during recognition, and learning may occur, if it does, afterwards. These theories never directly solve the lack of invariance problem, given a fundamentally deterministic computational process in which input states (whether acoustic or articulatory) must correspond uniquely to perceptual states (phonological categories). An alternative is to consider speech perception is an active process in which alternative phonetic interpretations are activated, each corresponding to a particular input pattern from speech (Nusbaum and Schwab, 1986 ). These alternatives must then be reduced to the recognized form, possibly by testing these alternatives as hypotheses shifting attention among different aspects of context, knowledge, or cues to find the best constraints. This view suggests that there should be an increase in cognitive load on the listener until a shift of attention to more diagnostic information occurs when there is a one-to-many mapping, either due to speech rate variability (Francis and Nusbaum, 1996 ) or talker variability (Nusbaum and Morin, 1992 ). Variation in talker or speaking rate or distortion can change the way attention is directed at a particular source of speech, shifting attention towards the most diagnostic cues and away from the misleading cues. This suggests a direct link between attention and learning, with the load on working memory reflecting the uncertainty of recognition given a one-to-many mapping of acoustic cues to phonemes.

If a one-to-many mapping increases the load on working memory because of active alternative phonetic hypotheses, and learning shifts attention to more phonetically diagnostic cues, learning to perceive synthetic speech should reduce the load on working memory. In this sense, focusing attention on the diagnostic cues should reduce the number of phonetic hypotheses. Moreover, this should not simply be a result of improved intelligibility, as increasing speech intelligibility without training should not have the same effect. To investigate this, Francis and Nusbaum ( 2009 ) used a speeded spoken target monitoring procedure and manipulated memory load to see if the effect of such a manipulation would change as a function of learning synthetic speech. The logic of the study was that varying a working memory load explicitly should affect recognition speed if working memory plays a role in recognition. Before training, working memory should have a higher load than after training, suggesting that there should be an interaction between working memory load and the training in recognition time (cf. Navon, 1984 ). When the extrinsic working memory load (to the speech task) is high, there should be less working memory available for recognition but when the extrinsic load is low there should be more working memory available. This suggests that training should interact with working memory load by showing a larger improvement of recognition time in the low load case than in the high load case. Of course if speech is directly mapped from acoustic cues to phonetic categories, there is no reason to predict a working memory load effect and certainly no interaction with training. The results demonstrated however a clear interaction of working memory load and training as predicted by the use of working memory and attention (Francis and Nusbaum, 2009 ). These results support the view that training reorganizes perception, shifting attention to more informative cues allowing working memory to be used more efficiently and effectively. This has implications for older adults who suffer from hearing loss. If individuals recruit additional cognitive and perceptual resources to ameliorate sensory deficits, then they will lack the necessary resources to cope with situations where there is an increase in talker or speaking rate variability. In fact, Peelle and Wingfield ( 2005 ) report that while older adults can adapt to time-compressed speech, they are unable to transfer learning on one speech rate to a second speech rate.

Mechanisms of Memory

Changes in the allocation of attention and the demands on working memory are likely related to substantial modifications of category structures in long term memory (Nosofsky, 1986 ; Ashby and Maddox, 2005 ). Effects of training on synthetic speech have been shown to be retained for 6 months suggesting that categorization structures in long-term memory that guide perception have been altered (Schwab et al., 1985 ). How are these category structures that guide perception (Schyns et al., 1998 ) modified? McClelland and Rumelhart ( 1985 ) and McClelland et al. ( 1995 ) have proposed a neural cognitive model that explains how individuals are able to adapt to new information in their environment. According to their model, specific memory traces are initially encoded during learning via a fast-learning hippocampal based memory system. Then, via a process of repeated reactivation or rehearsal, memory traces are strengthened and ultimately represented solely in the neocortical memory system. One of the main benefits of McClelland’s model is that it explains how previously learned information is protected against newly acquired information that may potentially be irrelevant for long-term use. In their model, the hippocampal memory system acts as temporary storage where fast-learning occurs, while the neocortical memory system, which houses the long-term memory category that guide perception, are modified later, presumably offline when there are no encoding demands on the system. This allows the representational system to remain adaptive without the loss of representational stability as only memory traces that are significant to the system will be strengthened and rehearsed. This kind of two-stage model of memory is consistent with a large body of memory data, although the role of the hippocampus outlined in this model is somewhat different than other theories of memory (e.g., Eichenbaum et al., 1992 ; Wood et al., 1999 , 2000 ).

Ashby et al. ( 2007 ) have also posited a two-stage model for category learning, but implementing the basis for the two stages, as well as their function in category formation, very differently. They suggest that the basal ganglia and the thalamus, rather than the hippocampus, together mediate the development of more permanent neorcortical memory structures. In their model, the striatum, globus pallidus, and thalamus comprise the fast learning temporary memory system. This subcortical circuit is has greater adaptability due to the dopamine-mediated learning that can occur in the basal ganglia, while representations in the neocortical circuit are much more slow to change as they rely solely on Hebbian learning to be amended.

McClelland’s neural model relies on the hippocampal memory system as a substrate to support the development of the long-term memory structures in neocortex. Thus hippocampal memories are comprised of recent specific experiences or rote memory traces that are encoded during training. In this sense, the hippocampal memory circuit supports the longer-term reorganization or consolidation of declarative memories. In contrast, in the basal ganglia based model of learning put forth by Ashby a striatum to thalamus circuit provides the foundation for the development of consolidation in cortical circuits. This is seen as a progression from a slow based hypothesis-testing system to a faster processing, implicit memory system. Therefore the striatum to thalamus circuit mediates the reorganization or consolidation of procedural memories. To show evidence for this, Ashby et al. ( 2007 ) use information-integration categorization tasks, where the rules that govern the categories that are to be learned are not easily verbalized. In these tasks, the learner is required to integrate information from two or more dimensions at some pre-decisional stage. The logic is that information-integration tasks use the dopamine-mediated reward signals afforded by the basal ganglia. In contrast, in rule-based categorization tasks the categories to be learned are explicitly verbally defined, and thus rely on conscious hypothesis generation and testing. As such, this explicit category learning is thought (Ashby et al., 2007 ) to be mediated by the anterior cingulate and the prefrontal cortex. For this reason, demands on working memory and executive attention are hypothesized to affect only the learning of explicit based categories and not implicit procedural categories, as working memory and executive attention are processes that are largely governed by the prefrontal cortex (Kane and Engle, 2000 ).

The differences between McClelland and Ashby’s models appear to be related in part to the distinction between declarative versus procedural learning. While it is certainly reasonable to divide memory in this way, it is unarguable that both types of memories involve encoding and consolidation. While it may be the case that the declarative and procedural memories operate through different systems, this seems unlikely given that there are data suggesting the role of the hippocampus in procedural learning (Chun and Phelps, 1999 ) even when this is not a verbalizable and an explicit rule-based learning process. Elements of the theoretic assumptions of both models seem open to criticism in one way or another. But both models make explicit a process by which rapidly learned, short-term memories can be consolidated into more stable forms. Therefore it is important to consider such models in trying to understand the process by which stable memories are formed as the foundation of phonological knowledge in speech perception.

As noted previously, speech appears to have separate representations for the specific acoustic patterns of speech as well as more abstract phonological categories (e.g., Sawusch and Jusczyk, 1981 ; Sawusch and Nusbaum, 1983 ; Hasson et al., 2007 ). Learning appears to occur at both levels as well (Greenspan et al., 1988 ) suggesting the importance of memory theory differentiating both short-term and long-term representations as well as stimulus specific traces and more abstract representations. It is widely accepted that any experience may be represented across various levels of abstraction. For example, while only specific memory traces are encoded for many connectionist models (e.g., McClelland and Rumelhart’s 1985 model), various levels of abstraction can be achieved in the retrieval process depending on the goals of the task. This is in fact the foundation of Goldinger’s ( 1998 ) echoic trace model based on Hintzman’s ( 1984 ) MINERVA2 model. Specific auditory representations of the acoustic pattern of a spoken word are encoded into memory and abstractions are derived during the retrieval process using working memory.

In contrast to these trace-abstraction models is another possibility wherein stimulus-specific and abstracted information are both stored in memory. For example an acoustic pattern description of speech as well as a phonological category description are represented separately in memory in the TRACE model (McClelland and Elman, 1986 ; Mirman et al., 2006a ). In this respect then, the acoustic patterns of speech—as particular representations of a specific perceptual experience—are very much like the echoic traces of Goldinger’s model. However where Goldinger argued against forming and storing abstract representations, others have suggested that such abstractions may in fact be formed and stored in the lexicon (see Luce et al., 2003 ; Ju and Luce, 2006 ). Indeed, Hasson et al. ( 2007 ) demonstrated repetition suppression effects specific to the abstract phonological representation of speech sounds given that the effect held between an illusory syllable /ta/ and a physical syllable /ta/ based on a network spanning sensory and motor cortex. Such abstractions are unlikely to simply be an assemblage of prior sensory traces given that the brain areas involved are not the same as those typically activated in recognizing those traces. In this way, memory can be theoretically distinguished into rote representational structures that consist of specific experienced items or more generalized structures that consist of abstracted information. Rote memories are advantageous for precise recall of already experienced stimuli where as generalized memory would favor performance for a larger span of stimuli in a novel context.

This distinction between rote and generalized representations cuts across the distinction between procedural and declarative memory. Both declarative and procedural memories may be encoded as either rote or generalized memory representational structures. For example, an individual may be trained to press a specific sequence of keys on a keyboard. This would lead to the development of a rote representational memory structure, allowing the individual to improve his or her performance on that specific sequence. Alternatively, the individual may be trained to press several sequences of keys on a keyboard. This difference in training would lead to the development of a more generalized memory structure, resulting in better performance both experienced and novel key sequences. Similarly declarative memories may be encoded as either rote or generalized structures as a given declarative memory structures may consist of either the specific experienced instances of a particular stimulus, as in a typical episodic memory experiment, or the “gist” of the experienced instances as in the formation of semantic memories or possibly illusory memories based on associations (see Gallo, 2006 ).

The argument about the distinction between rote and generalized or abstracted memory representations becomes important when considering the way in which memories become stabilized through consolidation. In particular, for perceptual learning of speech, two aspects are critical. First, given the generativity of language and the context-sensitive nature of acoustic-phonetics, listeners are not going to hear the same utterances again and again and further, the acoustic pattern variation in repeated utterances, even if they occurred, would be immense due to changes in linguistic context, speaking rate, and talkers. As such, this makes the use of rote-memorization of acoustic patterns untenable as a speech recognition system. Listeners either have to be able to generalize in real time from prior auditory experiences (as suggested by Goldinger, 1998 ) or there must be more abstract representations that go beyond the specific sensory patterns of any particular utterance (as suggested by Hasson et al., 2007 ). This is unlikely due to the second consideration, which is that any generalizations in speech perception must be made quickly and remain stable to be useful. As demonstrated by Greenspan et al. ( 1988 ) even learning a small number of spoken words from a particular speech synthesizer will produce some generalization to novel utterances, although increasing the variability in experiences will increase the amount of generalization.

The separation between rote and generalization learning is further demonstrated by the effects of sleep consolidation on the stability of memories. In the original synthetic speech learning study by Schwab et al. ( 1985 ), listeners demonstrated significant learning in spite of never hearing the same words twice. Moreover this generalization learning lasted for roughly 6 months without subsequent training. This demonstrates that high variability in training examples with appropriate feedback can produce large improvements in generalized performance that can remain robust and stable for a long time. Fenn et al. ( 2003 ) demonstrated that this stability is a consequence of sleep consolidation of learning. In addition, when some forgetting takes place over the course of a day following learning, sleep restores the forgotten memories. It appears that this may well be due to sleep separately consolidating both the initial learning as well as any interference that occurs following learning (Brawn et al., 2013 ). Furthermore, Fenn and Hambrick ( 2012 ) have demonstrated that the effectiveness of sleep consolidation is related to individual differences in working memory such that higher levels of working memory performance are related to better consolidation. This links the effectiveness of sleep consolidation to a mechanism closely tied to active processing in speech perception. Most recently though, Fenn et al. ( 2013 ) found that sleep operates differently for rote and generalized learning.

These findings have several implications for therapy with listeners with hearing loss. First, training and testing should be separated by a period of sleep in order to measure the amount of learning that is stable. Second, although variability in training experiences seems to produce slower rates of learning, it produces greater generalization learning. Third, measurements of working memory can give a rough guide to the relative effectiveness of sleep consolidation thereby indicating how at risk learning may be to interference and suggesting that training may need to be more prolonged for people with lower working memory capacity.

Theories of speech perception have often conceptualized the earliest stages of auditory processing of speech to be independent of higher level linguistic and cognitive processing. In many respects this kind of approach (e.g., in Shortlist B) treats the phonetic processing of auditory inputs as a passive system in which acoustic patterns are directly mapped onto phonetic features or categories, albeit with some distribution of performance. Such theories treat the distributions of input phonetic properties as relatively immutable. However, our argument is that even early auditory processes are subject to descending attentional control and active processing. Just as echolocation in the bat is explained by a cortofugal system in which cortical and subcortical structures are viewed as processing cotemporaneously and interactively (Suga, 2008 ), the idea that descending projects from cortex to thalamus and to the cochlea provide a neural substrate for cortical tuning of auditory inputs. Descending projections from the lateral olivary complex to the inner hair cells and from the medial olivary complex to the outer hair cells provide a potential basis for changing auditory encoding in real time as a result of shifts of attention. This kind of mechanism could support the kinds of effects seen in increased auditory brainstem response fidelity to acoustic input following training (Strait et al., 2010 ).

Understanding speech perception as an active process suggests that learning or plasticity is not simply a higher-level process grafted on top of word recognition. Rather the kinds of mechanisms involved in shifting attention to relevant acoustic cues for phoneme perception (e.g., Francis et al., 2000 , 2007 ) are needed for tuning speech perception to the specific vocal characteristics of a new speaker or to cope with distortion of speech or noise in the environment. Given that such plasticity is linked to attention and working memory, we argue that speech perception is inherently a cognitive process, even in terms of the involvement of sensory encoding. This has implications for remediation of hearing loss either with augmentative aids or therapy. First, understanding the cognitive abilities (e.g., working memory capacity, attention control etc.) may provide guidance on how to design a training program by providing different kinds of sensory cues that are correlated or reducing the cognitive demands of training. Second, increasing sensory variability within the limits of individual tolerance should be part of a therapeutic program. Third, understanding the sleep practice of participants using sleep logs, record of drug and alcohol consumption, and exercise are important to the consolidation of learning. If speech perception is continuously plastic but there are limitations based on prior experiences and cognitive capacities, this shapes the basic nature of remediation of hearing loss in a number of different ways.

Finally, we would note that there is a dissociation among the three classes of models that are relevant to understanding speech perception as an active process. Although cognitive models of spoken word processing (e.g., Cohort, TRACE, and Shortlist) have been developed to include some plasticity and to account for different patterns of the influence of lexical knowledge, even the most recent versions (e.g., Distributed Cohort, Hebb-TRACE, and Shortlist B) do not specifically account for active processing of auditory input. It is true that some models have attempted to account for active processing below the level of phonemes (e.g., TRACE I: Elman and McClelland, 1986 ; McClelland et al., 2006 ), these models not been related or compared systematically to the kinds of models emerging from neuroscience research. For example, Friederici ( 2012 ) and Rauschecker and Scott ( 2009 ) and Hickok and Poeppel ( 2007 ) have all proposed neurally plausible models largely around the idea of dorsal and ventral processing streams. Although these models differ in details, in principle the model proposed by Friederici ( 2012 ) and Rauschecker and Scott ( 2009 ) have more extensive feedback mechanisms to support active processing of sensory input. These models are constructed in a neuroanatomical vernacular rather than the cognitive vernacular (even the Hebb-TRACE is still largely a cognitive model) of the others. But both sets of models are notable for two important omissions.

First, while the cognitive models mention learning and even model it, and the neural models refer to some aspects of learning, these models do not relate to the two-process learning models (e.g., complementary learning systems (CLS; McClelland et al., 1995 ; Ashby and Maddox, 2005 ; Ashby et al., 2007 )). Although CLS focuses on episodic memory and Ashby et al. ( 2007 ) focus on category learning, two process models involving either hippocampus, basal ganglia, or cerebellum as a fast associator and cortico-cortical connections as a slower more robust learning system, have garnered substantial interest and research support. Yet learning in the models of speech recognition has yet to seriously address the neural bases of learning and memory except descriptively.

This points to a second important omission. All of the speech recognition models are cortical models. There is no serious consideration to the role of the thalamus, amygdala, hippocampus, cerebellum or other structures in these models. In taking a corticocentric view (see Parvizi, 2009 ), these models exhibit an unrealistic myopia about neural explanations of speech perception. Research by Kraus et al. (Wong et al., 2007 ; Song et al., 2008 ) demonstrates that there are measurable effects of training and experience on speech processing in the auditory brainstem. This is consistent with an active model of speech perception in which attention and experience shape the earliest levels of sensory encoding of speech. Although current data do not exist to support online changes in this kind of processing, this is exactly the kind of prediction an active model of speech perception would make but is entirely unexpected from any of the current models of speech perception.

Author contributions

Shannon L. M. Heald prepared the first draft and Howard C. Nusbaum revised and both refined the manuscript to final form.

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

Preparation of this manuscript was supported in part by an ONR grant DoD/ONR N00014-12-1-0850, and in part by the Division of Social Sciences at the University of Chicago.

  • Abbs J. H., Sussman H. M. (1971). Neurophysiological feature detectors and speech perception: a discussion of theoretical implications . J. Speech Hear. Res. 14 , 23–36 [ PubMed ] [ Google Scholar ]
  • Asari H., Zador A. M. (2009). Long-lasting context dependence constrains neural encoding models in rodent auditory cortex . J. Neurophysiol. 102 , 2638–2656 10.1152/jn.00577.2009 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ashby F. G., Ennis J. M., Spiering B. J. (2007). A neurobiological theory of automaticity in perceptual categorization . Psychol. Rev. 114 , 632–656 10.1037/0033-295x.114.3.632 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ashby F. G., Maddox W. T. (2005). Human category learning . Annu. Rev. Psychol. 56 , 149–178 10.1146/annurev.psych.56.091103.070217 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Barlow H. B. (1961). “ Possible principles underlying the transformations of sensory messages ,” in Sensory Communication , ed Rosenblith W. (Cambridge, MA: MIT Press; ), 217–234 [ Google Scholar ]
  • Barsalou L. W. (1983). Ad hoc categories . Mem. Cognit. 11 , 211–227 10.3758/bf03196968 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Best C. T., McRoberts G. W., Goodell E. (2001). Discrimination of non-native consonant contrasts varying in perceptual assimilation to the listener’s native phonological system . J. Acoust. Soc. Am. 109 , 775–794 10.1121/1.1332378 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Best C. T., McRoberts G. W., Sithole N. M. (1988). Examination of perceptual reorganization for nonnative speech contrasts: Zulu click discrimination by English-speaking adults and infants . J. Exp. Psychol. Hum. Percept. Perform. 14 , 345–360 10.1037//0096-1523.14.3.345 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Blumstein S. E., Stevens K. N. (1980). Perceptual invariance and onset spectra for stop consonants in different vowel environments . J. Acoust. Soc. Am. 67 , 648–662 10.1121/1.383890 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Born J., Wilhelm I. (2012). System consolidation of memory during sleep . Psychol. Res. 76 , 192–203 10.1007/s00426-011-0335-6 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Bradlow A. R., Bent T. (2008). Perceptual adaptation to non-native speech . Cognition 106 , 707–729 10.1016/j.cognition.2007.04.005 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Brawn T., Nusbaum H. C., Margoliash D. (2013). Sleep consolidation of interfering auditory memories in starlings . Psychol. Sci. 24 , 439–447 10.1177/0956797612457391 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Broca P. (1865). Sur le sieège de la faculté du langage articulé . Bull. Soc. Anthropol. 6 , 377–393 10.3406/bmsap.1865.9495 [ CrossRef ] [ Google Scholar ]
  • Browman C. P., Goldstein L. (1995). “ Gestural syllable position effects in American English ,” in Producing Speech: Contemporary Issues. For Katherine Safford Harris , eds Bell-Berti F., Raphael L. J. (Woodbury, NY: American Institute of Physics; ), 19–34 [ Google Scholar ]
  • Carpenter G. A., Grossberg S. (1988). The ART of adaptive pattern recognition by a self-organizing neural network . Computer 21 , 77–88 10.1109/2.33 [ CrossRef ] [ Google Scholar ]
  • Chun M. M., Phelps E. A. (1999). Memory deficits for implicit contextual information in amnesic subjects with hippocampal damage . Nat. Neurosci. 2 , 844–847 10.1038/12222 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Clarke C. M., Garrett M. F. (2004). Rapid adaptation to foreign-accented English . J. Acoust. Soc. Am. 116 , 3647–3658 10.1121/1.1815131 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Coelho C., Rosenbaum D., Nusbaum H. C., Fenn K. M. (2012). Imagined actions aren’t just weak actions: task variability promotes skill learning in physical but not in mental practice . J. Exp. Psychol. Learn. Mem. Cogn. 38 , 1759–1764 10.1037/a0028065 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Cruikshank S. J., Weinberger N. M. (1996). Receptive-field plasticity in the adult auditory cortex induced by Hebbian covariance . J. Neurosci. 16 , 861–875 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Dahan D., Mead R. L. (2010). Context-conditioned generalization in adaptation to distorted speech . J. Exp. Psychol. Hum. Percept. Perform. 36 , 704–728 10.1037/a0017449 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Davis M. H., Johnsrude I. S., Hervais-Adelman A., Taylor K., McGettigan C. (2005). Lexical information drives perceptual learning of distorted speech: evidence from the comprehension of noise-vocoded sentences . J. Exp. Psychol. Gen. 134 , 222–241 10.1037/0096-3445.134.2.222 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Diehl R. L., Lotto A. J., Holt L. L. (2004). Speech perception . Annu. Rev. Psychol. 55 , 149–179 10.1146/annurev.psych.55.090902.142028 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Dorman M. F., Studdert-Kennedy M., Raphael L. J. (1977). Stop-consonant recognition: release bursts and formant transitions as functionally equivalent, context-dependent cues . Percept. Psychophys. 22 , 109–122 10.3758/bf03198744 [ CrossRef ] [ Google Scholar ]
  • Eichenbaum H., Otto T., Cohen N. J. (1992). The hippocampus: what does it do? Behav. Neural Biol. 57 , 2–36 10.1016/0163-1047(92)90724-I [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Elman J. L., McClelland J. L. (1986). “ Exploiting the lawful variability in the speech wave ,” in In Variance and Variability in Speech Processes , eds Perkell J. S., Klatt D. H. (Hillsdale, NJ: Erlbaum; ), 360–385 [ Google Scholar ]
  • Fant C. G. (1962). Descriptive analysis of the acoustic aspects of speech . Logos 5 , 3–17 [ PubMed ] [ Google Scholar ]
  • Fenn K. M., Hambrick D. Z. (2012). Individual differences in working memory capacity predict sleep-dependent memory consolidation . J. Exp. Psychol. Gen. 141 , 404–410 10.1037/a0025268 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fenn K. M., Margoliash D., Nusbaum H. C. (2013). Sleep restores loss of generalized but not rote learning of synthetic speech . Cognition 128 , 280–286 10.1016/j.cognition.2013.04.007 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fenn K. M., Nusbaum H. C., Margoliash D. (2003). Consolidation during sleep of perceptual learning of spoken language . Nature 425 , 614–616 10.1038/nature01951 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Fodor J. A. (1983). Modularity of Mind: An Essay on Faculty Psychology. Cambridge, MA: MIT Press [ Google Scholar ]
  • Francis A. L., Baldwin K., Nusbaum H. C. (2000). Effects of training on attention to acoustic cues . Percept. Psychophys. 62 , 1668–1680 10.3758/bf03212164 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Francis A. L., Nusbaum H. C. (2009). Effects of intelligibility on working memory demand for speech perception . Atten. Percept. Psychophys. 71 , 1360–1374 10.3758/APP.71.6.1360 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Francis A. L., Nusbaum H. C. (2002). Selective attention and the acquisition of new phonetic categories . J. Exp. Psychol. Hum. Percept. Perform. 28 , 349–366 10.1037/0096-1523.28.2.349 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Francis A. L., Kaganovich N., Driscoll-Huber C. J. (2008). Cue-specific effects of categorization training on the relative weighting of acoustic cues to consonant voicing in English . J. Acoust. Soc. Am. 124 , 1234–1251 10.1121/1.2945161 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Francis A. L., Nusbaum H. C., Fenn K. (2007). Effects of training on the acoustic phonetic representation of synthetic speech . J. Speech Lang. Hear. Res. 50 , 1445–1465 10.1044/1092-4388(2007/100) [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Francis A. L., Nusbaum H. C. (1996). Paying attention to speaking rate . ICSLP 96 Proceedings of the Fourth International Conference on Spoken Language 3 , 1537–1540 [ Google Scholar ]
  • Friederici A. D. (2012). The cortical language circuit: from auditory perception to sentence comprehension . Trends Cogn. Sci. 16 , 262–268 10.1016/j.tics.2012.04.001 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Galbraith G. C., Arroyo C. (1993). Selective attention and brainstem frequency-following responses . Biol. Psychol. 37 , 3–22 10.1016/0301-0511(93)90024-3 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Gallo D. A. (2006). Associative Illusions of Memory. New York: Psychology Press [ Google Scholar ]
  • Gaskell M. G., Marslen-Wilson W. D. (1997). Integrating form and meaning: a distributed model of speech perception . Lang. Cogn. Process. 12 , 613–656 10.1080/016909697386646 [ CrossRef ] [ Google Scholar ]
  • Geschwind N. (1970). The organization of language and the brain . Science 170 , 940–944 [ PubMed ] [ Google Scholar ]
  • Giard M. H., Collet L., Bouchet P., Pernier J. (1994). Auditory selective attention in the human cochlea . Brain Res. 633 , 353–356 10.1016/0006-8993(94)91561-x [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Gibson E. J. (1969). Principles of Perceptual Learning and Development. New York: Appleton-Century-Crofts [ Google Scholar ]
  • Goldinger S. D. (1998). Echoes of echoes? An episodic theory of lexical access . Psychol. Rev. 105 , 251–279 10.1037/0033-295x.105.2.251 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Goldstone R. L. (1998). Perceptual learning . Annu. Rev. Psychol. 49 , 585–612 10.1146/annurev.psych.49.1.585 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Goldstone R. L., Kersten A. (2003). “ Concepts and categories ,” in Comprehensive Handbook of Psychology, Experimental Psychology (Vol. 4), eds Healy A. F., Proctor R. W. (New York: Wiley; ), 591–621 [ Google Scholar ]
  • Greenspan S. L., Nusbaum H. C., Pisoni D. B. (1988). Perceptual learning of synthetic speech produced by rule . J. Exp. Psychol. Learn. Mem. Cogn. 14 , 421–433 10.1037/0278-7393.14.3.421 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hasson U., Skipper J. I., Nusbaum H. C., Small S. L. (2007). Abstract coding of audiovisual speech: beyond sensory representation . Neuron 56 , 1116–1126 10.1016/j.neuron.2007.09.037 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hickok G., Poeppel D. (2007). The cortical organization of speech processing . Nat. Rev. Neurosci. 8 , 393–402 10.1038/nrn2113 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hintzman D. L. (1984). MINERVA 2: a simulation model of human memory . Behav. Res. Methods Instrum. Comput. 16 , 96–101 10.3758/bf03202365 [ CrossRef ] [ Google Scholar ]
  • Huang J., Holt L. L. (2012). Listening for the norm: adaptive coding in speech categorization . Front. Psychol. 3 :10 10.3389/fpsyg.2012.00010 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ju M., Luce P. A. (2006). Representational specificity of within-category phonetic variation in the long-term mental lexicon . J. Exp. Psychol. Hum. Percept. Perform. 32 , 120–138 10.1037/0096-1523.32.1.120 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Jusczyk P. W. (1993). From general to language-specific capacities: the WRAPSA model of how speech perception develops . J. Phon. – A Special Issue on Phon. Development 21 , 3–28 [ Google Scholar ]
  • Kane M. J., Engle R. W. (2000). Working memory capacity, proactive interference and divided attention: limits on long-term memory retrieval . J. Exp. Psychol. Learn. Mem. Cogn. 26 , 336–358 10.1037/0278-7393.26.2.336 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ladefoged P., Broadbent D. E. (1957). Information conveyed by vowels . J. Acoust. Soc. Am. 29 , 98–104 10.1121/1.1908694 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Laing E. J. C., Liu R., Lotto A. J., Holt L. L. (2012). Tuned with a tune: talker normalization via general auditory processes . Front. Psychol. 3 :203 10.3389/fpsyg.2012.00203 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Lametti D. R., Ostry D. J. (2010). Postural constraint on movement variability . J. Neurophysiol. 104 , 1061–1067 10.1152/jn.00306.2010 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Liberman A. M., Cooper F. S., Shankweiler D. P., StuddertKennedy M. (1967). Perception of the speech code . Psychol. Rev. 74 , 431–461 10.1037/h0020279 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Liberman A. M., Delattre P. C., Cooper F. S., Gerstman L. J. (1954). The role of consonant-vowel transitions in the perception of the stop and nasal consonants . Psychol. Monogr. Gen. Appl. 68 , 1–13 10.1037/h0093673 [ CrossRef ] [ Google Scholar ]
  • Lichtheim L. (1885). On aphasia . Brain 7 , 433–484 [ Google Scholar ]
  • Lim S. J., Holt L. L. (2011). Learning foreign sounds in an alien world: videogame training improves non-native speech categorization . Cogn. Sci. 35 , 1390–1405 10.1111/j.1551-6709.2011.01192.x [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Lisker L., Abramson A. S. (1964). A cross-language study of voicing in initial stops: acoustical measurements . Word 20 , 384–422 [ Google Scholar ]
  • Lively S. E., Logan J. S., Pisoni D. B. (1993). Training Japanese listeners to identify English/r/and/l/. II: the role of phonetic environment and talker variability in learning new perceptual categories . J. Acoust. Soc. Am. 94 , 1242–1255 10.1121/1.408177 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Logan G. D. (1988). Toward an instance theory of automatization . Psychol. Rev. 95 , 492–527 10.1037/0033-295x.95.4.492 [ CrossRef ] [ Google Scholar ]
  • Logan J. S., Lively S. E., Pisoni D. B. (1991). Training Japanese listeners to identify English/r/and/l: a first report . J. Acoust. Soc. Am. 89 , 874–886 10.1121/1.1894649 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Luce P. A., Pisoni D. B. (1998). Recognizing spoken words: the neighborhood activation model . Ear Hear. 19 , 1–36 10.1097/00003446-199802000-00001 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Luce P. A., McLennan C., Charles-Luce J. (2003). “ Abstractness and specificity in spoken word recognition: indexical and allophonic variability in long-term repetition priming ,” in Rethinking Implicit Memory , eds Bowers J., Marsolek C. (Oxford: Oxford University Press; ), 197–214 [ Google Scholar ]
  • MacKay D. M. (1951). Mindlike Behavior in artefacts . Br. J. Philos. Sci. 2 , 105–121 10.10.1093/bjps/ii.6.105 [ CrossRef ] [ Google Scholar ]
  • MacKay D. M. (1956). “ The epistemological problem for automata ,” in Automata Studies , eds Shannon C. E., McCarthy J. (Princeton: Princeton University Press; ). [ Google Scholar ]
  • Marr D. (1982). Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. San Francisco: Freeman [ Google Scholar ]
  • Marr D. (1971). Simple memory: a theory for archicortex . Philos. Trans. R. Soc. Lond. B Biol. Sci. 262 , 23–81 [ PubMed ] [ Google Scholar ]
  • Marslen-Wilson W., Welsh A. (1978). Processing interactions and lexical access during word recognition in continuous speech . Cogn. Psychol. 10 , 29–63 10.1016/0010-0285(78)90018-x [ CrossRef ] [ Google Scholar ]
  • Mattar A. A. G., Ostry D. J. (2010). Generalization of dynamics learning across changes in movement amplitude . J. Neurophysiol. 104 , 426–438 10.1152/jn.00886.2009 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • McClelland J. L., Elman J. L. (1986). The TRACE model of speech perception . Cogn. Psychol. 18 , 1–86 [ PubMed ] [ Google Scholar ]
  • McClelland J. L., Rumelhart D. E. (1985). Distributed memory and the representation of general and specific information . J. Exp. Psychol. Gen. 114 , 159–197 [ PubMed ] [ Google Scholar ]
  • McClelland J. L., McNaughton B. L., O’Reilly R. C. (1995). Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory . Psychol. Rev. 102 , 419–457 10.1037//0033-295x.102.3.419 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • McClelland J. L., Mirman D., Holt L. L. (2006). Are there interactive processes in speech perception? Trends Cogn. Sci. 10 , 363–369 10.1016/j.tics.2006.06.007 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • McCoy S. L., Tun P. A., Cox L. C., Colangelo M., Stewart R. A., Wingfield A. (2005). Hearing loss and perceptual effort: downstream effects on older adults’ memory for speech . Q. J. Exp. Psychol. A 58 , 22–33 10.1080/02724980443000151 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • McMurray B., Jongman A. (2011). What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations . Psychol. Rev. 118 , 219–246 10.1037/a0022325 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • McQueen J. M., Norris D., Cutler A. (2006). Are there really interactive speech processes in speech perception? Trends Cogn. Sci. 10 :533 [ PubMed ] [ Google Scholar ]
  • Mirman D., McClelland J. L., Holt L. L. (2006a). An interactive Hebbian account of lexically guided tuning of speech perception . Psychon. Bull. Rev. 13 , 958–965 10.3758/bf03213909 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Mirman D., McClelland J. L., Holt L. L. (2006b). Theoretical and empirical arguments support interactive processing . Trends Cogn. Sci. 10 , 534 10.1016/j.tics.2006.10.003 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Moran J., Desimone R. (1985). Selective attention gates visual processing in the extrastriate cortex . Science 229 , 782–784 10.1126/science.4023713 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Murphy D. R., Craik F. I., Li K. Z., Schneider B. A. (2000). Comparing the effects of aging and background noise of short-term memory performance . Psychol. Aging 15 , 323–334 10.1037/0882-7974.15.2.323 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Navon D. (1984). Resources—a theoretical soup stone? Psychol. Rev. 91 , 216–234 10.1037/0033-295x.91.2.216 [ CrossRef ] [ Google Scholar ]
  • Nittrouer S., Miller M. E. (1997). Predicting developmental shifts in perceptual weighting schemes . J. Acoust. Soc. Am. 101 , 2253–2266 10.1121/1.418207 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nittrouer S., Lowenstein J. H. (2007). Children’s weighting strategies for word-final stop voicing are not explained by auditory capacities . J. Speech Lang. Hear. Res. 50 , 58–73 10.1044/1092-4388(2007/005) [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Norris D. (1994). Shortlist: a connectionist model of continuous speech recognition . Cognition 52 , 189–234 10.1016/0010-0277(94)90043-4 [ CrossRef ] [ Google Scholar ]
  • Norris D., McQueen J. M. (2008). Shortlist B: a Bayesian model of continuous speech recognition . Psychol. Rev. 115 , 357–395 10.1037/0033-295x.115.2.357 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Norris D., McQueen J. M., Cutler A. (2000). Merging information in speech recognition: feedback is never necessary . Behav. Brain Sci. 23 , 299–325 10.1017/s0140525x00003241 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nosofsky R. M. (1986). Attention, similarity and the identification - categorization relationship . J. Exp. Psychol. Gen. 115 , 39–57 10.1037/0096-3445.115.1.39 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nusbaum H. C., Lee L. (1992). “ Learning to hear phonetic information ,” in Speech Perception, Production, and Linguistic Structure , eds Tohkura Y., Vatikiotis-Bateson E., Sagisaka Y. (Tokyo: OHM Publishing Company; ), 265–273 [ Google Scholar ]
  • Nusbaum H. C., Magnuson J. (1997). “ Talker normalization: phonetic constancy as a cognitive process ,” in Talker Variability in Speech Processing , eds Johnson K., Mullennix J. W. (San Diego: Academic Press; ), 109–129 [ Google Scholar ]
  • Nusbaum H. C., Morin T. M. (1992). “ Paying attention to differences among talkers ,” in Speech Perception, Production, and Linguistic Structure , eds Tohkura Y., Vatikiotis-Bateson E., Sagisaka Y. (Tokyo: OHM Publishing Company; ), 113–134 [ Google Scholar ]
  • Nusbaum H. C., Pisoni D. B. (1985). Constraints on the perception of synthetic speech generated by rule . Behav. Res. Methods Instrum. Comput. 17 , 235–242 10.3758/bf03214389 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nusbaum H. C., Schwab E. C. (1986). “ The role of attention and active processing in speech perception ,” in Pattern Recognition by Humans and Machines: Speech Perception (Vol. 1), eds Schwab E. C., Nusbaum H. C. (San Diego: Academic Press; ), 113–157 [ Google Scholar ]
  • Nygaard L. C., Pisoni D. B. (1998). Talker-specific perceptual learning in spoken word recognition . Percept. Psychophys. 60 , 355–376 10.1121/1.397688 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nygaard L. C., Sommers M., Pisoni D. B. (1994). Speech perception as a talker-contingent process . Psychol. Sci. 5 , 42–46 10.1111/j.1467-9280.1994.tb00612.x [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Parvizi J. (2009). Corticocentric myopia: old bias in new cognitive sciences . Trends Cogn. Sci. 13 , 354–359 10.1016/j.tics.2009.04.008 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Peelle J. E., Wingfield A. (2005). Dissociable components of perceptual learning revealed by adult age differences in adaptation to time-compressed speech . J. Exp. Psychol. Hum. Percept. Perform. 31 , 1315–1330 10.1037/0096-1523.31.6.1315 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Peterson G. E., Barney H. L. (1952). Control methods used in a study of the vowels . J. Acoust. Soc. Am. 24 , 175–184 10.1121/1.1906875 [ CrossRef ] [ Google Scholar ]
  • Pichora-Fuller M. K., Souza P. E. (2003). Effects of aging on auditory processing of speech . Int. J. Audiol. 42 , 11–16 10.3109/14992020309074638 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Pisoni D. B., Aslin R. N., Perey A. J., Hennessy B. L. (1982). Some effects of laboratory training on identification and discrimination of voicing contrasts in stop consonants . J. Exp. Psychol. Hum. Percept. Perform. 8 , 297–314 10.1037//0096-1523.8.2.297 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Pisoni D. B., Lively S. E., Logan J. S. (1994). “ Perceptual learning of non-native speech contrasts: implications for theories of speech perception ,” in Development of Speech Perception: The Transition from Speech Sounds to Spoken Words , eds Goodman J., Nusbaum H. C. (Cambridge, MA: MIT Press; ), 121–166 [ Google Scholar ]
  • Rabbitt P. (1991). Mild hearing loss can cause apparent memory failures which increase with age and reduce with IQ . Acta Otolaryngol. Suppl. 111 , 167–176 10.3109/00016489109127274 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Rauschecker J. P., Scott S. K. (2009). Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing . Nat. Neurosci. 12 , 718–724 10.1038/nn.2331 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Rosch E., Mervis C. B., Gray W., Johnson D., Boyes-Braem P. (1976). Basic objects in natural categories . Cogn. Psychol. 8 , 382–439 10.1016/0010-0285(76)90013-x [ CrossRef ] [ Google Scholar ]
  • Rosen S., Faulkner A., Wilkinson L. (1999). Perceptual adaptation by normal listeners to upward shifts of spectral information in speech and its relevance for users of cochlear implants . J. Acoust. Soc. Am. 106 , 3629–3636 10.1121/1.428215 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sawusch J. R., Nusbaum H. C. (1983). Auditory and phonetic processes in place perception for stops . Percept. Psychophys. 34 , 560–568 10.3758/bf03205911 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sawusch J. R., Jusczyk P. W. (1981). Adaptation and contrast in the perception of voicing . J. Exp. Psychol. Hum. Percept. Perform. 7 , 408–421 10.1037/0096-1523.7.2.408 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Schwab E. C., Nusbaum H. C., Pisoni D. B. (1985). Some effects of training on the perception of synthetic speech . Hum. Factors 27 , 395–408 [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Schyns P. G., Goldstone R. L., Thibaut J. P. (1998). The development of features in object concepts . Behav. Brain Sci. 21 , 1–17 10.1017/s0140525x98000107 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Shannon R. V., Zeng F.-G., Kamath V., Wygonski J., Ekelid M. (1995). Speech recognition with primarily temporal cues . Science 270 , 303–304 10.1126/science.270.5234.303 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Shiffrin R. M., Schneider W. (1977). Controlled and automatic human information processing: II. Perceptual learning, automatic attending and a general theory . Psychol. Rev. 84 , 127–190 10.1037//0033-295x.84.2.127 [ CrossRef ] [ Google Scholar ]
  • Sidaras S. K., Alexander J. E., Nygaard L. C. (2009). Perceptual learning of systematic variation in Spanish-accented speech . J. Acoust. Soc. Am. 125 , 3306–3316 10.1121/1.3101452 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Skoe E., Kraus N. (2012). A little goes a long way: how the adult brain is shaped by musical training in childhood . J. Neurosci. 32 , 11507–11510 10.1523/JNEUROSCI.1949-12.2012 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Song J. H., Skoe E., Wong P. C. M., Kraus N. (2008). Plasticity in the adult human auditory brainstem following short-term linguistic training . J. Cogn. Neurosci. 20 , 1892–1902 10.1162/jocn.2008.20131 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Spinelli D. N., Pribram K. H. (1966). Changes in visual recovery functions produced by temporal lobe stimulation in monkeys . Electroencephalogr. Clin. Neurophysiol. 20 , 44–49 10.1016/0013-4694(66)90139-8 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sproat R., Fujimura O. (1993). Allophonic variation in English /l/ and its implications for phonetic implementation . J. Phon. 21 , 291–311 [ Google Scholar ]
  • Strait D. L., Kraus N., Parbery-Clark A., Ashley R. (2010). Musical experience shapes top-down auditory mechanisms: evidence from masking and auditory attention performance . Hear. Res. 261 , 22–29 10.1016/j.heares.2009.12.021 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Strange W., Jenkins J. J. (1978). “ Role of linguistic experience in the perception of speech ,” in Perception and Experience , eds Walk R. D., Pick H. L. (New York: Plenum Press; ), 125–169 [ Google Scholar ]
  • Suga N. (2008). Role of corticofugal feedback in hearing . J. Comp. Physiol. A Neuroethol. Sens. Neural Behav. Physiol. 194 , 169–183 10.1007/s00359-007-0274-2 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Surprenant A. M. (2007). Effects of noise on identification and serial recall of nonsense syllables in older and younger adults . Neuropsychol. Dev. Cogn. B Aging Neuropsychol. Cogn. 14 , 126–143 10.1080/13825580701217710 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Walley A. C., Carrell T. D. (1983). Onset spectra and formant transitions in the adult’s and child’s perception of place of articulation in stop consonants . J. Acoust. Soc. Am. 73 , 1011–1022 10.1121/1.389149 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Warren P., Marslen-Wilson W. (1987). Continuous uptake of acoustic cues in spoken word recognition . Percept. Psychophys. 41 , 262–275 10.3758/bf03208224 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wehr M., Zador A. M. (2003). Balanced inhibition underlies tuning and sharpens spike timing in auditory cortex . Nature 426 , 442–446 10.1038/nature02116 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Weil S. A. (2001). Foreign Accented Speech: Adaptation and Generalization. The Ohio State University: Doctoral Dissertation [ Google Scholar ]
  • Weinberger N. M. (1998). Tuning the brain by learning and by stimulation of the nucleus basalis . Trends Cogn. Sci. 2 , 271–273 10.1016/s1364-6613(98)01200-5 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Werker J. F., Logan J. S. (1985). Cross-language evidence for three factors in speech perception . Percept. Psychophys. 37 , 35–44 10.3758/bf03207136 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Werker J. F., Polka L. (1993). Developmental changes in speech perception: new challenges and new directions . J. Phon. 83 , 101 [ Google Scholar ]
  • Werker J. F., Tees R. C. (1983). Developmental changes across childhood in the perception of non-native speech sounds . Can. J. Psychol. 37 , 278–286 10.1037/h0080725 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Werker J. F., Tees R. C. (1984). Cross-language speech perception: evidence for perceptual reorganization during the first year of life . Infant. Behav. Dev. 7 , 49–63 10.1016/s0163-6383(84)80022-3 [ CrossRef ] [ Google Scholar ]
  • Wernicke C. (1874/1977). “ Der aphasische symptomencomplex: eine psychologische studie auf anatomischer basis ,” in Wernicke’s Works on Aphasia: A Sourcebook and Review , ed Eggert G. H. (The Hague: Mouton; ), 91–145 [ Google Scholar ]
  • Whalen D. H. (1991). Subcategorical phonetic mismatches and lexical access . Percept. Psychophys. 50 , 351–360 10.3758/bf03212227 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wingfield A., Tun P. A., McCoy S. L. (2005). Hearing loss in older adulthood. What it is and how it interacts with cognitive performance . Curr. Dir. Psychol. Sci. 14 , 144–148 10.1111/j.0963-7214.2005.00356.x [ CrossRef ] [ Google Scholar ]
  • Wong P. C. M., Skoe E., Russo N. M., Dees T., Kraus N. (2007). Musical experience shapes human brainstem encoding of linguistic pitch patterns . Nat. Neurosci. 10 , 420–422 10.1038/nn1872 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wood E. R., Dudchenko P. A., Eichenbaum H. (1999). The global record of memory in hippocampal neuronal activity . Nature 397 , 613–616 10.1038/17605 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wood E. R., Dudchenko P. A., Robitsek R. J., Eichenbaum H. (2000). Hippocampal neurons encode information about different types of memory episodes occurring in the same location . Neuron 27 , 623–633 10.1016/s0896-6273(00)00071-4 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Yamada R. A., Tohkura Y. (1992). The effects of experimental variables on the perception of American English /r/ and /l/ by Japanese listeners . Percept. Psychophys. 52 , 376–392 10.3758/bf03206698 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Znamenskiy P., Zador A. M. (2013). Corticostriatal neurons in auditory cortex drive decisions during auditory discrimination . Nature 497 , 482–486 10.1038/nature12077 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

Abbott, UT System, UH system sued over executive order to combat antisemitism at colleges

Students for Justice in Palestine chapters at the University of Houston and the University of Texas at Dallas as well as the Democratic Socialists of America, which runs youth chapters at several Texas universities, are suing Gov. Greg Abbott over his executive order calling for universities to change free speech policies to better prevent antisemitism on college campuses, including by enforcing the policies against Students for Justice in Palestine and the Palestine Solidarity Committee, its chapter at UT-Austin.

The plaintiffs, represented by Council on Islamic Relations, filed a suit Thursday asking a U.S. District Court to issue an injunction barring the executive order, saying it targets and chills speech against Israel by specifically naming Students for Justice in Palestine and the Palestine Solidarity Committee and by calling for colleges to recognize a broad definition of antisemitism. CAIR alleges that the order's implementation in some cases has censored students' speech in violation of the First Amendment.

Who else is named in the lawsuit and why?

In addition to Abbott, the University of Texas System Board of Regents, University of Houston and the UH System Board of Regents are also named as defendants, as well as Rene Khator, who serves as chancellor of the UH System and president of its flagship school.

University systems and institutions are responsible for carrying out the executive order at their respective institutions, and they are responsible for the legality of their policies.

The president of UT-San Antonio, Taylor Eighmy, is named as a defendant for the second count in the lawsuit, which alleges that UT-San Antonio adopted a "phrase-specific" rule barring Young Democratic Socialists of America members from saying "From the River to the Sea, Palestine will be free."

Abbott's executive order specifically lists this an example of "antisemitic phrases." But the lawsuit says the rule is an illegal content-based free speech limitation, and the suit says "its literal and figurative meaning reflect an aspiration for peace and dignity for all people—Palestinians, Israelis, Arabs, Muslims, Jews, Christians, and everyone else."

A UT-San Antonio spokesperson said the university does not comment on pending litigation and did not answer a question asking if the rule was in place as stated in the lawsuit.

The UH System Board of Regents on Wednesday added the state's definition of antisemitism to its policies, the flagship university confirmed in a statement, saying the university has met with Students for Justice in Palestine and "reiterated our commitment to freedom of expression." The University of Houston also asserted that the freedom of expression policy is equally applied without regard to viewpoint and that the system enforces both its freedom of expression policy and anti-discrimination policy.

"The University recognizes and supports the constitutional rights of free speech and peaceful assembly," the statement said. "We encourage all who engage in expressive activity to do so safely and in a manner that respects the rights of others with differing opinions."

The definition of antisemitism that Abbott ordered universities to adopt is "a certain perception of Jews that may be expressed as hatred toward Jews," including "rhetorical and physical acts of antisemitism directed toward Jewish or non-Jewish individuals or their property or toward Jewish community institutions and religious facilities."

CAIR's complaint, filed in U.S. District Court for the Western District of Texas, asks for relief under the First and 14th amendments, and calls for Abbott's executive order and the UT-San Antonio rule about the slogan to be withdrawn. The suit also requests for the UT and UH systems to not include the International Holocaust Remembrance Alliance's definition of antisemitism as specified in Abbott's order "or otherwise target individuals because of their political speech about Israel."

"We are hopeful that a court will see this for what it is, a clear as day abandonment of the First amendment," Gadeir Abbas, an attorney for the plaintiff with CAIR, told the American-Statesman on Thursday.

In a statement Thursday, Andrew Mahaleris, Abbott's press secretary, said antisemitism has been rising, and that the governor "condemns heinous acts of violence and inhumanity against Israel and its people by ruthless terrorists."

"Antisemitism is never acceptable in Texas, and we will do everything we can to fight it," Mahaleris said. "Texas will always stand with Israel and our Jewish neighbors here at home and across the world.”

What was the governor's executive order?

Abbott issued his executive order March 27 to address antisemitism at Texas universities. The Anti-Defamation League reports that acts of antisemitic have risen dramatically since Oct. 7, when Hamas, the militant Palestinian group that controlled the Gaza strip, launched a surprise attack on Israel that killed 1,200 people.

In response, Israel began bombarding Gaza, reportedly killing more than 34,000 people since the war's start. Islamophobic incidents have also risen dramatically since Oct. 7, according to CAIR.

Abbott's executive order did not mention rising incidents of Islamophobia but did mention the state's continued support for Israel and Jewish Texans.

"While many Texas universities have acted quickly to condemn antisemitism and foster appropriate discourse on the terrorist attacks against Israel and the ensuing Israel-Hamas war, some radical organizations have engaged in unacceptable actions on university campuses," the order said.

The order charges all public institutions of higher education in Texas to revise their free speech policies to address antisemitism, instating punishments including up to expulsion, enforcing the policies against groups "such as the Palestine Solidarity Committee and Students for Justice in Palestine" and including the definition of antisemitism that is in the Texas Government Code.

The lawsuit draws a contrast between a 2019 law Abbott signed making college campuses public forums for free speech, and his suppression of anti-Israel speech, which he deems antisemitic. It also draws attention to Abbott's deployment of Texas Department of Public Safety troopers at UT-Austin during pro-Palestinian protests in the past few weeks, which they said further reflects the suppression of anti-Israel speech.

"The events of the last few weeks increased the urgency to challenges to these really extreme efforts to make it impossible to do the normal and typical advocacy things that students do," Abbas said. "Only when it comes to Palestine and Israel."

What constitutes antisemitic speech?

The definition of antisemitism mentioned in Abbott's order is in the Texas Government Code and is pulled from the International Holocaust Remembrance Alliance, which describes the definition on their website as a "non-legally binding working definition."

"Manifestations might include the targeting of the state of Israel, conceived as a Jewish collectivity. However, criticism of Israel similar to that leveled against any other country cannot be regarded as antisemitic," the Holocaust remembrance alliance's website states . "Antisemitism frequently charges Jews with conspiring to harm humanity, and it is often used to blame Jews for 'why things go wrong.'"

The site's FAQ states that "claiming that the existence of a State of Israel is a racist endeavor" would qualify as antisemitism. "Israel is a racist state" has been chanted during recent pro-Palestinian protests at UT, as has "From the River to the Sea."

The lawsuit states that the definition "once adopted, would transform normal and typical criticism of a foreign country into antisemitism when the foreign country criticized is Israel."

Mahaleris did not answer specific questions about the lawsuit or its criticism that Muslim students don't have the same free speech or safety protections as Jewish students. UT System also declined to comment.

definition of speech perception

What to Know About US 'Antisemitism Awareness Act'

The bill received support and criticism from both parties as it headed to the u.s. senate., published may 13, 2024.

On May 1, 2024, the U.S. House of Representatives passed H.R. 6090 , the "Antisemitism Awareness Act of 2023." If passed by the Senate and signed into law by President Joe Biden, it adopts a definition of antisemitism created in 2016 by the International Holocaust Remembrance Alliance and directs the Department of Education to use that definition when "reviewing or investigating complaints of discrimination based on race, color, or national origin in programs or activities that receive federal financial assistance."

It's essentially the congressional response to pro-Palestinian demonstrations and allegations of antisemitism on college campuses across the U.S. But the bill, which passed the House 320-91, faced criticism from various groups.

Some pro-Palestine social media users claimed that the language in the IHRA definition of antisemitism would ban people comparing Israeli policy and actions to those of Nazi Germany, a comparison sometimes drawn to highlight the Israeli invasion of Gaza in late 2023 :

Some Christians claimed the language in the IRHA definition of antisemitism would criminalize the Bible because mentioning the fact that " Jews killed Jesus " would be considered antisemitic.

Some of these claims are true and some are exaggerated. It's especially important to remember that, as of the writing of this article, the Senate had not passed the bill. We will update this story with more information as new events occur. But here's what you need to know about the IRHA definition of antisemitism and H.R. 6090:

The International Holocaust Remembrance Alliance consists of 35 countries , including the U.S. and Israel. In 2016, its delegates met and created the following definition of antisemitism : " Antisemitism is a certain perception of Jews, which may be expressed as hatred toward Jews. Rhetorical and physical manifestations of antisemitism are directed toward Jewish or non-Jewish individuals and/or their property, toward Jewish community institutions and religious facilities."

But what critics worry about is the list of examples the IHRA provides under its definition:

Calling for, aiding, or justifying the killing or harming of Jews in the name of a radical ideology or an extremist view of religion. Making mendacious, dehumanizing, demonizing, or stereotypical allegations about Jews as such or the power of Jews as collective — such as, especially but not exclusively, the myth about a world Jewish conspiracy or of Jews controlling the media, economy, government or other societal institutions. Accusing Jews as a people of being responsible for real or imagined wrongdoing committed by a single Jewish person or group, or even for acts committed by non-Jews. Denying the fact, scope, mechanisms (e.g. gas chambers) or intentionality of the genocide of the Jewish people at the hands of National Socialist Germany and its supporters and accomplices during World War II (the Holocaust). Accusing the Jews as a people, or Israel as a state, of inventing or exaggerating the Holocaust. Accusing Jewish citizens of being more loyal to Israel, or to the alleged priorities of Jews worldwide, than to the interests of their own nations. Denying the Jewish people their right to self-determination, e.g., by claiming that the existence of a State of Israel is a racist endeavor. Applying double standards by requiring of it a behavior not expected or demanded of any other democratic nation. Using the symbols and images associated with classic antisemitism (e.g., claims of Jews killing Jesus or blood libel) to characterize Israel or Israelis. Drawing comparisons of contemporary Israeli policy to that of the Nazis. Holding Jews collectively responsible for actions of the state of Israel.

Critics mentioned items 9 and 10 in particular.

Christian critics wished to protect their right to claim that the "Jews killed Jesus," which they say is in the Bible. The Bible does contain a line or two that blames the death of Jesus on Jewish people. But that line is often used to blame all Jewish people as a group and has been used as justification for antisemitic violence for a long time. The Catholic Church has repeatedly stressed that Jesus Christ's death cannot be attributed to Jewish people as a whole, neither at the time of Jesus's death nor today.

Considering that the item specifically mentions using these ideas to criticize the state of Israel, we conclude that saying "Jews killed Jesus" with no other context would not necessarily be restricted speech under the bill.

Some pro-Palestine advocates have made a comparison between "contemporary Israeli policy" and Nazi Germany, or something close to it, which is the 10th item on the IHRA's list of examples. In their critiques of the bill, those advocates seem to imply that the comparison is not antisemitic and would also restrict freedom of speech under the First Amendment.

However, under the text of the bill, neither of these things would be "banned" or "made illegal" in the U.S., as both parties have maintained. As previously mentioned, a complete ban would likely be in violation of the First Amendment, which protects controversial expressions of speech unless it can be proven that there was a direct intent of harm.

Instead, the bill requires only the Department of Education, which is in charge of federal grants for financial aid and funding of schools, to consider the definition of antisemitism when investigating programs that receive federal funding. Essentially, it gives the Education Department the ability to pull funding if it finds discrimination "based on race, color, or national origin," something known as a Title VI investigation.

Here's the relevant text of the bill , along with the roll-call vote by member :

SEC. 3. Findings . Congress finds the following: (1) Antisemitism is on the rise in the United States and is impacting Jewish students in K–12 schools, colleges, and universities. (2) The International Holocaust Remembrance Alliance (referred to in this Act as the "IHRA") Working Definition of Antisemitism is a vital tool which helps individuals understand and identify the various manifestations of antisemitism. (3) On December 11, 2019, Executive Order 13899 extended protections against discrimination under the Civil Rights Act of 1964 to individuals subjected to antisemitism on college and university campuses and tasked Federal agencies to consider the IHRA Working Definition of Antisemitism when enforcing title VI of such Act. (4) Since 2018, the Department of Education has used the IHRA Working Definition of Antisemitism when investigating violations of that title VI. (5) The use of alternative definitions of antisemitism impairs enforcement efforts by adding multiple standards and may fail to identify many of the modern manifestations of antisemitism. (6) The White House released the first-ever United States National Strategy to Counter Antisemitism on May 25, 2023, making clear that the fight against this hate is a national, bipartisan priority that must be successfully conducted through a whole-of-government-and-society approach. SEC. 4. Definitions . For purposes of this Act, the term "definition of antisemitism"— (1) means the definition of antisemitism adopted on May 26, 2016, by the IHRA, of which the United States is a member, which definition has been adopted by the Department of State; and (2) includes the "[c]ontemporary examples of antisemitism" identified in the IHRA definition. SEC. 5. Rule of construction for title VI of the Civil Rights Act of 1964 . In reviewing, investigating, or deciding whether there has been a violation of title VI of the Civil Rights Act of 1964 (42 U.S.C. 2000d et seq.) on the basis of race, color, or national origin, based on an individual's actual or perceived shared Jewish ancestry or Jewish ethnic characteristics, the Department of Education shall take into consideration the definition of antisemitism as part of the Department's assessment of whether the practice was motivated by antisemitic intent.

What Problem Is the Bill Attempting to Address?

It is clear that antisemitism in the United States on college campuses is, as the bill puts it, "on the rise." What critics have vigorously debated is whether pro-Palestinian demonstrations on university campuses across the U.S. in spring 2024 fall under that umbrella.

Advocates for the bill called it long overdue. Rep. Mike Lawler, a Republican from New York who co-sponsored the bill, called the demonstrations on college campuses "reprehensible and alarming," and listed examples at Yale, UCLA and The Cooper Union that he said placed Jewish students in danger.

"When people engage in harassment or bullying of Jewish individuals where they justify the killing of Jews or use blood libel or hold Jews collectively responsible for the actions of the Israeli government — that is antisemitic. It's unfortunate that needs to be clarified, but that's why this bill is necessary," he said in a news release .

But Rep. Jerry Nadler, a Jewish Democrat from New York, released a transcript of his speech opposing the bill on the House floor because he felt it was a violation of the First Amendment:

"Speech that is critical of Israel — alone — does not constitute unlawful discrimination. By encompassing purely political speech about Israel into Title VI's ambit, the bill sweeps too broadly. As the ACLU notes, if this legislation were to become law, colleges and universities that want to avoid Title VI investigations, or the potential loss of federal funding, could end up suppressing protected speech criticizing Israel or supporting Palestinians. Moreover, it could result in students and faculty self-censoring their political speech. Even the IHRA definition's lead author, Kenneth Stern, opposes codifying this definition for this reason. Vigorous enforcement of federal civil rights law does not depend on defining terms like "antisemitism" or "racism." In fact, codifying one definition of antisemitism, to the exclusion of all other possible definitions, could actually undermine federal civil rights law because antisemitism, like other forms of bigotry, evolves over time, and future conduct that comes to be widely understood as antisemitic may no longer meet the statutory definition."

It's unclear whether the Senate will try to pass the bill. If none of the 100 senators objects ( unanimous consent ), the bill can be fast-tracked without a roll-call vote. According to reporting from Politico , however, the bill faces likely objections from both sides of the political aisle.

By Jack Izzo

Jack Izzo is a Chicago-based journalist and two-time "Jeopardy!" alumnus.

Article Tags

definition of speech perception

  • Overview An introduction to the IHRA organization
  • Member Countries The countries that are IHRA members
  • Permanent International Partners How we work with intergovernmental organizations
  • People leading the IHRA Those leading and managing our work
  • Our structure How our organization and the Permanent Office is set up
  • Careers Current vacancies and working here
  • Overview An introduction to the work we do or support
  • Focus areas The eight subject areas we are working in
  • Our work Current projects, campaigns and activities
  • IHRA Grant Program Our grant program, what we fund and how to apply
  • Statements Official statements issued by our Chairs or the Membership
  • Annual report Recent highlights, achievements and key decisions
  • For Memorial & Museum Professionals Information that helps museums, memorials and similar organizations
  • For Education Professionals Information that helps Education professionals who are teaching about the Holocaust
  • For Policymakers Information that helps governmental organizations and people in the Government
  • For Scholars & Researchers Information that helps people in academic organizations
  • All Resources Explore our working definitions and other tools
  • News & Events

definition of speech perception

Working definition of antisemitism

Read the full text of the IHRA’s non-legally binding working definition of antisemitism and learn more about this important tool with FAQs.

Read the full text of the IHRA’s non-legally binding working definition of antisemitism and learn more about this important tool with the FAQs below.

Our working definitions are available in multiple languages. While we try to ensure the accuracy of all of our translations, in the event of any discrepancies, the English translation takes precedence.

In the spirit of the Stockholm Declaration that states: “With humanity still scarred by …antisemitism and xenophobia the international community shares a solemn responsibility to fight those evils” the committee on Antisemitism and Holocaust Denial called the IHRA Plenary in Budapest 2015 to adopt the following working definition of antisemitism.

On 26 May 2016, the Plenary in Bucharest decided to:

Adopt the following non-legally binding working definition of antisemitism: “Antisemitism is a certain perception of Jews, which may be expressed as hatred toward Jews. Rhetorical and physical manifestations of antisemitism are directed toward Jewish or non-Jewish individuals and/or their property, toward Jewish community institutions and religious facilities.”

To guide IHRA in its work, the following examples may serve as illustrations:

Manifestations might include the targeting of the state of Israel, conceived as a Jewish collectivity. However, criticism of Israel similar to that leveled against any other country cannot be regarded as antisemitic. Antisemitism frequently charges Jews with conspiring to harm humanity, and it is often used to blame Jews for “why things go wrong.” It is expressed in speech, writing, visual forms and action, and employs sinister stereotypes and negative character traits.

Contemporary examples of antisemitism in public life, the media, schools, the workplace, and in the religious sphere could, taking into account the overall context, include, but are not limited to:

  • Calling for, aiding, or justifying the killing or harming of Jews in the name of a radical ideology or an extremist view of religion.
  • Making mendacious, dehumanizing, demonizing, or stereotypical allegations about Jews as such or the power of Jews as collective — such as, especially but not exclusively, the myth about a world Jewish conspiracy or of Jews controlling the media, economy, government or other societal institutions.
  • Accusing Jews as a people of being responsible for real or imagined wrongdoing committed by a single Jewish person or group, or even for acts committed by non-Jews.
  • Denying the fact, scope, mechanisms (e.g. gas chambers) or intentionality of the genocide of the Jewish people at the hands of National Socialist Germany and its supporters and accomplices during World War II (the Holocaust).
  • Accusing the Jews as a people, or Israel as a state, of inventing or exaggerating the Holocaust.
  • Accusing Jewish citizens of being more loyal to Israel, or to the alleged priorities of Jews worldwide, than to the interests of their own nations.
  • Denying the Jewish people their right to self-determination, e.g., by claiming that the existence of a State of Israel is a racist endeavor.
  • Applying double standards by requiring of it a behavior not expected or demanded of any other democratic nation.
  • Using the symbols and images associated with classic antisemitism (e.g., claims of Jews killing Jesus or blood libel) to characterize Israel or Israelis.
  • Drawing comparisons of contemporary Israeli policy to that of the Nazis.
  • Holding Jews collectively responsible for actions of the state of Israel.

Antisemitic acts are criminal when they are so defined by law (for example, denial of the Holocaust or distribution of antisemitic materials in some countries).

Criminal acts are antisemitic when the targets of attacks, whether they are people or property – such as buildings, schools, places of worship and cemeteries – are selected because they are, or are perceived to be, Jewish or linked to Jews.

Antisemitic discrimination is the denial to Jews of opportunities or services available to others and is illegal in many countries.

Download the IHRA working definition of antisemitism

Frequently asked questions

The IHRA is the only intergovernmental organization mandated to focus solely on Holocaust-related issues. With evidence that the scourge of antisemitism was once again on the rise, we resolved to take a leading role in combating it.

Combating antisemitism requires international cooperation among experts, governments, and civil society, which meant a mutual starting point for discussion and action was needed. Following the example of the working definition of Holocaust denial and distortion (2013), the IHRA’s experts set out to adopt a similar tool to help guide the organization’s work on combating antisemitism.

The experts in the IHRA’s Committee on Antisemitism and Holocaust Denial built international consensus around a non-legally binding working definition of antisemitism, which was then adopted by the Plenary, the IHRA’s decision-making body made up of representatives from all IHRA Member Countries. By doing so, the IHRA provided an important practical tool for its Member Countries.

The working definition of antisemitism has brought the issue once more to the attention of leaders around the world, making them aware that much work still needs to be done to address antisemitism, which is deeply rooted in our societies.

As a result, countries and organizations have developed national and regional strategies to counter antisemitism, and have appointed Special Envoys and other advisors on the issue. The working definition has helped educate and sensitize administrations, politicians, judges, police, teachers, media, and civil society.

Its broad international implementation has allowed monitoring organizations to better track antisemitism across borders, and has provided researchers and civil society organizations with a way to better moderate content online, including by training AI tools.

The working definition has helped guide countless governments, organizations, and individuals in their efforts to identify antisemitism. This practical tool has also been formally adopted or endorsed* by these groups, both at the national and organizational level.

National level

The following UN member states have adopted or endorsed the IHRA working definition of antisemitism. Beyond the 43 countries listed below, a wide range of other political entities, including a large number of regional/state and local governments, have done so as well. Depending on their domestic situation, countries may use different terminology, including adopt, endorse, embrace, recognize, support, and so on.

Albania (22 October 2020)

Argentina (4 June 2020)

Australia (13 October 2021)

Austria (25 April 2017)

Belgium (14 December 2018)

Bosnia (22 July 2022)

Bulgaria (18 October 2017)

Canada (27 June 2019)

Colombia (2 June 2022)

Croatia (20 January 2023)

Cyprus (18 December 2019)

Czech Republic (25 January 2019)

Denmark (January 2022)

Estonia (29 April 2021)

Finland (17 February 2022)

France (3 December 2019)

Germany (20 September 2017)

Greece (8 November 2019)

Guatemala (27 January 2021)

Hungary (18 February 2019)

Israel (22 January 2017)

Italy (17 January 2020)

Latvia (11 April 2023)

Lithuania (24 January 2018)

Luxembourg (10 July 2019)

Moldova (18 January 2019)

Netherlands (27 November 2018)

North Macedonia (6 March 2018)

Panama (10 May 2023)

Philippines (18 February 2022)

Poland (13 October 2021)

Portugal (28 July 2021)

Romania (25 May 2017)

Serbia (26 February 2020)

Slovakia (28 November 2018)

Slovenia (20 December 2018)

South Korea (4 August 2021)

Spain (22 July 2020)

Sweden (21 January 2020)

Switzerland (4 June 2021)

United Kingdom (12 December 2016)

United States (11 December 2019)

Uruguay (27 January 2020)

Organizations

The following international organizations have expressed support for the working definition of antisemitism:

United Nations

  • Secretary General Antonio Guterres  acknowledged the efforts of the IHRA Member Countries to agree on a common definition of antisemitism and underlined that it could serve as a basis for law enforcement, as well as preventive policies.
  • Special Rapporteur for freedom of religion or belief Ahmed Shaheed  recommended that governments use the IHRA working definition of antisemitism as a non-legally binding educational and training tool and ensure it is incorporated, together with relevant human rights standards-based guidance on protecting freedom of opinion and expression, into training and educational materials for all public officials, such as police, prosecutors, and judges, government employees, educators, and national human rights institutions, and integrated into diversity inclusion programs.

European Union

  • Council and Parliament called on Member States that had not done so already to endorse the non-legally binding working definition of antisemitism employed by the International Holocaust Remembrance Alliance (IHRA) as a useful guidance tool in education and training, including for law enforcement authorities in their efforts to identify and investigate antisemitic attacks more efficiently and effectively.
  • Commission highlighted the working definition of antisemitism by the International Holocaust Remembrance Alliance as the benchmark for developing a victim-centered approach and urged for its adoption.

Organization of American States

  • Secretary General Luis Almagro asked every member state to adopt the working definition and announced it would be employed to guide OAS work.

Council of Europe

  • European Commission against Racism and Intolerance welcomed the non-legally binding IHRA working definition of antisemitism in the sense that it aids and promotes a better understanding of antisemitism. It considered that it can be a positive tool and encouraged Council of Europe member states to take it into account, in particular in the areas of data collection, education, and awareness-raising.
  • The Parliament of MERCOSUR approved a proposal endorsing the IHRA working definition of antisemitism during its LXXXIII Ordinary Session on 11 November 2022.

* Different countries and organizations will use different terminology, including adopt, endorse, embrace, recognize, support, and so on.

Related Resource

definition of speech perception

EU Handbook for the practical use of the IHRA working definition of antisemitism published

The Handbook presents the IHRA working definition of antisemitism and relates its guiding examples to the contexts of real-world antisemitic incidents and crimes.

Download the EU Handbook

Related content

Antisemitism.

We are creating a shared understanding of how antisemitism impacts our societies – and how we can fight it.

definition of speech perception

Working definition of antigypsyism/anti-Roma discrimination

Read the full text of the IHRA’s non-legally binding working definition of antigypsyism/anti-Roma discrimination and learn more about this important tool with the FAQs.

definition of speech perception

IHRA Toolkit Against Holocaust Distortion

Take steps towards recognizing and countering Holocaust distortion with the practical tools, guidance, and example activities included in this online Toolkit.

Sign up to our newsletter to 
 receive the latest updates

By signing up to the IHRA newsletter, you agree to our Privacy Policy

  • New Hampshire
  • North Carolina
  • Pennsylvania
  • West Virginia
  • Online hoaxes
  • Coronavirus
  • Health Care
  • Immigration
  • Environment
  • Foreign Policy
  • Kamala Harris
  • Donald Trump
  • Mitch McConnell
  • Hakeem Jeffries
  • Ron DeSantis
  • Tucker Carlson
  • Sean Hannity
  • Rachel Maddow
  • PolitiFact Videos
  • 2024 Elections
  • Mostly True
  • Mostly False
  • Pants on Fire
  • Biden Promise Tracker
  • Trump-O-Meter
  • Latest Promises
  • Our Process
  • Who pays for PolitiFact?
  • Advertise with Us
  • Suggest a Fact-check
  • Corrections and Updates
  • Newsletters

Stand up for the facts!

Our only agenda is to publish the truth so you can be an informed participant in democracy. We need your help.

I would like to contribute

true

  • Instagram posts

Maria Briceño

The Antisemitism Act hasn’t proposed to ‘eliminate’ part of the Bible

If your time is short.

PolitiFact didn’t find any text in the Antisemitism Awareness Act that mentions eliminating parts of the Bible.

Experts also confirmed that the bill doesn’t propose to eliminate parts of the book.

They also said that prohibiting the Bible would be going against the freedom of expression and religion under the First Amendment. 

The Antisemitism Awareness Act , a bill that was voted on during a surge in pro-Palestinian protests on college campuses against the Israel-Hamas war, has sparked claims on social media that it eliminates parts of the Bible.

A May 9 Instagram post  in Spanish shows news segments reporting about the bill and a man explaining: 

"The House of Representatives in the U.S. just passed in an almost unanimous way, 320 votes in favor and 91 against, this past May 1, the Antisemitism Act, which proposes to eliminate parts of the Bible. Nothing more and nothing less than the New Testament for the teaching that the Gospels said that Jesus was handed over by Herod to the Jews to be crucified."

The post was flagged as part of Meta’s efforts to combat false news and misinformation on its News Feed. (Read more about our partnership with Meta , which owns Facebook and Instagram.)

The man in the video says that the Antisemitism Act proposes to eliminate parts of the Bible, but PolitiFact didn’t find any mentions of the religious book or its passages in the bill’s text.

The bill, passed May 1 by the U.S. House of Representatives, requires the federal Education Department to use the definition of antisemitism outlined by the Stockholm-based International Holocaust Remembrance Alliance when addressing allegations of discrimination in higher education. If discrimination is determined to have occurred, schools would risk losing federal funding. Currently, there is no standard definition for antisemitism in such discrimination cases.

The International Holocaust Remembrance Alliance’s definition says :

"Antisemitism is a certain perception of Jews, which may be expressed as hatred toward Jews. Rhetorical and physical manifestations of antisemitism are directed toward Jewish or non-Jewish individuals and/or their property, toward Jewish community institutions and religious facilities." 

Conservative critics have also expressed concern about the bill and whether the antisemitism definition could cover elements of the Bible.

Kansas City Chiefs’ kicker Harrison Butker mentioned the Antisemitism Act in a May 14 commencement speech at Benedictine College, saying, "Congress just passed a bill, where stating something as basic as the biblical teaching of who killed Jesus could land you in jail."

Rep. Marjorie Taylor Greene, R-Ga., posted May 1 on X  that the text of the bill "could convict Christians of antisemitism for believing the Gospel that says Jesus was handed over to Herod to be crucified by the Jews."

The Anti-Defamation League considers blaming Jews for killing Jesus a "myth" and said it "has been used to justify violence against Jews for centuries. Historians as well as Christian leaders have agreed that the claim is baseless." After centuries of teaching that Jewish people as a whole killed Jesus, the Catholic Church rejected the belief in 1965, a stance then-Pope Benedict XVI reiterated in 2011 .

Rep. Mike Lawler, R-N.Y., a sponsor of the bill, has dismissed concerns that the bill would eliminate parts of the Bible, telling CNN the claims were "inflammatory and it's irrational." Legal experts and political experts had the same take.

Ernesto Sagás, a professor at Colorado State University’s ethnic studies department, told PolitiFact that it doesn’t make sense for the government to create a new law to "eliminate" the New Testament, since the government could not prohibit the religious work’s publication because of the rights of free speech and religion established by the First Amendment. 

Featured Fact-check

definition of speech perception

"Is the government going to prohibit (the Bible’s) publication, despite the rights of free expression and freedom of religion established by the First Amendment? Filed (the post) under ‘conspiracy theory’," Sagás said.

Jason Mazzone, a law professor and director of the program in constitutional theory, history and law at the University of Illinois, also told us that nothing in the Antisemitism Act imposes a requirement to eliminate the Bible.

"There is no plausible reading of the Anti-Semitism Awareness Act as banning the New Testament. The Act doesn't ban anything: it provides a definition for enforcement of Title VI," Mazzone said, referring to a part of the Civil Rights Act of 1964 that prohibits discrimination based on race, color or national origin in programs or activities that receive federal financial assistance. "And Title VI has never been understood to require schools and other recipients of federal funding to remove or prohibit from campus books on the basis that they contain or may contain offensive material."

He also added there are some First Amendment concerns with the proposed bill because the International Holocaust Remembrance Alliance’s vague definition leaves people guessing about what kinds of expression or conduct are prohibited. 

However, Mazzone said that the Education Department’s mandate is to consider the International Holocaust Remembrance Alliance’s definition only when enforcing Title VI. 

 "There is no necessary reason to think that the result of the Act, if adopted, will be to punish or chill speech that the First Amendment protects."

Gregory P. Magarian, a law professor at Washington University in St. Louis, told PolitiFact previously that he sees the concern over the Bible as a stretch.

"My best sense is that the Bible is sufficiently normatively ingrained in Western cultures, especially in the U.S., that no government actor would be at all likely to invoke the (International Holocaust Remembrance Alliance) definition against any common usage of the Bible," Magarian said .

Eugene Volokh, a First Amendment law professor at the University of California, Los Angeles, said the Antisemitism Act wouldn’t eliminate part of the Bible, nor criminalize biblical teaching. 

He said in an email to PolitiFact the bill would provide guidance for the Education Department in considering a definition of antisemitism when it handles complaints that universities are "tolerating anti-Semitic harassment of or discrimination against students."

The proposal is now in the U.S. Senate, where its future is uncertain.

It’s also unclear how binding the bill would be. It is phrased as a "sense of Congress" legislation, which is language typically used for nonbinding, advisory legislation. However, the measure also says that the Education Department "shall" take into consideration the definition of antisemitism, which seems to leave no wiggle room.

An Instagram post says "The House of Representatives in the U.S. just passed … the Antisemitism Act that proposes to eliminate part of the Bible."

But the text in the bill doesn’t mention anything about eliminating parts of the Bible. Experts told us that prohibiting the book would be going against the First Amendment’s freedom of speech and religion rights.

Other experts also told us the claim is inaccurate. 

We rate this claim False.

Read About Our Process

The Principles of the Truth-O-Meter

Our Sources

Instagram post, May 9, 2024

PolitiFact, The Antisemitism Awareness Act: What to know , May 10, 2024

Email interview with Ernesto Sagás, a professor at the Colorado State University, May 14, 2024

Congress.gov, H.R.6090 - Antisemitism Awareness Act of 2023 , accessed May 14

International Holocaust Remembrance Alliance, Working definition of antisemitism , accessed May 14, 2024

Anti-Defamation League, Myth: Jews Killed Jesus , accessed May 14, 2024

Google Books, Jesus of Nazareth , accessed May 14, 2024

Reuters, Pope book says Jews not guilty of Christ's death , March 2, 2011

CNN, Rep. Moskowtiz: Antisemitism bill is "bipartisan effort" , May 1, 2024

TikTok post , May 14, 2024

Email interview with Jason Mazzone, a law professor and director of the program in constitutional theory, history and law at the University of Illinois, May 15, 2024

Email interview with Eugene Volokh, a professor who teaches First Amendment law at the University of California, Los Angeles, May 15, 2024

U.S. Department of Education, Education and Title VI , accessed May 16, 2024

CBS News, Pope Exonerates Jews for Jesus' Death in Book , March 2, 2011 

Vatican.va, Nostra Aetate : A milestone ,  accessed May 16, 2024

Browse the Truth-O-Meter

More by maria briceño.

false

Support independent fact-checking. Become a member!

IMAGES

  1. Theories of Speech Perception

    definition of speech perception

  2. PPT

    definition of speech perception

  3. PPT

    definition of speech perception

  4. PPT

    definition of speech perception

  5. PPT

    definition of speech perception

  6. Theories of Speech Perception

    definition of speech perception

VIDEO

  1. What is Perception? Perception Explained Definition #perception

  2. Perception

  3. Definition Speech Self-Reflection

  4. Definition Speech

  5. THE STAGES OF SPEECH PRODUCTION AND SPEECH PERCEPTION

  6. Self Definition #shorts #jordanpeterson

COMMENTS

  1. Speech perception

    Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perception in psychology. Research in speech perception seeks to understand how human listeners recognize ...

  2. Speech Perception

    Speech Perception. Andrew J. Lotto, Lori L. Holt, in Neurobiology of Language, 2016 16.1 Introduction. For much of the past 50 years, the main theoretical debate in the scientific study of speech perception has focused on whether the processing of speech sounds relies on neural mechanisms that are specific to speech and language or whether general perceptual/cognitive processes can account for ...

  3. Speech Perception

    Speech perception is conventionally defined as the perceptual and cognitive processes leading to the discrimination, identification, and interpretation of speech sounds. However, to gain a broader understanding of the concept, such processes must be investigated relative to their interaction with long-term knowledge—lexical information in ...

  4. Speech Perception

    Speech Perception. Speech perception refers to the suite of (neural, computational, cognitive) operations that transform auditory input signals into representations that can make contact with internally stored information: the words in a listener's mental lexicon. ... A classic definition of categorization is that it permits treating ...

  5. Introduction. The perception of speech: from sound to meaning

    Infants' speech perception skills show two types of changes towards the end of the first year of life. First, the ability to perceive phonetic distinctions in a non-native language declines. Second, skills at making phonetic distinctions in the child's own language improve. The paper presents new data showing that both native and non-native ...

  6. The Auditory Cognitive Neuroscience of Speech Perception in Context

    The field of speech perception has radically shifted to embrace these reciprocal benefits, with a methodological toolbox equipped to support the endeavor. ... At the same time, almost by definition, these natural stimuli are not well controlled for various acoustic or linguistic features of interest. Thus, the strongest claims will likely need ...

  7. Speech Perception: Empirical and Theoretical Considerations

    This does not imply that speech perception involves entirely different kinds of objects or processes from ordinary non-linguistic audition, nor does it imply that speech perception is a uniquely human capacity. Nevertheless, speech clearly is special for humans, in that we have special sensitivity for speech sounds. Speech perception promises ...

  8. Speech Perception

    Speech Perception. D.W. Massaro, in International Encyclopedia of the Social & Behavioral Sciences, 2001 The study of speech perception is an interdisciplinary endeavor, which involves a varied set of experimental and theoretical approaches. It includes the fundamental psychophysical question of what properties of spoken language are perceptually meaningful and how these properties signal the ...

  9. Speech Perception

    Speech perception as an experimental discipline has a roughly sixty-year history. In a very broad sense, much of the research in this field investigates how listeners map the input acoustic signal onto phonological units. Determining the nature of the mapping is an intriguing issue because the acoustic signal is highly variable, yet perception ...

  10. The Handbook of Speech Perception

    The Handbook of Speech Perception, Second Edition, is a comprehensive and up-to-date survey of technical and theoretical developments in perceptual research on human speech. Offering a variety of perspectives on the perception of spoken language, this volume provides original essays by leading researchers on the major issues and most recent ...

  11. Speech Perception

    This chapter evaluates the evidence that speech perception is distinctive when compared with non-linguistic auditory perception. It addresses the phenomenology, contents, objects, and mechanisms involved in the perception of spoken language. According to the account it proposes, the capacity to perceive speech in a manner that enables ...

  12. Speech perception and production

    Abstract. Until recently, research in speech perception and speech production has largely focused on the search for psychological and phonetic evidence of discrete, abstract, context-free symbolic units corresponding to phonological segments or phonemes. Despite this common conceptual goal and intimately related objects of study, however ...

  13. Speech perception: Some new directions in research and theory

    The perception of speech is one of the most fascinating attributes of human behavior; both the auditory periphery and higher centers help define the parameters of sound perception. In this paper some of the fundamental perceptual problems facing speech sciences are described. The paper focuses on several of the new directions speech perception ...

  14. The motor theory of speech perception reviewed

    Abstract. More than 50 years after the appearance of the motor theory of speech perception, it is timely to evaluate its three main claims that (1) speech processing is special, (2) perceiving speech is perceiving gestures, and (3) the motor system is recruited for perceiving speech. We argue that to the extent that it can be evaluated, the ...

  15. Auditory Perception

    Furthermore, speech perception's contents might differ in a more dramatic way from those of non-linguistic audition. Listening with understanding to speech involves grasping meanings. The commonplace view is conservative. It holds that grasping meanings is an act of the understanding rather than of audition.

  16. Chapter 1

    from Section 1 - Speech Perception. Published online by Cambridge University Press: 05 November 2012. By. Carol A. Fowler and. James S. Magnuson. Edited by. Michael Spivey , Ken McRae and. Marc Joanisse.

  17. PDF L105/205

    Theories of speech perception differ with respect to their views of what is perceived and how. Auditory - listeners identify acoustic patterns or features by matching them to stored acoustic representations. Bottom-up - perception is built from information in the physical signal. Active - cognitive/intellectual work is involved in perception.

  18. Motor theory of speech perception

    The motor theory of speech perception is the hypothesis that people perceive spoken words by identifying the vocal tract gestures with which they are pronounced rather than by identifying the sound patterns that speech generates. It originally claimed that speech perception is done through a specialized module that is innate and human-specific. Though the idea of a module has been qualified in ...

  19. What is SPEECH PERCEPTION? definition of SPEECH PERCEPTION (Psychology

    Psychology Definition of SPEECH PERCEPTION: a psychological process where the listener processes the speech in to a phonological presentation.

  20. Introduction. The perception of speech: from sound to meaning

    Introduction. The perception of speech: from sound to meaning. Spoken language communication is arguably the most important activity that distinguishes humans from non-human species. This paper provides an overview of the review papers that make up this theme issue on the processes underlying speech communication.

  21. Perception

    Perception (from Latin perceptio 'gathering, receiving') is the organization, identification, and interpretation of sensory information in order to represent and understand the presented information or environment. [2] All perception involves signals that go through the nervous system, which in turn result from physical or chemical stimulation ...

  22. Psycholinguistics/Models of Speech Perception

    Speech perception is the process by which speech is interpreted. Speech perception involves three processes, hearing, interpreting and comprehending all of the sounds produced by a speaker. The combination of these features into an order that resembles speech of a given language is a main function of speech perception.

  23. Full article: Masked speech recognition by 6-13-year-olds with early

    When the target speech and speech interferers are spatially separated, spatial release from masking (SRM), which is calculated as the difference between the SRTs measured with and without spatial differences, occurs for normal-hearing listeners. More recently, a couple of studies investigated the long-term effects of early-childhood OM on SRM.

  24. Speech perception as an active cognitive process

    If speech perception itself is mediated by active processes, which require cognitive resources, then the increasing demands on additional cognitive and perceptual processing for older adults becomes more problematic. ... the linguistic context of a phoneme is very much apart of the acoustic definition of a phoneme. While experience during ...

  25. Lawsuit accuses Abbott, Texas systems of censoring anti-Israel speech

    The definition of antisemitism that Abbott ordered universities to adopt is "a certain perception of Jews that may be expressed as hatred toward Jews," including "rhetorical and physical acts of ...

  26. What to Know About US 'Antisemitism Awareness Act'

    In 2016, its delegates met and created the following definition of antisemitism: "Antisemitism is a certain perception of Jews, which may be expressed as hatred toward Jews. Rhetorical and ...

  27. PDF Speech perception as an active cognitive process

    Furthermore, understanding speech perception as an active process has implications for explaining some of the findings of the interaction of hearing loss with cognitive processes (e.g., Wingfield et al., 2005). One explanation of the demands on cognitive mechanisms through hearing loss is a compensatory model as noted above (e.g., Rabbitt, 1991).

  28. Working definition of antisemitism

    Adopt the following non-legally binding working definition of antisemitism: "Antisemitism is a certain perception of Jews, which may be expressed as hatred toward Jews. ... It is expressed in speech, writing, visual forms and action, and employs sinister stereotypes and negative character traits. Contemporary examples of antisemitism in ...

  29. The Antisemitism Act won't eliminate part of the Bible

    The Act doesn't ban anything: it provides a definition for enforcement of Title VI," Mazzone said, referring to a part of the Civil Rights Act of 1964 that prohibits discrimination based on race ...

  30. Scholar of antisemitism speaks on campus protests

    Inside Higher Ed spoke with Ginsberg for a deeper look at his view on campus protests, their relation to the generations-old form of discrimination, and how they expose the fine line between free speech and hate speech.The conversation, edited for length and clarity, follows. (A Q&A on Islamophobia with Khaled Beydoun will be available on May 17.). Q: You are a scholar of Jewish history and ...