U.S. flag

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Audiograms and Functional Auditory Testing to Assess Hearing Speech in Noise: A Review of the Clinical Evidence [Internet]. Ottawa (ON): Canadian Agency for Drugs and Technologies in Health; 2015 Aug 17.

Cover of Audiograms and Functional Auditory Testing to Assess Hearing Speech in Noise: A Review of the Clinical Evidence

Audiograms and Functional Auditory Testing to Assess Hearing Speech in Noise: A Review of the Clinical Evidence [Internet].

Appendix 1 functional tests to assess speech in noise.

  • The Hearing in Noise Test (HINT)

The HINT is an adaptive test that is composed of 250 sentences that are divided into 25 lists. 8 The test is adaptive in that the signal-to-noise ratio is adjusted based on the performance of the participant. 8 Only the speech level is adjusted whereas the noise level remains constant. 8 For each correct sentence the participant identifies, the following sentence is presented at a lower speech level, and for each sentence incorrectly identified, the following sentence is presented at a higher speech level. 8 The intention of the adaptive test is to ensure each participant approaches the 50% correct response rate. 8 The speech level typically is presented at 55 dB SPL to start, and the noise remains constant at 65 dB SPL. 8 The test is designed to be administered with the participant sitting 1 meter from eight speakers. 8 The speaker directly in front of the individual will present the sentences, whereas the other 7 speakers surrounding the participant will play the background noise. 8

  • The Speech Recognition in Noise Test (SPRINT)

The SPRINT was designed by the United States Army to identify hearing loss in active duty soldiers. 9 The test includes 200 monosyllabic words that are pre-recorded with multitalker babble, and are delivered to both ears simultaneously using earphones. 9 The speech-to-babble ratio is 9dB; at this ratio, active duty soldiers with normal hearing are expected to hear at least 95% of the words correctly. 9 The SPRINT is recommended to be administered by an audiologist or a technician with audiologist supervision. 9

  • The Words in Noise Test (WIN)

The WIN consists of the administration of 70 monosyllabic words divided into two 35 word lists that are pre-recorded with a noisy background. 10 The test is adaptive in that the loudness of the speech fluctuates during the test while the multitalker babble level remains constant. 10 The test is administered using earphones and is conducted in each ear separately. 10 The 50% correct point, measured in dB signal-to-babble ratio is then calculated for the participant. 10

Copyright: This report contains CADTH copyright material and may contain material in which a third party owns copyright. This report may be used for the purposes of research or private study only . It may not be copied, posted on a web site, redistributed by email or stored on an electronic system without the prior written permission of CADTH or applicable copyright owner.

Links : This report may contain links to other information available on the websites of third parties on the Internet. CADTH does not have control over the content of such sites. Use of third party sites is governed by the owners’ own terms and conditions.

Except where otherwise noted, this work is distributed under the terms of a Creative Commons Attribution-NonCommercial- NoDerivatives 4.0 International licence (CC BY-NC-ND), a copy of which is available at http://creativecommons.org/licenses/by-nc-nd/4.0/

  • Cite this Page Audiograms and Functional Auditory Testing to Assess Hearing Speech in Noise: A Review of the Clinical Evidence [Internet]. Ottawa (ON): Canadian Agency for Drugs and Technologies in Health; 2015 Aug 17. APPENDIX 1, Functional Tests to Assess Speech in Noise.
  • PDF version of this title (377K)

In this Page

Other titles in this collection.

  • CADTH Rapid Response Reports

Recent Activity

  • Functional Tests to Assess Speech in Noise - Audiograms and Functional Auditory ... Functional Tests to Assess Speech in Noise - Audiograms and Functional Auditory Testing to Assess Hearing Speech in Noise: A Review of the Clinical Evidence

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

statistics

Copyright © 2024 AudiologyOnline - All Rights Reserved

Facebook tracking pixel

  • Back to Basics: Speech Audiometry

Janet R. Schoepflin, PhD

  • Hearing Evaluation - Adults

Editor's Note: This is a transcript of an AudiologyOnline live seminar. Please download supplemental course materials . Speech is the auditory stimulus through which we communicate. The recognition of speech is therefore of great interest to all of us in the fields of speech and hearing. Speech audiometry developed originally out of the work conducted at Bell Labs in the 1920s and 1930s where they were looking into the efficiency of communication systems, and really gained momentum post World War II as returning veterans presented with hearing loss. The methods and materials for testing speech intelligibility were of interest then, and are still of interest today. It is due to this ongoing interest as seen in the questions that students ask during classes, by questions new audiologists raise as they begin their practice, and by the comments and questions we see on various audiology listservs about the most efficient and effective ways to test speech in the clinical setting, that AudiologyOnline proposed this webinar as part of their Back to Basics series. I am delighted to participate. I am presenting a review of the array of speech tests that we use in clinical evaluation with a summary of some of the old and new research that has come about to support the recommended practices. The topics that I will address today are an overview of speech threshold testing, suprathreshold speech recognition testing, the most comfortable listening level testing, uncomfortable listening level, and a brief mention of some new directions that speech testing is taking. In the context of testing speech, I will assume that the environment in which you are testing meets the ANSI permissible noise criteria and that the audiometer transducers that are being used to perform speech testing are all calibrated to the ANSI standards for speech. I will not be talking about those standards, but it's of course important to keep those in mind.

Speech Threshold testing involves several considerations. They include the purposes of the test or the reasons for performing the test, the materials that should be used in testing, and the method or procedure for testing. Purposes of Speech Threshold Testing A number of purposes have been given for speech threshold testing. In the past, speech thresholds were used as a means to cross-check the validity of pure tone thresholds. This purpose lacks some validity because we have other physiologic and electrophysiologic procedures like OAEs and imittance test results to help us in that cross-check. However, the speech threshold measure is a test of hearing. It is not entirely invalid to be performed as a cross-check for pure tone hearing. I think sometimes we are anxious to get rid of things because we feel we have a better handle from other tests, but in this case, it may not be the wisest thing to toss out. Also in past years, speech thresholds were used to determine the level for suprathreshold speech recognition testing. That also lacks validity, because the level at which suprathreshold testing is conducted depends on the reason you are doing the test itself. It is necessary to test speech thresholds if you are going to bill 92557. Aside from that, the current purpose for speech threshold testing is in the evaluation of pediatric and difficult to test patients. Clinical practice surveys tell us that the majority of clinicians do test speech thresholds for all their patients whether it is for billing purposes or not. It is always important that testing is done in the recommended, standardized manner. The accepted measures for speech thresholds are the Speech Recognition Threshold (SRT) and the Speech Detection Threshold (SDT). Those terms are used because they specify the material or stimulus, i.e. speech, as well as the task that the listener is required to do, which is recognition or identification in the case of the SRT, and detection or noticing of presence versus absence of the stimulus in the case of SDT. The terms also specify the criterion for performance which is threshold or generally 50%. The SDT is most commonly performed on those individuals who have been unable to complete an SRT, such as very young children. Because recognition is not required in the speech detection task, it is expected that the SDT will be about 5 to 10 dB better than the SRT, which requires recognition of the material. Materials for Speech Threshold Testing The materials that are used in speech threshold testing are spondees, which are familiar two-syllable words that have a fairly steep psychometric function. Cold running speech or connected discourse is an alternative for speech detection testing since recognition is not required in that task. Whatever material is used, it should be noted on the audiogram. It is important to make notations on the audiogram about the protocols and the materials we are using, although in common practice many of us are lax in doing so. Methods for Speech Threshold Testing The methods consideration in speech threshold testing is how we are going to do the test. This would include whether we use monitored live voice or recorded materials, and whether we familiarize the patient with the materials and the technique that we use to elicit threshold. Monitored live voice and recorded speech can both be used in SRT testing. However, recorded presentation is recommended because recorded materials standardize the test procedure. With live voice presentation, the monitoring of each syllable of each spondee, so that it peaks at 0 on the VU meter can be fairly difficult. The consistency of the presentation is lost then. Using recorded materials is recommended, but it is less important in speech threshold testing than it is in suprathreshold speech testing. As I mentioned with the materials that are used, it is important to note on the audiogram what method of presentation has been used. As far as familiarization goes, we have known for about 50 years, since Tillman and Jerger (1959) identified familiarity as a factor in speech thresholds, that familiarization of the patient with the test words should be included as part of every test. Several clinical practice surveys suggest that familiarization is not often done with the patients. This is not a good practice because familiarization does influence thresholds and should be part of the procedure. The last consideration under methods is regarding the technique that is going to be used. Several different techniques have been proposed for the determination of SRT. Clinical practice surveys suggest the most commonly used method is a bracketing procedure. The typical down 10 dB, up 5 dB is often used with two to four words presented at each level, and the threshold then is defined as the lowest level at which 50% or at least 50% of the words are correctly repeated. This is not the procedure that is recommended by ASHA (1988). The ASHA-recommended procedure is a descending technique where two spondees are presented at each decrement from the starting level. There are other modifications that have been proposed, but they are not widely used.  

Suprathreshold speech testing involves considerations as well. They are similar to those that we mentioned for threshold tests, but they are more complicated than the threshold considerations. They include the purposes of the testing, the materials that should be used in testing, whether the test material should be delivered via monitored live voice or recorded materials, the level or levels at which the testing should be conducted, whether a full list, half list, or an abbreviated word list should be used, and whether or not the test should be given in quiet or noise. Purposes of Suprathreshold Testing There are several reasons to conduct suprathreshold tests. They include estimating the communicative ability of the individual at a normal conversational level; determining whether or not a more thorough diagnostic assessment is going to be conducted; hearing aid considerations, and analysis of the error patterns in speech recognition. When the purpose of testing is to estimate communicative ability at a normal conversational level, then the test should be given at a level around 50 to 60 dBHL since that is representative of a normal conversational level at a communicating distance of about 1 meter. While monosyllabic words in quiet do not give a complete picture of communicative ability in daily situations, it is a procedure that people like to use to give some broad sense of overall communicative ability. If the purpose of the testing is for diagnostic assessment, then a psychometric or performance-intensity function should be obtained. If the reason for the testing is for hearing aid considerations, then the test is often given using words or sentences and either in quiet or in a background of noise. Another purpose is the analysis of error patterns in speech recognition and in that situation, a test other than some open set monosyllabic word test would be appropriate. Materials for Suprathreshold Testing The choice of materials for testing depends on the purpose of the test and on the age and abilities of the patients. The issues in materials include the set and the test items themselves.  

Closed set vs. Open set. The first consideration is whether a closed set or an open set is appropriate. Closed set tests limit the number of response alternatives to a fairly small set, usually between 4 and 10 depending on the procedure. The number of alternatives influences the guess rate. This is a consideration as well. The Word Intelligibility by Picture Identification or the WIPI test is a commonly used closed set test for children as it requires only the picture pointing response and it has a receptive language vocabulary that is as low as about 5 years. It is very useful in pediatric evaluations as is another closed set test, the Northwestern University Children's Perception of Speech test (NU-CHIPS).

In contrast, the open set protocol provides an unlimited number of stimulus alternatives. Therefore, open set tests are more difficult. The clinical practice surveys available suggest for routine audiometric testing that monosyllabic word lists are the most widely used materials in suprathreshold speech recognition testing for routine evaluations, but sentences in noise are gaining popularity for hearing aid purposes.  

CID W-22 vs. NU-6. The most common materials for speech recognition testing are the monosyllabic words, the Central Institute of the Deaf W-22 and the Northwestern University-6 word list. These are the most common open set materials and there has been some discussion among audiologists concerning the differences between those. From a historical perspective, the CID W-22 list came from the original Harvard PAL-PB50 words and the W-22s are a group of the more familiar of those. They were developed into four 50-word lists. They are still commonly used by audiologists today. The NU-6 lists were developed later and instead of looking for phonetic balance, they considered a more phonemic balance. The articulation function for both of those using recorded materials is about the same, 4% per dB. The NU-6 tests are considered somewhat more difficult than the W-22s. Clinical surveys show that both materials are used by practicing audiologists, with usage of the NU-6 lists beginning to surpass usage of W-22s.

Nonsense materials. There are other materials that are available for suprathreshold speech testing. There are other monosyllabic word lists like the Gardner high frequency word list (Gardner, 1971) that could be useful for special applications or special populations. There are also nonsense syllabic tasks which were used in early research in communication. An advantage of the nonsense syllables is that the effects of word familiarity and lexical constraints are reduced as compared to using actual words as test materials. A few that are available are the City University of New York Nonsense Syllable test, the Nonsense Syllable test, and others.

Sentence materials. Sentence materials are gaining popularity, particularly in hearing aid applications. This is because speech that contains contextual cues and is presented in a noise background is expected to have better predictive validity than words in quiet. The two sentence procedures that are popular are the Hearing In Noise Test (HINT) (Nilsson, Soli,& Sullivan, 1994) and the QuickSIN (Killion, Niquette, Gudmundsen, Revit & Banerjee, 2004). Other sentence tests that are available that have particular applications are the Synthetic Sentence Identification test (SSI), the Speech Perception and Noise test (SPIN), and the Connected Speech test.

Monitored Live Voice vs. Recorded. As with speech threshold testing, the use of recorded materials for suprathreshold speech testing standardizes the test administration. The recorded version of the test is actually the test in my opinion. This goes back to a study in 1969 where the findings said the test is not just the written word list, but rather it is a recorded version of those words.

Inter-speaker and intra-speaker variability makes using recorded materials the method of choice in almost all cases for suprathreshold testing. Monitored live voice (MLV) is not recommended. In years gone by, recorded materials were difficult to manipulate, but the ease and flexibility that is afforded us by CDs and digital recordings makes recorded materials the only way to go for testing suprathreshold speech recognition. Another issue to consider is the use of the carrier phrase. Since the carrier phrase is included on recordings and recorded materials are the recommended procedure, that issue is settled. However, I do know that monitored live voice is necessary in certain situations and if monitored live voice is used in testing, then the carrier phrase should precede the test word. In monitored live voice, the carrier phrase is intended to allow the test word to have its own natural inflection and its own natural power. The VU meter should peak at 0 for the carrier phrase and the test word then is delivered at its own natural or normal level for that word in the phrase.  

Levels. The level at which testing is done is another consideration. The psychometric or performance-intensity function plots speech performance in percent correct on the Y-axis, as a function of the level of the speech signal on the X-axis. This is important because testing at only one level, which is fairly common, gives us insufficient information about the patient's optimal performance or what we commonly call the PB-max. It also does not allow us to know anything about any possible deterioration in performance if the level is increased. As a reminder, normal hearers show a function that reaches its maximum around 25 to 40 dB SL (re: SRT) and that is the reason why suprathreshold testing is often conducted at that level. For normals, the performance remains at that level, 100% or so, as the level increases. People with conductive hearing loss also show a similar function. Individuals with sensorineural hearing loss, however, show a performance function that reaches its maximum at generally less than 100%. They can either show performance that stays at that level as intensity increases, or they can show a curve that reaches its maximum and then decreases in performance as intensity increases. This is known as roll-over. A single level is not the best way to go as we cannot anticipate which patients may have rollover during testing, unless we test at a level higher than where the maximum score was obtained. I recognize that there are often time constraints in everyday practice, but two levels are recommended so that the performance-intensity function can be observed for an individual patient at least in an abbreviated way.

Recently, Guthrie and Mackersie (2009) published a paper that compared several different presentation levels to ascertain which level would result in maximum word recognition in individuals who had different hearing loss configurations. They looked at a number of presentation levels ranging from 10 dB above the SRT to a level at the UCL (uncomfortable listening level) -5 dB. Their results indicated that individuals with mild to moderate losses and those with more steeply sloping losses reached their best scores at a UCL -5 dB. That was also true for those patients who had moderately-severe to severe losses. The best phoneme recognition scores for their populations were achieved at a level of UCL -5 dB. As a reminder about speech recognition testing, masking is frequently needed because the test is being presented at a level above threshold, in many cases well above the threshold. Masking will always be needed for suprathreshold testing when the presentation level in the test ear is 40 dB or greater above the best bone conduction threshold in the non-test ear if supra-aural phones are used.  

Full lists vs. half-lists. Another consideration is whether a full list or a half-list should be administered. Original lists were composed of 50 words and those 50 words were created for phonetic balance and for simplicity in scoring. It made it easy for the test to be scored if 50 words were administered and each word was worth 2%. Because 50-word lists take a long time, people often use half-lists or even shorter lists for the purpose of suprathreshold speech recognition testing. Let's look into this practice a little further.

An early study was done by Thornton and Raffin (1978) using the Binomial Distribution Model. They investigated the critical differences between one score and a retest score that would be necessary for those scores to be considered statistically significant. Their findings showed that with an increasing set size, variability decreased. It would seem that more items are better. More recently Hurley and Sells (2003) conducted a study that looked at developing a test methodology that would identify those patients requiring a full 50 item suprathreshold test and allow abbreviated testing of patients who do not need a full 50 item list. They used Auditec recordings and developed 10-word and 25-word screening tests. They found that the four lists of NU-6 10-word and the 25-word screening tests were able to differentiate listeners who had impaired word recognition who needed a full 50-word list from those with unimpaired word recognition ability who only needed the 10-word or 25-word list. If abbreviated testing is important, then it would seem that this would be the protocol to follow. These screening lists are available in a recorded version and their findings were based on a recorded version. Once again, it is important to use recorded materials whether you are going to use a full list or use an abbreviated list.  

Quiet vs. Noise. Another consideration in suprathreshold speech recognition testing is whether to test in quiet or in noise. The effects of sensorineural hearing loss beyond the threshold loss, such as impaired frequency resolution or impaired temporal resolution, makes speech recognition performance in quiet a poor predictor for how those individuals will perform in noise. Speech recognition in noise is being promoted by a number of experts because adding noise improves the sensitivity of the test and the validity of the test. Giving the test at several levels will provide for a better separation between people who have hearing loss and those who have normal hearing. We know that individuals with hearing loss have a lot more difficulty with speech recognition in noise than those with normal hearing, and that those with sensorineural hearing loss often require a much greater signal-to-noise ratio (SNR), 10 to 15 better, than normal hearers.

Monosyllabic words in noise have not been widely used in clinical evaluation. However there are several word lists that are available. One of them is the Words in Noise test or WIN test which presents NU-6 words in a multi-talker babble. The words are presented at several different SNRs with the babble remaining at a constant level. One of the advantages of using these kinds of tests is that they are adaptive. They can be administered in a shorter period of time and they do not run into the same problems that we see with ceiling effects and floor effects. As I mentioned earlier, sentence tests in noise have become increasingly popular in hearing aid applications. Testing speech in noise is one way to look at amplification pre and post fitting. The Hearing in Noise Test and QuickSin, have gained popularity in those applications. The HINT was developed by Nilsson and colleagues in 1994 and later modified. It is scored as the dB to noise ratio that is necessary to get a 50% correct performance on the sentences. The sentences are the BKB (Bamford-Kowal-Bench) sentences. They are presented in sets of 10 and the listener listens and repeats the entire sentence correctly in order to get credit. In the HINT, the speech spectrum noise stays constant and the signal level is varied to obtain that 50% point. The QuickSin is a test that was developed by Killion and colleagues (2004) and uses the IEEE sentences. It has six sentences per list with five key words that are the scoring words in each sentence. All of them are presented in a multi-talker babble. The sentences get presented one at a time in 5 dB decrements from a high positive SNR down to 0 dB SNR. Again the test is scored as the 50% point in terms of dB signal-to-noise ratio. The guide proposed by Killion on the SNR is if an individual has somewhere around a 0 to 3 dB SNR it would be considered normal, 3 to 7 would be a mild SNR loss, 7 to15 dB would be a moderate SNR loss, and greater than 15 dB would be a severe SNR loss.  

Scoring. Scoring is another issue in suprathreshold speech recognition testing. It is generally done on a whole word basis. However phoneme scoring is another option. If phoneme scoring is used, it is a way of increasing the set size and you have more items to score without adding to the time of the test. If whole word scoring is used, the words have to be exactly correct. In this situation, being close does not count. The word must be absolutely correct in order to be judged as being correct. Over time, different scoring categorizations have been proposed, although the percentages that are attributed to those categories vary among the different proposals.

The traditional categorizations include excellent, good, fair, poor, and very poor. These categories are defined as:  

  • Excellent or within normal limits = 90 - 100% on whole word scoring
  • Good or slight difficulty = 78 - 88%
  • Fair to moderate difficulty = 66 - 76%
  • Poor or great difficulty = 54 - 64 %
  • Very poor is < 52%

A very useful test routinely administered to those who are being considered for hearing aids is the level at which a listener finds listening most comfortable. The materials that are used for this are usually cold running speech or connected discourse. The listener is asked to rate the level at which listening is found to be most comfortable. Several trials are usually completed because most comfortable listening is typically a range, not a specific level or a single value. People sometimes want sounds a little louder or a little softer, so the range is a more appropriate term for this than most comfortable level. However whatever is obtained, whether it is a most comfortable level or a most comfortable range, should be recorded on the audiogram. Again, the material used should also be noted on the audiogram. As I mentioned earlier the most comfortable level (MCL) is often not the level at which a listener achieves maximum intelligibility. Using MCL in order to determine where the suprathreshold speech recognition measure will be done is not a good reason to use this test. MCL is useful, but not for determining where maximum intelligibility will be. The study I mentioned earlier showed that maximum intelligibility was reached for most people with hearing loss at a UCL -5. MCL is useful however in determining ANL or acceptable noise level.  

The uncomfortable listening level (UCL) is also conducted with cold running speech. The instructions for this test can certainly influence the outcome since uncomfortable or uncomfortably loud for some individuals may not really be their UCL, but rather a preference for listening at a softer level. It is important to define for the patient what you mean by uncomfortably loud. The utility of the UCL is in providing an estimate for the dynamic range for speech which is the difference between the UCL and the SRT. In normals, this range is usually 100 dB or more, but it is reduced in ears with sensorineural hearing loss often dramatically. By doing the UCL, you can get an estimate of the individual's dynamic range for speech.  

Acceptable Noise Level (ANL) is the amount of background noise that a listener is willing to accept while listening to speech (Nabelek, Tucker, & Letowski, 1991). It is a test of noise tolerance and it has been shown to be related to the successful use of hearing aids and to potential benefit with hearing aids (Nabelek, Freyaldenhoven, Tampas, & Muenchen, 2006). It uses the MCL and a measure known as BNL or background noise level. To conduct the test, a recorded speech passage is presented to the listener in the sound field for the MCL. Again note the use of recorded materials. The noise is then introduced to the listener to a level that will be the highest level that that person is able to accept or "put up with" while they are listening to and following the story in the speech passage. The ANL then becomes the difference between the MCL and the BNL. Individuals that have very low scores on the ANL are considered successful hearing aid users or good candidates for hearing aids. Those that have very high scores are considered unsuccessful users or poor hearing aid candidates. Obviously there are number of other applications for speech in audiologic practice, not the least of which is in the assessment of auditory processing. Many seminars could be conducted on this topic alone. Another application or future direction for speech audiometry is to more realistically assess hearing aid performance in "real world" environments. This is an area where research is currently underway.  

Question: Are there any more specific instructions for the UCL measurement? Answer: Instructions are very important. We need to make it clear to a patient exactly what we expect them to do. I personally do not like things loud. If I am asked to indicate what is uncomfortably loud, I am much below what is really my UCL. I think you have to be very direct in instructing your patients in that you are not looking for a little uncomfortable, but where they just do not want to hear it or cannot take it. Question: Can you sum up what the best methods are to test hearing aid performance? I assume this means with speech signals. Answer: I think the use of the HINT or the QuickSin would be the most useful on a behavioral test. We have other ways of looking at performance that are not behavioral. Question: What about dialects? In my area, some of the local dialects have clipped words during speech testing. I am not sure if I should count those as correct or incorrect. Answer: It all depends on your situation. If a patient's production is really reflective of the dialect of that region and they are saying the word as everyone else in that area would say it, then I would say they do have the word correct. If necessary, if you are really unclear, you can always ask the patient to spell the word or write it down. This extra time can be inconvenient, but that is the best way to be sure that they have correctly identified the word. Question: Is there a reference for the bracketing method? Answer: The bracketing method is based on the old modified Hughson-Westlake that many people use for pure tone threshold testing. It is very similar to that traditional down 10 dB, up 5 dB. I am sure there are more references, but the Hughson-Westlake is what bracketing is based on. Question: Once you get an SRT result, if you want to compare it to the thresholds to validate your pure tones, how do you compare it to the audiogram? Answer: If it is a flat hearing loss, then you can compare to the 3-frequency pure tone average (PTA). If there is a high frequency loss, where audibility at perhaps 2000 Hz is greatly reduced, then it is better to use just the average of 500Hz and 1000Hz as your comparison. If it is a steeply sloping loss, then you look for agreement with the best threshold, which would probably be the 500 Hz threshold. The reverse is also true for patients who have rising configurations. Compare the SRT to the best two frequencies of the PTA, if the loss has either a steep slope or a steep rise, or the best frequency in the PTA if it is a really precipitous change in configuration. Question: Where can I find speech lists in Russian or other languages? Answer: Auditec has some material available in languages other than English - it would be best to contact them directly. You can also view their catalog at www.auditec.com Carolyn Smaka: This raises a question I have. If an audiologist is not fluent in a particular language, such as Spanish, is it ok to obtain a word list or recording in that language and conduct speech testing? Janet Schoepflin: I do not think that is a good practice. If you are not fluent in a language, you do not know all the subtleties of that language and the allophonic variations. People want to get an estimation of suprathreshold speech recognition and this would be an attempt to do that. This goes along with dialect. Whether you are using a recording, or doing your best to say these words exactly as there are supposed to be said, and your patient is fluent in a language and they say the word back to you, since you are not familiar with all the variations in the language it is possible that you will score the word incorrectly. You may think it is correct when it is actually incorrect, or you may think it is incorrect when it is correct based on the dialect or variation of that language. Question: In school we were instructed to use the full 50-word list for any word discrimination testing at suprathreshold, but if we are pressed for time, a half word list would be okay. However, my professor warned us that we absolutely must go in order on the word list. Can you clarify this? Answer: I'm not sure why that might have been said. I was trained in the model to use the 50-word list. This was because the phonetic balance that was proposed for those words was based on the 50 words. If you only used 25 words, you were not getting the phonetic balance. I think the more current findings from Hurley and Sells show us that it is possible to use a shorter list developed specifically for this purpose. It should be the recorded version of those words. These lists are available through Auditec. Question: On the NU-6 list, the words 'tough' and 'puff' are next to each other. 'Tough' is often mistaken for 'puff' so then when we reads 'puff', the person looks confused. Is it okay to mix up the order on the word list? Answer: I think in that case it is perfectly fine to move that one word down. Question: When do you recommend conducting speech testing, before or after pure tone testing? Answer: I have always been a person who likes to interact with my patients. My own procedure is to do an SRT first. Frequently for an SRT I do use live voice. I do not use monitored live voice for suprathreshold testing. It gives me a time to interact with the patient. People feel comfortable with speech. It is a communicative act. Then I do pure tone testing. Personally I would not do suprathreshold until I finished pure tone testing. My sequence is often SRT, pure tone, and suprathreshold. If this is not a good protocol for you based on time, then I would conduct pure tone testing, SRT, and then suprathreshold. Question: Some of the spondee words are outdated such as inkwell and whitewash. Is it okay to substitute other words that we know are spondee words, but may not be on the list? Or if we familiarize people, does it matter? Answer: The words that are on the list were put there for their so-called familiarity, but also because they were somewhat homogeneous and equal in intelligibility. I think inkwell, drawbridge and whitewash are outdated. If you follow a protocol where you are using a representative sample of the words and you are familiarizing, I think it is perfectly fine to eliminate those words you do not want to use. You just do not want to end up only using five or six words as it will limit the test set. Question: At what age is it appropriate to expect a child to perform suprathreshold speech recognition testing? Answer: If the child has a receptive language age of around 4 or 5 years, even 3 years maybe, it is possible to use the NU-CHIPS as a measure. It really does depend on language more than anything else, and the fact that the child can sit still for a period of time to do the test. Question: Regarding masking, when you are going 40 dB above the bone conduction threshold in the non-test ear, what frequency are you looking at? Are you comparing speech presented at 40 above a pure tone average of the bone conduction threshold? Answer: The best bone conduction threshold in the non-test ear is what really should be used. Question: When seeing a patient in follow-up after an ENT prescribes a steroid therapy for hydrops, do you recommend using the same word list to compare their suprathreshold speech recognition? Answer: I think it is better to use a different list, personally. Word familiarity as we said can influence even threshold and it certainly can affect suprathreshold performance. I think it is best to use a different word list. Carolyn Smaka: Thanks to everyone for their questions. Dr. Schoepflin has provided her email address with the handout. If your question was not answered or if you have further thoughts after the presentation, please feel free to follow up directly with her via email. Janet Schoepflin: Thank you so much. It was my pleasure and I hope everyone found the presentation worthwhile.

American Speech, Language and Hearing Association. (1988). Determining Threshold Level for Speech [Guidelines]. Available from www.asha.org/policy Gardner, H.(1971). Application of a high-frequency consonant discrimination word list in hearing-aid evaluation. Journal of Speech and Hearing Disorders, 36 , 354-355. Guthrie, L. & Mackersie, C. (2009). A comparison of presentation levels to maximize word recognition scores. Journal of the American Academy of Audiology, 20 (6), 381-90. Hurley, R. & Sells, J. (2003). An abbreviated word recognition protocol based on item difficulty. Ear & Hearing, 24 (2), 111-118. Killion, M., Niquette, P., Gudmundsen, G., Revit, L., & Banerjee, S. (2004). Development of a quick speech-in-noise test for measuring signal-to-noise ratio loss in normal-hearing and hearing-impaired listeners. Journal of the Acoustical Society of America, 116 (4 Pt 1), 2395-405. Nabelek, A., Freyaldenhoven, M., Tampas, J., Burchfield, S., & Muenchen, R. (2006). Acceptable noise level as a predictor of hearing aid use. Journal of the American Academy of Audiology, 17 , 626-639. Nabelek, A., Tucker, F., & Letowski, T. (1991). Toleration of background noises: Relationship with patterns of hearing aid use by elderly persons. Journal of Speech and Hearing Research, 34 , 679-685. Nilsson, M., Soli. S,, & Sullivan, J. (1994). Development of the hearing in noise test for the measurement of speech reception thresholds in quiet and in noise. Journal of the Acoustical Society of America, 95 (2), 1085-99. Thornton, A.. & Raffin, M, (1978). Speech-discrimination scores modeled as a binomial variable. Journal of Speech and Hearing Research, 21 , 507-518. Tillman, T., & Jerger, J. (1959). Some factors affecting the spondee threshold in normal-hearing subjects. Journal of Speech and Hearing Research, 2 , 141-146.

Rexton Reach - April 2024

Chair, Communication Sciences and Disorders, Adelphi University

Janet Schoepflin is an Associate Professor and Chair of the Department of Communication Sciences and Disorders at Adelphi University and a member of the faculty of the Long Island AuD Consortium.  Her areas of research interest include speech perception in children and adults, particularly those with hearing loss, and the effects of noise on audition and speech recognition performance.

Related Courses

Empowerment and behavioral insights in client decision making, presented in partnership with nal, course: #37124 level: intermediate 1 hour, cognition and audition: supporting evidence, screening options, and clinical research, course: #37381 level: introductory 1 hour, innovative audiologic care delivery, course: #38661 level: intermediate 4 hours, aurical hit applications part 1 - applications for hearing instrument fittings and beyond, course: #28678 level: intermediate 1 hour, rethinking your diagnostic audiology battery: using value added tests, course: #29447 level: introductory 1 hour.

Our site uses cookies to improve your experience. By using our site, you agree to our Privacy Policy .

American Academy of Audiology

FIGURE 1. Subject and speaker diagram where X marks the placement of the calibration microphone.

To perform free-field calibration, the microphone location must estimate the center of the subject’s head position (marked by an X in Figure 1). All measurements should be made with the microphone in this static position. Set the sound level meter (SLM) to SPL mode (see FIGURE 2).

  • Subject and speaker diagram where X marks the calibration microphone (FIGURE 1).
  • Speakers at 0 degrees (right channel) and 180 degrees (left channel).
  • MedRx free-field calibrations (completed by certified technician).
  • Complete full pure-tone, free-field calibration using warble tones.
  • MedRx equipment must have white noise, speech babble (A Weighted), speech-tone calibrated (1000 Hz cal. tone).
  • Save calibration.

FIGURE 2

FIGURE 2. Avant advanced audiometry settings, set free field to SPL.

Special Thanks

The authors offer thanks and appreciation to Andy Vermiglio, AuD, and Caleb Sparkman, AuD, for their review and valuable input regarding the preparation of this manuscript. 

Beck DL. (2017) Best practices in hearing aid dispensing: An interview with Michael Valente, PhD. Hear Rev 24(12):39–41.

Beck DL, Danhauer JL, Abrams HB, Atcherson SR, Brown DK, Chasin M, Clark JG, De Placido C, Edwards B, Fabry DA, Flexer C, Fligor B, Frazer G, Galster JA, Gifford L, Johnson CE, Madell J, Moore DR, Roeser RJ, Saunders GH, Searchfield GD, Spankovich C, Valente M, Wolfe J. (2018) Audiologic considerations for people with normal hearing sensitivity yet hearing difficulty and/or speech-in-noise problems. Hear Rev 25(10):28–38.

Beck DL, Ng E, Jensen JJ. (2019): A scoping review 2019: OpenSound Navigator. Hear Rev 26(2):28–31.

Clark, JG, Huff C, Earl B. (2017) Clinical practice report card–Are we meeting best practice standards for adult hearing rehabilitation? Audiol Today 29(6):15–25.

Carhart R. (1946) Tests for selection of hearing aids. Laryngoscope 56(12):780–794.

Dillon H. (2012) Hearing Aids. (2d Ed) Thieme Publishers. 

The Harvard Report. (1946) Davis H, Hudgins CV, Marquis RJ, et al. The Selection of Hearing Aids. Laryngoscope 56(3):85–115.

Jerger J. (2018) Lessons from the past: Two influential articles in the early history of audiology. Hear Rev . Published Dec 5.

Killion MC. (2002) New thinking on hearing in noise: a generalized articulation index. Sem Hear 23(1):57–75.

Lawson G. (2012) Speech Audiometry, Word Recognition, and Binomial Variables: Interview with Gary Lawson. www.audiology.org/news/speech-audiometry-word-recognition-and-binomial-variables-interview-gary-lawson-phd.

Loven F, Hawkins D. (1983) Interlist equivalency of the CID W-22 word lists presented in quiet and in noise. Ear Hear 4:91–97.

Stockley KB, Green WB. (2000) Interlist equivalency of the Northwestern University auditory test No. 6 in quiet and noise with adult hearing-impaired individuals. J Am Acad Audiol 11:91–96.

Taylor B, Mueller G. (2017) Fitting and Dispensing Hearing Aids (2d ed) Plural Publishing.

Vermiglio AJ, Herring CC, Heeke P, Post CE, Fang X. (2019) Sentence recognition in steady-state speech-shaped noise versus four-talker babble. J Am Acad Audiol 30(1):54–65. 

Vermiglio AJ, Soli SD, Freed DJ, Fisher LM. (2012) The relationship between high-frequency pure-tone hearing loss, hearing in noise test (HINT) thresholds, and the articulation index. J Am Acad Audiol 23(10):779–788.

Wilson RH. (2011) Clinical experience with the words-in-noise test on 3430veterans: comparisons with pure-tone thresholds and word recognition in quiet. J Am Acad Audiol 22(7):405–423.

Douglas L. Beck

Douglas L. Beck, AuD, is the executive director of Academic Sciences at Oticon, Inc, in Somerset, New Jersey.

Lauren Benitez

Lauren Benitez, AuD, is the hearing clinic director at MedRx, in Largo, Florida.

speech recognition test in noise

Related Content

Updating Your Pediatric Audiology Toolbox (Suzanne Chanesman)

Updating Your Pediatric Audiology Toolbox

Improving Listening at Home in Young Children with Hearing Loss (BRO.vector/Shutterstock.com and S. Chanesman)

Improving Listening at Home in Young Children with Hearing Loss

AT356-Main_Images-F1-min(PattyPhoto/Shutterstock.com and S. Chanesman)

Hearing Loss and Fall Risks

Join the academy.

Academy members receive many benefits for professional development, practice management, and community development.

  • Audiometers
  • Tympanometers
  • Hearing Aid Fitting
  • Research Systems
  • Research Unit
  • ACT Research
  • Our History
  • Distributors
  • Sustainability
  • Environmental Sustainability

speech recognition test in noise

Training in Speech Audiometry

  • Why Perform Functional Hearing Tests?

Speech Audiometry: An Introduction

Description, table of contents, what is speech audiometry, why perform speech audiometry.

  • Contraindications and considerations

Audiometers that can perform speech audiometry

How to perform speech audiometry, results interpretation, calibration for speech audiometry.

Speech audiometry is an umbrella term used to describe a collection of audiometric tests using speech as the stimulus. You can perform speech audiometry by presenting speech to the subject in both quiet and in the presence of noise (e.g. speech babble or speech noise). The latter is speech-in-noise testing and is beyond the scope of this article.

Speech audiometry is a core test in the audiologist’s test battery because pure tone audiometry (the primary test of hearing sensitivity) is a limited predictor of a person’s ability to recognize speech. Improving an individual’s access to speech sounds is often the main motivation for fitting them with a hearing aid. Therefore, it is important to understand how a person with hearing loss recognizes or discriminates speech before fitting them with amplification, and speech audiometry provides a method of doing this.

A decrease in hearing sensitivity, as measured by pure tone audiometry, results in greater difficulty understanding speech. However, the literature also shows that two individuals of the same age with similar audiograms can have quite different speech recognition scores. Therefore, by performing speech audiometry, an audiologist can determine how well a person can access speech information.

Acquiring this information is key in the diagnostic process. For instance, it can assist in differentiating between different types of hearing loss. You can also use information from speech audiometry in the (re)habilitation process. For example, the results can guide you toward the appropriate amplification technology, such as directional microphones or remote microphone devices. Speech audiometry can also provide the audiologist with a prediction of how well a subject will hear with their new hearing aids. You can use this information to set realistic expectations and help with other aspects of the counseling process.

Below are some more examples of how you can use the results obtained from speech testing.

Identify need for further testing

Based on the results from speech recognition testing, it may be appropriate to perform further testing to get more information on the nature of the hearing loss. An example could be to perform a TEN test to detect a dead region or to perform the Audible Contrast Threshold (ACT™) test .

Inform amplification decisions

You can use the results from speech audiometry to determine whether binaural amplification is the most appropriate fitting approach or if you should consider alternatives such as CROS aids.

You can use the results obtained through speech audiometry to discuss and manage the amplification expectations of patients and their communication partners.

Unexpected asymmetric speech discrimination, significant roll-over , or particularly poor speech discrimination may warrant further investigation by a medical professional.

Non-organic hearing loss

You can use speech testing to cross-check the results from pure tone audiometry for suspected non‑organic hearing loss.

Contraindications and considerations when performing speech audiometry

Before speech audiometry, it is important that you perform pure tone audiometry and otoscopy. Results from these procedures can reveal contraindications to performing speech audiometry.

Otoscopic findings

Speech testing using headphones or inserts is generally contraindicated when the ear canal is occluded with:

  • Foreign body
  • Or infective otitis externa

In these situations, you can perform bone conduction speech testing or sound field testing.

Audiometric findings

Speech audiometry can be challenging to perform in subjects with severe-to-profound hearing losses as well as asymmetrical hearing losses where the level of stimulation and/or masking noise  required is beyond the limits of the audiometer or the patient's uncomfortable loudness levels (ULLs).

Subject variables

Depending on the age or language ability of the subject, complex words may not be suitable. This is particularly true for young children and adults with learning disabilities or other complex presentations such as dementia and reduced cognitive function.

You should also perform speech audiometry in a language which is native to your patient. Speech recognition testing may not be suitable for patients with expressive speech difficulties. However, in these situations, speech detection testing should be possible.

Before we discuss speech audiometry in more detail, let’s briefly consider the instrumentation to deliver the speech stimuli. As speech audiometry plays a significant role in diagnostic audiometry, many audiometers include – or have the option to include – speech testing capabilities.

Table 1 outlines which audiometers from Interacoustics can perform speech audiometry.

Table 1: Audiometers from Interacoustics that can perform speech audiometry.

Because speech audiometry uses speech as the stimulus and languages are different across the globe, the way in which speech audiometry is implemented varies depending on the country where the test is being performed. For the purposes of this article, we will start with addressing how to measure speech in quiet using the international organization of standards ISO 8252-3:2022 as the reference to describe the terminology and processes encompassing speech audiometry. We will describe two tests: speech detection testing and speech recognition testing.

Speech detection testing

In speech detection testing, you ask the subject to identify when they hear speech (not necessarily understand). It is the most basic form of speech testing because understanding is not required. However, it is not commonly performed. In this test, words are normally presented to the ear(s) through headphones (monaural or binaural testing) or through a loudspeaker (binaural testing).

Speech detection threshold (SDT)

Here, the tester will present speech at varying intensity levels and the patient identifies when they can detect speech. The goal is to identify the level at which the patient detects speech in 50% of the trials. This is the speech detection threshold. It is important not to confuse this with the speech discrimination threshold. The speech discrimination threshold looks at a person’s ability to recognize speech and we will explain it later in this article.

The speech detection threshold has been found to correlate well with the pure tone average, which is calculated from pure tone audiometry. Because of this, the main application of speech detection testing in the clinical setting is confirmation of the audiogram.

Speech recognition testing

In speech recognition testing, also known as speech discrimination testing, the subject must not only detect the speech, but also correctly recognize the word or words presented. This is the most popular form of speech testing and provides insights into how a person with hearing loss can discriminate speech in ideal conditions.

Across the globe, the methods of obtaining this information are different and this often leads to confusion about speech recognition testing. Despite there being differences in the way speech recognition testing is performed, there are some core calculations and test parameters which are used globally.

Speech recognition testing: Calculations

There are two main calculations in speech recognition testing.

1. Speech recognition threshold (SRT)

This is the level in dB HL at which the patient recognizes 50% of the test material correctly. This level will differ depending on the test material used. Some references describe the SRT as the speech discrimination threshold or SDT. This can be confusing because the acronym SDT belongs to the speech detection threshold. For this reason, we will not use the term discrimination but instead continue with the term speech recognition threshold.

2. Word recognition score (WRS)

In word recognition testing, you present a list of phonetically balanced words to the subject at a single intensity and ask them to repeat the words they hear. You score if the patient repeats these words correctly or incorrectly.  This score, expressed as a percentage of correct words, is calculated by dividing the number of words correctly identified by the total number of words presented.

In some countries, multiple word recognition scores are recorded at various intensities and plotted on a graph. In other countries, a single word recognition score is performed using a level based on the SRT (usually presented 20 to 40 dB louder than the SRT).

Speech recognition testing: Parameters

Before completing a speech recognition test, there are several parameters to consider.

1. Test transducer

You can perform speech recognition testing using air conduction, bone conduction, and speakers in a sound-field setup.

2. Types of words

Speech recognition testing can be performed using a variety of different words or sentences. Some countries use monosyllabic words such as ‘boat’ or ‘cat’ whereas other countries prefer to use spondee words such as ‘baseball’ or ‘cowboy’. These words are then combined with other words to create a phonetically balanced list of words called a word list.

3. Number of words

The number of words in a word list can impact the score. If there are too few words in the list, then there is a risk that not enough data points are acquired to accurately calculate the word recognition score. However, too many words may lead to increased test times and patient fatigue. Word lists often consist of 10 to 25 words.

You can either score words as whole words or by the number of phonemes they contain.

An example of scoring can be illustrated by the word ‘boat’. When scoring using whole words, anything other than the word ‘boat’ would result in an incorrect score.

However, in phoneme scoring, the word ‘boat’ is broken down into its individual phonemes: /b/, /oa/, and /t/. Each phoneme is then scored as a point, meaning that the word boat has a maximum score of 3. An example could be that a patient mishears the word ‘boat’ and reports the word to be ‘float’. With phoneme scoring, 2 points would be awarded for this answer whereas in word scoring, the word float would be marked as incorrect.

5. Delivery of material

Modern audiometers have the functionality of storing word lists digitally onto the hardware of the device so that you can deliver a calibrated speech signal the same way each time you test a patient. This is different from the older methods of testing using live voice or a CD recording of the speech material. Using digitally stored and calibrated speech material in .wav files provides the most reliable and repeatable results as the delivery of the speech is not influenced by the tester.

6. Aided or unaided

You can perform speech recognition testing either aided or unaided. When performing aided measurements, the stimulus is usually played through a loudspeaker and the test is recorded binaurally.

Global examples of how speech recognition testing is performed and reported

Below are examples of how speech recognition testing is performed in the US and the UK. This will show how speech testing varies across the globe.

Speech recognition testing in the US: Speech tables

In the US, the SRT and WRS are usually performed as two separate tests using different word lists for each test. The results are displayed in tables called speech tables.

The SRT is the first speech test which is performed and typically uses spondee words (a word with two equally stressed syllables, such as ‘hotdog’) as the stimulus. During this test, you present spondee words to the patient at different intensities and a bracketing technique establishes the threshold at where the patient correctly identifies 50% of the words.

In the below video, we can see how an SRT is performed using spondee words.

Below, you can see a table showing the results from an SRT test (Figure 1). Here, we can see that the SRT has been measured in each ear. The table shows the intensity at which the SRT was found as well as the transducer, word list, and the level at which masking noise was presented (if applicable). Here we see an unaided SRT of 30 dB HL in both the left and right ears.

For both ears, the transducer type is phone and the masking level is 15 dB HL. The word list for the right ear is Spondee A, while the word list for the left ear is Spondee B.

Once you have established the intensity of the SRT in dB HL, you can use it to calculate the intensity to present the next list of words to measure the WRS. In WRS testing, it is common to start at an intensity of between 20 dB and 40 dB louder than the speech recognition threshold and to use a different word list from the SRT. The word lists most commonly used in the US for WRS are the NU-6 and CID-W22 word lists.

In word recognition score testing, you present an entire word list to the test subject at a single intensity and score each word based on whether the subject can correctly repeat it or not. The results are reported as a percentage.

The video below demonstrates how to perform the word recognition score.

Below is an image of a speech table showing the word recognition score in the left ear using the NU‑6 word list at an intensity of 55 dB HL (Figure 2). Here we can see that the patient in this example scored 90%, indicating good speech recognition at moderate intensities.

speech recognition test in noise

Speech recognition testing in the UK: Speech audiogram

In the UK, speech recognition testing is performed with the goal of obtaining a speech audiogram. A speech audiogram is a graphical representation of how well an individual can discriminate speech across a variety of intensities (Figure 3).

speech recognition test in noise

In the UK, the most common method of recording a speech audiogram is to present several different word lists to the subject at varying intensities and calculate multiple word recognition scores. The AB (Arthur Boothroyd) word lists are the most used lists. The initial list is presented around 20 to 30 dB sensation level with subsequent lists performed at quieter intensities before finally increasing the sensation level to determine how well the patient can recognize words at louder intensities.

The speech audiogram is made up of plotting the WRS at each intensity on a graph displaying word recognition score in % as a function of intensity in dB HL. The following video explains how it is performed.

Below is an image of a completed speech audiogram (Figure 4). There are several components.

Point A on the graph shows the intensity in dB HL where the person identified 50% of the speech material correctly. This is the speech recognition threshold or SRT.

Point B on the graph shows the maximum speech recognition score which informs the clinician of the maximum score the subject obtained.

Point C on the graph shows the reference speech recognition curve; this is specific to the test material used (e.g., AB words) and method of presentation (e.g., headphones), and shows a curve which describes the median speech recognition scores at multiple intensities for a group of normal hearing individuals.

Point A is at about 45 dB HL. Point B is at about 70 dB HL.

Having this displayed on a single graph can provide a quick and easy way to determine and analyze the ability of the person to hear speech and compare their results to a normative group. Lastly, you can use the speech audiogram to identify roll-over. Roll-over occurs when the speech recognition deteriorates at loud intensities and can be a sign of retro-cochlear hearing loss. We will discuss this further in the interpretation section.

Masking in speech recognition testing

Just like in audiometry, cross hearing can also occur in speech audiometry. Therefore, it is important to mask the non-test ear when testing monaurally. Masking is important because word recognition testing is usually performed at supra-threshold levels. Speech encompasses a wide spectrum of frequencies, so the use of narrowband noise as a masking stimulus is not appropriate, and you need to modify the masking noise for speech audiometry. In speech audiometry, speech noise is typically used to mask the non-test ear.

There are several approaches to calculating required masking noise level. An equation by Coles and Priede (1975) suggests one approach which applies to all types of hearing loss (sensorineural, conductive, and mixed):

  • Masking level = D S plus max ABG NT minus 40 plus E M

It considers the following factors.

1. Dial setting

D S is the level of dial setting in dB HL for presentation of speech to the test ear.

2. Air-bone gap

Max ABG NT is the maximum air-bone gap between 250 to 4000 Hz in the non‑test ear.

3. Interaural attenuation

Interaural attenuation: The value of 40 comes from the minimum interaural attenuation for masking in audiometry using headphones (for insert earphones, this would be 55 dB).

4. Effective masking

E M is effective masking. Modern audiometers are calibrated in E M , so you don’t need to include this in the calculation. However, if you are using an old audiometer calibrated to an older calibration standard, then you should calculate the E M .

You can calculate it by measuring the difference in the speech dial setting presented to normal listeners at a level that yields a score of 95% in quiet and the noise dial setting presented to the same ear that yields a score less than 10%. 

You can use the results from speech audiometry for many purposes. The below section describes these applications.

1. Cross-check against pure tone audiometry results

The cross-check principle in audiology states that no auditory test result should be accepted and used in the diagnosis of hearing loss until you confirm or cross-check it by one or more independent measures (Hall J. W., 3rd, 2016). Speech-in-quiet testing serves this purpose for the pure tone audiogram.

The following scores and their descriptions identify how well the speech detection threshold and the pure tone average correlate (Table 2).

Table 2: Correlation between speech detection threshold and pure tone average.

If there is a poor correlation between the speech detection threshold and the pure tone average, it warrants further investigation to determine the underlying cause or to identify if there was a technical error in the recordings of one of the tests.

2. Detect asymmetries between ears

Another core use of speech audiometry in quiet is to determine the symmetry between the two ears and whether it is appropriate to fit binaural amplification. Significant differences between ears can occur when there are two different etiologies causing hearing loss.

An example of this could be a patient with sensorineural hearing loss who then also contracts unilateral Meniere’s disease . In this example, it would be important to understand if there are significant differences in the word recognition scores between the two ears. If there are significant differences, then it may not be appropriate for you to fit binaural amplification, where other forms of amplification such as contralateral routing of sound (CROS) devices may be more appropriate.

3. Identify if further testing is required

The results from speech audiometry in quiet can identify whether further testing is required. This could be highlighted in several ways.

One example could be a severe difference in the SRT and the pure tone average. Another example could be significant asymmetries between the two ears. Lastly, very poor speech recognition scores in quiet might also be a red flag for further testing.

In these examples, the clinician might decide to perform a test to detect the presence of cochlear dead regions such as the TEN test or an ACT test to get more information.

4. Detect retro-cochlear hearing loss

In subjects with retro-cochlear causes of hearing loss, speech recognition can begin to deteriorate as sounds are made louder. This is called ‘roll-over’ and is calculated by the following equation:

  • Roll-over index = (maximum score minus minimum score) divided by maximum score

If roll-over is detected at a certain value (the value is dependent on the word list chosen for testing but is commonly larger than 0.4), then it is considered to be a sign of retro-cochlear pathology. This could then have an influence on the fitting strategy for patients exhibiting these results.

It is important to note however that as the cross-check principle states, you should interpret any roll-over with caution and you should perform additional tests such as acoustic reflexes , the reflex decay test, or auditory brainstem response measurements to confirm the presence of a retro-cochlear lesion.

5. Predict success with amplification

The maximum speech recognition score is a useful measure which you can use to predict whether a person will benefit from hearing aids. More recent, and advanced tests such as the ACT test combined with the Acceptable Noise Level (ANL) test offer good alternatives to predicting hearing success with amplification.

Just like in pure tone audiometry, the stimuli which are presented during speech audiometry require annual calibration by a specialized technician ster. Checking of the transducers of the audiometer to determine if the speech stimulus contains any distortions or level abnormalities should also be performed daily. This process replicates the daily checks a clinicians would do for pure tone audiometry. If speech is being presented using a sound field setup, then you can use a sound level meter to check if the material is being presented at the correct level.

The next level of calibration depends on how the speech material is delivered to the audiometer. Speech material can be presented in many ways including live voice, CD, or installed WAV files on the audiometer. Speech being presented as live voice cannot be calibrated but instead requires the clinician to use the VU meter on the audiometer (which indicates the level of the signal being presented) to determine if they are speaking at the correct intensity. Speech material on a CD requires daily checks and is also performed using the VU meter on the audiometer. Here, a speech calibration tone track on the CD is used, and the VU meter is adjusted accordingly to the desired level as determined by the manufacturer of the speech material.

The most reliable way to deliver a speech stimulus is through a WAV file. By presenting through a WAV file, you can skip the daily tone-based calibration as this method allows you to calibrate the speech material as part of the annual calibration process. This saves the clinician time and ensures the stimulus is calibrated to the same standard as the pure tones in their audiometer. To calibrate the WAV file stimulus, the speech material is calibrated against a speech calibration tone. This is stored on the audiometer. Typically, a 1000 Hz speech tone is used for the calibration and the calibration process is the same as for a 1000 Hz pure tone calibration.

Lastly, if the speech is being presented through the sound field, a calibration professional should perform an annual sound field speaker calibration using an external free field microphone aimed directly at the speaker from the position of the patient’s head.

Coles, R. R., & Priede, V. M. (1975). Masking of the non-test ear in speech audiometry .  The Journal of laryngology and otology ,  89 (3), 217–226.

Graham, J. Baguley, D. (2009). Ballantyne's Deafness, 7th Edition. Whiley Blackwell.

Hall J. W., 3rd (2016). Crosscheck Principle in Pediatric Audiology Today: A 40-Year Perspective .  Journal of audiology & otology ,  20 (2), 59–67.

Katz, J. (2009). Handbook of Clinical Audiology. Wolters Kluwer.

Killion, M. C., Niquette, P. A., Gudmundsen, G. I., Revit, L. J., & Banerjee, S. (2004).  Development of a quick speech-in-noise test for measuring signal-to-noise ratio loss in normal-hearing and hearing-impaired listeners . The Journal of the Acoustical Society of America , 116 (4), 2395–2405.

Stach, B.A (1998). Clinical Audiology: An Introduction, Cengage Learning.

speech recognition test in noise

Popular Academy Advancements

Getting started: assr, what is nhl-to-ehl correction, nhl-to-ehl correction for abr stimuli, what is the ce-chirp® family of stimuli.

  • Find a distributor
  • Customer stories
  • Made Magazine
  • ABR equipment
  • OAE devices
  • Hearing aid fitting systems
  • Balance testing equipment

Certificates

  • Privacy policy
  • Cookie Policy
  • AudioStar Pro
  • TympStar Pro
  • All Content
  • 60 Minute Courses
  • 30 Minute Courses
  • Testing Guides
  • Video Library
  • Get a Quote
  • Select Language

Speech Audiometry

Audiometry guides, introduction.

Speech audiometry is an important component of a comprehensive hearing evaluation. There are several kinds of speech audiometry, but the most common uses are to 1) verify the pure tone thresholds 2) determine speech understanding and 3) determine most comfortable and uncomfortable listening levels. The results are used with the other tests to develop a diagnosis and treatment plan.

SDT = Speech Detection Threshold, SAT = Speech Awareness Threshold.   These terms are interchangeable and they describe the lowest level at which a patient can hear the presence of speech 50% of the time.   They specifically refer to the speech being AUDIBLE, not INTELLIGIBLE.

This test is performed by presenting spondee (two-syllable) words such as baseball, ice cream, hotdog and the patient is to respond when they hear the speech.   This is often used with non-verbal patients such as infants or other difficult to test populations.   The thresholds should correspond to the PTA and is used to verify the pure tone threshold testing.    

How to Test:      

Instruct the patient that he or she will be hearing words that have two parts, such as “mushroom” or “baseball.” The patient should repeat the words and if not sure, he or she should not be afraid to guess.

Using either live voice or recorded speech, present the spondee word lists testing the better ear first. Start 20 dB above the 1000 Hz pure tone threshold level. Present one word on the list and, if the response is correct, lower the level by 5 dB. Continue until the patient has difficulty with the words. When this occurs, present more words for each 5 dB step.

Speech Reception Threshold (SRT)

SRT, or speech reception threshold, is a fast way to    help verify that the pure tone thresholds are valid. Common compound words - or spondee words - are presented at varying degrees of loudness until it is too soft for the patient to hear. SRT scores are compared to the pure tone average as part of the cross check principle.   When these two values agree, the reliability of testing is improved.

Word Recognition

Instruct the patient that he or she is to repeat the words presented. Using either live voice or recorded speech, present the standardized PB word list of your choice. Present the words at a level comfortable to the patient; at least 30 dB and generally 35 to 50 dB above the 1000 Hz pure tone threshold. Using the scorer buttons on the front panel, press the “Correct” button each time the right response is given and the “Incorrect” button each time a wrong response is given.

Speech Audiometry Testing Screen

The Discrimination Score is the percentage of words repeated correctly: Discrimination % at HL = 100 x Number of Correct Responses/Number of Trials.

WRS = Word Recognition Score, SRS = Speech Reception Score, Speech Discrimination Score.   These terms are interchangeable and describe the patient’s capability to correctly repeat a list of phonetically balanced (PB) words at a comfortable level.   The score is a percentage of correct responses and indicates the patient’s ability to understand speech.

Word Recognition Score (WRS)

WRS, or word recognition score, is a type of speech audiometry that is designed to measure speech understanding. Sometimes it is called word discrimination. The words used are common and phonetically balanced and typically presented at a level that is comfortable for the patient. The results of WRS can be used to help set realistic expectations and formulate a treatment plan.

Speech In Noise Test

Speech in noise testing is a critical component to a comprehensive hearing evaluation. When you test a patient's ability to understand speech in a "real world setting" like background noise, the results influence the diagnosis, the recommendations, and the patient's understanding of their own hearing loss.

Auditory Processing

Sometimes, a patient's brain has trouble making sense of auditory information. This is called an auditory processing disorder. It's not always clear that this lack of understanding is a hearing issue, so it requires a very specialized battery of speech tests to identify what kind of processing disorder exists and develop recommendations to improve the listening and understanding for the patient.

QuickSIN is a quick sentence in noise test that quantifies how a patient hears in noise. The patient repeats sentences that are embedded in different levels of restaurant noise and the result is an SNR loss - or Signal To Noise ratio loss.   Taking a few additional minutes to measure the SNR loss of every patient seen in your clinic provides valuable insights on the overall status of the patient' s auditory system and allows you to counsel more effectively about communication in real-world situations. Using the Quick SIN to make important decisions about hearing loss treatment and rehabilitation is a key differentiator for clinicians who strive to provide patient-centered care.

Speech-in-Noise Audiometry Testing Screen

BKB-SIN is a sentence in noise test that quantifies how patients hear in noise. The patient repeats sentences that are embedded in different levels of restaurant noise an the result is an SNR loss - or signal to noise ratio loss. This test is designed to evaluate patients of many ages and has normative corrections for children and adults. Taking a few additional minutes to measure the SNR loss of every patient seen in your clinic is a key differentiator for clinicians who strive to provide patient-centered care.

  • Education >
  • Testing Guides >
  • Speech Audiometry >

GRASON-STADLER

  • Our Approach
  • Cookie Policy
  • AMTAS Pro

Corporate Headquarters 10395 West 70th St. Eden Prairie, MN 55344

General Inquires +1 800-700-2282 +1 952-278-4402 [email protected]  (US) [email protected]  (International)

Technical Support Hardware: +1 877-722-4490 Software: +1 952-278-4456

DISTRIBUTOR LOGIN

  • GSI Extranet
  • Request an Account
  • Forgot Password?

Get Connected

Facebook

  • Distributor Locator

© Copyright 2024

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here .

Loading metrics

Open Access

Peer-reviewed

Research Article

Speech Recognition in Natural Background Noise

* E-mail: [email protected]

Affiliation Linguistics Division, Museu Paraense Emilio Goeldi (MPEG), Belém, Brazil

Affiliation Universidade Federal do Para (UFPA), Belém, Brazil

Affiliation Laboratoire sur le Langage, le Cerveau et la Cognition (L2C2), CNRS, UMR5304, Institut des Sciences Cognitives, Lyon, France

  • Julien Meyer, 
  • Laure Dentel, 
  • Fanny Meunier

PLOS

  • Published: November 19, 2013
  • https://doi.org/10.1371/journal.pone.0079279
  • Reader Comments

7 Jan 2014: Meyer J, Dentel L, Meunier F (2014) Correction: Speech Recognition in Natural Background Noise. PLOS ONE 9(1): 10.1371/annotation/012d9419-8135-40ab-8c81-ce46e8e708d0. https://doi.org/10.1371/annotation/012d9419-8135-40ab-8c81-ce46e8e708d0 View correction

Figure 1

In the real world, human speech recognition nearly always involves listening in background noise. The impact of such noise on speech signals and on intelligibility performance increases with the separation of the listener from the speaker. The present behavioral experiment provides an overview of the effects of such acoustic disturbances on speech perception in conditions approaching ecologically valid contexts. We analysed the intelligibility loss in spoken word lists with increasing listener-to-speaker distance in a typical low-level natural background noise. The noise was combined with the simple spherical amplitude attenuation due to distance, basically changing the signal-to-noise ratio (SNR). Therefore, our study draws attention to some of the most basic environmental constraints that have pervaded spoken communication throughout human history. We evaluated the ability of native French participants to recognize French monosyllabic words (spoken at 65.3 dB(A), reference at 1 meter) at distances between 11 to 33 meters, which corresponded to the SNRs most revealing of the progressive effect of the selected natural noise (−8.8 dB to −18.4 dB). Our results showed that in such conditions, identity of vowels is mostly preserved, with the striking peculiarity of the absence of confusion in vowels. The results also confirmed the functional role of consonants during lexical identification. The extensive analysis of recognition scores, confusion patterns and associated acoustic cues revealed that sonorant, sibilant and burst properties were the most important parameters influencing phoneme recognition. . Altogether these analyses allowed us to extract a resistance scale from consonant recognition scores. We also identified specific perceptual consonant confusion groups depending of the place in the words (onset vs. coda). Finally our data suggested that listeners may access some acoustic cues of the CV transition, opening interesting perspectives for future studies.

Citation: Meyer J, Dentel L, Meunier F (2013) Speech Recognition in Natural Background Noise. PLoS ONE 8(11): e79279. https://doi.org/10.1371/journal.pone.0079279

Editor: Manuel S. Malmierca, University of Salamanca- Institute for Neuroscience of Castille and Leon and Medical School, Spain

Received: March 25, 2012; Accepted: September 29, 2013; Published: November 19, 2013

Copyright: © 2013 Meyer et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This research was financially supported by the Fyssen Foundation (Post Doc Grant), Conselho Nacional de Desenvolvimento Científico e Tecnológico (Cnpq) (Pos Doutorado Junior Grant n°159747/2010-4) and the European Research Council (starting-grant SpiN Project, ERC n°209234). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Speech-in-noise research has revealed that speech signals incorporate several acoustic properties that contribute to compensating for signal distortions and noisy interferences. For example, they include enhanced spectral peaks for vowels, rapid spectral changes for consonants, amplitude modulation patterns to highlight informative portions such as stress or vowel-consonant alternations, or periodicity of the waveform perceived from any harmonic of the signal (see Assman and Summerfield [1] for a review). In parallel, our cognitive system was found to be adapted to overcome speech degradations and is able, to some extent, to overcome distortions and fragmentations of the signal. This means that our perceptual and cognitive systems perform highly sophisticated mechanisms of informational shielding [2] , [3] . As a result, speech recognition remains possible even after large amounts of the signal have been removed, such as via gating in the time domain (e.g., [4] ), drastically filtering the frequency domain (e.g., [5] , [6] ), or significantly altering the spectro-temporal coherence [7] . Even when the spectral details and periodicity of voiced speech are eliminated, intelligibility remains high if the temporal modulation structure is preserved in a small number of frequency bands [8] .

The emergence and the evolution of these adaptive listening abilities in humans have developed in rural environments which are the dominant setting for the vast majority of human evolution. Hence, the acoustic constraints present in these environments are interesting to take into account in order to study human speech recognition in conditions approaching ecologically valid contexts. So far, there is no systematic study dealing with the impact of natural acoustic backgrounds on spoken recognition. One difficulty explaining this situation is that rural background noise is known to be rather variable even when it does not include mechanical sources of noise. It depends on the geographical situation, the terrain, the vegetation, meteorological circumstances, but also bio-noises such as animal calls and hydro-noise such as rivers or sea rumble. However, natural background noises have common underlying basic properties that are different from the ones used in most speech-in-noise experiments (such as periodic tones, random noises, artificial broadband noises, artificial continuous and fluctuating noises, or speech-shaped noises [1] ). One reasonable solution would therefore be to focus primarily on the most regular and frequent acoustic constraints encountered outdoors. Such basic constraints are characterized by a non uniform distribution of noisy frequencies. A first important characteristic is that it emphasizes low frequency content and therefore resembles the frequency distribution of pink noise at this level. A second important aspect to note is that the power levels decrease more rapidly than the ones of pink noise as a function of increasing frequencies. Moreover, at higher frequencies the power level distribution rather resembles speech-shaped noises (see Figure 1 and ‘ materials and methods ’).

thumbnail

  • PPT PowerPoint slide
  • PNG larger image
  • TIFF original image

Power spectral density (PSD) as a function of frequency.

https://doi.org/10.1371/journal.pone.0079279.g001

Until now, the studies concerning the influence of outdoor natural environments on speech have focused on three main domains: first, on the human ability to tacitly adjust vocal output to compensate for intensity losses due to sound propagation over distance, known as the Lombard effect [9] , [10] ; secondly, on the human ability to estimate the distance of the speaker [11] , [12] ; and finally, on the natural adaptation of the speech signal into shouted speech forms or into other alternative acoustic media, such as whistled forms of languages that enable dialogue over long distances [13] . Interestingly, distance has been an essential ecological parameter implied in all these studies. Namely, this parameter is suitable to reveal the impact of ambient noise which progressively merges with speech during the spoken signal transmission. Listening to distant speech is a rather common task in daily life, both in urban or rural contexts. However, the great majority of speech-in-noise recognition studies available in the literature thus far have concerned close listening conditions. There are few papers testing speech recognition with distance and they are mostly targeted at testing indoor environments such as classrooms [14] , halls [15] , or even tunnels [16] . Speech recognition with distance has recently sparked new scientific interest in the fast developing domain of indoor human-machine environments, where a whole set of sound capture techniques and algorithms of signal treatment have been developed for automatic speech recognition [17] . In such cases, the speech signal is not only affected by the ambient noise but is also degraded during its in-air transmission between the speaker and the listener.

In the present study, we took a complementary approach to former experiments, by evaluating the ability of normal-hearing individuals to recognize words and their constituent phonemes at variable distances in a very basic model of an outdoor environment. For a first study of this type we decided to test relatively stationary and long-term natural acoustic effects of such environments and, as noted above, we decided to explore their most frequent and regular acoustic constraints. The speech signal was masked by a natural background noise recorded in a flat open field characterized by very low - assumed to be negligible - reverberation indices. Therefore, distance was simulated in the simplest way by amplitude attenuation only. This method had the advantage of corresponding to variations in signal-to-noise ratio as in most speech-in-noise studies existing in the literature. For each participant several listening distances were tested, as lists of target isolated French words were played between the virtual distances of 11 m and 33 m from the participants. Word recognition performance was the first measured parameter. An intelligibility function was derived from these results. Next, we analysed recognition performance on vowels, consonants and syllable structure, underlining the differences between different classes of phonemes and ranking them as a function of recognition performance. Our results extend some findings of other speech perception studies to the specific conditions of the present study. This was the case for the central role played by vowels in word detection as well as for the strong relationship between consonant recognition and the identification of the lexical meaning of isolated words. We also found peculiarities in natural background noise, particularly concerning confusion patterns of phonemes. We explained them by combining perceptual and acoustic analyses as a function of parameters such as distance ( = SNR levels), position of the consonant in the word, or phonetic features such as place, manner or voicing.

Materials and Methods

Participants.

The 36 participants were 18 to 30 year-old French native speakers. Their normal hearing thresholds were tested by audiogram. The present study was conducted in accordance with the Declaration of Helsinki. It was approved by the ethics committee of the SPIN research group (CNRS) and each participant gave written consent.

In total, 19 lists were recorded in a sound-proof box by a masculine speaker trained for this task in the DDL-CNRS laboratory (mean level of words was 65.3 dB(A) at one meter from the speaker, with a standard deviation of 3.3 dB). Each list contained 17 French isolated words. The French language is characterized by a certain balance between vowels and consonants that contributes to avoiding drastic numerical asymmetry that might favor computations of consonants over vowels [18] . The selected words were nouns regularly used in current French vocabulary. They were mostly monosyllabic words, and a few - less than 5% - were words of CVV and VVC syllabic structure. For all lists, all participants and all simulated distances, the distribution of the played word structures was as follows: 82.1% for CVC, 12.7% for CCV, 4.1% for CVV, 0.8% for VVC, and 0.3% for VCC.

Moreover, all the lists were balanced in terms of:

  • Frequency of word occurrence in the French language: the average word frequency per list was between 3.79 and 3.91 according to the evaluation method of New et al. [19] .
  • Number of phonological neighbors for each word. This number was on average between 19.59 and 20.1 for each list.
  • Number of phonemes per word. The average for each list was between 4.5 and 4.6 letters.
  • Duration of pronunciation of each word. The average duration of the words in each list was between 547 to 553 ms.
  • Alternation between vowels and consonants. Each list contained in average the same number of possible CVC alternations.
  • Gender of the nouns. There was approximately the same number of masculine and feminine nouns in each list.

Each list was organized on a single audio track where each word was separated from the following by 3 seconds of silence. All these tracks were calibrated with the same root mean square energy level. From these original audio tracks we built new audio files by applying the masking effect of the background noise and the amplitude attenuation simulating distance.

Design and Procedure

Background noise.

The natural background noise interfering with the speech signal was recorded in a flat open field (near Vilanova i la Geltru, Spain). The recording precautions enabled us to capture a relatively stationary background noise (standard deviation of 1.2 dB) in low level conditions (mean value of 41.6 dB(A), measured with a sound level meter BK 2240). This ambient noise was chosen because it was representative of diurnal background noises typically found in rural isolated geographic areas, with quiet weather and no noisy animal near the recorder. Acoustically, such noises are characterized by high energy levels at lower frequencies of the voice spectrum (below 300 Hz) and a strong decline towards higher frequencies (see Figure 1 ). Their frequency-dependent distribution of acoustic energy levels is explained by the fact that absorption increases with frequency in natural environments due to the terrain, the vegetation, and the micro-climates that noisy signals traverse [20] .

Distance simulation.

In a natural environment, the impact of background noise on speech recognition is revealed by distance. As mentioned in the introduction, we chose the simplest method to simulate distance: by amplitude attenuation, applying the inverse square law for outdoor spherical propagation. Word lists were presented at different levels corresponding to the attenuation simulation for each distance and they were masked by the selected background noise ( Table 1 ) in accordance with the reference levels measured during the recording sessions of the speaker and of the noise. Therefore the variable distances resulted in variations of the SNR according to distance. It was a relatively realistic option as background noise recordings were made in quasi-stationary meteorological conditions (wind speed <1 m/s throughout the session, degree of humidity between 57% and 65%, temperature between 26°C and 28°C, measured on a portable meteorological station Geos Skywatch), one meter above from the ground, in an open field made of a plain dirt track which is the guarantee of low reverberation indices.

thumbnail

https://doi.org/10.1371/journal.pone.0079279.t001

Signal-to-Noise Ratio.

The SNR levels were estimated by calculating the sound power levels of all lists played at each distance (we concatenated words without silent pauses between them and applied the Welch's method [21] , [22] ) and by subtracting from these values the sound power levels of the long-term frequency spectrum of the selected noise (cf. Table 1 ). The calibration of the listening equipment in the laboratory was made with a 2 kHz reference sinewave recorded on the field in the natural background noise and for which we measured sound power level in dB(A) at 1 meter from the source. We also checked the values of two reference words recorded and measured on the field in the same conditions.

Procedure of the experiments.

Each participant sat in front of a computer in the experimental studio of the Institut des Sciences de l'Homme of Lyon (CNRS, University of Lyon) and was asked to perform the test played on specialized software interface and delivered diotically via headphones (Beyerdynamic DT 48, 200 X, with free field equalizer, according to [23] ). All computers were identical with identical sound cards and had been calibrated for the experiment according to the reference measures of the original recordings. The participants had the simple task of listening to each stimulus and trying to recognize the isolated target word, in an open response format. They were asked to type the perceived sounds, even if they did not correspond to a French word, into the software interface through the computer keyboard and then validate their answer in order to move on to the following word (the experiment can be tested through the interface given in Supporting Information, see Protocol S1). The participants did not receive any feedback on their performance before the end of the test. After a training phase of 5 words to ensure that they had understood the task, the test phase began with a list of 17 words. For each participant, a different list was presented at each of the 12 distances tested. We either increased the distance progressively from 11 meters to 33 meters (with a two-meter step), or decreased it progressively from 33 meters to 11 meters. The reason why we chose these two presentation options instead of randomizing distance was that we wanted to verify whether there was a differential impact on performance between a progressive distance increase and a progressive distance decrease. However, such a distinction was not found (F(1, 11) = .94; n.s.).

First, we will present the general results of word intelligibility. Then, we will provide a detailed analysis of recognition performance for various phonological properties, such as syllabic structure, phoneme type and consonant position in the word. This large set of data will enable us to detect some specificities of the effect of natural background noise on speech recognition. Most of our results are based on recognition percentage scores. In order to stabilize variance and normalize proportional data, an arcsine transformation [24] was applied to such scores before running inferential analyses.

General word recognition performance

The mean word recognition performance was 54.6% of correct answers for all the participants at all distances. The performances showed a general decrease in the average proportion of correct answers, from 77.8% at 11 meters to 35.9% at 33 meters, with consistent inter-individual variability at each distance ( Figure 2 ). There was a strong quasi-linear correlation between distance and the intelligibility loss (R 2  = .95). The SNR categories associated with each distance enabled us to derive the values of the speech reception threshold (SRT) and the corresponding slope of the empirical intelligibility function (across lists and participants). The SRT value was found at −15.2 dB of SNR, and the slope at 3.2%/dB (values calculated on more than 600 words in each category of SNR levels that straddled the 50% correct answer point). The intelligibility values also showed that the experiment mostly concerned the central quasi-linear part of the psychometric function, which corresponded to the SNRs most revealing of the progressive effect of the selected noise.

thumbnail

https://doi.org/10.1371/journal.pone.0079279.g002

Word structure recognition, phoneme insertion or deletion

The mean recognition performance for the syllabic word structures of the corpus was 76.5% of correct answers, over all distances. The two most frequent structures were CVC, which was recognized at 80.2%, and CCV, which was recognized at 55.7%. The progressive degradation in recognition performance with regards to distance confirms CVC as a very resistant syllabic structure ( Figure 3 ). In general, structural errors were of two main types: phoneme deletion(s) or phoneme insertion(s). Deletions and insertions could occur at the same time. Deletions were much more frequent than insertions. Some deletions were due to an absence of response for the entire word (20.6% for all structures, 23.6% for CVC words, and 12.9% for CCV words). When looking only at the errors which were not due to an absence of response for the entire word, deletions remained prevalent. Indeed, in this case, 73.6% of all non recognized structures were errors involving deletions (73.3% for CVC and 78.2% for CCV), whereas the errors with insertions reached 36.8% (34.7% for CVC and 37.9% for CCV). The proportion of words involving deletions increased monotonically with distance from 11 m (8.7%) to 33 m (27.5%), except between 19 and 21 meters where a greater step occurred. Moreover, the proportion of words involving insertions increased quasi-linearly from 4% at 11 m to 12.9% at 33 m. Both deletions and insertions were more frequent with consonants than with vowels: consonants were involved in 85.5% of the insertions and in 98.7% of the deletions (involving respectively 7.4% and 17.3% of all words). Another aspect we checked was the impact of the position of the consonant in the word. The most interesting case was for deletions in CVC words, which occurred a little more on onsets than on codas (9% vs. 7% of all words). In fact, codas were more accurately perceived at any distance (see the lower section of Figure 3 ).

thumbnail

CVC were much more frequent and therefore the evolution with distance is more regular than for CCV data.

https://doi.org/10.1371/journal.pone.0079279.g003

Vowel recognition and confusion

speech recognition test in noise

Data is presented either as a function of individual phonemes (up) or as a function of distance (bottom).

https://doi.org/10.1371/journal.pone.0079279.g004

thumbnail

The correct answers are presented as a function of distance.

https://doi.org/10.1371/journal.pone.0079279.g005

Consonant recognition

General results..

speech recognition test in noise

Consonant recognition as a function of distance and SNR

speech recognition test in noise

https://doi.org/10.1371/journal.pone.0079279.g006

Consonant confusion

With the intention of clarifying consonant confusion patterns, we analysed them in detail by taking into account three principal factors: consonant position, phonetic features they carry and distance corresponding to SNR variations.

Confusion matrix.

speech recognition test in noise

https://doi.org/10.1371/journal.pone.0079279.g007

Influence of consonant position

A more detailed observation of the instances of confusion revealed that some consonants were more likely to be confused when they were at the beginning of the word (such as [t, k, f, s, z]), whereas others were more confused when they were at the end of the word (such as [b, d, g, v]) ( Figure 8 ). One of the consonants, the approximant [j] appeared only at the end of words and was therefore excluded from some of our further analyzes separating onsets and codas. Moreover, the position of the consonant in the word often influenced the nature of the confusion: [t] was mistaken for a [d] only in onsets, whereas [d] was mistaken for a [t] only in codas. A similar phenomenon occurred in the [k]-[g] pair where the unvoiced [k] was mistaken for the voiced [g] only in onsets and [g] was mistaken for a [k] mostly in codas. This unvoiced-voiced relationship was not systematic as the [p]-[b] confusions almost always occurred in onsets (92.6%), showing that the labial locus of articulation is less subject to asymmetry in recognition. In fact, of the three voiced stops, [b] is known to be the most compatible with sustained voicing [27] , [28] .

thumbnail

Displayed values show the % of played consonant for each type.

https://doi.org/10.1371/journal.pone.0079279.g008

speech recognition test in noise

https://doi.org/10.1371/journal.pone.0079279.g009

Manner, place and voicing

Guided by the preceding results, we computed the information associated with some of the constituent phonetic features of consonants, using information transmission as the recognition metric, as defined by Miller and Nicely [30] . In order to compute the amount of information transmitted, the sixteen consonants of the recognition set were partitioned into three (overlapping) groups on the basis of voicing, articulatory manner and place of articulation (as illustrated in Table 2 ). We derived confusion matrices for each phonetic-feature dimension from the original confusion matrix by summing the results for each feature group. In essence, each phonetic-feature dimension was treated as an independent information channel.

thumbnail

https://doi.org/10.1371/journal.pone.0079279.t002

For all distances pooled together, voicing was the best recognized with 80.6% of correct answers, followed by place (78.1%) and manner (77.4%). We also measured the recognition performance for the feature categories and subcategories as a function of distance. First, we found that manner, place and voicing were recognized at equivalent levels until 21 meters (−14.4 dB of SNR). Above 21 meters, voicing was better recognized than place and manner (manner being slightly less well recognized, see Figure 10 ). Additional measures showed that this was the case both in Ci and Cf, with the strongest effect in Cf.

thumbnail

https://doi.org/10.1371/journal.pone.0079279.g010

speech recognition test in noise

https://doi.org/10.1371/journal.pone.0079279.g011

Phoneme Power Spectral Density (PSD) analysis

speech recognition test in noise

https://doi.org/10.1371/journal.pone.0079279.g012

thumbnail

https://doi.org/10.1371/journal.pone.0079279.g013

In general, at every distance, the phonemes that had high recognition scores were the ones showing large differences with noise and therefore high amplitude levels of some of their characteristic acoustic cues. For plosives, we noted that the stronger the differences between the burst and the noise, the higher the recognition scores (for example, see the onset-coda differences of [g] and [k] in Figure 13 ). For fricatives, such a difference occurs for example between [s] and [z] (see Figure 14 ). Moreover, for stop consonants, when the bursts became unreliable because they were covered by noise, performances dropped around or below 50% (this was the case for [t] onsets at 27 to 33 m, whereas recognition scores for [t] codas remained above 65% because their bursts did not merge with the noise until 33 m). Other parameters that we didn't check systematically here may have also influenced the results for fricatives and plosives, such as probably the range of frequencies above noise and the rapidity of the CV spectral changes.

thumbnail

https://doi.org/10.1371/journal.pone.0079279.g014

This study focused on the impact of slight natural background noise on speech recognition. We examined how such a noise interfered with word, phoneme and feature recognition as distance increased and we undertook an analysis of the cues that were used by the listeners for vowel and consonant recognition. Our simulation of distance was simple and resulted in changing the SNR ( Table 1 ). The specificity of our study stems mostly in the non uniform frequency spectrum characterizing natural background noise and on the multiple ways in which we analysed its confrontation with the also non uniform distribution of energy in spoken voice frequencies.

General word recognition

The intelligibility function classified the words as a function of distance and equivalent SNRs ( Figure 2 , Table 1 ). The SRT value derived from word recognition performance was quite low (−15.7 dB) because the noise didn't uniformly mask the speech signal. The slope corresponded to the average of other intelligibility tests with a paradigm based on isolated words (3.2%/dB) [36] , [37] . The word recognition scores were higher than those found in the literature for non uniform pink noise at equivalent SNRs [38] . Intelligibility tests on sentences usually show steeper slopes (between 15 and 25%/dB according to [39] ) indicating that the psychometric function is more precise. However, here, the advantage of choosing a paradigm with words in isolation was that it allowed for the assessment of confusion matrices and transmission of information: two aspects that we wanted to quantify, particularly for vowels and consonants.

Syllabic structure effect

The recognition scores analysis on syllabic structure confirms CVC as a very resistant syllabic structure and revealed that errors on CCV and CVC were mostly due to deletions. Also, the perception of the final segments of words was improved and CV onsets were less well recognized than VC codas ( Figure 3 , bottom part): for example, more onset deletions occurred for 9 of the 16 consonants ( Figure 8 ). This result is due to the lexical status effect that is characteristic of the choice of word stimuli [40] , [41] , whereas the tests based on artificial nonsense words (logatomes or pseudowords) systematically have better recognition performance on CV onsets (e.g., [40] , [42] ) and show better resistance of onsets to deletion.

Differences between vowels and consonants

speech recognition test in noise

Acoustic cues and perception

The phoneme recognition and error data of this experiment was deployed as a function of different parameters such as distance, phonetic features or the position of the consonant in the word. Such data showed that formants, as well as sibilant and burst properties of phonemes were the principal parameters influencing performance of phoneme identification in natural background noise. The special roles of formant patterns and frication were first suggested by the consonant rankings (1) and (2) established on the basis of recognition performance. They were confirmed by the PSD analysis, which also highlighted the important role of bursts in stop consonants.

There are, in the literature, phoneme scales based on purely phonetic data that are in accordance with the perceptual ranking we found: for example scales ranking phonemes as a function of sonority (Clements [26] ) but also scales ranking phonemes as a function of phonetic power [49] . Indeed, the presentation made in (2) strongly suggests that a scale based on perceived resonance (typically Semivowels >Liquids>Nasals>Obstruents, [26] ) also influences the recognition scores in our study. Clements had explained that vowels stand at the top of such a sonority scale as they are characterized by a powerful, well-defined formant pattern and that sonorant consonants (that is, semivowels, liquids and nasals) are next on this scale as they are also characterized by a formant pattern but with a decreasing degree of definition. Moreover, the composite measure of the relative phonetic power of English sounds produced by Fletcher ( [49] , pp 82–86) fits with a sonority–based ranking, but also fits with the high rank of voiceless sibilants and with the low rank of low power fricatives found in (2).

speech recognition test in noise

https://doi.org/10.1371/journal.pone.0079279.g015

speech recognition test in noise

Consonant confusion groups

The clearest perceptual consonant confusion groups found in our study were ({b-v, f-p, f-b, m-b} {s-z} {p-k-t, k-g} {m-n}) in onsets and ({p-b, p-t-k, k-g} {m-n}) in codas. This was partly in accordance with the groups found by Benki [42] (which provided an onset/coda distinction for pseudowords in speech-dependent and flat-spectrum noise) or by Phatak and Allen [50] (which dealt with words in speech-shaped noise, without the onset/coda distinction that we made here). Similarities with Phatak and Allen [50] concerned mostly ({b-v},{s-z},{m-n}), whereas most of the preponderant confusions we found matched with those of Benki [42] except for the low level of confusion found by Benki between voiced and unvoiced plosives in codas. The perceptual consonant confusion groups that we found are characterized by an important occurrence of inter-labial confusions in onsets and of inter-stop confusions both in codas and onsets (these are for example clearly visible on the perceptual maps of Figure 9 ). The rapid merging with distance of the 2 nd formants of voiced consonants [b, g, m] partly explains these confusions. Moreover, for plosives, only bursts stood out of the noise past 19–21 meters. In these conditions, the close frequency values of [k, g, t] bursts must have contributed to their confusion.

This paper shows how environmental noise and distance (from 11 to 33 meters) interfere with spoken word recognition. Our results showed that in such conditions, identity of vowels is mostly preserved and confirmed the special functional role of consonants during lexical identification. While vowels played central role in the word-detection step that precedes the word recognition step in adverse listening conditions, consonant seems mainly used to identify the lexical item. Also, we confirmed that CVC is a particularly resistant syllabic structure; we observed an asymmetry between the ending and the beginning of words, with CV onset being less resistant than VC coda. We also identified different perceptual consonant confusion groups depending of the place in the words: ({b-v, f-p, f-b, m-b} {s-z} {p-k-t, k-g} {m-n}) in onsets and ({p-b, p-t-k, k-g} {m-n}) in codas. Moreover, our data allowed us to propose a resistance scale were high energy sibilants are the more resistant, followed by semivowels and obstruents, and where low energy fricatives are the less resistant. These results clearly underlined the particular role of coherent strong energy structures like formants and sibilants in noisy situations. Complementary acoustic measures showed the important role of bursts in plosives. Finally, besides the acoustic cues of consonants and vowels that were identified as important for speech recognition, the study highlighted that listeners may have access to some acoustic cues of the CV transition that we didn't specifically track and that should be more completely investigated in the future in a complementary study using the same kind of noise.

Supporting Information

Protocol s1..

General interface of the experiment. The experimenter made the calibration with the calibration button on the left and then chose one of the two upper options on the right (increasing or decreasing distances). Next, each participant had the simple task of listening to each stimulus and trying to recognize the isolated target word (including in a preliminary training phase).

https://doi.org/10.1371/journal.pone.0079279.s001

Acknowledgments

The authors would like to thank C. Grataloup for participating in building the lists of words, V. Monatte and the participants for their collaboration during testing. Michel Hoen and Joshua Birchall for discussions and corrections on the paper. And the team of Musikit shop in Lyon for their advice in the choice of the experimental equipment. Finally, the members of the Linguistic Division of the Museu Goeldi (MPEG/MCT, Belém, Brazil) are also thanked for their constant support.

Author Contributions

Conceived and designed the experiments: JM LD FM. Performed the experiments: FM JM LD. Analyzed the data: JM LD FM. Contributed reagents/materials/analysis tools: LD JM FM. Wrote the paper: JM LD FM. Designed the software used in analysis: LD JM. Designed the Interface of Test: LD. Designed the lists of words: FM.

  • 1. Assmann PF, Summerfield Q (2004) Perception of Speech under adverse conditions. In Speech Processing in the Auditory System, ed 1, S. Greenberg et al, eds. Springer, pp 231–308.
  • View Article
  • Google Scholar
  • 3. Palmer AR, Shamma SA (2004) Physiological representation of speech. In Speech Processing in the Auditory System, ed 1, S. Greenberg et al, eds. Springer, pp. 163–230
  • 17. Woelfel M, McDonough J (2005) Distant Speech Recognition, Wiley Eds.
  • 20. Wiley RH (2009) Signal transmission in Natural environments Encyclopedia of Neuroscience (2009), vol. 8, pp. 827–832
  • 23. Zwicker E, Fastl H (1999) Psychoacoustics. Facts and Models. Springer, Berlin.
  • 25. Calliope (1989) La parole et son traitement automatique. Masson, Paris.
  • 26. Clements N (2005) Does sonority have a phonetic basis? Comments on the chapter by Vaux. In Contemporary views on architecture and representations in phonological theory, E Raimy, C Cairns, eds., Cambridge, MA: MIT Press.
  • 28. Ohala JJ, Riordan CJ (1979) Passive vocal tract enlargement during voiced stops. In Speech communication papers, JJ Wolf, DH Klatt, eds., New York: Acoustical Society of America, pp 89–92.
  • 29. Shepard RN (1972) Psychological representation of speech sounds. In Human Communication: A Unified View. In EE David and PB Denes (eds.) New York Mc Graw Hill, pp 67–113
  • 31. Fujimura O, Erickson D (1997) Acoustic phonetics In WJ Hardcastle, J Laver, eds., The Handbook of Phonetic Sciences, Oxford: Blackwell, pp 65–115.
  • 33. Krakow R (1989) The articulatory organization of syllables: a kinematic analysis of labial and velar gestures, PhD diss. Yale University, New Haven
  • 34. Krakow R (1993) Nonsegmental influences on velum movement patterns: syllables, sentences, stress, and speaking rate; in Huffman, Krakow, Nasals, nasalization, and the velum (Phonetics and phonology V), (Academic Press, San Diego 1993).
  • 38. Scharine A, Henry P, Rao M, Dreyer J (2007) A Model for Predicting Intelligibility of Binaurally Perceived Speech, Army research laboratory report, ARL-TR-4075, Aberdeen: Army Prooving Ground. Netherlands.
  • 39. Kollmeier B, Brand T, Meyer B (2008) Perception of Speech and Sound. In Jacob Benesty, M. Mohan Sondhi, Yiteng Huang (eds.) Handbook of Speech Processing, pp 61–82.
  • 44. Varnet L, Meyer J, Hoen M, Meunier F (2012) Phoneme resistance during speech-in-speech comprehension, Proceedings of Interspeech 2012, Portland, USA
  • 45. Fletcher H (1929) Speech and Hearing Van Nostrand, New York.
  • 48. Ladefoged P (2006) A course in phonetics (5 th ed.) Boston: Wadsworth.
  • 49. Fletcher H (1972) Speech and Hearing in Communication. Huntington: Robert E.Krieger.
  • 57. Stevens KN (1997) Articulatory-acoustic-auditory relationships, In The Handbook of Phonetic Sciences, WJ. Hardcastle & John Laver, eds., Oxford: Blackwell, pp 462–506.
  • 58. Kent RD, Read C (1992) The Acoustic Analysis of Speech. San Diego: Singular Publishing.

Acoustical Society of America

  • Previous Article
  • Next Article

The digits-in-noise test: Assessing auditory speech recognition abilities in noise

Author to whom correspondence should be addressed. Electronic mail: [email protected]

  • Article contents
  • Figures & tables
  • Supplementary Data
  • Peer Review
  • Reprints and Permissions
  • Cite Icon Cite
  • Search Site

Cas Smits , S. Theo Goverts , Joost M. Festen; The digits-in-noise test: Assessing auditory speech recognition abilities in noise. J. Acoust. Soc. Am. 1 March 2013; 133 (3): 1693–1706. https://doi.org/10.1121/1.4789933

Download citation file:

  • Ris (Zotero)
  • Reference Manager

A speech-in-noise test which uses digit triplets in steady-state speech noise was developed. The test measures primarily the auditory, or bottom-up, speech recognition abilities in noise. Digit triplets were formed by concatenating single digits spoken by a male speaker. Level corrections were made to individual digits to create a set of homogeneous digit triplets with steep speech recognition functions. The test measures the speech reception threshold (SRT) in long-term average speech-spectrum noise via a 1-up, 1-down adaptive procedure with a measurement error of 0.7 dB. One training list is needed for naive listeners. No further learning effects were observed in 24 subsequent SRT measurements. The test was validated by comparing results on the test with results on the standard sentences-in-noise test. To avoid the confounding of hearing loss, age, and linguistic skills, these measurements were performed in normal-hearing subjects with simulated hearing loss. The signals were spectrally smeared and/or low-pass filtered at varying cutoff frequencies. After correction for measurement error the correlation coefficient between SRTs measured with both tests equaled 0.96. Finally, the feasibility of the test was approved in a study where reference SRT values were gathered in a representative set of 1386 listeners over 60 years of age.

Sign in via your Institution

Citing articles via.

  • Online ISSN 1520-8524
  • Print ISSN 0001-4966
  • For Researchers
  • For Librarians
  • For Advertisers
  • Our Publishing Partners  
  • Physics Today
  • Conference Proceedings
  • Special Topics

pubs.aip.org

  • Privacy Policy
  • Terms of Use

Connect with AIP Publishing

This feature is available to subscribers only.

Sign In or Create an Account

Audiology and Neurotology

Predicting Matrix test effectiveness for evaluating auditory performance in noise using pure-tone audiometry and speech recognition in quiet in cochlear implant recipients

  • Article contents
  • Figures & tables
  • Supplementary Data
  • Peer Review
  • Get Permissions
  • Cite Icon Cite
  • Search Site

Jonathan Flament , Daniele De Seta , Francesca Yoshie Russo , Julie Bestel , Olivier Sterkers , Evelyne Ferrary , Yann Nguyen , Isabelle Mosnier , Renato Torres; Predicting Matrix test effectiveness for evaluating auditory performance in noise using pure-tone audiometry and speech recognition in quiet in cochlear implant recipients. Audiol Neurotol 2024; https://doi.org/10.1159/000535622

Download citation file:

  • Ris (Zotero)
  • Reference Manager

Article PDF first page preview

Introduction: Auditory performance in noise of cochlear implant recipients can be assessed with the adaptive Matrix test (MT); however, when the speech-to-noise ratio (SNR) exceeds 15 dB, the background noise has any negative impact on the speech recognition. Here, we aim to evaluate the predictive power of aided pure-tone audiometry and speech recognition in quiet, and establish cut-off values for both tests that indicate whether auditory performance in noise can be assessed using the Matrix sentence test in a diffuse noise environment. Methods: Here, we assessed the power of pure-tone audiometry and speech recognition in quiet to predict the response to the MT. Ninety-eight cochlear implant recipients were assessed using different sound processors from Advanced Bionics (n=56) and Cochlear (n=42). Auditory tests were performed at least 1-year after cochlear implantation or upgrading the sound processor to ensure the best benefit of the implant. Auditory assessment of the implanted ear in free-field conditions included: pure-tone average (PTA), speech discrimination score (SDS) in quiet at 65 dB, and speech recognition threshold (SRT) in noise that is the SNR at which the patient can correctly recognize 50% of the words using the MT in a diffuse sound field. Results: The SRT in noise was determined in sixty patients (61%) and undetermined in 38 (39%) using the MT. When cut-off values for PTA <36 dB and SDS >41% were used separately, they were able to predict a positive response to the MT in 83% of recipients; using both cut-off values together, the predictive value reached 92%. Discussion/Conclusion: As the pure-tone audiometry is standardized universally and the speech recognition in quiet could vary depending on the language used; we propose that the MT should be performed in recipients with PTA <36 dB, and in recipients with PTA >36 dB, a list of Matrix sentences at a fixed SNR should be presented to determine the percentage of words understood. This approach should enable clinicians to obtain information about auditory performance in noise whenever possible.

Individual Login

Institutional login.

  • Access via Shibboleth and OpenAthens
  • Access via username and password

Digital Version

Email alerts, citing articles via, suggested reading.

  • Online ISSN 1421-9700
  • Print ISSN 1420-3030

INFORMATION

  • Contact & Support
  • Information & Downloads
  • Rights & Permissions
  • Terms & Conditions
  • Catalogue & Pricing
  • Policies & Information
  • People & Organization
  • Stay Up-to-Date
  • Regional Offices
  • Community Voice

SERVICES FOR

  • Researchers
  • Healthcare Professionals
  • Patients & Supporters
  • Health Sciences Industry
  • Medical Societies
  • Agents & Booksellers

Karger International

  • S. Karger AG
  • P.O Box, CH-4009 Basel (Switzerland)
  • Allschwilerstrasse 10, CH-4055 Basel
  • Tel: +41 61 306 11 11
  • Fax: +41 61 306 12 34
  • Email: [email protected]
  • Experience Blog
  • Privacy Policy
  • Terms of Use

This Feature Is Available To Subscribers Only

Sign In or Create an Account

Prosem Lecture: Speech Recognition in Noisy Environments in Children: Simulating Cochlear Implant Listening and Testing the Importance of the Predictability of Speech

Didulani Dantanarayana, M.Sc. Audiology

62 Goodnight Hall April 15, 2024 @ 12:00 pm - 1:00 pm Learn more about Prosem

Didulani Dantanarayana, Master of Audiology

Didulani Dantanarayana, M.Sc. Audiology PhD Student, Communication Sciences & Disorders Binaural Hearing and Speech Lab University of Wisconsin – Madison

Speech Recognition in Noisy Environments in Children: Simulating Cochlear Implant Listening and Testing the Importance of the Predictability of Speech

Children show significant variability in outcome measures including speech understanding in quiet and in noise. Children with hearing loss show even greater variability; numerous factors can contribute to such variability, including auditory experience prior to the onset of deafness and implantation and the downstream effects of deafness, including neurocognitive abilities neural health and the integrity of the auditory system.

Much of the research to date on speech understanding in children with and without hearing loss has used standardized tests consisting of words and sentences with highly predictable content. However, when navigating realistic listening situations, children are likely to be presented with information that is complex and unpredictable, thus the sentence materials used here had either low- or high-predictability. To investigate the extent to which children benefit from spatial separation of target speech from background noise, spatial release from masking (SRM) was investigated.

Critically, as we prepare to study these effects in children with cochlear implants (CIs), we first investigated SRM in typically hearing children, who listened to speech that was spectrally degraded via vocoding, to estimate aspects of CI processing. Because typically hearing children are not accustomed to listening to vocoded speech, they were first exposed to vocoded sentences, and then tested using vocoded low- and high-predictable sentences. This talk will focus on preliminary findings to date and discuss future directions on hypothesized outcomes in children with CIs. Particularly, we are investigating associations with cognitive skills (e.g., attention, cognitive flexibility of using information and ability to monitor and supress information) that might aid in selectively attending to understanding speech in noisy listening environments.

This work was funded by grant NIH-NIDCD to R.Y. Litovsky (R01DC019511, R01DC020355) and in part by a core grant to Waisman Center from the National Institute of Child Health and Human Development (P50HD105353).

Learn more about Prosem.

SpeechRecognition 3.10.3

pip install SpeechRecognition Copy PIP instructions

Released: Mar 30, 2024

Library for performing speech recognition, with support for several engines and APIs, online and offline.

Project links

  • Open issues:

View statistics for this project via Libraries.io , or by using our public dataset on Google BigQuery

License: BSD License (BSD)

Author: Anthony Zhang (Uberi)

Tags speech, recognition, voice, sphinx, google, wit, bing, api, houndify, ibm, snowboy

Requires: Python >=3.8

Maintainers

Avatar for Anthony.Zhang from gravatar.com

Classifiers

  • 5 - Production/Stable
  • OSI Approved :: BSD License
  • MacOS :: MacOS X
  • Microsoft :: Windows
  • POSIX :: Linux
  • Python :: 3
  • Python :: 3.8
  • Python :: 3.9
  • Python :: 3.10
  • Python :: 3.11
  • Multimedia :: Sound/Audio :: Speech
  • Software Development :: Libraries :: Python Modules

Project description

Latest Version

UPDATE 2022-02-09 : Hey everyone! This project started as a tech demo, but these days it needs more time than I have to keep up with all the PRs and issues. Therefore, I’d like to put out an open invite for collaborators - just reach out at me @ anthonyz . ca if you’re interested!

Speech recognition engine/API support:

Quickstart: pip install SpeechRecognition . See the “Installing” section for more details.

To quickly try it out, run python -m speech_recognition after installing.

Project links:

Library Reference

The library reference documents every publicly accessible object in the library. This document is also included under reference/library-reference.rst .

See Notes on using PocketSphinx for information about installing languages, compiling PocketSphinx, and building language packs from online resources. This document is also included under reference/pocketsphinx.rst .

You have to install Vosk models for using Vosk. Here are models avaiable. You have to place them in models folder of your project, like “your-project-folder/models/your-vosk-model”

See the examples/ directory in the repository root for usage examples:

First, make sure you have all the requirements listed in the “Requirements” section.

The easiest way to install this is using pip install SpeechRecognition .

Otherwise, download the source distribution from PyPI , and extract the archive.

In the folder, run python setup.py install .

Requirements

To use all of the functionality of the library, you should have:

The following requirements are optional, but can improve or extend functionality in some situations:

The following sections go over the details of each requirement.

The first software requirement is Python 3.8+ . This is required to use the library.

PyAudio (for microphone users)

PyAudio is required if and only if you want to use microphone input ( Microphone ). PyAudio version 0.2.11+ is required, as earlier versions have known memory management bugs when recording from microphones in certain situations.

If not installed, everything in the library will still work, except attempting to instantiate a Microphone object will raise an AttributeError .

The installation instructions on the PyAudio website are quite good - for convenience, they are summarized below:

PyAudio wheel packages for common 64-bit Python versions on Windows and Linux are included for convenience, under the third-party/ directory in the repository root. To install, simply run pip install wheel followed by pip install ./third-party/WHEEL_FILENAME (replace pip with pip3 if using Python 3) in the repository root directory .

PocketSphinx-Python (for Sphinx users)

PocketSphinx-Python is required if and only if you want to use the Sphinx recognizer ( recognizer_instance.recognize_sphinx ).

PocketSphinx-Python wheel packages for 64-bit Python 3.4, and 3.5 on Windows are included for convenience, under the third-party/ directory . To install, simply run pip install wheel followed by pip install ./third-party/WHEEL_FILENAME (replace pip with pip3 if using Python 3) in the SpeechRecognition folder.

On Linux and other POSIX systems (such as OS X), follow the instructions under “Building PocketSphinx-Python from source” in Notes on using PocketSphinx for installation instructions.

Note that the versions available in most package repositories are outdated and will not work with the bundled language data. Using the bundled wheel packages or building from source is recommended.

Vosk (for Vosk users)

Vosk API is required if and only if you want to use Vosk recognizer ( recognizer_instance.recognize_vosk ).

You can install it with python3 -m pip install vosk .

You also have to install Vosk Models:

Here are models avaiable for download. You have to place them in models folder of your project, like “your-project-folder/models/your-vosk-model”

Google Cloud Speech Library for Python (for Google Cloud Speech API users)

Google Cloud Speech library for Python is required if and only if you want to use the Google Cloud Speech API ( recognizer_instance.recognize_google_cloud ).

If not installed, everything in the library will still work, except calling recognizer_instance.recognize_google_cloud will raise an RequestError .

According to the official installation instructions , the recommended way to install this is using Pip : execute pip install google-cloud-speech (replace pip with pip3 if using Python 3).

FLAC (for some systems)

A FLAC encoder is required to encode the audio data to send to the API. If using Windows (x86 or x86-64), OS X (Intel Macs only, OS X 10.6 or higher), or Linux (x86 or x86-64), this is already bundled with this library - you do not need to install anything .

Otherwise, ensure that you have the flac command line tool, which is often available through the system package manager. For example, this would usually be sudo apt-get install flac on Debian-derivatives, or brew install flac on OS X with Homebrew.

Whisper (for Whisper users)

Whisper is required if and only if you want to use whisper ( recognizer_instance.recognize_whisper ).

You can install it with python3 -m pip install SpeechRecognition[whisper-local] .

Whisper API (for Whisper API users)

The library openai is required if and only if you want to use Whisper API ( recognizer_instance.recognize_whisper_api ).

If not installed, everything in the library will still work, except calling recognizer_instance.recognize_whisper_api will raise an RequestError .

You can install it with python3 -m pip install SpeechRecognition[whisper-api] .

Troubleshooting

The recognizer tries to recognize speech even when i’m not speaking, or after i’m done speaking..

Try increasing the recognizer_instance.energy_threshold property. This is basically how sensitive the recognizer is to when recognition should start. Higher values mean that it will be less sensitive, which is useful if you are in a loud room.

This value depends entirely on your microphone or audio data. There is no one-size-fits-all value, but good values typically range from 50 to 4000.

Also, check on your microphone volume settings. If it is too sensitive, the microphone may be picking up a lot of ambient noise. If it is too insensitive, the microphone may be rejecting speech as just noise.

The recognizer can’t recognize speech right after it starts listening for the first time.

The recognizer_instance.energy_threshold property is probably set to a value that is too high to start off with, and then being adjusted lower automatically by dynamic energy threshold adjustment. Before it is at a good level, the energy threshold is so high that speech is just considered ambient noise.

The solution is to decrease this threshold, or call recognizer_instance.adjust_for_ambient_noise beforehand, which will set the threshold to a good value automatically.

The recognizer doesn’t understand my particular language/dialect.

Try setting the recognition language to your language/dialect. To do this, see the documentation for recognizer_instance.recognize_sphinx , recognizer_instance.recognize_google , recognizer_instance.recognize_wit , recognizer_instance.recognize_bing , recognizer_instance.recognize_api , recognizer_instance.recognize_houndify , and recognizer_instance.recognize_ibm .

For example, if your language/dialect is British English, it is better to use "en-GB" as the language rather than "en-US" .

The recognizer hangs on recognizer_instance.listen ; specifically, when it’s calling Microphone.MicrophoneStream.read .

This usually happens when you’re using a Raspberry Pi board, which doesn’t have audio input capabilities by itself. This causes the default microphone used by PyAudio to simply block when we try to read it. If you happen to be using a Raspberry Pi, you’ll need a USB sound card (or USB microphone).

Once you do this, change all instances of Microphone() to Microphone(device_index=MICROPHONE_INDEX) , where MICROPHONE_INDEX is the hardware-specific index of the microphone.

To figure out what the value of MICROPHONE_INDEX should be, run the following code:

This will print out something like the following:

Now, to use the Snowball microphone, you would change Microphone() to Microphone(device_index=3) .

Calling Microphone() gives the error IOError: No Default Input Device Available .

As the error says, the program doesn’t know which microphone to use.

To proceed, either use Microphone(device_index=MICROPHONE_INDEX, ...) instead of Microphone(...) , or set a default microphone in your OS. You can obtain possible values of MICROPHONE_INDEX using the code in the troubleshooting entry right above this one.

The program doesn’t run when compiled with PyInstaller .

As of PyInstaller version 3.0, SpeechRecognition is supported out of the box. If you’re getting weird issues when compiling your program using PyInstaller, simply update PyInstaller.

You can easily do this by running pip install --upgrade pyinstaller .

On Ubuntu/Debian, I get annoying output in the terminal saying things like “bt_audio_service_open: […] Connection refused” and various others.

The “bt_audio_service_open” error means that you have a Bluetooth audio device, but as a physical device is not currently connected, we can’t actually use it - if you’re not using a Bluetooth microphone, then this can be safely ignored. If you are, and audio isn’t working, then double check to make sure your microphone is actually connected. There does not seem to be a simple way to disable these messages.

For errors of the form “ALSA lib […] Unknown PCM”, see this StackOverflow answer . Basically, to get rid of an error of the form “Unknown PCM cards.pcm.rear”, simply comment out pcm.rear cards.pcm.rear in /usr/share/alsa/alsa.conf , ~/.asoundrc , and /etc/asound.conf .

For “jack server is not running or cannot be started” or “connect(2) call to /dev/shm/jack-1000/default/jack_0 failed (err=No such file or directory)” or “attempt to connect to server failed”, these are caused by ALSA trying to connect to JACK, and can be safely ignored. I’m not aware of any simple way to turn those messages off at this time, besides entirely disabling printing while starting the microphone .

On OS X, I get a ChildProcessError saying that it couldn’t find the system FLAC converter, even though it’s installed.

Installing FLAC for OS X directly from the source code will not work, since it doesn’t correctly add the executables to the search path.

Installing FLAC using Homebrew ensures that the search path is correctly updated. First, ensure you have Homebrew, then run brew install flac to install the necessary files.

To hack on this library, first make sure you have all the requirements listed in the “Requirements” section.

To install/reinstall the library locally, run python setup.py install in the project root directory .

Before a release, the version number is bumped in README.rst and speech_recognition/__init__.py . Version tags are then created using git config gpg.program gpg2 && git config user.signingkey DB45F6C431DE7C2DCD99FF7904882258A4063489 && git tag -s VERSION_GOES_HERE -m "Version VERSION_GOES_HERE" .

Releases are done by running make-release.sh VERSION_GOES_HERE to build the Python source packages, sign them, and upload them to PyPI.

To run all the tests:

Testing is also done automatically by TravisCI, upon every push. To set up the environment for offline/local Travis-like testing on a Debian-like system:

FLAC Executables

The included flac-win32 executable is the official FLAC 1.3.2 32-bit Windows binary .

The included flac-linux-x86 and flac-linux-x86_64 executables are built from the FLAC 1.3.2 source code with Manylinux to ensure that it’s compatible with a wide variety of distributions.

The built FLAC executables should be bit-for-bit reproducible. To rebuild them, run the following inside the project directory on a Debian-like system:

The included flac-mac executable is extracted from xACT 2.39 , which is a frontend for FLAC 1.3.2 that conveniently includes binaries for all of its encoders. Specifically, it is a copy of xACT 2.39/xACT.app/Contents/Resources/flac in xACT2.39.zip .

Please report bugs and suggestions at the issue tracker !

How to cite this library (APA style):

Zhang, A. (2017). Speech Recognition (Version 3.8) [Software]. Available from https://github.com/Uberi/speech_recognition#readme .

How to cite this library (Chicago style):

Zhang, Anthony. 2017. Speech Recognition (version 3.8).

Also check out the Python Baidu Yuyin API , which is based on an older version of this project, and adds support for Baidu Yuyin . Note that Baidu Yuyin is only available inside China.

Copyright 2014-2017 Anthony Zhang (Uberi) . The source code for this library is available online at GitHub .

SpeechRecognition is made available under the 3-clause BSD license. See LICENSE.txt in the project’s root directory for more information.

For convenience, all the official distributions of SpeechRecognition already include a copy of the necessary copyright notices and licenses. In your project, you can simply say that licensing information for SpeechRecognition can be found within the SpeechRecognition README, and make sure SpeechRecognition is visible to users if they wish to see it .

SpeechRecognition distributes source code, binaries, and language files from CMU Sphinx . These files are BSD-licensed and redistributable as long as copyright notices are correctly retained. See speech_recognition/pocketsphinx-data/*/LICENSE*.txt and third-party/LICENSE-Sphinx.txt for license details for individual parts.

SpeechRecognition distributes source code and binaries from PyAudio . These files are MIT-licensed and redistributable as long as copyright notices are correctly retained. See third-party/LICENSE-PyAudio.txt for license details.

SpeechRecognition distributes binaries from FLAC - speech_recognition/flac-win32.exe , speech_recognition/flac-linux-x86 , and speech_recognition/flac-mac . These files are GPLv2-licensed and redistributable, as long as the terms of the GPL are satisfied. The FLAC binaries are an aggregate of separate programs , so these GPL restrictions do not apply to the library or your programs that use the library, only to FLAC itself. See LICENSE-FLAC.txt for license details.

Project details

Release history release notifications | rss feed.

Mar 30, 2024

Mar 28, 2024

Dec 6, 2023

Mar 13, 2023

Dec 4, 2022

Dec 5, 2017

Jun 27, 2017

Apr 13, 2017

Mar 11, 2017

Jan 7, 2017

Nov 21, 2016

May 22, 2016

May 11, 2016

May 10, 2016

Apr 9, 2016

Apr 4, 2016

Apr 3, 2016

Mar 5, 2016

Mar 4, 2016

Feb 26, 2016

Feb 20, 2016

Feb 19, 2016

Feb 4, 2016

Nov 5, 2015

Nov 2, 2015

Sep 2, 2015

Sep 1, 2015

Aug 30, 2015

Aug 24, 2015

Jul 26, 2015

Jul 12, 2015

Jul 3, 2015

May 20, 2015

Apr 24, 2015

Apr 14, 2015

Apr 7, 2015

Apr 5, 2015

Apr 4, 2015

Mar 31, 2015

Dec 10, 2014

Nov 17, 2014

Sep 11, 2014

Sep 6, 2014

Aug 25, 2014

Jul 6, 2014

Jun 10, 2014

Jun 9, 2014

May 29, 2014

Apr 23, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages .

Source Distribution

Uploaded Mar 30, 2024 Source

Built Distribution

Uploaded Mar 30, 2024 Python 2 Python 3

Hashes for SpeechRecognition-3.10.3.tar.gz

Hashes for speechrecognition-3.10.3-py2.py3-none-any.whl.

  • português (Brasil)

Supported by

speech recognition test in noise

Large Language Models are Efficient Learners of Noise-Robust Speech Recognition

Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which leverages the rich linguistic knowledge and powerful reasoning ability of LLMs to improve recognition results. The latest work proposes a GER benchmark with "HyPoradise" dataset to learn the mapping from ASR N-best hypotheses to ground-truth transcription by efficient LLM finetuning, which shows great effectiveness but lacks specificity on noise-robust ASR. In this work, we extend the benchmark to noisy conditions and investigate if we can teach LLMs to perform denoising for GER just like what robust ASR do, where one solution is introducing noise information as a conditioner into LLM. However, directly incorporating noise embeddings from audio encoder could harm the LLM tuning due to cross-modality gap. To this end, we propose to extract a language-space noise embedding from the N-best list to represent the noise conditions of source speech, which can promote the denoising process in GER. Furthermore, in order to enhance its representation ability of audio noise, we design a knowledge distillation (KD) approach via mutual information estimation to distill the real noise information in audio embeddings to our language embedding. Experiments on various latest LLMs demonstrate our approach achieves a new breakthrough with up to 53.9% correction improvement in terms of word error rate while with limited training data. Analysis shows that our language-space noise embedding can well represent the noise conditions of source speech, under which off-the-shelf LLMs show strong ability of language-space denoising.

Publication

  • Chao-han Huck Yang
  • Pin-Yu Chen
  • Ensiong Chng
  • Machine Learning
  • Natural Language Processing
  • Foundation Models

A Decentralized Primal-Dual Framework for Non-convex Smooth Consensus Optimization

High-quality conversational systems, privacy-preserving gaussian process regression - a modular approach to the application of homomorphic encryption.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 08 April 2024

A neural speech decoding framework leveraging deep learning and speech synthesis

  • Xupeng Chen 1   na1 ,
  • Ran Wang 1   na1 ,
  • Amirhossein Khalilian-Gourtani   ORCID: orcid.org/0000-0003-1376-9583 2 ,
  • Leyao Yu 2 , 3 ,
  • Patricia Dugan 2 ,
  • Daniel Friedman 2 ,
  • Werner Doyle 4 ,
  • Orrin Devinsky 2 ,
  • Yao Wang   ORCID: orcid.org/0000-0003-3199-3802 1 , 3   na2 &
  • Adeen Flinker   ORCID: orcid.org/0000-0003-1247-1283 2 , 3   na2  

Nature Machine Intelligence ( 2024 ) Cite this article

2781 Accesses

354 Altmetric

Metrics details

  • Neural decoding

A preprint version of the article is available at bioRxiv.

Decoding human speech from neural signals is essential for brain–computer interface (BCI) technologies that aim to restore speech in populations with neurological deficits. However, it remains a highly challenging task, compounded by the scarce availability of neural signals with corresponding speech, data complexity and high dimensionality. Here we present a novel deep learning-based neural speech decoding framework that includes an ECoG decoder that translates electrocorticographic (ECoG) signals from the cortex into interpretable speech parameters and a novel differentiable speech synthesizer that maps speech parameters to spectrograms. We have developed a companion speech-to-speech auto-encoder consisting of a speech encoder and the same speech synthesizer to generate reference speech parameters to facilitate the ECoG decoder training. This framework generates natural-sounding speech and is highly reproducible across a cohort of 48 participants. Our experimental results show that our models can decode speech with high correlation, even when limited to only causal operations, which is necessary for adoption by real-time neural prostheses. Finally, we successfully decode speech in participants with either left or right hemisphere coverage, which could lead to speech prostheses in patients with deficits resulting from left hemisphere damage.

Similar content being viewed by others

speech recognition test in noise

Alignment of brain embeddings and artificial contextual embeddings in natural language points to common geometric patterns

Ariel Goldstein, Avigail Grinstein-Dabush, … Uri Hasson

speech recognition test in noise

A modular organic neuromorphic spiking circuit for retina-inspired sensory coding and neurotransmitter-mediated neural pathways

Giovanni Maria Matrone, Eveline R. W. van Doremaele, … Yoeri van de Burgt

speech recognition test in noise

A precision functional atlas of personalized network topography and probabilities

Robert J. M. Hermosillo, Lucille A. Moore, … Damien A. Fair

Speech loss due to neurological deficits is a severe disability that limits both work life and social life. Advances in machine learning and brain–computer interface (BCI) systems have pushed the envelope in the development of neural speech prostheses to enable people with speech loss to communicate 1 , 2 , 3 , 4 , 5 . An effective modality for acquiring data to develop such decoders involves electrocorticographic (ECoG) recordings obtained in patients undergoing epilepsy surgery 4 , 5 , 6 , 7 , 8 , 9 , 10 . Implanted electrodes in patients with epilepsy provide a rare opportunity to collect cortical data during speech with high spatial and temporal resolution, and such approaches have produced promising results in speech decoding 4 , 5 , 8 , 9 , 10 , 11 .

Two challenges are inherent to successfully carrying out speech decoding from neural signals. First, the data to train personalized neural-to-speech decoding models are limited in duration, and deep learning models require extensive training data. Second, speech production varies in rate, intonation, pitch and so on, even within a single speaker producing the same word, complicating the underlying model representation 12 , 13 . These challenges have led to diverse speech decoding approaches with a range of model architectures. Currently, public code to test and replicate findings across research groups is limited in availability.

Earlier approaches to decoding and synthesizing speech spectrograms from neural signals focused on linear models. These approaches achieved a Pearson correlation coefficient (PCC) of ~0.6 or lower, but with simple model architectures that are easy to interpret and do not require large training datasets 14 , 15 , 16 . Recent research has focused on deep neural networks leveraging convolutional 8 , 9 and recurrent 5 , 10 , 17 network architectures. These approaches vary across two major dimensions: the intermediate latent representation used to model speech and the speech quality produced after synthesis. For example, cortical activity has been decoded into an articulatory movement space, which is then transformed into speech, providing robust decoding performance but with a non-natural synthetic voice reconstruction 17 . Conversely, some approaches have produced naturalistic reconstruction leveraging wavenet vocoders 8 , generative adversarial networks (GAN) 11 and unit selection 18 , but achieve limited accuracy. A recent study in one implanted patient 19 provided both robust accuracies and a naturalistic speech waveform by leveraging quantized HuBERT features 20 as an intermediate representation space and a pretrained speech synthesizer that converts the HuBERT features into speech. However, HuBERT features do not carry speaker-dependent acoustic information and can only be used to generate a generic speaker’s voice, so they require a separate model to translate the generic voice to a specific patient’s voice. Furthermore, this study and most previous approaches have employed non-causal architectures, which may limit real-time applications, which typically require causal operations.

To address these issues, in this Article we present a novel ECoG-to-speech framework with a low-dimensional intermediate representation guided by subject-specific pre-training using speech signal only (Fig. 1 ). Our framework consists of an ECoG decoder that maps the ECoG signals to interpretable acoustic speech parameters (for example, pitch, voicing and formant frequencies), as well as a speech synthesizer that translates the speech parameters to a spectrogram. The speech synthesizer is differentiable, enabling us to minimize the spectrogram reconstruction error during training of the ECoG decoder. The low-dimensional latent space, together with guidance on the latent representation generated by a pre-trained speech encoder, overcomes data scarcity issues. Our publicly available framework produces naturalistic speech that highly resembles the speaker’s own voice, and the ECoG decoder can be realized with different deep learning model architectures and using different causality directions. We report this framework with multiple deep architectures (convolutional, recurrent and transformer) as the ECoG decoder, and apply it to 48 neurosurgical patients. Our framework performs with high accuracy across the models, with the best performance obtained by the convolutional (ResNet) architecture (PCC of 0.806 between the original and decoded spectrograms). Our framework can achieve high accuracy using only causal processing and relatively low spatial sampling on the cortex. We also show comparable speech decoding from grid implants on the left and right hemispheres, providing a proof of concept for neural prosthetics in patients suffering from expressive aphasia (with damage limited to the left hemisphere), although such an approach must be tested in patients with damage to the left hemisphere. Finally, we provide a publicly available neural decoding pipeline ( https://github.com/flinkerlab/neural_speech_decoding ) that offers flexibility in ECoG decoding architectures to push forward research across the speech science and prostheses communities.

figure 1

The upper part shows the ECoG-to-speech decoding pipeline. The ECoG decoder generates time-varying speech parameters from ECoG signals. The speech synthesizer generates spectrograms from the speech parameters. A separate spectrogram inversion algorithm converts the spectrograms to speech waveforms. The lower part shows the speech-to-speech auto-encoder, which generates the guidance for the speech parameters to be produced by the ECoG decoder during its training. The speech encoder maps an input spectrogram to the speech parameters, which are then fed to the same speech synthesizer to reproduce the spectrogram. The speech encoder and a few learnable subject-specific parameters in the speech synthesizer are pre-trained using speech signals only. Only the upper part is needed to decode the speech from ECoG signals once the pipeline is trained.

ECoG-to-speech decoding framework

Our ECoG-to-speech framework consists of an ECoG decoder and a speech synthesizer (shown in the upper part of Fig. 1 ). The neural signals are fed into an ECoG decoder, which generates speech parameters, followed by a speech synthesizer, which translates the parameters into spectrograms (which are then converted to a waveform by the Griffin–Lim algorithm 21 ). The training of our framework comprises two steps. We first use semi-supervised learning on the speech signals alone. An auto-encoder, shown in the lower part of Fig. 1 , is trained so that the speech encoder derives speech parameters from a given spectrogram, while the speech synthesizer (used here as the decoder) reproduces the spectrogram from the speech parameters. Our speech synthesizer is fully differentiable and generates speech through a weighted combination of voiced and unvoiced speech components generated from input time series of speech parameters, including pitch, formant frequencies, loudness and so on. The speech synthesizer has only a few subject-specific parameters, which are learned as part of the auto-encoder training (more details are provided in the Methods Speech synthesizer section). Currently, our speech encoder and speech synthesizer are subject-specific and can be trained using any speech signal of a participant, not just those with corresponding ECoG signals.

In the next step, we train the ECoG decoder in a supervised manner based on ground-truth spectrograms (using measures of spectrogram difference and short-time objective intelligibility, STOI 8 , 22 ), as well as guidance for the speech parameters generated by the pre-trained speech encoder (that is, reference loss between speech parameters). By limiting the number of speech parameters (18 at each time step; Methods section Summary of speech parameters ) and using the reference loss, the ECoG decoder can be trained with limited corresponding ECoG and speech data. Furthermore, because our speech synthesizer is differentiable, we can back-propagate the spectral loss (differences between the original and decoded spectrograms) to update the ECoG decoder. We provide multiple ECoG decoder architectures to choose from, including 3D ResNet 23 , 3D Swin Transformer 24 and LSTM 25 . Importantly, unlike many methods in the literature, we employ ECoG decoders that can operate in a causal manner, which is necessary for real-time speech generation from neural signals. Note that, once the ECoG decoder and speech synthesizer are trained, they can be used for ECoG-to-speech decoding without using the speech encoder.

Data collection

We employed our speech decoding framework across N  = 48 participants who consented to complete a series of speech tasks (Methods section Experiments design). These participants, as part of their clinical care, were undergoing treatment for refractory epilepsy with implanted electrodes. During the hospital stay, we acquired synchronized neural and acoustic speech data. ECoG data were obtained from five participants with hybrid-density (HB) sampling (clinical-research grid) and 43 participants with low-density (LD) sampling (standard clinical grid), who took part in five speech tasks: auditory repetition (AR), auditory naming (AN), sentence completion (SC), word reading (WR) and picture naming (PN). These tasks were designed to elicit the same set of spoken words across tasks while varying the stimulus modality. We provided 50 repeated unique words (400 total trials per participant), all of which were analysed locked to the onset of speech production. We trained a model for each participant using 80% of available data for that participant and evaluated the model on the remaining 20% of data (with the exception of the more stringent word-level cross-validation).

Speech decoding performance and causality

We first aimed to directly compare the decoding performance across different architectures, including those that have been employed in the neural speech decoding literature (recurrent and convolutional) and transformer-based models. Although any decoder architecture could be used for the ECoG decoder in our framework, employing the same speech encoder guidance and speech synthesizer, we focused on three representative models for convolution (ResNet), recurrent (LSTM) and transformer (Swin) architectures. Note that any of these models can be configured to use temporally non-causal or causal operations. Our results show that ResNet outperformed the other models, providing the highest PCC across N  = 48 participants (mean PCC = 0.806 and 0.797 for non-causal and causal, respectively), closely followed by Swin (mean PCC = 0.792 and 0.798 for non-causal and causal, respectively) (Fig. 2a ). We found the same when evaluating the three models using STOI+ (ref. 26 ), as shown in Supplementary Fig. 1a . The causality of machine learning models for speech production has important implications for BCI applications. A causal model only uses past and current neural signals to generate speech, whereas non-causal models use past, present and future neural signals. Previous reports have typically employed non-causal models 5 , 8 , 10 , 17 , which can use neural signals related to the auditory and speech feedback that is unavailable in real-time applications. Optimally, only the causal direction should be employed. We thus compared the performance of the same models with non-causal and causal temporal operations. Figure 2a compares the decoding results of causal and non-causal versions of our models. The causal ResNet model (PCC = 0.797) achieved a performance comparable to that of the non-causal model (PCC = 0.806), with no significant differences between the two (Wilcoxon two-sided signed-rank test P  = 0.093). The same was true for the causal Swin model (PCC = 0.798) and its non-causal (PCC = 0.792) counterpart (Wilcoxon two-sided signed-rank test P  = 0.196). In contrast, the performance of the causal LSTM model (PCC = 0.712) was significantly inferior to that of its non-causal (PCC = 0.745) version (Wilcoxon two-sided signed-rank test P  = 0.009). Furthermore, the LSTM model showed consistently lower performance than ResNet and Swin. However, we did not find significant differences between the causal ResNet and causal Swin performances (Wilcoxon two-sided signed-rank test P  = 0.587). Because the ResNet and Swin models had the highest performance and were on par with each other and their causal counterparts, we chose to focus further analyses on these causal models, which we believe are best suited for prosthetic applications.

figure 2

a , Performances of ResNet, Swin and LSTM models with non-causal and causal operations. The PCC between the original and decoded spectrograms is evaluated on the held-out testing set and shown for each participant. Each data point corresponds to a participant’s average PCC across testing trials. b , A stringent cross-validation showing the performance of the causal ResNet model on unseen words during training from five folds; we ensured that the training and validation sets in each fold did not overlap in unique words. The performance across all five validation folds was comparable to our trial-based validation, denoted for comparison as ResNet (identical to the ResNet causal model in a ). c – f , Examples of decoded spectrograms and speech parameters from the causal ResNet model for eight words (from two participants) and the PCC values for the decoded and reference speech parameters across all participants. Spectrograms of the original ( c ) and decoded ( d ) speech are shown, with orange curves overlaid representing the reference voice weight learned by the speech encoder ( c ) and the decoded voice weight from the ECoG decoder ( d ). The PCC between the decoded and reference voice weights is shown on the right across all participants. e , Decoded and reference loudness parameters for the eight words, and the PCC values of the decoded loudness parameters across participants (boxplot on the right). f , Decoded (dashed) and reference (solid) parameters for pitch ( f 0 ) and the first two formants ( f 1 and f 2 ) are shown for the eight words, as well as the PCC values across participants (box plots to the right). All box plots depict the median (horizontal line inside the box), 25th and 75th percentiles (box) and 25th or 75th percentiles ± 1.5 × interquartile range (whiskers) across all participants ( N  = 48). Yellow error bars denote the mean ± s.e.m. across participants.

Source data

To ensure our framework can generalize well to unseen words, we added a more stringent word-level cross-validation in which random (ten unique) words were entirely held out during training (including both pre-training of the speech encoder and speech synthesizer and training of the ECoG decoder). This ensured that different trials from the same word could not appear in both the training and testing sets. The results shown in Fig. 2b demonstrate that performance on the held-out words is comparable to our standard trial-based held-out approach (Fig. 2a , ‘ResNet’). It is encouraging that the model can decode unseen validation words well, regardless of which words were held out during training.

Next, we show the performance of the ResNet causal decoder on the level of single words across two representative participants (LD grids). The decoded spectrograms accurately preserve the spectro-temporal structure of the original speech (Fig. 2c,d ). We also compare the decoded speech parameters with the reference parameters. For each parameter, we calculated the PCC between the decoded time series and the reference sequence, showing average PCC values of 0.781 (voice weight, Fig. 2d ), 0.571 (loudness, Fig. 2e ), 0.889 (pitch f 0 , Fig. 2f ), 0.812 (first formant f 1 , Fig. 2f ) and 0.883 (second formant f 2 , Fig. 2f ). Accurate reconstruction of the speech parameters, especially the pitch, voice weight and first two formants, is essential for accurate speech decoding and naturalistic reconstruction that mimics a participant’s voice. We also provide a non-causal version of Fig. 2 in Supplementary Fig. 2 . The fact that both non-causal and causal models can yield reasonable decoding results is encouraging.

Left-hemisphere versus right-hemisphere decoding

Most speech decoding studies have focused on the language- and speech-dominant left hemisphere 27 . However, little is known about decoding speech representations from the right hemisphere. To this end, we compared left- versus right-hemisphere decoding performance across our participants to establish the feasibility of a right-hemisphere speech prosthetic. For both our ResNet and Swin decoders, we found robust speech decoding from the right hemisphere (ResNet PCC = 0.790, Swin PCC = 0.798) that was not significantly different from that of the left (Fig. 3a , ResNet independent t -test, P  = 0.623; Swin independent t -test, P  = 0.968). A similar conclusion held when evaluating STOI+ (Supplementary Fig. 1b , ResNet independent t -test, P  = 0.166; Swin independent t -test, P  = 0.114). Although these results suggest that it may be feasible to use neural signals in the right hemisphere to decode speech for patients who suffer damage to the left hemisphere and are unable to speak 28 , it remains unknown whether intact left-hemisphere cortex is necessary to allow for speech decoding from the right hemisphere until tested in such patients.

figure 3

a , Comparison between left- and right-hemisphere participants using causal models. No statistically significant differences (ResNet independent t -test, P  = 0.623; Swin Wilcoxon independent t -test, P  = 0.968) in PCC values exist between left- ( N  = 32) and right- ( N  = 16) hemisphere participants. b , An example hybrid-density ECoG array with a total of 128 electrodes. The 64 electrodes marked in red correspond to a LD placement. The remaining 64 green electrodes, combined with red electrodes, reflect HB placement. c , Comparison between causal ResNet and causal Swin models for the same participant across participants with HB ( N  = 5) or LD ( N  = 43) ECoG grids. The two models show similar decoding performances from the HB and LD grids. d , Decoding PCC values across 50 test trials by the ResNet model for HB ( N  = 5) participants when all electrodes are used versus when only LD-in-HB electrodes ( N  = 5) are considered. There are no statistically significant differences for four out of five participants (Wilcoxon two-sided signed-rank test, P  = 0.114, 0.003, 0.0773, 0.472 and 0.605, respectively). All box plots depict the median (horizontal line inside box), 25th and 75th percentiles (box) and 25th or 75th percentiles ± 1.5 × interquartile range (whiskers). Yellow error bars denote mean ± s.e.m. Distributions were compared with each other as indicated, using the Wilcoxon two-sided signed-rank test and independent t -test. ** P  < 0.01; NS, not significant.

Effect of electrode density

Next, we assessed the impact of electrode sampling density on speech decoding, as many previous reports use higher-density grids (0.4 mm) with more closely spaced contacts than typical clinical grids (1 cm). Five participants consented to hybrid grids (Fig. 3b , HB), which typically had LD electrode sampling but with additional electrodes interleaved. The HB grids provided a decoding performance similar to clinical LD grids in terms of PCC values (Fig. 3c ), with a slight advantage in STOI+, as shown in Supplementary Fig. 3b . To ascertain whether the additional spatial sampling indeed provides improved speech decoding, we compared models that decode speech based on all the hybrid electrodes versus only the LD electrodes in participants with HB grids (comparable to our other LD participants). Our findings (Fig. 3d ) suggest that the decoding results were not significantly different from each other (with the exception of participant 2) in terms of PCC and STOI+ (Supplementary Fig. 3c ). Together, these results suggest that our models can learn speech representations well from both high and low spatial sampling of the cortex, with the exciting finding of robust speech decoding from the right hemisphere.

Contribution analysis

Finally, we investigated which cortical regions contribute to decoding to provide insight for the targeted implantation of future prosthetics, especially on the right hemisphere, which has not yet been investigated. We used an occlusion approach to quantify the contributions of different cortical sites to speech decoding. If a region is involved in decoding, occluding the neural signal in the corresponding electrode (that is, setting the signal to zero) will reduce the accuracy (PCC) of the speech reconstructed on testing data (Methods section Contribution analysis ). We thus measured each region’s contribution by decoding the reduction in the PCC when the corresponding electrode was occluded. We analysed all electrodes and participants with causal and non-causal versions of the ResNet and Swin decoders. The results in Fig. 4 show similar contributions for the ResNet and Swin models (Supplementary Figs. 8 and 9 describe the noise-level contribution). The non-causal models show enhanced auditory cortex contributions compared with the causal models, implicating auditory feedback in decoding, and underlying the importance of employing only causal models during speech decoding because neural feedback signals are not available for real-time decoding applications. Furthermore, across the causal models, both the right and left hemispheres show similar contributions across the sensorimotor cortex, especially on the ventral portion, suggesting the potential feasibility of right-hemisphere neural prosthetics.

figure 4

Visualization of the contribution of each cortical location to the decoding result achieved by both causal and non-causal decoding models through an occlusion analysis. The contribution of each electrode region in each participant is projected onto the standardized Montreal Neurological Institute (MNI) brain anatomical map and then averaged over all participants. Each subplot shows the causal or non-causal contribution of different cortical locations (red indicates a higher contribution; yellow indicates a lower contribution). For visualization purposes, we normalized the contribution of each electrode location by the local grid density, because there were multiple participants with non-uniform density.

Our novel pipeline can decode speech from neural signals by leveraging interchangeable architectures for the ECoG decoder and a novel differentiable speech synthesizer (Fig. 5 ). Our training process relies on estimating guidance speech parameters from the participants’ speech using a pre-trained speech encoder (Fig. 6a ). This strategy enabled us to train ECoG decoders with limited corresponding speech and neural data, which can produce natural-sounding speech when paired with our speech synthesizer. Our approach was highly reproducible across participants ( N  = 48), providing evidence for successful causal decoding with convolutional (ResNet; Fig. 6c ) and transformer (Swin; Fig. 6d ) architectures, both of which outperformed the recurrent architecture (LSTM; Fig. 6e ). Our framework can successfully decode from both high and low spatial sampling with high levels of decoding performance. Finally, we provide potential evidence for robust speech decoding from the right hemisphere as well as the spatial contribution of cortical structures to decoding across the hemispheres.

figure 5

Our speech synthesizer generates the spectrogram at time t by combining a voiced component and an unvoiced component based on a set of speech parameters at t . The upper part represents the voice pathway, which generates the voiced component by passing a harmonic excitation with fundamental frequency \({f}_{0}^{\;t}\) through a voice filter (which is the sum of six formant filters, each specified by formant frequency \({f}_{i}^{\;t}\) and amplitude \({a}_{i}^{t}\) ). The lower part describes the noise pathway, which synthesizes the unvoiced sound by passing white noise through an unvoice filter (consisting of a broadband filter defined by centre frequency \({f}_{\hat{u}}^{\;t}\) , bandwidth \({b}_{\hat{u}}^{t}\) and amplitude \({a}_{\hat{u}}^{t}\) , and the same six formant filters used for the voice filter). The two components are next mixed with voice weight α t and unvoice weight 1 −  α t , respectively, and then amplified by loudness L t . A background noise (defined by a stationary spectrogram B ( f )) is finally added to generate the output spectrogram. There are a total of 18 speech parameters at any time t , indicated in purple boxes.

figure 6

a , The speech encoder architecture. We input a spectrogram into a network of temporal convolution layers and channel MLPs that produce speech parameters. b , c , The ECoG decoder ( c ) using the 3D ResNet architecture. We first use several temporal and spatial convolutional layers with residual connections and spatiotemporal pooling to generate downsampled latent features, and then use corresponding transposed temporal convolutional layers to upsample the features to the original temporal dimension. We then apply temporal convolution layers and channel MLPs to map the features to speech parameters, as shown in b . The non-causal version uses non-causal temporal convolution in each layer, whereas the causal version uses causal convolution. d , The ECoG decoder using the 3D Swin architecture. We use three or four stages of 3D Swin blocks with spatial-temporal attention (three blocks for LD and four blocks for HB) to extract the features from the ECoG signal. We then use the transposed versions of temporal convolution layers as in c to upsample the features. The resulting features are mapped to the speech parameters using the same structure as shown in b . Non-causal versions apply temporal attention to past, present and future tokens, whereas the causal version applies temporal attention only to past and present tokens. e , The ECoG decoder using LSTM layers. We use three LSTM layers and one layer of channel MLP to generate features. We then reuse the prediction layers in b to generate the corresponding speech parameters. The non-causal version employs bidirectional LSTM in each layer, whereas the causal version uses unidirectional LSTM.

Our decoding pipeline showed robust speech decoding across participants, leading to PCC values within the range 0.62–0.92 (Fig. 2a ; causal ResNet mean 0.797, median 0.805) between the decoded and ground-truth speech across several architectures. We attribute our stable training and accurate decoding to the carefully designed components of our pipeline (for example, the speech synthesizer and speech parameter guidance) and the multiple improvements ( Methods sections Speech synthesizer , ECoG decoder and Model training ) over our previous approach on the subset of participants with hybrid-density grids 29 . Previous reports have investigated speech- or text-decoding using linear models 14 , 15 , 30 , transitional probability 4 , 31 , recurrent neural networks 5 , 10 , 17 , 19 , convolutional neural networks 8 , 29 and other hybrid or selection approaches 9 , 16 , 18 , 32 , 33 . Overall, our results are similar to (or better than) many previous reports (54% of our participants showed higher than 0.8 for the decoding PCC; Fig. 3c ). However, a direct comparison is complicated by multiple factors. Previous reports vary in terms of the reported performance metrics, as well as the stimuli decoded (for example, continuous speech versus single words) and the cortical sampling (that is, high versus low density, depth electrodes compared with surface grids). Our publicly available pipeline, which can be used across multiple neural network architectures and tested on various performance metrics, can facilitate the research community to conduct more direct comparisons while still adhering to a high accuracy of speech decoding.

The temporal causality of decoding operations, critical for real-time BCI applications, has not been considered by most previous studies. Many of these non-causal models relied on auditory (and somatosensory) feedback signals. Our analyses show that non-causal models rely on a robust contribution from the superior temporal gyrus (STG), which is mostly eliminated using a causal model (Fig. 4 ). We believe that non-causal models would show limited generalizability to real-time BCI applications due to their over-reliance on feedback signals, which may be absent (if no delay is allowed) or incorrect (if a short latency is allowed during real-time decoding). Some approaches used imagined speech, which avoids feedback during training 16 , or showed generalizability to mimed production lacking auditory feedback 17 , 19 . However, most reports still employ non-causal models, which cannot rule out feedback during training and inference. Indeed, our contribution maps show robust auditory cortex recruitment for the non-causal ResNet and Swin models (Fig. 4 , in contrast to their causal counterparts, which decode based on more frontal regions. Furthermore, the recurrent neural networks that are widely used in the literature 5 , 19 are typically bidirectional, producing non-causal behaviours and longer latencies for prediction during real-time applications. Unidirectional causal results are typically not reported. The recurrent network we tested performed the worst when trained with one direction (Fig. 2a , causal LSTM). Although our current focus was not real-time decoding, we were able to synthesize speech from neural signals with a delay of under 50 ms (Supplementary Table 1 ), which provides minimal auditory delay interference and allows for normal speech production 34 , 35 . Our data suggest that causal convolutional and transformer models can perform on par with their non-causal counterparts and recruit more relevant cortical structures for real-time decoding.

In our study we have leveraged an intermediate speech parameter space together with a novel differentiable speech synthesizer to decode subject-specific naturalistic speech (Fig. 1 . Previous reports used varying approaches to model speech, including an intermediate kinematic space 17 , an acoustically relevant intermediate space using HuBERT features 19 derived from a self-supervised speech masked prediction task 20 , an intermediate random vector (that is, GAN) 11 or direct spectrogram representations 8 , 17 , 36 , 37 . Our choice of speech parameters as the intermediate representation allowed us to decode subject-specific acoustics. Our intermediate acoustic representation led to significantly more accurate speech decoding than directly mapping ECoG to the speech spectrogram 38 , and than mapping ECoG to a random vector, which is then fed to a GAN-based speech synthesizer 11 (Supplementary Fig. 10 ). Unlike the kinematic representation, our acoustic intermediate representation using speech parameters and the associated speech synthesizer enables our decoding pipeline to produce natural-sounding speech that preserves subject-specific characteristics, which would be lost with the kinematic representation.

Our speech synthesizer is motivated by classical vocoder models for speech production (generating speech by passing an excitation source, harmonic or noise, through a filter 39 , 40 and is fully differentiable, facilitating the training of the ECoG decoder using spectral losses through backpropagation. Furthermore, the guidance speech parameters needed for training the ECoG decoder can be obtained using a speech encoder that can be pre-trained without requiring neural data. Thus, it could be trained using older speech recordings or a proxy speaker chosen by the patient in the case of patients without the ability to speak. Training the ECoG decoder using such guidance, however, would require us to revise our current training strategy to overcome the challenge of misalignment between neural signals and speech signals, which is a scope of our future work. Additionally, the low-dimensional acoustic space and pre-trained speech encoder (for generating the guidance) using speech signals only alleviate the limited data challenge in training the ECoG-to-speech decoder and provide a highly interpretable latent space. Finally, our decoding pipeline is generalizable to unseen words (Fig. 2b ). This provides an advantage compared to the pattern-matching approaches 18 that produce subject-specific utterances but with limited generalizability.

Many earlier studies employed high-density electrode coverage over the cortex, providing many distinct neural signals 5 , 10 , 17 , 30 , 37 . One question we directly addressed was whether higher-density coverage improves decoding. Surprisingly, we found a high decoding performance in terms of spectrogram PCC with both low-density and higher (hybrid) density grid coverages (Fig. 3c ). Furthermore, comparing the decoding performance obtained using all electrodes in our hybrid-density participants versus using only the low-density electrodes in the same participants revealed that the decoding did not differ significantly (albeit for one participant; Fig. 3d ). We attribute these results to the ability of our ECoG decoder to extract speech parameters from neural signals as long as there is sufficient perisylvian coverage, even in low-density participants.

A striking result was the robust decoding from right hemisphere cortical structures as well as the clear contribution of the right perisylvian cortex. Our results are consistent with the idea that syllable-level speech information is represented bilaterally 41 . However, our findings suggest that speech information is well-represented in the right hemisphere. Our decoding results could directly lead to speech prostheses for patients who suffer from expressive aphasia or apraxia of speech. Some previous studies have shown limited right-hemisphere decoding of vowels 42 and sentences 43 . However, the results were mostly mixed with left-hemisphere signals. Although our decoding results provide evidence for a robust representation of speech in the right hemisphere, it is important to note that these regions are likely not critical for speech, as evidenced by the few studies that have probed both hemispheres using electrical stimulation mapping 44 , 45 . Furthermore, it is unclear whether the right hemisphere would contain sufficient information for speech decoding if the left hemisphere were damaged. It would be necessary to collect right-hemisphere neural data from left-hemisphere-damaged patients to verify we can still achieve acceptable speech decoding. However, we believe that right-hemisphere decoding is still an exciting avenue as a clinical target for patients who are unable to speak due to left-hemisphere cortical damage.

There are several limitations in our study. First, our decoding pipeline requires speech training data paired with ECoG recordings, which may not exist for paralysed patients. This could be mitigated by using neural recordings during imagined or mimed speech and the corresponding older speech recordings of the patient or speech by a proxy speaker chosen by the patient. As discussed earlier, we would need to revise our training strategy to overcome the temporal misalignment between the neural signal and the speech signal. Second, our ECoG decoder models (3D ResNet and 3D Swin) assume a grid-based electrode sampling, which may not be the case. Future work should develop model architectures that are capable of handling non-grid data, such as strips and depth electrodes (stereo intracranial electroencephalogram (sEEG)). Importantly, such decoders could replace our current grid-based ECoG decoders while still being trained using our overall pipeline. Finally, our focus in this study was on word-level decoding limited to a vocabulary of 50 words, which may not be directly comparable to sentence-level decoding. Specifically, two recent studies have provided robust speech decoding in a few chronic patients implanted with intracranial ECoG 19 or a Utah array 46 that leveraged a large amount of data available in one patient in each study. It is noteworthy that these studies use a range of approaches in constraining their neural predictions. Metzger and colleagues employed a pre-trained large transformer model leveraging directional attention to provide the guidance HuBERT features for their ECoG decoder. In contrast, Willet and colleagues decoded at the level of phonemes and used transition probability models at both phoneme and word levels to constrain decoding. Our study is much more limited in terms of data. However, we were able to achieve good decoding results across a large cohort of patients through the use of a compact acoustic representation (rather than learnt contextual information). We expect that our approach can help improve generalizability for chronically implanted patients.

To summarize, our neural decoding approach, capable of decoding natural-sounding speech from 48 participants, provides the following major contributions. First, our proposed intermediate representation uses explicit speech parameters and a novel differentiable speech synthesizer, which enables interpretable and acoustically accurate speech decoding. Second, we directly consider the causality of the ECoG decoder, providing strong support for causal decoding, which is essential for real-time BCI applications. Third, our promising decoding results using low sampling density and right-hemisphere electrodes shed light on future neural prosthetic devices using low-density grids and in patients with damage to the left hemisphere. Last, but not least, we have made our decoding framework open to the community with documentation ( https://github.com/flinkerlab/neural_speech_decoding ), and we trust that this open platform will help propel the field forward, supporting reproducible science.

Experiments design

We collected neural data from 48 native English-speaking participants (26 female, 22 male) with refractory epilepsy who had ECoG subdural electrode grids implanted at NYU Langone Hospital. Five participants underwent HB sampling, and 43 LD sampling. The ECoG array was implanted on the left hemisphere for 32 participants and on the right for 16. The Institutional Review Board of NYU Grossman School of Medicine approved all experimental procedures. After consulting with the clinical-care provider, a research team member obtained written and oral consent from each participant. Each participant performed five tasks 47 to produce target words in response to auditory or visual stimuli. The tasks were auditory repetition (AR, repeating auditory words), auditory naming (AN, naming a word based on an auditory definition), sentence completion (SC, completing the last word of an auditory sentence), visual reading (VR, reading aloud written words) and picture naming (PN, naming a word based on a colour drawing).

For each task, we used the exact 50 target words with different stimulus modalities (auditory, visual and so on). Each word appeared once in the AN and SC tasks and twice in the others. The five tasks involved 400 trials, with corresponding word production and ECoG recording for each participant. The average duration of the produced speech in each trial was 500 ms.

Data collection and preprocessing

The study recorded ECoG signals from the perisylvian cortex (including STG, inferior frontal gyrus (IFG), pre-central and postcentral gyri) of 48 participants while they performed five speech tasks. A microphone recorded the subjects’ speech and was synchronized to the clinical Neuroworks Quantum Amplifier (Natus Biomedical), which captured ECoG signals. The ECoG array consisted of 64 standard 8 × 8 macro contacts (10-mm spacing) for 43 participants with low-density sampling. For five participants with hybrid-density sampling, the ECoG array also included 64 additional interspersed smaller electrodes (1 mm) between the macro contacts (providing 10-mm centre-to-centre spacing between macro contacts and 5-mm centre-to-centre spacing between micro/macro contacts; PMT Corporation) (Fig. 3b ). This Food and Drug Administration (FDA)-approved array was manufactured for this study. A research team member informed participants that the additional contacts were for research purposes during consent. Clinical care solely determined the placement location across participants (32 left hemispheres; 16 right hemispheres). The decoding models were trained separately for each participant using all trials except ten randomly selected ones from each task, leading to 350 trials for training and 50 for testing. The reported results are for testing data only.

We sampled ECoG signals from each electrode at 2,048 Hz and downsampled them to 512 Hz before processing. Electrodes with artefacts (for example, line noise, poor contact with the cortex, high-amplitude shifts) were rejected. The electrodes with interictal and epileptiform activity were also excluded from the analysis. The mean of a common average reference (across all remaining valid electrodes and time) was subtracted from each individual electrode. After the subtraction, a Hilbert transform extracted the envelope of the high gamma (70–150 Hz) component from the raw signal, which was then downsampled to 125 Hz. A reference signal was obtained by extracting a silent period of 250 ms before each trial’s stimulus period within the training set and averaging the signals over these silent periods. Each electrode’s signal was normalized to the reference mean and variance (that is, z -score). The data-preprocessing pipeline was coded in MATLAB and Python. For participants with noisy speech recordings, we applied spectral gating to remove stationary noise from the speech using an open-source tool 48 . We ruled out the possibility that our neural data suffer from a recently reported acoustic contamination (Supplementary Fig. 5 ) by following published approaches 49 .

To pre-train the auto-encoder, including the speech encoder and speech synthesizer, unlike our previous work in ref. 29 , which completely relied on unsupervised training, we provided supervision for some speech parameters to improve their estimation accuracy further. Specifically, we used the Praat method 50 to estimate the pitch and four formant frequencies ( \({f}_{ {{{\rm{i}}}} = {1}\,{{{\rm{to}}}}\,4}^{t}\) , in hertz) from the speech waveform. The estimated pitch and formant frequency were resampled to 125 Hz, the same as the ECoG signal and spectrogram sampling frequency. The mean square error between these speech parameters generated by the speech encoder and those estimated by the Praat method was used as a supervised reference loss, in addition to the unsupervised spectrogram reconstruction and STOI losses, making the training of the auto-encoder semi-supervised.

Speech synthesizer

Our speech synthesizer was inspired by the traditional speech vocoder, which generates speech by switching between voiced and unvoiced content, each generated by filtering a specific excitation signal. Instead of switching between the two components, we use a soft mix of the two components, making the speech synthesizer differentiable. This enables us to train the ECoG decoder and the speech encoder end-to-end by minimizing the spectrogram reconstruction loss with backpropagation. Our speech synthesizer can generate a spectrogram from a compact set of speech parameters, enabling training of the ECoG decoder with limited data. As shown in Fig. 5 , the synthesizer takes dynamic speech parameters as input and contains two pathways. The voice pathway applies a set of formant filters (each specified by the centre frequency \({f}_{i}^{\;t}\) , bandwidth \({b}_{i}^{t}\) that is dependent on \({f}_{i}^{\;t}\) , and amplitude \({a}_{i}^{t}\) ) to the harmonic excitation (with pitch frequency f 0 ) and generates the voiced component, V t ( f ), for each time step t and frequency f . The noise pathway filters the input white noise with an unvoice filter (consisting of a broadband filter defined by centre frequency \({f}_{\hat{u}}^{\;t}\) , bandwidth \({b}_{\hat{u}}^{t}\) and amplitude \({a}_{\hat{u}}^{t}\) and the same six formant filters used for the voice filter) and produces the unvoiced content, U t ( f ). The synthesizer combines the two components with a voice weight α t   ∈  [0, 1] to obtain the combined spectrogram \({\widetilde{S}}^{t}{(\;f\;)}\) as

Factor α t acts as a soft switch for the gradient to flow back through the synthesizer. The final speech spectrogram is given by

where L t is the loudness modulation and B ( f ) the background noise. We describe the various components in more detail in the following.

Formant filters in the voice pathway

We use multiple formant filters in the voice pathway to model formants that represent vowels and nasal information. The formant filters capture the resonance in the vocal tract, which can help recover a speaker’s timbre characteristics and generate natural-sounding speech. We assume the filter for each formant is time-varying and can be derived from a prototype filter G i ( f ), which achieves maximum at a centre frequency \({f}_{i}^{{{\;{\rm{proto}}}}}\) and has a half-power bandwidth \({b}_{i}^{{{{\rm{proto}}}}}\) . The prototype filters have learnable parameters and will be discussed later. The actual formant filter at any time is written as a shifted and scaled version of G i ( f ). Specifically, at time t , given an amplitude \({\left({a}_{i}^{t}\right)}\) , centre frequency \({\left(\;{f}_{i}^{\;t}\right)}\) and bandwidth \({\left({b}_{i}^{t}\right)}\) , the frequency-domain representation of the i th formant filter is

where f max is half of the speech sampling frequency, which in our case is 8,000 Hz.

Rather than letting the bandwidth parameters \({b}_{i}^{t}\) be independent variables, based on the empirically observed relationships between \({b}_{i}^{t}\) and the centre frequencies \({f}_{i}^{\;t}\) , we set

The threshold frequency f θ , slope a and baseline bandwidth b 0 are three parameters that are learned during the auto-encoder training, shared among all six formant filters. This parameterization helps to reduce the number of speech parameters to be estimated at every time sample, making the representation space more compact.

Finally the filter for the voice pathway with N formant filters is given by \({F}_{{{{\rm{v}}}}}^{\;t}{(\;f\;)}={\mathop{\sum }\nolimits_{i = 1}^{N}{F}_{i}^{\;t}(\;f\;)}\) . Previous studies have shown that two formants ( N  = 2) are enough for intelligible reconstruction 51 , but we use N  = 6 for more accurate synthesis in our experiments.

Unvoice filters

We construct the unvoice filter by adding a single broadband filter \({F}_{\hat{u}}^{\;t}{(\;f\;)}\) to the formant filters for each time step t . The broadband filter \({F}_{\hat{u}}^{\;t}{(\;f\;)}\) has the same form as equation ( 1 ) but has its own learned prototype filter \({G}_{\hat{u}}{(f)}\) . The speech parameters corresponding to the broadband filter include \({\left({\alpha }_{\hat{u}}^{t},\,{f}_{\hat{u}}^{\;t},\,{b}_{\hat{u}}^{t}\right)}\) . We do not impose a relationship between the centre frequency \({f}_{\hat{u}}^{\;t}\) and the bandwidth \({b}_{\hat{u}}^{t}\) . This allows more flexibility in shaping the broadband unvoice filter. However, we constrain \({b}_{\hat{u}}^{t}\) to be larger than 2,000 Hz to capture the wide spectral range of obstruent phonemes. Instead of using only the broadband filter, we also retain the N formant filters in the voice pathway \({F}_{i}^{\;t}\) for the noise pathway. This is based on the observation that humans perceive consonants such as /p/ and /d/ not only by their initial bursts but also by their subsequent formant transitions until the next vowel 52 . We use identical formant filter parameters to encode these transitions. The overall unvoice filter is \({F}_{{{{\rm{u}}}}}^{\;t}{(\;f\;)}={F}_{\hat{u}}^{\;t}(\;f\;)+\mathop{\sum }\nolimits_{i = 1}^{N}{F}_{i}^{\;t}{(\;f\;)}\) .

Voice excitation

We use the voice filter in the voice pathway to modulate the harmonic excitation. Following ref. 53 , we define the harmonic excitation as \({h}^{t}={\mathop{\sum }\nolimits_{k = 1}^{K}{h}_{k}^{t}}\) , where K  = 80 is the number of harmonics.

The value of the k th resonance at time step t is \({h}_{k}^{t}={\sin (2\uppi k{\phi }^{t})}\) with \({\phi }^{t}={\mathop{\sum }\nolimits_{\tau = 0}^{t}{f}_{0}^{\;\tau }}\) , where \({f}_{0}^{\;\tau }\) is the fundamental frequency at time τ . The spectrogram of h t forms the harmonic excitation in the frequency domain H t ( f ), and the voice excitation is \({V}^{\;t}{(\;f\;)}={F}_{{{{\rm{v}}}}}^{t}{(\;f\;)}{H}^{\;t}{(\;f\;)}\) .

Noise excitation

The noise pathway models consonant sounds (plosives and fricatives). It is generated by passing a stationary Gaussian white noise excitation through the unvoice filter. We first generate the noise signal n ( t ) in the time domain by sampling from the Gaussian process \({{{\mathcal{N}}}}{(0,\,1)}\) and then obtain its spectrogram N t ( f ). The spectrogram of the unvoiced component is \({U}^{\;t}{(\;f\;)}={F}_{u}^{\;t}{(\;f\;)}{N}^{\;t}{(\;f\;)}\) .

Summary of speech parameters

The synthesizer generates the voiced component at time t by driving a harmonic excitation with pitch frequency \({f}_{0}^{\;t}\) through N formant filters in the voice pathway, each described by two parameters ( \({f}_{ i}^{\;t},\,{a}_{ i}^{t}\) ). The unvoiced component is generated by filtering a white noise through the unvoice filter consisting of an additional broadband filter with three parameters ( \({f}_{\hat{u}}^{\;t},\,{b}_{\hat{u}}^{t},\,{a}_{\hat{u}}^{t}\) ). The two components are mixed based on the voice weight α t and further amplified by the loudness value L t . In total, the synthesizer input includes 18 speech parameters at each time step.

Unlike the differentiable digital signal processing (DDSP) in ref. 53 , we do not directly assign amplitudes to the K harmonics. Instead, the amplitude in our model depends on the formant filters, which has two benefits:

The representation space is more compact. DDSP requires 80 amplitude parameters \({a}_{k}^{t}\) for each of the 80 harmonic components \({f}_{k}^{\;t}\) ( k  = 1, 2, …, 80) at each time step. In contrast, our synthesizer only needs a total of 18 parameters.

The representation is more disentangled. For human speech, the vocal tract shape (affecting the formant filters) is largely independent of the vocal cord tension (which determines the pitch). Modelling these two separately leads to a disentangled representation.

In contrast, DDSP specifies the amplitude for each harmonic component directly resulting in entanglement and redundancy between these amplitudes. Furthermore, it remains uncertain whether the amplitudes \({a}_{k}^{t}\) could be effectively controlled and encoded by the brain. In our approach, we explicitly model the formant filters and fundamental frequency, which possess clear physical interpretations and are likely to be directly controlled by the brain. Our representation also enables a more robust and direct estimation of the pitch.

Speaker-specific synthesizer parameters

Prototype filters.

Instead of using a predetermined prototype formant filter shape, for example, a standard Gaussian function, we learn a speaker-dependent prototype filter for each formant to allow more expressive and flexible formant filter shapes. We define the prototype filter G i ( f ) of the i th formant as a piecewise linear function, linearly interpolated from g i [ m ], m  = 1, …,  M , with the amplitudes of the filter at M being uniformly sampled frequencies in the range [0,  f max ]. We constrain g i [ m ] to increase and then decrease monotonically so that G i ( f ) is unimodal and has a single peak value of 1. Given g i [ m ], m  = 1, …,  M , we can determine the peak frequency \({f}_{i}^{\;{{{\rm{proto}}}}}\) and the half-power bandwidth \({b}_{i}^{{{{\rm{proto}}}}}\) of G i ( f ).

The prototype parameters g i [ m ], m  = 1, …,  M of each formant filter are time-invariant and are determined during the auto-encoder training. Compared with ref. 29 , we increase M from 20 to 80 to enable more expressive formant filters, essential for synthesizing male speakers’ voices.

We similarly learn a prototype filter for the broadband filter G û ( f ) for the unvoiced component, which is specified by M parameters g û ( m ).

Background noise

The recorded sound typically contains background noise. We assume that the background noise is stationary and has a specific frequency distribution, depending on the speech recording environment. This frequency distribution B ( f ) is described by K parameters, where K is the number of frequency bins ( K  = 256 for females and 512 for males). The K parameters are also learned during auto-encoder training. The background noise is added to the mixed speech components to generate the final speech spectrogram.

To summarize, our speech synthesizer has the following learnable parameters: the M  = 80 prototype filter parameters for each of the N  = 6 formant filters and the broadband filters (totalling M ( N  + 1) = 560), the three parameters f θ , a and b 0 relating the centre frequency and bandwidth for the formant filters (totalling 18), and K parameters for the background noise (256 for female and 512 for male). The total number of parameters for female speakers is 834, and that for male speakers is 1,090. Note that these parameters are speaker-dependent but time-independent, and they can be learned together with the speech encoder during the training of the speech-to-speech auto-encoder, using the speaker’s speech only.

Speech encoder

The speech encoder extracts a set of (18) speech parameters at each time point from a given spectrogram, which are then fed to the speech synthesizer to reproduce the spectrogram.

We use a simple network architecture for the speech encoder, with temporal convolutional layers and multilayer perceptron (MLP) across channels at the same time point, as shown in Fig. 6a . We encode pitch \({f}_{0}^{\;t}\) by combining features generated from linear and Mel-scale spectrograms. The other 17 speech parameters are derived by applying temporal convolutional layers and channel MLP to the linear-scale spectrogram. To generate formant filter centre frequencies \({f}_{i = 1\,{{{\rm{to}}}}\,6}^{\;t}\) , broadband unvoice filter frequency \({f}_{\hat{u}}^{\;t}\) and pitch \({f}_{0}^{\;t}\) , we use sigmoid activation at the end of the corresponding channel MLP to map the output to [0, 1], and then de-normalize it to real values by scaling [0, 1] to predefined [ f min ,  f max ]. The [ f min ,  f max ] values for each frequency parameter are chosen based on previous studies 54 , 55 , 56 , 57 . Our compact speech parameter space facilitates stable and easy training of our speech encoder. Models were coded using PyTorch version 1.21.1 in Python.

ECoG decoder

In this section we present the design details of three ECoG decoders: the 3D ResNet ECoG decoder, the 3D Swin transformer ECoG decoder and the LSTM ECoG decoder. The models were coded using PyTorch version 1.21.1 in Python.

3D ResNet ECoG decoder

This decoder adopts the ResNet architecture 23 for the feature extraction backbone of the decoder. Figure 6c illustrates the feature extraction part. The model views the ECoG input as 3D tensors with spatiotemporal dimensions. In the first layer, we apply only temporal convolution to the signal from each electrode, because the ECoG signal exhibits more temporal than spatial correlations. In the subsequent parts of the decoder, we have four residual blocks that extract spatiotemporal features using 3D convolution. After downsampling the electrode dimension to 1 × 1 and the temporal dimension to T /16, we use several transposed Conv layers to upsample the features to the original temporal size T . Figure 6b shows how to generate the different speech parameters from the resulting features using different temporal convolution and channel MLP layers. The temporal convolution operation can be causal (that is, using only past and current samples as input) or non-causal (that is, using past, current and future samples), leading to causal and non-causal models.

3D Swin Transformer ECoG decoder

Swin Transformer 24 employs the window and shift window methods to enable self-attention of small patches within each window. This reduces the computational complexity and introduces the inductive bias of locality. Because our ECoG input data have three dimensions, we extend Swin Transformer to three dimensions to enable local self-attention in both temporal and spatial dimensions among 3D patches. The local attention within each window gradually becomes global attention as the model merges neighbouring patches in deeper transformer stages.

Figure 6d illustrates the overall architecture of the proposed 3D Swin Transformer. The input ECoG signal has a size of T  ×  H  ×  W , where T is the number of frames and H  ×  W is the number of electrodes at each frame. We treat each 3D patch of size 2 × 2 × 2 as a token in the 3D Swin Transformer. The 3D patch partitioning layer produces \({\frac{T}{2}\times \frac{H}{2}\times \frac{W}{2}}\) 3D tokens, each with a 48-dimensional feature. A linear embedding layer then projects the features of each token to a higher dimension C (=128).

The 3D Swin Transformer comprises three stages with two, two and six layers, respectively, for LD participants and four stages with two, two, six and two layers for HB participants. It performs 2 × 2 × 2 spatial and temporal downsampling in the patch-merging layer of each stage. The patch-merging layer concatenates the features of each group of 2 × 2 × 2 temporally and spatially adjacent tokens. It applies a linear layer to project the concatenated features to one-quarter of their original dimension after merging. In the 3D Swin Transformer block, we replace the multi-head self-attention (MSA) module in the original Swin Transformer with the 3D shifted window multi-head self-attention module. It adapts the other components to 3D operations as well. A Swin Transformer block consists of a 3D shifted window-based MSA module followed by a feedforward network (FFN), a two-layer MLP. Layer normalization is applied before each MSA module and FFN, and a residual connection is applied after each module.

Consider a stage with T  ×  H  ×  W input tokens. If the 3D window size is P  ×  M  ×  M , we partition the input into \({\lceil \frac{T}{P}\rceil \times \lceil \frac{H}{M}\rceil \times \lceil \frac{W}{M}\rceil}\) non-overlapping 3D windows evenly. We choose P  = 16, M  = 2. We perform the multi-head self-attention within each 3D window. However, this design lacks connection across adjacent windows, which may limit the representation power of the architecture. Therefore, we extend the shifted 2D window mechanism of the Swin Transformer to shifted 3D windows. In the second layer of the stage, we shift the window by \(\left({\frac{P}{2},\,\frac{M}{2},\,\frac{M}{2}}\right)\) tokens along the temporal, height and width axes from the previous layer. This creates cross-window connections for the self-attention module. This shifted 3D window design enables the interaction of electrodes with longer spatial and temporal distances by connecting neighbouring tokens in non-overlapping 3D windows in the previous layer.

The temporal attention in the self-attention operation can be constrained to be causal (that is, each token only attends to tokens temporally before it) or non-causal (that is, each token can attend to tokens temporally before or after it), leading to the causal and non-causal models, respectively.

LSTM decoder

The decoder uses the LSTM architecture 25 for the feature extraction in Fig. 6e . Each LSTM cell is composed of a set of gates that control the flow of information: the input gate, the forget gate and the output gate. The input gate regulates the entry of new data into the cell state, the forget gate decides what information is discarded from the cell state, and the output gate determines what information is transferred to the next hidden state and can be output from the cell.

In the LSTM architecture, the ECoG input would be processed through these cells sequentially. For each time step T , the LSTM would take the current input x t and the previous hidden state h t  − 1 and would produce a new hidden state h t and output y t . This process allows the LSTM to maintain information over time and is particularly useful for tasks such as speech and neural signal processing, where temporal dependencies are critical. Here we use three layers of LSTM and one linear layer to generate features to map to speech parameters. Unlike 3D ResNet and 3D Swin, we keep the temporal dimension unchanged across all layers.

Model training

Training of the speech encoder and speech synthesizer.

As described earlier, we pre-train the speech encoder and the learnable parameters in the speech synthesizer to perform a speech-to-speech auto-encoding task. We use multiple loss terms for the training. The modified multi-scale spectral (MSS) loss is inspired by ref. 53 and is defined as

Here, S t ( f ) denotes the ground-truth spectrogram and \({\widehat{S}}^{t}{(\;f\;)}\) the reconstructed spectrogram in the linear scale, \({S}_{{{{\rm{mel}}}}}^{t}{(\;f\;)}\) and \({\widehat{S}}_{{{{\rm{mel}}}}}^{t}{(\;f\;)}\) are the corresponding spectrograms in the Mel-frequency scale. We sample the frequency range [0, 8,000 Hz] with K  = 256 bins for female participants. For male participants, we set K  = 512 because they have lower f 0 , and it is better to have a higher resolution in frequency.

To improve the intelligibility of the reconstructed speech, we also introduce the STOI loss by implementing the STOI+ metric 26 , which is a variation of the original STOI metric 8 , 22 . STOI+ 26 discards the normalization and clipping step in STOI and has been shown to perform best among intelligibility evaluation metrics. First, a one-third octave band analysis 22 is performed by grouping Discrete Fourier transform (DFT) bins into 15 one-third octave bands with the lowest centre frequency set equal to 150 Hz and the highest centre frequency equal to ~4.3 kHz. Let \({\hat{x}(k,\,m)}\) denote the k th DFT bin of the m th frame of the ground-truth speech. The norm of the j th one-third octave band, referred to as a time-frequency (TF) unit, is then defined as

where k 1 ( j ) and k 2 ( j ) denote the one-third octave band edges rounded to the nearest DFT bin. The TF representation of the processed speech \({\hat{y}}\) is obtained similarly and denoted by Y j ( m ). We then extract the short-time temporal envelopes in each band and frame, denoted X j ,  m and Y j ,  m , where \({X}_{j,\,m}={\left[{X}_{j}{(m-N+1)},\,{X}_{j}{(m-N+2)},\,\ldots ,\,{X}_{j}{(m)}\right]}^{\rm{T}}\) , with N  = 30. The STOI+ metric is the average of the PCC d j ,  m between X j ,  m and Y j ,  m , overall j and m (ref. 26 ):

We use the negative of the STOI+ metric as the STOI loss:

where J and M are the total numbers of frequency bins ( J  = 15) and frames, respectively. Note that L STOI is differentiable with respect to \({\widehat{S}}^{t}{(\;f\;)}\) , and thus can be used to update the model parameters generating the predicted spectrogram \({\widehat{S}}^{t}{(\;f\;)}\) .

To further improve the accuracy for estimating the pitch \({\widetilde{f}}_{0}^{\;t}\) and formant frequencies \({\widetilde{f}}_{{{{\rm{i}}}} = {1}\,{{{\rm{to}}}}\,4}^{\;t}\) , we add supervisions to them using the formant frequencies extracted by the Praat method 50 . The supervision loss is defined as

where the weights β i are chosen to be β 1  = 0.1, β 2  = 0.06, β 3  = 0.03 and β 4  = 0.02, based on empirical trials. The overall training loss is defined as

where the weighting parameters λ i are empirically optimized to be λ 1  = 1.2 and λ 2  = 0.1 through testing the performances on three hybrid-density participants with different parameter choices.

Training of the ECoG decoder

With the reference speech parameters generated by the speech encoder and the target speech spectrograms as ground truth, the ECoG decoder is trained to match these targets. Let us denote the decoded speech parameters as \({\widetilde{C}}_{j}^{\;t}\) , and their references as \({C}_{j}^{\;t}\) , where j enumerates all speech parameters fed to the speech synthesizer. We define the reference loss as

where weighting parameters λ j are chosen as follows: voice weight λ α  = 1.8, loudness λ L  = 1.5, pitch \({\lambda }_{{f}_{0}}={0.4}\) , formant frequencies \({\lambda }_{{f}_{1}}={3},\,{\lambda }_{{f}_{2}}={1.8},\,{\lambda }_{{f}_{3}}={1.2},\,{\lambda }_{{f}_{4}}={0.9},\,{\lambda }_{{f}_{5}}={0.6},\,{\lambda }_{{f}_{6}}={0.3}\) , formant amplitudes \({\lambda }_{{a}_{1}}={4},\,{\lambda }_{{a}_{2}}={2.4},\,{\lambda }_{{a}_{3}}={1.2},\,{\lambda }_{{a}_{4}}={0.9},\,{\lambda }_{{a}_{5}}={0.6},\,{\lambda }_{{a}_{6}}={0.3}\) , broadband filter frequency \({\lambda }_{{f}_{\hat{u}}}={10}\) , amplitude \({\lambda }_{{a}_{\hat{u}}}={4}\) , bandwidth \({\lambda }_{{b}_{\hat{u}}}={4}\) . Similar to speech-to-speech auto-encoding, we add supervision loss for pitch and formant frequencies derived by the Praat method and use the MSS and STOI loss to measure the difference between the reconstructed spectrograms and the ground-truth spectrogram. The overall training loss for the ECoG decoder is

where weighting parameters λ i are empirically optimized to be λ 1  = 1.2, λ 2  = 0.1 and λ 3  = 1, through the same parameter search process as described for training the speech encoder.

We use the Adam optimizer 58 with hyper-parameters lr  = 10 −3 , β 1  = 0.9 and β 2  = 0.999 to train both the auto-encoder (including the speech encoder and speech synthesizer) and the ECoG decoder. We train a separate set of models for each participant. As mentioned earlier, we randomly selected 50 out of 400 trials per participant as the test data and used the rest for training.

Evaluation metrics

In this Article, we use the PCC between the decoded spectrogram and the actual speech spectrogram to evaluate the objective quality of the decoded speech, similar to refs. 8 , 18 , 59 .

We also use STOI+ 26 , as described in Methods section Training of the ECoG decoder to measure the intelligibility of the decoded speech. The STOI+ value ranges from −1 to 1 and has been reported to have a monotonic relationship with speech intelligibility.

Contribution analysis with the occlusion method

To measure the contribution of the cortex region under each electrode to the decoding performance, we adopted an occlusion-based method that calculates the change in the PCC between the decoded and the ground-truth spectrograms when an electrode signal is occluded (that is, set to zeros), as in ref. 29 . This method enables us to reveal the critical brain regions for speech production. We used the following notations: S t ( f ), the ground-truth spectrogram; \({\hat{{{{{S}}}}}}^{t}{(\;f\;)}\) , the decoded spectrogram with ‘intact’ input (that is, all ECoG signals are used); \({\hat{{{{{S}}}}}}_{i}^{t}{(\;f\;)}\) , the decoded spectrogram with the i th ECoG electrode signal occluded; r ( ⋅ ,  ⋅ ), correlation coefficient between two signals. The contribution of i th electrode for a particular participant is defined as

where Mean{ ⋅ } denotes averaging across all testing trials of the participant.

We generate the contribution map on the standardized Montreal Neurological Institute (MNI) brain anatomical map by diffusing the contribution of each electrode of each participant (with a corresponding location in the MNI coordinate) into the adjacent area within the same anatomical region using a Gaussian kernel and then averaging the resulting map from all participants. To account for the non-uniform density of the electrodes in different regions and across the participants, we normalize the sum of the diffused contribution from all the electrodes at each brain location by the total number of electrodes in the region across all participants.

We estimate the noise level for the contribution map to assess the significance of our contribution analysis. To derive the noise level, we train a shuffled model for each participant by randomly pairing the mismatched speech segment and ECoG segment in the training set. We derive the average contribution map from the shuffled models for all participants using the same occlusion analysis as described earlier. The resulting contribution map is used as the noise level. Contribution levels below the noise levels at corresponding cortex locations are assigned a value of 0 (white) in Fig. 4 .

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this Article.

Data availability

The data of one participant who consented to the release of the neural and audio data are publicly available through Mendeley Data at https://data.mendeley.com/datasets/fp4bv9gtwk/2 (ref. 60 ). Although all participants consented to share their data for research purposes, not all participants agreed to share their audio publicly. Given the sensitive nature of audio speech data we will share data with researchers that directly contact the corresponding author and provide documentation that the data will be strictly used for research purposes and will comply with the terms of our study IRB. Source data are provided with this paper.

Code availability

The code is available at https://github.com/flinkerlab/neural_speech_decoding ( https://doi.org/10.5281/zenodo.10719428 ) 61 .

Schultz, T. et al. Biosignal-based spoken communication: a survey. IEEE / ACM Trans. Audio Speech Lang. Process. 25 , 2257–2271 (2017).

Google Scholar  

Miller, K. J., Hermes, D. & Staff, N. P. The current state of electrocorticography-based brain-computer interfaces. Neurosurg. Focus 49 , E2 (2020).

Article   Google Scholar  

Luo, S., Rabbani, Q. & Crone, N. E. Brain-computer interface: applications to speech decoding and synthesis to augment communication. Neurotherapeutics 19 , 263–273 (2022).

Moses, D. A., Leonard, M. K., Makin, J. G. & Chang, E. F. Real-time decoding of question-and-answer speech dialogue using human cortical activity. Nat. Commun. 10 , 3096 (2019).

Moses, D. A. et al. Neuroprosthesis for decoding speech in a paralyzed person with anarthria. N. Engl. J. Med. 385 , 217–227 (2021).

Herff, C. & Schultz, T. Automatic speech recognition from neural signals: a focused review. Front. Neurosci. 10 , 429 (2016).

Rabbani, Q., Milsap, G. & Crone, N. E. The potential for a speech brain-computer interface using chronic electrocorticography. Neurotherapeutics 16 , 144–165 (2019).

Angrick, M. et al. Speech synthesis from ECoG using densely connected 3D convolutional neural networks. J. Neural Eng. 16 , 036019 (2019).

Sun, P., Anumanchipalli, G. K. & Chang, E. F. Brain2Char: a deep architecture for decoding text from brain recordings. J. Neural Eng. 17 , 066015 (2020).

Makin, J. G., Moses, D. A. & Chang, E. F. Machine translation of cortical activity to text with an encoder–decoder framework. Nat. Neurosci. 23 , 575–582 (2020).

Wang, R. et al. Stimulus speech decoding from human cortex with generative adversarial network transfer learning. In Proc. 2020 IEEE 17th International Symposium on Biomedical Imaging ( ISBI ) (ed. Amini, A.) 390–394 (IEEE, 2020).

Zelinka, P., Sigmund, M. & Schimmel, J. Impact of vocal effort variability on automatic speech recognition. Speech Commun. 54 , 732–742 (2012).

Benzeghiba, M. et al. Automatic speech recognition and speech variability: a review. Speech Commun. 49 , 763–786 (2007).

Martin, S. et al. Decoding spectrotemporal features of overt and covert speech from the human cortex. Front. Neuroeng. 7 , 14 (2014).

Herff, C. et al. Towards direct speech synthesis from ECoG: a pilot study. In Proc. 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society ( EMBC ) (ed. Patton, J.) 1540–1543 (IEEE, 2016).

Angrick, M. et al. Real-time synthesis of imagined speech processes from minimally invasive recordings of neural activity. Commun. Biol 4 , 1055 (2021).

Anumanchipalli, G. K., Chartier, J. & Chang, E. F. Speech synthesis from neural decoding of spoken sentences. Nature 568 , 493–498 (2019).

Herff, C. et al. Generating natural, intelligible speech from brain activity in motor, premotor and inferior frontal cortices. Front. Neurosci. 13 , 1267 (2019).

Metzger, S. L. et al. A high-performance neuroprosthesis for speech decoding and avatar control. Nature 620 , 1037–1046 (2023).

Hsu, W.-N. et al. Hubert: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 29 , 3451–3460 (2021).

Griffin, D. & Lim, J. Signal estimation from modified short-time fourier transform. IEEE Trans. Acoustics Speech Signal Process. 32 , 236–243 (1984).

Taal, C. H., Hendriks, R. C., Heusdens, R. & Jensen, J. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In Proc. 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (ed. Douglas, S.) 4214–4217 (IEEE, 2010).

He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition ( CVPR ) (ed. Bajcsy, R.) 770–778 (IEEE, 2016).

Liu, Z. et al. Swin Transformer: hierarchical vision transformer using shifted windows. In Proc. 2021 IEEE / CVF International Conference on Computer Vision ( ICCV ) (ed. Dickinson, S.) 9992–10002 (IEEE, 2021).

Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9 , 1735–1780 (1997).

Graetzer, S. & Hopkins, C. Intelligibility prediction for speech mixed with white Gaussian noise at low signal-to-noise ratios. J. Acoust. Soc. Am. 149 , 1346–1362 (2021).

Hickok, G. & Poeppel, D. The cortical organization of speech processing. Nat. Rev. Neurosci. 8 , 393–402 (2007).

Trupe, L. A. et al. Chronic apraxia of speech and Broca’s area. Stroke 44 , 740–744 (2013).

Wang, R. et al. Distributed feedforward and feedback cortical processing supports human speech production. Proc. Natl Acad. Sci. USA 120 , e2300255120 (2023).

Mugler, E. M. et al. Differential representation ofÿ articulatory gestures and phonemes in precentral and inferior frontal gyri. J. Neurosci. 38 , 9803–9813 (2018).

Herff, C. et al. Brain-to-text: decoding spoken phrases from phone representations in the brain. Front. Neurosci. 9 , 217 (2015).

Kohler, J. et al. Synthesizing speech from intracranial depth electrodes using an encoder-decoder framework. Neurons Behav. Data Anal. Theory https://doi.org/10.51628/001c.57524 (2022).

Angrick, M. et al. Towards closed-loop speech synthesis from stereotactic EEG: a unit selection approach. In Proc. 2022 IEEE International Conference on Acoustics , Speech and Signal Processing ( ICASSP ) (ed. Li, H.) 1296–1300 (IEEE, 2022).

Ozker, M., Doyle, W., Devinsky, O. & Flinker, A. A cortical network processes auditory error signals during human speech production to maintain fluency. PLoS Biol. 20 , e3001493 (2022).

Stuart, A., Kalinowski, J., Rastatter, M. P. & Lynch, K. Effect of delayed auditory feedback on normal speakers at two speech rates. J. Acoust. Soc. Am. 111 , 2237–2241 (2002).

Verwoert, M. et al. Dataset of speech production in intracranial electroencephalography. Sci. Data 9 , 434 (2022).

Berezutskaya, J. et al. Direct speech reconstruction from sensorimotor brain activity with optimized deep learning models. J. Neural Eng. 20 , 056010 (2023).

Wang, R., Wang, Y. & Flinker, A. Reconstructing speech stimuli from human auditory cortex activity using a WaveNet approach. In Proc. 2018 IEEE Signal Processing in Medicine and Biology Symposium ( SPMB ) (ed. Picone, J.) 1–6 (IEEE, 2018).

Flanagan, J. L. Speech Analysis Synthesis and Perception Vol. 3 (Springer, 2013).

Serra, X. & Smith, J. Spectral modeling synthesis: a sound analysis/synthesis system based on a deterministic plus stochastic decomposition. Comput. Music J. 14 , 12–24 (1990).

Cogan, G. B. et al. Sensory–motor transformations for speech occur bilaterally. Nature 507 , 94–98 (2014).

Ibayashi, K. et al. Decoding speech with integrated hybrid signals recorded from the human ventral motor cortex. Front. Neurosci. 12 , 221 (2018).

Soroush, P. Z. et al. The nested hierarchy of overt, mouthed and imagined speech activity evident in intracranial recordings. NeuroImage 269 , 119913 (2023).

Tate, M. C., Herbet, G., Moritz-Gasser, S., Tate, J. E. & Duffau, H. Probabilistic map of critical functional regions of the human cerebral cortex: Broca’s area revisited. Brain 137 , 2773–2782 (2014).

Long, M. A. et al. Functional segregation of cortical regions underlying speech timing and articulation. Neuron 89 , 1187–1193 (2016).

Willett, F. R. et al. A high-performance speech neuroprosthesis. Nature 620 , 1031–1036 (2023).

Shum, J. et al. Neural correlates of sign language production revealed by electrocorticography. Neurology 95 , e2880–e2889 (2020).

Sainburg, T., Thielk, M. & Gentner, T. Q. Finding, visualizing and quantifying latent structure across diverse animal vocal repertoires. PLoS Comput. Biol. 16 , e1008228 (2020).

Roussel, P. et al. Observation and assessment of acoustic contamination of electrophysiological brain signals during speech production and sound perception. J. Neural Eng. 17 , 056028 (2020).

Boersma, P. & Van Heuven, V. Speak and unSpeak with PRAAT. Glot Int. 5 , 341–347 (2001).

Chang, E. F., Raygor, K. P. & Berger, M. S. Contemporary model of language organization: an overview for neurosurgeons. J. Neurosurgery 122 , 250–261 (2015).

Jiang, J., Chen, M. & Alwan, A. On the perception of voicing in syllable-initial plosives in noise. J. Acoust. Soc. Am. 119 , 1092–1105 (2006).

Engel, J., Hantrakul, L., Gu, C. & Roberts, A. DDSP: differentiable digital signal processing. In Proc. 8th International Conference on Learning Representations https://openreview.net/forum?id=B1x1ma4tDr (Open.Review.net, 2020).

Flanagan, J. L. A difference limen for vowel formant frequency. J. Acoust. Soc. Am. 27 , 613–617 (1955).

Schafer, R. W. & Rabiner, L. R. System for automatic formant analysis of voiced speech. J. Acoust. Soc. Am. 47 , 634–648 (1970).

Fitch, J. L. & Holbrook, A. Modal vocal fundamental frequency of young adults. Arch. Otolaryngol. 92 , 379–382 (1970).

Stevens, S. S. & Volkmann, J. The relation of pitch to frequency: a revised scale. Am. J. Psychol. 53 , 329–353 (1940).

Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In Proc. 3rd International Conference on Learning Representations (eds Bengio, Y. & LeCun, Y.) http://arxiv.org/abs/1412.6980 (arXiv, 2015).

Angrick, M. et al. Interpretation of convolutional neural networks for speech spectrogram regression from intracranial recordings. Neurocomputing 342 , 145–151 (2019).

Chen, X. ECoG_HB_02. Mendeley data, V2 (Mendeley, 2024); https://doi.org/10.17632/fp4bv9gtwk.2

Chen, X. & Wang, R. Neural speech decoding 1.0 (Zenodo, 2024); https://doi.org/10.5281/zenodo.10719428

Download references

Acknowledgements

This Work was supported by the National Science Foundation under grants IIS-1912286 and 2309057 (Y.W. and A.F.) and National Institute of Health grants R01NS109367, R01NS115929 and R01DC018805 (A.F.).

Author information

These authors contributed equally: Xupeng Chen, Ran Wang.

These authors jointly supervised this work: Yao Wang, Adeen Flinker.

Authors and Affiliations

Electrical and Computer Engineering Department, New York University, Brooklyn, NY, USA

Xupeng Chen, Ran Wang & Yao Wang

Neurology Department, New York University, Manhattan, NY, USA

Amirhossein Khalilian-Gourtani, Leyao Yu, Patricia Dugan, Daniel Friedman, Orrin Devinsky & Adeen Flinker

Biomedical Engineering Department, New York University, Brooklyn, NY, USA

Leyao Yu, Yao Wang & Adeen Flinker

Neurosurgery Department, New York University, Manhattan, NY, USA

Werner Doyle

You can also search for this author in PubMed   Google Scholar

Contributions

Y.W. and A.F. supervised the research. X.C., R.W., Y.W. and A.F. conceived research. X.C., R.W., A.K.-G., L.Y., P.D., D.F., W.D., O.D. and A.F. performed research. X.C., R.W., Y.W. and A.F. contributed new reagents/analytic tools. X.C., R.W., A.K.-G., L.Y. and A.F. analysed data. P.D. and D.F. provided clinical care. W.D. provided neurosurgical clinical care. O.D. assisted with patient care and consent. X.C., Y.W. and A.F. wrote the paper.

Corresponding author

Correspondence to Adeen Flinker .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Machine Intelligence thanks the anonymous reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information.

Supplementary Figs. 1–10, Table 1 and audio files list.

Reporting Summary

Supplementary audio 1.

Example original and decoded audios for eight words.

Supplementary Audio 2

Example original and decoded words from low density participants.

Supplementary Audio 3

Example original and decoded words from hybrid density participants.

Supplementary Audio 4

Example original and decoded words from left hemisphere low density participants.

Supplementary Audio 5

Example original and decoded words from right hemisphere low density participants.

Source Data Fig. 2

Data for Fig, 2a,b,d,e,f.

Source Data Fig. 3

Data for Fig, 3a,c,d.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Chen, X., Wang, R., Khalilian-Gourtani, A. et al. A neural speech decoding framework leveraging deep learning and speech synthesis. Nat Mach Intell (2024). https://doi.org/10.1038/s42256-024-00824-8

Download citation

Received : 29 July 2023

Accepted : 08 March 2024

Published : 08 April 2024

DOI : https://doi.org/10.1038/s42256-024-00824-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

speech recognition test in noise

  • Skip to main content
  • Keyboard shortcuts for audio player

Untangling Disinformation

Using ai to detect ai-generated deepfakes can work for audio — but not always.

Huo Jingnan

speech recognition test in noise

As deepfake generation technology improves and leaves ever-fewer telltale signs that humans can rely on, computational methods for detection are becoming the norm. But technological solutions are no silver bullet for the problem of detecting AI-generated voices. Aaron Marin for NPR hide caption

As deepfake generation technology improves and leaves ever-fewer telltale signs that humans can rely on, computational methods for detection are becoming the norm. But technological solutions are no silver bullet for the problem of detecting AI-generated voices.

Artificial intelligence is supercharging audio deepfakes , with alarm bells ringing in areas from politics to financial fraud.

The federal government has banned robocalls using voices generated by AI and is offering a cash prize for solutions to mitigate harms from voice cloning frauds . At the same time, researchers and the private sector are racing to develop software to detect voice clones, with companies often marketing them as fraud-detection tools.

The stakes are high. Detection software getting it wrong can carry serious implications.

It takes a few dollars and 8 minutes to create a deepfake. And that's only the start

It takes a few dollars and 8 minutes to create a deepfake. And that's only the start

"If we label a real audio as fake, let's say, in a political context, what does that mean for the world? We lose trust in everything," says Sarah Barrington, an AI and forensics researcher at the University of California, Berkeley.

"And if we label fake audios as real, then the same thing applies. We can get anyone to do or say anything and completely distort the discourse of what the truth is."

As deepfake generation technology improves and leaves ever-fewer telltale signs that humans can rely on, computational methods for detection are becoming the norm.

But an NPR experiment indicated that technological solutions are no silver bullet for the problem of detecting AI-generated voices.

Probably yes? Probably not

NPR identified three deepfake audio detection providers — Pindrop Security , AI or Not and AI Voice Detector . Most claim their tools are over 90% accurate at differentiating between real audio and AI-generated audio. Pindrop only works with businesses, while the others are available for individuals to use.

5 tips for not getting tricked online this April Fools' Day — and beyond

5 tips for not getting tricked online this April Fools' Day — and beyond

NPR submitted 84 clips of five to eight seconds to each provider. About half of the clips were snippets of real radio stories from three NPR reporters. The rest were cloned voices of the same reporters saying the same words as in the authentic clips.

The voice clones were generated by technology company PlayHT. To clone each voice, NPR submitted four 30-second clips of audio — one snippet of a previously aired radio story of each reporter and one recording made for this purpose.

Our experiment revealed that the detection software often failed to identify AI-generated clips, or misidentified real voices as AI-generated, or both. While Pindrop Security's tool got all but three samples correct, AI or Not's tool got about half wrong, failing to catch most of the AI-generated clips.

The verdicts these companies provide aren't just a binary yes or no. They give their results in the form of probabilities between 0% and 100%, indicating how likely it is that the audio was generated by AI.

AI-generated images are everywhere. Here's how to spot them

AI-generated images are everywhere. Here's how to spot them

AI Voice Detector's CEO, Abdellah Azzouzi, told NPR in an interview that if the model predicts that a clip was 60% or more likely to be generated by AI, then it considers the clip AI-generated. Under this definition, the tool wrongly identified 20 out of the 84 samples NPR submitted.

AI Voice Detector updated its website after the interview. While the probability percentages for most previously tested clips remained the same, they now include an additional note laying out a new way of interpreting those results. Clips flagged as 80% or more are now deemed "highly likely to be generated by AI." Those scoring between 20% and 80% are "inconclusive." Clips rated less than 20 are "highly likely to be real."

That panicky call from a relative? It could be a thief using a voice clone, FTC warns

That panicky call from a relative? It could be a thief using a voice clone, FTC warns

In an email to NPR, the company did not respond to NPR's questions about why the thresholds changed, but says it's "always updating our services to offer the best to those who trust us." The company also removed the claim from its website that the tool was more than 90% accurate.

Under these revised definitions, AI Voice Detector's tool got five of the clips NPR submitted wrong and returned inconclusive results for 32 clips.

While the other providers also provide results as probabilities, they did not provide results marked as inconclusive.

Using AI to catch AI

While NPR's anecdotal experiment is not a formal test or academic study, it highlights some challenges in the tricky business of deepfake detection.

AI images and conspiracy theories are driving a push for media literacy education

AI images and conspiracy theories are driving a push for media literacy education

Detection technologies often involve training machine learning models. Since machine learning and artificial intelligence are virtually the same technology, people also call this approach "using AI to detect AI."

Barrington has both tested various detection methods and developed one with her team. Researchers curate a dataset of real audio and fake audio, transforming each into a series of numbers that are fed into the computer to analyze. The computer then finds the patterns humans cannot see to distinguish the two.

"Things like in the frequency domain, or very sort of small differences between audio signals and the noise, and things that we can't hear but to a computer are actually quite obvious," says Barrington.

Amit Gupta, head of product at Pindrop Security, says one of the things their algorithm does when evaluating a piece of audio is to reverse-engineer the vocal tract — the actual physical properties of a person's body — that would be needed to produce the sound. They called one fraudster's voice that they caught "Giraffe Man."

The FCC says AI voices in robocalls are illegal

The FCC says AI voices in robocalls are illegal

"When you hear the sequence of sound from that fraudster, it is only possible for a vocal tract where a human had a 7-foot-long neck," Gupta says. "Machines don't have a vocal tract. ... And that's where they make mistakes."

Anatoly Kvitnitsky, CEO of AI or Not, says his company trains its machine learning model based on clients' specific-use cases. As a result, he said, the general-use model the public has access to is not as accurate.

"The format is a little bit different depending on if it's a phone call ... if it's a YouTube video. If it's a Spotify song, or TikTok video. All of those formats leave a different kind of trace."

Tech giants pledge action against deceptive AI in elections

Tech giants pledge action against deceptive AI in elections

While often better at detecting fake audio than people, machine learning models can easily be stumped in the wild. Accuracy can drop if the audio is degraded or contains background noise. Model makers need to train their detectors on every new AI audio generator on the market to detect the subtle differences between them and real people. With new deepfake models being released frequently and open source models becoming available for everyone to tweak and use, it's a game of whack-a-mole.

After NPR told AI or Not which provider it used to generate the deepfake audio clips, the company released an updated detection model that returned better results. It caught most of the AI clips, but also misidentified more real voices as AI. Its tool cannot process some other clips and returned error messages.

What's more, all of these accuracy rates only pertain to English-language audio. Machine learning models need to analyze real and fake audio samples from each language to tell the difference between them.

Meta will start labeling AI-generated images on Instagram and Facebook

Meta will start labeling AI-generated images on Instagram and Facebook

While there seems to be an arm's race between deepfake voice generators and deepfake voice detectors, Barrington says it's important for the two sides to work together to make detection better.

ElevenLabs, whose technology was used to create the audio for the deepfake Biden robocall , has a publicly available tool that detects its own product. Previously, the website claimed that the tool also detects audio generated by other providers, but independent research has shown poor results. PlayHT says a tool to detect AI voices — including its own — is still under development.

Detection at scale isn't there yet

Tech giants including major social media companies such as Meta, TikTok and X have expressed their interest in "developing technology to watermark, detect and label realistic content that's been created with AI." Most platforms' efforts seem to focus more on video, and it's unclear whether that would include audio, says Katie Harbath, chief global affairs officer at Duco Experts, a consultancy on trust and safety.

AI fakes raise election risks as lawmakers and tech companies scramble to catch up

AI fakes raise election risks as lawmakers and tech companies scramble to catch up

In March, YouTube announced that it would require content creators to self-label some videos made with generative AI before they upload videos. This follows similar steps from TikTok . Meta says it's also going to roll out labeling on Facebook and Instagram, using watermarks from companies that produce generative AI content.

Barrington says specific algorithms could detect deepfakes of world leaders whose voices are well known and documented, such as President Biden. That won't be the case for people who are less well known.

"What people should be very careful about is the potential for deepfake audio in down-ballot races," Harbath says. With less local journalism and with fact-checkers at capacity, deepfakes could cause disruption.

AI-generated deepfakes are moving fast. Policymakers can't keep up

AI-generated deepfakes are moving fast. Policymakers can't keep up

As for scam calls impersonating loved ones, there's no high-tech detection that flags them. You and your family can come up with questions a scammer wouldn't know the answer to in advance, and the FTC recommends calling back to make sure the call was not spoofed.

"Anyone who says 'here's an algorithm,' just, you know, a web browser plug-in, it will tell you yes or no — I think that's hugely misleading," Barrington says.

IMAGES

  1. Diagnostic Hearing Tests

    speech recognition test in noise

  2. Enhancement of speech-in-noise comprehension through vibrotactile

    speech recognition test in noise

  3. Speech in Noise Testing and Its Importance

    speech recognition test in noise

  4. New technology provides breakthrough speech recognition in noisy

    speech recognition test in noise

  5. a,b Speech recognition test in noisy environments without (a

    speech recognition test in noise

  6. Hearing in Noise Test (HINT) (Panel A) and Quick Speech in Noise test

    speech recognition test in noise

VIDEO

  1. NVIDIA Riva Automatic Speech Recognition for AudioCodes VoiceAI Connect Users

  2. Project '87

  3. High-performance OpenAI's Whisper speech recognition

  4. ENGLISH LISTENING: Speech recognition test

  5. Speaker recognition using DNN

  6. Enhanced voice & reduced noise #shorts

COMMENTS

  1. Functional Tests to Assess Speech in Noise

    The Speech Recognition in Noise Test (SPRINT) The SPRINT was designed by the United States Army to identify hearing loss in active duty soldiers. 9 The test includes 200 monosyllabic words that are pre-recorded with multitalker babble, and are delivered to both ears simultaneously using earphones. 9 The speech-to-babble ratio is 9dB; at this ratio, active duty soldiers with normal hearing are ...

  2. Back to Basics: Speech Audiometry

    Speech recognition in noise is being promoted by a number of experts because adding noise improves the sensitivity of the test and the validity of the test. Giving the test at several levels will provide for a better separation between people who have hearing loss and those who have normal hearing.

  3. Speech-in-noise tests: How and why to include them in your b ...

    Word-recognition scores are 76% in the left ear and 80% in the right ear. The SNR loss is 9 dB in each ear. ... In both cases, a speech-in-noise test quickly and realistically demonstrates that the hearing aids are effectively restoring quiet speech cues. Figure 4: Quick SIN presented at 45 dB HL in the sound field. Both the speech and ...

  4. A Two-Minute Speech-in-Noise Test: Protocol and Pilot Data

    By Douglas L. Beck and Lauren Benitez. This article is a part of the May/June 2019, Volume 31, Number 3, Audiology Today issue. Hearing-care professionals (HCPs) and hearing aid wearers report the chief complaint secondary to hearing loss and to wearing traditional hearing aids, is the inability to understand speech-in-noise (SIN; see Beck et al, 2019).

  5. Effective Use of Speech-in-Noise Testing in the Clinic

    Metrics. Speech-in-noise (SIN) testing provides a useful window into the status of a patient's auditory system. It can be used for clinical diagnosis and measurement of functional capacity of the hearing system, providing clinicians with highly valuable information while requiring minimal clinical time. However, SIN tests are infrequently used ...

  6. PDF Practice Guidance

    assessments whenever improving speech understanding in noise is a desired outcome, this is likely to reduce clinical time and minimise the need for repeated visits. 1.2 Background and scope Speech testing can be applied in a variety of ways throughout a care pathway. In this practice guidance document, we focus exclusively on speech in noise ...

  7. Speech Perception in Noise: The Basics

    This review paper highlights the importance of measuring speech recognition in noise and provides a guide to the basics of speech-in-noise testing. Topics included in this review paper along with relevant research findings are (a) discussion regarding the two components of hearing loss and their relation to understanding speech, (b) speech ...

  8. Speech Testing

    This is done to test word recognition. Speech testing may happen in a quiet or noisy place. People who have hearing loss often say that they have the most trouble hearing in noisy places. So, it is helpful to test how well you hear in noise. Learn more about hearing testing. To find an audiologist near you, visit ProFind. Find a Professional ...

  9. Assessment of speech recognition abilities in quiet and in noise: a

    The standard digits-in-noise speech recognition test (Smits, Goverts, and Festen Citation 2013) uses 24 digit-triplets (e.g. 6-5-2) presented in steady-state speech-shaped noise. The digit-triplets are selected at random from a list of 120 digit-triplets. The test is designed to estimate the speech reception threshold (SRT), which represents ...

  10. Speech-in-noise testing: Innovative applications for pediatric patients

    A variety of SiN tests are available to clinicians and researchers (Table I).Some of the most popular, clinically available SiN tests include the Hearing-in-Noise Test (HINT; Nilsson , 1994), the Words-in-Noise Test (WIN; Wilson, 2003; Wilson , 2003), and the Quick Speech-in-Noise Test (QuickSIN; Killion , 2004).As evident in Table I, in addition to the most commonly used assessments, there ...

  11. PDF Speech-in-Noise Testing: An Introduction for Audiologists

    Speech-in-noise testing has been proposed as a useful part of the audiometric test battery dating back to the earliest years of the field ... tested had word recognition scores 80% in quiet, whereas only 7% of the group demon-strated normal WIN performance ( 6dB; Wilson 2011). Wilson suggested that speech-

  12. Speech Audiometry: An Introduction

    In speech recognition testing, also known as speech discrimination testing, the subject must not only detect the speech, but also correctly recognize the word or words presented. ... Development of a quick speech-in-noise test for measuring signal-to-noise ratio loss in normal-hearing and hearing-impaired listeners. The Journal of the ...

  13. Speech Audiometry

    WRS = Word Recognition Score, SRS = Speech Reception Score, Speech Discrimination Score. ... Speech In Noise Test. Speech in noise testing is a critical component to a comprehensive hearing evaluation. When you test a patient's ability to understand speech in a "real world setting" like background noise, the results influence the diagnosis, the ...

  14. PDF Speech Recognition Testing: The Basics

    Speech-in-noise testing Came about in the late 1960's as a way to quantify the amount of distortion Carhart & Tillman (1970) advocated for speech-in-noise testing to be part of test battery Strom (2006) surveyed and found that less than half of dispensing professionals use some type of speech-in-noise task

  15. Speech-in-Noise Audiometry in Adults: A Review of the Available Tests

    With the increased use of speech-in-noise (SpIN) testing, audiologists and otologists are expected to evidence disabilities in a greater number of patients with sensorineural hearing loss. ... Similarly, in the most difficult test condition of the FrBIO, the speech recognition never drops below 80% in NH subjects [Bergeron et al., 2019]. This ...

  16. Speech-in-Noise Test

    Introduction. Jos J. Eggermont, in Noise and the Brain, 2014 1.8 The Need to Move beyond Threshold Audiometry as an Indicator of Safe Exposure Levels. It is now recognized that poor results of speech-in-noise tests by hearing-impaired persons cannot be fully explained by the elevated pure-tone hearing thresholds. Plomp et al. 79 and others have shown that an additional factor has to be taken ...

  17. Speech Recognition in Natural Background Noise

    In the real world, human speech recognition nearly always involves listening in background noise. The impact of such noise on speech signals and on intelligibility performance increases with the separation of the listener from the speaker. The present behavioral experiment provides an overview of the effects of such acoustic disturbances on speech perception in conditions approaching ...

  18. The digits-in-noise test: Assessing auditory speech recognition

    The test measures primarily the auditory, or bottom-up, speech recognition abilities in noise. Digit triplets were formed by concatenating single digits spoken by a male speaker. Level corrections were made to individual digits to create a set of homogeneous digit triplets with steep speech recognition functions.

  19. The Ultimate Guide To Speech Recognition With Python

    The Effect of Noise on Speech Recognition. Noise is a fact of life. All audio recordings have some degree of noise in them, and un-handled noise can wreck the accuracy of speech recognition apps. To get a feel for how noise can affect speech recognition, download the "jackhammer.wav" file here. As always, make sure you save this to your ...

  20. PDF SPeech Recognition In Noise Test (SPRINT) for H-3 Profile

    The SPeech Recognition in Noise Test (SPRINT) consists of 200 monosyllabic words (Form C of the NU-6 lists) pre-recorded in a background of multi-talker babble noise. The speech-to-babble ratio (+9 dB) is such that normal-hearing soldiers are little affected by the background noise; that is, they can identify 95-100% of the monosyllabic words ...

  21. Predicting Matrix test effectiveness for evaluating auditory

    Abstract. Introduction: Auditory performance in noise of cochlear implant recipients can be assessed with the adaptive Matrix test (MT); however, when the speech-to-noise ratio (SNR) exceeds 15 dB, the background noise has any negative impact on the speech recognition. Here, we aim to evaluate the predictive power of aided pure-tone audiometry and speech recognition in quiet, and establish cut ...

  22. Prosem Lecture: Speech Recognition in Noisy Environments in Children

    Binaural Hearing and Speech Lab University of Wisconsin - Madison. Speech Recognition in Noisy Environments in Children: Simulating Cochlear Implant Listening and Testing the Importance of the Predictability of Speech. Children show significant variability in outcome measures including speech understanding in quiet and in noise.

  23. SpeechRecognition · PyPI

    Library for performing speech recognition, with support for several engines and APIs, online and offline. ... If it is too insensitive, the microphone may be rejecting speech as just noise. ... Testing is also done automatically by TravisCI, upon every push. To set up the environment for offline/local Travis-like testing on a Debian-like system:

  24. Large Language Models are Efficient Learners of Noise-Robust Speech

    However, directly incorporating noise embeddings from audio encoder could harm the LLM tuning due to cross-modality gap. To this end, we propose to extract a language-space noise embedding from the N-best list to represent the noise conditions of source speech, which can promote the denoising process in GER.

  25. A neural speech decoding framework leveraging deep learning ...

    Decoding human speech from neural signals is essential for brain-computer interface (BCI) technologies that aim to restore speech in populations with neurological deficits. However, it remains a ...

  26. Tools to detect audio deepfakes are in a race with technology : NPR

    AI Voice Detector's CEO, Abdellah Azzouzi, told NPR in an interview that if the model predicts that a clip was 60% or more likely to be generated by AI, then it considers the clip AI-generated.