Goals and Objectives - ICSI

Goals and Objectives - ICSI

What are the Essential Cues for Understanding Spoken Language? Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 http://www.icsi.berkeley.edu/~steveng [email protected] No Scientist is an Island IMPORTANT COLLEAGUES ACOUSTIC BASIS OF SPEECH INTELLIGILIBILTY Takayuki Arai, Joy Hollenback, Rosaria Silipo AUDITORY-VISUAL INTEGRATION FOR SPEECH PROCESSING Ken Grant AUTOMATIC SPEECH RECOGNITION AND FEATURE CLASSIFICATION Shawn Chang, Lokendra Shastri, Mirjam Wester STATISTICAL ANALYSIS OF PRONUNCIATION VARIATION Eric Fosler, Leah Hitchcock, Joy Hollenback Germane Publications STATISTICAL PROPERTIES OF SPOKEN LANGUAGE AND PRONUNCIATION MODELING Fosler-Lussier, E., Greenberg, S. and Morgan, N. (1999) Incorporating contextual phonetics into automatic speech recognition. Proceedings of the 14th International Congress of Phonetic Sciences,

San Francisco. Greenberg, S. (1997) On the origins of speech intelligibility in the real world. Proceedings of the ESCA Workshop on Robust Speech Recognition for Unknown Communication Channels, Pont-a-Mousson, France, pp. 23-32. Greenberg, S. (1999) Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation, Speech Communication, 29, 159-176. Greenberg, S. and Fosler-Lussier, E. (2000) The uninvited guest: Information's role in guiding the production of spontaneous speech, in the Proceedings of the Crest Workshop on Models of Speech Production: Motor Planning and Articulatory Modelling, Kloster Seeon, Germany . Greenberg, S., Hollenback, J. and Ellis, D. (1996) Insights into spoken language gleaned from phonetic transcription of the Switchboard corpus, in Proc. Intern. Conf. Spoken Lang. (ICSLP), Philadelphia, pp. S24-27. AUTOMATIC PHONETIC TRANSCRIPTION AND ACOUSTIC FEATURE CLASSIFICATION Chang, S. Greenberg, S. and Wester, M. (2001) An elitist approach to articulatory-acoustic feature classification. 7th European Conference on Speech Communication and Technology (Eurospeech2001). Chang, S., Shastri, L. and Greenberg, S. (2000) Automatic phonetic transcription of spontaneous speech (American English), Proceedings of the International. Conference on. Spoken. Language. Processing, Beijing. Shastri, L., Chang, S. and Greenberg, S. (1999) Syllable segmentation using temporal flow model neural networks. Proceedings of the 14th International Congress of Phonetic Sciences, San Francisco. Wester, M. Greenberg, http://www.icsi.berkeley.edu/~steveng S. and Chang,, S. (2001) A Dutch treatment of an elitist approach to articulatory-acoustic feature classification. 7th European Conference on Speech Communication and

Technology (Eurospeech-2001). Germane Publications PERCEPTUAL BASES OF SPEECH INTELLIGIBILITY Arai, T. and Greenberg, S. (1998) Speech intelligibility in the presence of cross-channel spectral asynchrony, IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, pp. 933-936. Greenberg, S. and Arai, T. (1998) Speech intelligibility is highly tolerant of cross-channel spectral asynchrony. Proceedings of the Joint Meeting of the Acoustical Society of America and the International Congress on Acoustics, Seattle, pp. 2677-2678. Greenberg, S. and Arai, T. (2001) The relation between speech intelligibility and the complex modulation spectrum. Submitted to the 7th European Conference on Speech Communication and Technology (Eurospeech-2001). Greenberg, S., Arai, T. and Silipo, R. (1998) Speech intelligibility derived from exceedingly sparse spectral information, Proceedingss of the International Conference on Spoken Language Processing, Sydney, pp. 74-77. Silipo, R., Greenberg, S. and Arai, T. (1999) Temporal Constraints on Speech Intelligibility as Deduced from Exceedingly Sparse Spectral Representations, Proceedings of Eurospeech, Budapest. AUDITORY-VISUAL SPEECH PROCESSING Grant, K. and Greenberg, S. (2001) Speech intelligibility derived from processing of asynchronous processing of auditory-visual information. Submitted to the ISCA Workshop on Audio-Visual Speech Processing (AVSP-2001).

PROSODIC STRESS ACCENT AUTOMATIC CLASSIFICATION AND CHARACTERIZATION Hitchcock, L. and Greenberg, S. (2001) Vowel height is intimately associated with stress-accent in spontaneous American English discourse. Submitted to the 7th European Conference on Speech Communication and Technology (Eurospeech-2001). Silipo, R. and Greenberg, S. (1999) Automatic transcription of prosodic stress for spontaneous English discourse. Proceedings of the 14th International Congress of Phonetic Sciences, San Francisco. Silipo, R. and Greenberg, S. (2000) Prosodic stress revisited: Reassessing the role of fundamental frequency. Proceedings of the NIST Speech Transcription Workshop, College Park, MD. http://www.icsi.berkeley.edu/~steveng Silipo, R. and Greenberg, S. (2000) Automatic detection of prosodic stress in American English PROLOGUE The Central Challenge for Models of Speech Recognition Language - The Traditional Perspective The classical view of spoken language posits a quasi-arbitrary relation between the lower and higher tiers of linguistic organization The Serial Frame Perspective on Speech Traditional models of speech recognition assume that the identity of a phonetic segment depends on the detailed spectral profile of the acoustic signal

for a given (usually 25-ms) frame of speech Language - A Syllable-Centric Perspective A more empirical perspective of spoken language focuses on the syllable as the interface between sound and meaning Within this framework the relationship between the syllable and the higher and lower tiers is non-arbitrary and systematic statistically Lines of Evidence Take Home Messages Segmentation is crucial for understanding spoken language At the level of the phrase the word

the syllable the phonetic segment But . this linguistic segmentation is inherently fuzzy As is the spectral information associated with each linguistic tier The low-frequency (3-25 Hz) modulation spectrum is a crucial acoustic (and possibly visual) parameter associated with intelligibility It provides segmentation information that unites the phonetic segment with the syllable (and possibly the word and beyond) Many properties of spontaneous spoken language differ from those of laboratory and citation speech There are systematic patterns in real speech that potentially reveal underlying principles of linguistic organization The Central Importance of the Modulation Spectrum and the Syllable for Understanding Spoken Language Effects of Reverberation on the Speech Signal Reflections from walls and other surfaces routinely modify the spectro-temporal structure of the speech signal under everyday conditions

Effects of Reverberation on the Speech Signal Reflections from walls and other surfaces routinely modify the temporal and modulation spectral properties of the speech signal The modulation spectrums peak is attenuated and shifted down to ca. 2 Hz [based on an illustration by Hynek Hermansky] Modulation Spectrum Computation The Modulation Spectrum Reflects Syllables The peak in the distribution of syllable duration is close to the mean - 200 ms The syllable duration distribution is very close to that of the modulation spectrum - suggesting that the modulation spectrum reflects syllables The Ability to Understand Speech Under Reverberant Conditions (Spectral Asynchrony) Spectral Asynchrony - Method Output of quarter-octave frequency bands quasi- randomly time-shifted relative to common reference. Maximum shift interval ranged between 40 and 240 ms (in 20-ms steps). Mean shift interval

is half of the maximum interval. Adjacent channels separated by a minimum of one-quarter of the maximum shift range. Stimuli 40 TIMIT Sentences She washed his dark suit in greasy dish water all year Spectral Asynchrony - Paradigm The magnitude of energy in the 3-6 Hz region of the modulation spectrum is computed for each (4 or 7 channel sub-band) as a function of spectral asynchrony The modulation spectrum magnitude is relatively unaffected by asynchronies of 80 ms or less (open symbols), but is appreciably diminished for asynchronies of 160 ms or more Is intelligibility correlated with the reduction in the 3-6 Hz modulation spectrum? Intelligibility and Spectral Asynchrony

Speech intelligibility does appear to be roughly correlated with the energy in the modulation spectrum between 3 and 6 Hz The correlation varies depending on the sub-band and the degree of spectral asynchrony Spectral Asynchrony - Summary Speech is capable of withstanding a high degree of temporal asynchrony across frequency channels This form of cross-spectral asynchrony is similar to the effects of many common forms of acoustic reverberation Speech intelligibility remains high (>75%) until this asynchrony (maximum) exceeds 140 ms The magnitude of the low-frequency (3-6 Hz) modulation spectrum is highly correlated with speech intelligibility

Understanding Spoken Language Under Very Sparse Spectral Conditions A Flaw in the Spectral Asynchrony Study Of the 448 possible combinations of four slits across the spectrum (where one slit is present in each of the 4 sub-bands) ca. 10% (i.e. 45) exhibit a coefficient of variation less than 10% - thus, the seeming temporal tolerance of the auditory system may be illusory (if listeners can decode the speech signal using information from only a small number of channels distributed across the spectrum) Intelligibility of spectrally desynchronized speech Distribution of channel asynchrony Spectral Slit Paradigm Can listeners decode spoken sentences using just four narrow (1/3 octave) channels (slits) distributed across the spectrum? The edge of each slit was separated from its nearest neighbor by an octave The modulation pattern for each slit differs from that of the others The four-slit compound waveform looks very similar to the full-band signal + + Word Intelligibility - Single Slits The intelligibility associated with any single slit is only 2 to 9% The mid-frequency slits exhibit somewhat higher intelligibility than the lateral slits

Word Intelligibility - Roap Map 1. Intelligibility as a function of the number of slits (from one to four) Word Intelligibility - 1 Slit Word Intelligibility - 2 Slits Word Intelligibility - 3 Slits Word Intelligibility - 4 Slits Word Intelligibility - Roap Map 2. Intelligibility for different combinations of two-slit compounds The two center slits yield the highest intelligibility Word Intelligibility - 2 Slits Word Intelligibility - 2 Slits Intelligibility - 2 Slits Intelligibility - 2 Slits

Intelligibility - 2 Slits Intelligibility - 2 Slits Word Intelligibility - Roap Map 3. Intelligibility for different combinations of three-slit compounds Combinations with one or two center slits yield the highest intelligibility Intelligibility - 3 Slits Intelligibility - 3 Slits Intelligibility - 3 Slits Intelligibility - 3 Slits Word Intelligibility - Roap Map 4. Four slits yield nearly (but not quite) perfect intelligibility of ca. 90% This maximum level of intelligibility makes it possible to deduce the specific contribution of each slit by itself and in combination with others Intelligibility - 3 Slits Spectral Slits - Summary

A detailed spectro-temporal analysis of the speech signal is not required to understand spoken language An exceedingly sparse spectral representation can, under certain circumstances, yield nearly perfect intelligibility Modulation Spectrum Across Frequency The modulation spectrum varies in magnitude across frequency The shape of the modulation spectrum is similar for the three lowest slits, but the highest frequency slit differs from the rest in exhibiting a far greater amount of energy in the mid modulation frequencies Word Intelligibility - Single Slits The intelligibility associated with any single slit ranges between 2 and 9%, suggesting that the shape and magnitude of the modulation spectrum per se is NOT the controlling variable for intelligibility Spectral Slits - Summary

A detailed spectro-temporal analysis of the speech signal is not required to understand spoken language An exceedingly sparse spectral representation can, under certain circumstances, yield nearly perfect intelligibility The magnitude component of the modulation spectrum does not appear to be the controlling variable for intelligibility The Effect of Desynchronizing Sparse Spectral Information on Speech Intelligibility Modulation Spectrum Across Frequency Desynchronizing the slits by more than 25 ms results in a significant decline in intelligibility Spectral Slits - Summary Even small amounts of asynchrony (>25 ms) imposed on spectral slits

can result in significant degradation of intelligibility Asynchrony greater than 50 ms has a profound impact of intelligibility Intelligibility and Slit Asynchrony Spectral Slits - Summary A detailed spectro-temporal analysis of the speech signal is not required to understand spoken language An exceedingly sparse spectral representation can, under certain circumstances, yield nearly perfect intelligibility The magnitude component of the modulation spectrum does not appear to be the controlling variable for intelligibility Spectral Slits - Summary

Small amounts of asynchrony (>25 ms) imposed on spectral slits can result in significant degradation of intelligibility Asynchrony greater than 50 ms has a profound impact of intelligibility Intelligibility progressively declines with greater amounts of asynchrony up to an asymptote of ca. 250 ms Beyond asynchronies of 250 ms intelligibility IMPROVES, but the amount of improvement depends on individual factors Such results are NOT inconsistent with the high intelligibility of desynchronized full-spectrum speech, but rather imply that the auditory system is capable of extracting phonetically important information from a relatively small proportion of spectral channels BOTH the amplitude and phase components of the modulation spectrum are extremely important for speech intelligibility The modulation phase is of particular importance for cross-spectral integration of phonetic information Speech Intelligibility Derived from Asynchronous Presentation of

Auditory and Visual Information Auditory-Visual Integration of Speech Video of spoken (Harvard/IEEE) sentences, presented in tandem with sparse spectral representation (low- and high-frequency slits) Auditory-Visual Integration - Mean Intelligibility When the AUDIO signal LEADS the VIDEO, there is a progressive decline in intelligibility, similar to that observed for audio-alone signals When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as large as 200 ms 9 Subjects Auditory-Visual Integration - by Individual Ss Video lagging often better than synchronous Variation across subjects

Audio-Video Integration Summary Sparse audio and speech-reading information when presented alone provide minimal intelligibility But can, when combined provide good intelligibility When the audio signal leads the video, intelligibility falls off rapidly as a function of onset asynchrony When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms The dynamics of the video appear to be combined with the dynamics associated with the audio to provide good intelligibility The dynamics associated with the video signal are probably most closely associated with place of articulation information The implication is that place information has a long time constant of ca. 200 ms and appears linked to the syllable

Perceptual Evidence for the Spectral Origin of Articulatory-Acoustic Features Spectral Slit Paradigm Signals were CV and VC Nonsense Syllables (from CUNY) Consonant Recognition - Single Slits Consonant Recognition - 1 Slit Consonant Recognition - 2 Slits Consonant Recognition - 3 Slits Consonant Recognition - 4 Slits Consonant Recognition - 5 Slits Consonant Recognition - 2 Slits Consonant Recognition - 2 Slits

Consonant Recognition - 2 Slits Consonant Recognition - 2 Slits Consonant Recognition - 2 Slits Consonant Recognition - 2 Slits Consonant Recognition - 3 Slits Consonant Recognition - 3 Slits Consonant Recognition - 3 Slits Consonant Recognition - 4 Slits Consonant Recognition - 5 Slits Articulatory - Feature Analysis

The consonant recognition results can be scored in terms of articulatory features correct The the accuracy of the features are scored relative to the accuracy of consonant recognition an interesting pattern emerges Certain features (place and manner) appear to be highly correlated with consonant recognition performance While the voicing and rounding features are less highly correlated Correlation - AFs/Consonant Recognition Consonant recognition is almost perfectly correlated with place of articulation performance This correlation suggests that the place feature is based on cues distributed across the entire speech spectrum, in contrast to features such as voicing and rounding, which appear to be extracted from a narrower band of the spectrum

Manner is also highly correlated with consonant recognition, implying that this feature is extracted from a fairly broad portion of the spectrum Phonetic Transcription of Spontaneous (American) English Phonetic Transcription of Spontaneous English Telephone dialogues of 5-10 minutes duration - SWITCHBOARD Amount of material manually transcribed 4 hours labeled at the phone level and segmented at the syllabic level (this material was later phonetically segmented by automatic methods) 1 hour labeled and segmented at the phonetic-segment level Diversity of material transcribed Spans speech of both genders (ca. 50/50%) reflecting a wide range of American dialectal variation (6 regions + army brat), speaking rate and voice quality Transcribed by whom? 11 undergraduates and 1 graduate student, all enrolled at UC-Berkeley. Most of the corpus was transcribed by four individuals out of the twelve

Supervised by Steven Greenberg and John Ohala Transcription system A variant of Arpabet, with phonetic diacritics such as:_gl,_cr, _fr, _n, _vl, _vd How long does transcription take? (Dont ask!) 388 times real time for labeling and segmentation at the phonetic-segment level 150 times real time for labeling phonetic segments and segmenting syllables How was labeling and segmentation performed? Using a display of the signal waveform, spectrogram, word transcription and forced alignments (estimates of phones and boundaries) + audio (listening at multiple time scales - phone, word, utterance) on Sun workstations Data available at - http://www.icsi/berkeley.edu/real/stp Phonetic Transcription What a typical computer screen shot of the speech material looks like to a transcriber A Brief Tour of Pronunciation Variation in Spontaneous American English

How Many Pronunciations of and? N 82 63 45 35 34 30 20 17 17 11 7 7 6 6 5 4 4 4 3 3 Pronunciation

ae n eh n ix n ax n en n ae n dcl ih n q ae n ae n d q eh n ae

nx ae ae n ah n eh nx uh n ix nx q ae n eh n d q ae nx d

dcl d N 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1

1 Pronunciation eh ae n dcl ae ax m ax n d ae eh n eh n dcl ax nx q ae ae q ix

n ix n dcl ih eh eh n q eh nx ix d n eh m ax n dcl aw n ae q eh dcl dcl

d n d d d How Many Pronunciations of and? N 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 Pronunciation ah nx ae n t eh d ah n dcl ey ih n ae ix n ae

nx ax ax ng ay n ih ah n ae hh ih ng ix ae n d ix dcl d ae eh n hh n

ix n t ae ax n iy eh n d dcl d dcl dcl d N 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Pronunciation m ae ae n

nx q ae ae q ae ae q ae eh q ae ih aa n q ae n ? nx q ae n eh

n m q eh en eh ng q eh n em q eh ow q ih n q ix en er d n n

n n d q dcl q m dcl dcl d d How Many Different Pronunciations? Rank 1 2 3 4 5 6 7 8

9 10 11 12 13 14 15 16 17 18 19 20 Word I and the you that a to know of it

yeah in they do so but is like have was 1 2 3 4 5 I a nd th e y ou th a t 6

a 7 to 8 k now 9 1 0 1 1 3 1 3 2 3 3 3 4 3 5 3 6 3 7 3 8 3 9

4 0 4 1 5 2 5 3 5 4 5 5 5 6 ih z hha e v wa h z w iy ih tc ls jh ix s a a n e r ma a q fe r wa h d x

th e n be a s out k in d be c a ue go got th is s o me w o u ld th in g s now lo t ha d how good 5 9 5 8 5 8 5 6 5 6

5 5 5 5 4 6 2 7 1 9 1 7 2 5 8 1 9 9 2 1 1 6 1 8 3 6 2 8 9 1 0 2 1 8 2 0 1 6

8 1 8 1 8 4 1 3 1 2 8 1 5 5 1 5 0 4 9 4 7 4 7 4 6 4 5 4 5 4 5 4 4 4 3 4 1

4 1 3 9 3 9 3 9 1 9 1 1 1 6 1 9 1 7 3 1 2 1 5 3 2 1 1 4 1 6 1 5 1 1 9 1 9 we h l

dhe hs 6 1 4 5 7 8 ow rih liy wa h n 4 2 2 6 e r qa a m 2 8 ra y 4 1 a h

2 3 2 2 a x m a e dx dhe hr 6 6 5 8 1 4 ma y m iy n dx ow 7 7 3 5 5 5 5 4 5 4 3 8

7 6 now w ih th ih f we h n k c lk a e n dhe hn b c lb iy 1 8 2 2 2 1 1 5 a e z a e dx k c lk a x n x k c lk a x z 4 4 p c lp iy p c le l 8 3 1 5 4 7

4 8 2 9 5 2 g c lg o w g c lg a a d h ih s s a hm w ih d c l th ih n g z 6 9 4 7 2 4 na w la a d x hha e dc l 3 9 3 8 1 1 1 3

5 3 2 7 hha w g c lg u h d c l ge t 3 8 2 0 1 3 g c lg e h d x s e e 3 7 6 8 0

fro m he me d o n 't th e ir m o re it's th a t's to o ok a y v e ry up be e n gue s s tim e g o in g in to th o s e 9 2 9 3 9 4 9 5

9 6 9 7 ih y a e ih n s ow la y k c lk a x b c lb a w a ol 8 8 9 8 a x now a x v dhe y

d c ld u w b c lb a h tc lt th ih n g k c lk 1 2 2 4 8 9 9 9 a y a e n dha x y ix dha e tc ltu w 4 3 4 5 6 0

5 4 7 4 1 2 5 0 4 6 5 4 2 3 8 3 2 0 1 7 4 9 3 6 2 4 3 2 4 6 2 3 1 4 6 7 p e o p le 7 2 7 3

7 4 7 5 7 6 7 7 7 8 7 9 8 0 8 1 8 2 8 3 8 4 8 5 8 6 8 7 4 8 2 2 2 8 3 0 1 4 4 5 2 4 1 9 2 2

2 4 1 3 1 4 3 4 1 8 2 3 2 4 2 3 1 9 4 9 4 0 7 7 7 4 7 4 7 4 7 1 6 9 6 8 6 1 6 0 6 0

5 7 9 0 9 8 9 4 9 2 9 2 8 7 8 4 6 7 9 1 2 0 1 1 6 4 1 4 5 6 2 1 2 2 1 0 1

8 2 5 8 5 9 6 0 6 1 6 2 6 3 6 4 6 5 6 6 5 3 1 6 2 7 2 8 6 6 3 4 4 4 2 0 3 1 7 8

1 5 2 1 3 1 1 3 0 1 2 3 1 2 0 1 1 9 1 1 6 1 1 1 1 0 8 1 0 1 a t c a n 6 8 6 9 7 0 7 1 5 3 8 7 7 6 6 8

1 1 7 4 9 is or oh a re I'm uh my no w ith if wh e n 1 0 0

6 4 9 5 2 1 4 7 5 4 0 6 3 2 8 3 1 9 2 8 8 2 4 2 2 4 0 in do s o but lik e we it's ju s t on

not fo r one rig h t th e re me a n d o n 't 4 8 4 9 5 0 5 1 2 4 9 of th e y

ha v e wa s th in k we l wh a t a bout a l th a t's re a ly th e m 4 2 4 3 4 4 4 5 4 6 4 7 it y e a h

1 2 1 3 1 4 1 5 1 6 1 7 1 8 1 9 2 0 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 3 0 3 6 3 6 3 5

3 5 3 3 3 2 3 1 3 1 3 1 3 1 3 0 3 0 3 0 2 9 2 9 2 9 2 8 2 7 1 0 7 5 2 1 1 9 1 1 1 4 2 0

6 1 7 1 1 1 1 1 1 8 8 2 1 2 0 1 2 2 8 2 7 1 1 2 5 d id 2 7 1 3

2 3 w o rk a n e v e n our a ny 2 5 2 5 2 5 2 5 2 4 2 4 8 1 4 1 2 7 9 7

s iy fra h m 3 9 iy 8 7 1 4 2 5 5 6 2 0 1 6 6 0 4 5 3 6 3 4 5 1 4 2 6 2 m iy dx ow

dhe hr ma o r ih tc ls dhe hs tc ltu w o w k c lk e y v e h riy a h p c lp b c lb ih n g c lg e h s tc lta y m 1 3 g c lg o w ih n g 1 4 4 2 h e re o th e r I'v e th in g w e 're

ih n tc ltu w d h o wz h h iy e r d c ld ih d x 6 6 w e rk c lk 2 6 a hdhe r 2 8 a x n 4 6 5 2 a y v th ih n g

4 0 iy v ix n 2 3 9 2 3 1 1 2 3 ix n iy 2 3 8 2 5 3 3

we y r a a r N 649 521 475 406 328 319 288 249 242 240 203 178 152 131 130 123 120 119 116

111 #Pron 53 87 76 68 117 28 66 34 44 49 48 22 28 30 14 45 24 19 22 24

MCP %Total 53 16 27 20 11 64 14 56 21 22 43 45 60 54 74 12 50 46 54 23 The 20 most frequency words account for 35% of the tokens

Most Common Pronunciation ay ae n dh ax y ix dh ae ax tcl t uw n ow ax v ih y ae ih n dh ey dcl d uw s ow bcl b ah tcl t ih z l ay kcl k hh ae v w ah z

How Many Different Pronunciations? Rank 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 Word

we it's just on or not think for well what about all that's oh really one are I'm right uh N 108 101

101 98 94 92 92 87 84 82 77 74 74 74 71 69 68 67 61 60 #Pron 13 14 34 18

23 24 23 19 49 40 46 27 19 17 25 8 19 9 21 16 MCP %Total 83 20 17 49 36

24 32 46 23 14 12 24 16 61 45 78 42 26 28 41 The 40 most frequency words account for 45% of the tokens Most Common Pronunciation w iy ih tcl s jh ix s aa n

er m aa q th ih ng kcl k f er w eh l w ah dx ax bcl b aw ao l dh eh s ow r ih l iy w ah n er q aa m r ay ah How Many Different Pronunciations? Rank Word 41 them 42 at

43 there 44 my 45 mean 46 don't 47 no 48 with 49 if 50 when 51 can 52 then 53 be 54 as

55 out 56 kind 57 becaue 58 people 59 go 60 got N 60 59 58 58 56 56 55 55 55 54 54

51 50 49 47 47 46 45 45 45 #Pron 18 36 28 9 10 21 8 20 18 18 28 19 11

16 19 17 31 21 5 32 MCP %Total 23 8 22 66 58 14 77 35 41 31 15 38 76 18

22 21 15 44 83 15 The 60 most frequency words account for 55% of the tokens Most Common Pronunciation ax m ae dx dh eh r m ay m iy n dx ow n ow w ih th ih f w eh n kcl k ae n dh eh n bcl b iy

ae z ae dx kcl k ax nx kcl k ax z pcl p iy pcl l el gcl g ow gcl g aa How Many Different Pronunciations? Rank 61 62 63 64 65 66 67 68 69 70 71 72 73 74

75 76 77 78 79 80 Word this some would things now lot had how good get see from he me don't their

more it's that's too N 44 43 41 41 39 39 39 39 38 38 37 36 36 35 35 33 32 31

31 31 #Pron 11 4 16 15 11 9 19 11 13 20 6 10 7 5 21 19 11 14 20 6

MCP %Total 47 48 29 52 69 47 24 53 27 13 80 28 39 87 14 25 56 20 16 60

The 80 most frequency words account for 62% of the tokens Most Common Pronunciation dh ih s s ah m w ih dcl th ih ng z n aw l aa dx hh ae dcl hh aw gcl g uh dcl gcl g eh dx s iy f r ah m iy m iy dx ow dh eh r m ao r ih tcl s dh eh s tcl t uw

How Many Different Pronunciations? Rank 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

Word okay very up been guess time going into those here did work other an I've thing even our any we're N 31

30 30 30 29 29 29 28 27 27 27 25 25 25 25 24 24 23 23 23 #Pron 17 11 11

11 8 8 21 20 12 11 13 8 14 12 7 9 7 9 11 8 MCP %Total 45 36 34 51

42 62 13 14 42 25 23 66 26 28 46 52 40 33 23 25 The 100 most frequency words account for 67% of the tokens Most Common Pronunciation ow kcl k ey v eh r iy ah pcl p

bcl b ih n gcl g eh s tcl t ay m gcl g ow ih ng ih n tcl t uw dh ow z hh iy er dcl d ih dx w er kcl k ah dh er ax n ay v th ih ng iy v ix n aa r ix n iy w ey r English Syllable Structure is (sort of) Like Japanese Most syllables are simple in form (no consonant clusters) 87% of the pronunciations are simple syllabic forms 84% of the canonical corpus is composed of

simple syllabic forms 50 Pronunciation Corpus 45 40 35 30 C = Consonant V = Vowel Examples CV go CVC cat VC of V a 25 20 15 10

5 0 CV Corpus = Canonical representation Pronunciation = Actual pronunciation CVC VC V Syllable Type Coda consonants tend to drop n= 103, 054 Complex Syllables ARE Important (Though) There are many complex syllable forms (consonant clusters), but all occur relatively infrequently

Thus, despite Englishs reputation for complex syllabic forms, only ca. 15% of the syllable tokens are actually complex Complex syllables tend to be part of noun phrases (nouns or adjectives) Percent C = Consonant V = Vowel Examples CVCC fifth VCC ounce CCV stow CCVC stoop CCVCC stops CCCVCC strength n= 17,760 Coda consonants tend to drop Syllable-Centric Pronunciation Patterns Codas tend to be pronounced

canonically more frequently in formal speech than in spontaneous dialogues Onsets are pronounced canonically far more often than nuclei or codas Percent Canonically Pronounced 100 90 80 70 60 50 n= 120,814 Onset Sylla bl Cat [k ae t]

e Po [k] = onset TIMIT(Read Sentences) Nucleus sitio n [ae] = nucleus Coda [t] = coda STP (Spontaneous speech) Complex Onsets are Highly Canonical Complex onsets are pronounced more canonically than simple onsets despite the greater potential for deviation from the standard pronunciation

COMPLEX onsets contain TWO or MORE consonants Percent Canonically Pronounced 100 95 90 85 TIMIT (Read sentences) 80 75 STP (Spontaneous speech) 70 Simple (C) Complex (CC(C))

Syllable Onset Type Speaking Style Affects Syllable Codas COMPLEX codas contain TWO or MORE consonants Codas are much more likely to be realized canonically in formal than in spontaneous speech Percent Canonically Pronounced 85 80 75 70 65 TIMIT 60 55

STP 50 All Codas Simple Syllable Coda Type STP Spontaneous phone dialogues TIMIT Read sentences Complex Onsets (but not Codas) Affect Nuclei The presence of a syllable onset has a substantial impact on the realization of the nucleus Percent Canonically Pronounced

70 65 60 55 TIMIT STP 50 All Nuclei With Onset STP Spontaneous phone dialogues TIMIT Read sentences Without Onset

With Coda Without Coda Syllable-Centric Articulatory Feature Analysis Place of articulation deviates most in nucleus position Manner of articulation deviates most in onset and coda position Voicing deviates most in coda position Phonetic deviation along a SINGLE feature Place is VERY unstable in nucleus position Place deviates very little from canonical form in the onset and coda. It is a STABLE AF in these positions

Articulatory PLACE Feature Analysis Place of articulation is a dominant feature in nucleus position only Drives the feature deviation in the nucleus for manner and rounding Phonetic deviation across SEVERAL features Place carries manner and rounding in the nucleus Articulatory MANNER Feature Analysis Manner of articulation is a dominant feature in onset and coda position Drives the feature deviation in onsets and codas for place and voicing Phonetic deviation across SEVERAL features Manner is less stable in the coda than in the onset Manner drives place and

voicing deviations in the onset and coda Articulatory VOICING Feature Analysis Voicing is a subordinate feature in all syllable positions Its deviation pattern is controlled by manner in onset and coda positions Phonetic deviation across SEVERAL features Voicing is unstable in coda position and is dominated by manner The Intimate Relation Between Stress Accent and Vocalic Identity (especially height) What is (usually) Meant by Prosodic Stress? Prosody is supposed to pertain to extra-phonetic cues in the acoustic

signal The pattern of variation over a sequence of SYLLABLES pertaining to: syllabic DURATION, AMPLITUDE and PITCH (fo) variation over time (but the plot thickens, as we shall see) OGI Stories - Pitch Doesnt Cut the Mustard Although pitch range is the most important of the f o-related cues, a predictor of stress as DURATION Amplitude Pitch Range Duration Av. Pitch it is not as good Total Energy is the Best Predictor of Stress Duration x Amplitude is superior to all other combination pairs parameters. Pitch appears redundant with duration.

of acoustic Duration x Amplitude Dur x Pitch Range Pitch Range x Average Dur x Pitch Av Pitch Av x Amp Pitch Range x Amp Duration The Nitty Gritty (a.k.a. the Corpus Material) SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS Switchboard contains informal telephone dialogues 54 minutes of material that had previously been phonetically transcribed (by highly trained phonetics students from UCBerkeley)

45.5 minutes of pure speech (filled pauses, junctures filtered out), consisting of: 9,991 words, 13,446 syllables, 33,370 phonetic segments All of this material had been hand-segmented at either the phoneticsegment or syllabic level by the transcribers The syllabic-segmented material was subsequently segmented at the phonetic-segment level by a special-purpose neural network trained on 72-minutes of hand-segmented Switchboard material. This automatic segmentation was manually verified Manual Transcription of Stress Accent 2 UC-Berkeley Linguistics students each transcribed the full 45 minutes of material (i.e., there is 100% overlap between the 2) Three levels of stress-accent were marked for each syllabic nucleus

Fully stressed (78% concordance between transcribers) Completely unstressed (85% interlabeler agreement) An intermediate level of accent (neither fully stressed, nor completely unstressed (ca. 60% concordance) Hence, 95% concordance in terms of some level of stress The labels of the two transcribers were averaged In those instances where there was disagreement, the magnitude of disparity was almost always (ca. 90%) one step. Usually, disagreement signaled a genuine ambiguity in stress accent The illustrations in this presentation are based solely on those data in which both transcribers concurred (i.e., fully stressed or completely unstressed) A Brief Primer on Vocalic Acoustics Vowel quality is generally thought to be a function primarily of two articulatory properties - both related to the motion of the tongue The front-back plane is most closely associated with the second formant frequency (or more precisely F2 - F1) and the volume of the

front-cavity resonance The height parameter is closely linked to the frequency of F1 In the classic vowel triangle segments are positioned in terms of the tongue positions associated with their production, as follows: Durational Differences - Stressed/Unstressed There is a large dynamic range in duration between stressed and unstressed nuclei Diphthongs and tense, low monophthongs tend to have a larger range than the lax monophthongs Spatial Patterning of Duration and Amplitude Lets return to the vowel triangle and see if it can shed light on certain patterns in the vocalic data The duration, amplitude (and their product, integrated energy, will be plotted on a 2-D grid , where the x-axis will always be in terms hypothetical front-back tongue position (and hence remain a constant throughout the plots to follow) of The y-axis will serve as the dependent measure, sometimes expressed in

terms of duration, or amplitude, or their product Duration - Monophthongs vs. Diphthongs Diphthongs All nuclei Monophthongs Duration - Monophthongs vs. Diphthongs Diphthongs Stressed Unstressed Monophthongs Proportion of Stress Accent and Vowel Height Take Home Messages The vowel system of English (and perhaps other languages as well) needs to be re-thought in light of the intimate relationship between vocalic identity, nucleic duration and stress accent Stressed syllables tend to have significantly longer nuclei than their

unstressed counterparts, consistent with the findings reported by Silipo and Greenberg in previous years meetings regarding the OGI Stories corpus (telephone monologues) Certain vocalic classes exhibit a far greater dynamic range in duration than others Diphthongs tend to be longer than monophthongs, BUT . The low monophthongs ([ae], [aa], [ay], [aw], [ao]) exhibit patterns of duration and dynamic range under stress (accent) similar to diphtongs The statistical patterns are consistent with the hypothesis that duration serves under many conditions as either a primary or secondary cue for vowel height (normally associated with the frequency of the first formant) Take Home Messages Moreover, the stress-accent system in spontaneous (American) English appears to be closely associated with vocalic identity Low vowels are far more likely to be fully stressed than high vowels (with the mid vowels exhibiting an intermediate probability of being stressed) Thus, the identity of a vowel can not be considered independently of stress-accent The two parameters are likely to be flip sides of the same Koine Although English is not generally considered to be a vowel-quantity language (as is Finnish), given the close relationship between

stress-accent and duration, and between duration and vowel quality, there is some sense in which English (and perhaps other stress-accent languages) manifest certain properties of a quantity system Automatic Methods for Articulatory Feature Extraction and Phonetic Transcription Manner-Specific Place Classification Manner Feature Classification/Segmentation Automatic methods (neural networks) can accurately label MANNER of Implication MANNER information may be relatively co-terminous articulation features for spontaneous material (Switchboard corpus) with phonetic segments and evade co-articulation effects Label Accuracy per Frame

Central frames are labeled more accurately than those close to Implication some frames are created more equal than others OGI Numbers Corpus the segmental boundaries Frame step interval = 10 ms MANNER Classification Elitist Approach Confident (usually central) frames are classified more accurately NTIMIT (telephone) Corpus Manner-Specific Place Classification Knowing the manner improves place classification for consonants NTIMIT (telephone) Corpus Manner-Specific Place Classification

Knowing the manner improves place classification for vowels as well NTIMIT (telephone) Corpus Manner-Specific Place Classification Dutch Knowing the manner improves place classification for consonants and vowels in DUTCH as well as in English VIOS (telephone) Corpus Manner-Specific Place Classification Dutch Knowing the manner improves place classification for the approximant segments in DUTCH Approximants are classified as vocalic rather than as consonantal VIOS (telephone) Corpus Take Home Messages

Automatic recognition systems can be used to test specific hypotheses about the acoustic properties of articulatory features, segments and syllables Manner information appears to be well classified and segmented Suggests that manner features may be the key articulatory feature dimension for segmentation within the syllable Place information is not as well classified as manner information Improvement of place with manner-specific classification suggests that place recognition does depend to a certain degree on manner classification Voicing information appears to be relatively robust under many conditions and therefore is likely to emanate from a variety of spectral regions The time constant for voicing information is also likely to be less than or coterminous with the segment Sample Transcription from the ALPS System The ALPS (automatic labeling of phonetic segments) system performs very similarly to manual transcription in terms of both labels and segmentation 11 ms average concordance in segmentation 83% concordance with respect to phonetic labels

OGI Numbers (telephone) corpus ALPS Output Can Be Superior to Alignments Speech Waveform Spectrogram Word Transcript Forced Alignment Segments ALPS Manner Information ALPS - Seg Switchboard (telephone) Corpus Grand Summary and Conclusions The controlling parameters for understanding spoken language appear to be based on low-frequency modulation patterns in the acoustic signal associated with the syllable

Both the magnitude and phase of the modulation patterns are important Encoding information in terms of low-frequency modulations provides a certain degree of robustness to the speech signal that enables it to be decoded under a wide range of acoustic and speaking conditions Manner information appears to be the key to understanding segmentation internal to the syllable Place features appear to be dominant and most stable at syllable onset and coda Manner is the stable feature dimension for the syllabic nucleus

Voicing and rounding appear to be auxiliary features linked to manner and place feature information Real speech can be useful in delineating underlying patterns of l linguistic organization Thats All, Folks Many Thanks for Your Time and Attention

Recently Viewed Presentations

  • RSVP WBS Structure

    RSVP WBS Structure

    WBS 1.4 Cost, as of 18 Mar (estimated cont and indirect) L3 break-out suggestions WBS 1.4 AGS Upgrade Project WBS 1.4.1 AGS/BOOSTER* WBS 1.4.2 Switchyard WBS 1.4.3 K0PI0 WBS 1.4.4 MECO WBS 1.4.5 Project Office Supplemental/Archive Present NSF Guidance (from...
  • Rhetorical Devices

    Rhetorical Devices

    METAPHOR. Metaphor compares two different things in a figurative sense. Unlike in a simile (A is like B.), "like" is not used in metaphor (A is B.): Truthsare first clouds, then rain, then harvest and food. (Henry Ward Beecher) Through...
  • Enterprise Resource Planning April 25, 2007

    Enterprise Resource Planning April 25, 2007

    Enterprise Resource Planning Wisnu Cahyono Wenbin Li Erica Price Jurlian Sitanggang April 25, 2007 Overview Introduction Suppliers of ERP Implementation of ERP Case Studies ERP Evolution What is ERP?
  • Winning: Foreign Direct Investment 2015-2019 Brendan McDonagh Director

    Winning: Foreign Direct Investment 2015-2019 Brendan McDonagh Director

    Incentive aimed at regeneration of the historic inner cities of Cork, Dublin, Galway, Kilkenny, Limerick and Waterford. Encouraging people back to the centre of Irish cities to live in historic buildings; ... PowerPoint Presentation Last modified by:
  • THE PELVIC FLOOR: ANATOMY, FUNCTION AND CLINICAL INTEGRATION

    THE PELVIC FLOOR: ANATOMY, FUNCTION AND CLINICAL INTEGRATION

    THE PELVIC FLOOR: ANATOMY, FUNCTION AND CLINICAL INTEGRATION Heather Grewar BScPT, MScPT, FCAMT core-connections.ca [email protected] * In addition to the possible structural deficits that we just talked about, it appears that there are a multitude of modifiable factors that can...
  • The Mystery of Easter Island

    The Mystery of Easter Island

    It lies 3510 km west of the Chilean mainland. Very Isolated Easter Day in 1722 Imagine what the first Dutch explorers thought when they first accidentally sailed to Rapa Nui (Easter Island) and saw the statues. Statistics Rapa Nui (Easter...
  • Shiv-Vani Oil & Gas Exploration Services Limited

    Shiv-Vani Oil & Gas Exploration Services Limited

    Shiv-Vani Oil & Gas Exploration Services Limited * * * Growth in demand for oil and gas is being fuelled by rapid economic growth India's GDP has grown at an average of 9% between FY2006-08 Consumption in India grew by...
  • Light as Messenger Guiding Questions 1. How fast

    Light as Messenger Guiding Questions 1. How fast

    Guiding Questions How fast does light travel? How can this speed be measured? Why do we think light is a wave? What kind of wave is it? How is the light from an ordinary light bulb different from the light...