Acoustics of the vowel

Introduction

According to the theory of speech production, the vocal fold movements produce a uniform sound, which is transformed by the resonances of the vocal tract. Moreover, it is generally assumed that phonation and articulation are independent processes, and that articulation, i.e. the vocal tract configuration, is phoneme-specific. Thus, with regard to the acoustics of the vowel, the radiated sound wave is expected to exhibit phoneme-specific patterns of spectral envelope peaks, known as "formants". [1]

Experimentally, however, no simple correlation has been found between perceived vowel identities and formant patterns. Above all, two kinds of formant variations are extensively documented in the literature: i) for vowel sounds with quasi-static spectral properties, the formant frequencies of a particular vowel vary according to the speaker and the speaker group; ii) for vowel sounds exhibiting dynamic spectral changes, in particular for vowels in a CVC context, formant frequencies vary depending on coarticulation. Attempts have therefore been made to establish a correlation between phoneme identity and the physical properties of the signal within two different theoretical concepts. The first approach has been mainly concerned with the problem of normalising the different formant patterns found for a particular vowel identity [2]. The second approach has been mainly concerned with the dynamic description of the sounds [3].

Starting from the assumption, that the phoneme identity of a vowel relates to specific patterns of spectral envelope peaks, we have, in our investigations, turned our attention to a third kind of formant variation: the variation in the formant patterns and the spectral envelope shape in relation to the fundamental frequency F0.

Two ideas motivated us to study vowel sounds uttered by single speakers at different F0:

  • If articulation is phoneme-specific, and if, in general, the radiated sound wave of a vowel mirrors the resonances of the vocal tract, the sound waves should exhibit phoneme-specific patterns of spectral envelope peaks at least for vocalisations by single speakers. Moreover, these patterns should not substantially change with rising F0, especially in the case of comparisons for which prominent harmonics can theoretically be at similar frequency values.
  • On the contrary, if the spectral envelope is in principle dependent on F0, no phoneme-specific formant pattern can be expected, and the formant pattern itself cannot be considered as the primary acoustic characteristic of the vowel sound. The implications of this for the understanding of speech production and perception would be important.

The literature contains many indications from studies of vowel synthesis that the spectral envelope and its peak frequencies are dependent on F0. However, no systematic investigations concerning vowels uttered by single speakers (non-singers) at different F0 are reported. We therefore have investigated vowel sounds of this kind.

However, our investigations were not confined to the relationship between formant patterns and vowel identities. In particular, we were also confronted with the methodological problem of formant estimation, with different numbers of apparent spectral envelope peaks relevant for the vowel identity, and with observations concerning the vowel spectrum apart from resonance patterns.

Within this Internet presentation, some general information is first given about the vowel sample investigated, the recording procedure, the acoustic analysis applied, the general procedure for examination of the sounds, the use of the term "formant" and "formant pattern" and the scheme of presentation. Subsequently, our main findings are presented and illustrated by model vowel series, which include Fourier and LPC spectra, estimated formant patterns and original sounds.


References

[1]

According to Potter and Steinberg (1950) and Fant (1970), "formant" means a frequency region of energy concentration in the spectrum of a vowel sound, i.e. a spectral peak of the sound spectrum; conceptually, it should be held apart from the term "resonance" which means a frequency region of relatively effective transmission trough the vocal tract.

Fant, G. (1970): Acoustic Theory of Speech Production. Mouton,  's-Gravenhage/The Netherlands.

Potter, R.K., and Steinberg, J.C. (1950): "Toward the specification of speech," Journal of the Acoustical Society of America, 22, 807-820.

[2]

For a detailed review of the "formant ratio theory of vowel quality", see Miller, J.D. (1989): "Auditory-perceptual interpretation of the vowel," Journal of the Acoustical Society of America, 85, 2114-2134.

[3]

For a detailed review of the "dynamic specification models", see Strange, W. (1989): "Evolving theories of vowel perception," Journal of the Acoustical Society of America, 85, 2081-2087.

 

Method

The vowel sample

The Swiss German vowels /u,o,a,ä,ö,e,ü,i/ (region of Zurich; for phonetic symbols, see the section on "Symbols") were investigated. Approximately 18,700 recordings were made of 35 men, 44 women and 20 children. The vowels were produced in isolation as well as in CVC context. Whispered vowels were also included. Isolated vowels were produced at different levels of F0 (entire frequency range investigated was F0 = 70 – 870 Hz), and at different intensities. Within this Internet presentation, we discuss only the vowels produced in isolation and on a monotone.

Recordings

Sounds were recorded digitally with a sampling frequency of 22.05 kHz and with 16 bits of amplitude resolution.

Acoustic analysis

Fourier and LPC analysis was applied to the recorded sounds, with the following parameter settings: Hamming window of 1,024 sampling points, pre-emphasis = 98, filter order = 28 (LPC), maximum bandwidth = 500 Hz (LPC). The fundamental frequency was determined by the frequency of the first harmonic of the Fourier spectrum. (Please note that the analysis was applied to the entire frequency range from 0 to 11.025 kHz of the recorded sound. However, for presentation, the frequency range shown is reduced to 0 – 5.5 kHz.)

General procedure for investigation of the vowel sounds

We followed a purely acoustic (psychophysical) approach, that is, we investigated solely the relationship between the perceived vowel identity and the physical characteristics of the radiated sound wave. In general, the procedure was as follows. First, sound fragments were extracted from a recording, representing a particular vowel sound. Second, spectral constancy of the sound fragment was visually inspected against the background of Fourier and LPC analysis. Third, perceptual identification tests were performed. Fourth, vowel sounds with a high perceptual identification rate and exhibiting high spectral constancy were selected. Finally, the spectral envelopes of such sounds were examined in terms of their relationship with phoneme identity.

The expressions "formant" and "formant pattern", and the abbreviations F0, F1, F2, F3

As mentioned above, the experiments and the results presented as well as their discussion are restricted to the acoustic (or psychophysical) aspect of vowels, that is, the parallelism between a perceived phoneme identity and the physical correlate of the sound wave. Because of this, the term "formant" is used only in the sense of apparent spectral envelope peaks of the sound wave. Thus, the term "formant pattern" refers to a series of frequencies of apparent spectral envelope peaks, which are considered relevant for vowel identity.

It is beyond the scope of this Internet presentation to discuss in detail whether or not the observed spectral envelope peaks directly relate to resonances of the vocal tract. The acoustic question investigated was what variations in the formants could be observed when analysing the physical properties of the signals. The question was thus not how the vowels were produced, nor what the perceptual process of identification was. Since these different perspectives are often discussed together, it is important to note the purely descriptive approach of our studies. [1]

The fundamental frequency of a sound is expressed as F0, and the first, second and third formant frequencies as F1, F2 and F3.

Presentation scheme

For presentation, Fourier and LPC spectra as calculated for the centre of a sound are given. The frequency range shown is reduced to 0 – 5.5 kHz. Formant frequencies are estimated on the basis of LPC resonance values of the centre of a sound. Vowel identities are indicated by normal characters. For the corresponding phonetic symbols, see section "Symbols".

References

All the findings described in this Internet presentation are made on the basis of a new and large vowel sample, mentioned above. With regard to these findings, preliminary presentations were made [2], one manuscript has been submitted [3], and one article has been published [4]. Moreover, some of the results have already been obtained by us in earlier studies, using a previous and smaller vowel sample, and others were indicated from them. These studies have been published, and the references are given at the end of each paragraph (see also the list of publications in the section "General Information").

Please note

Use HiFi quality speakers to play back the vowel sounds in order to avoid changes in the perceived vowel quality.


References

[1]

Concerning the method of formant frequency estimation in the sense of the calculation of spectral envelope peaks, there remains an amiguity of the use of the term "formant". Current methods (LPC analysis, cepstral analysis, analysis by synthesis) are all theoretically aimed at detecting resonances of the vocal tract, although they are used as methods to calculate spectral envelope peaks. Only the formula of Potter and Steinberg (1950) used in the first statistical investigation of American English vowels (Peterson and Barney, 1952) represents a method to calculate spectral envelope peaks directly.

With regard to our studies, we could not avoid this methodological ambiguity. However, our investigations relate to sounds for which, in most cases, a direct estimation of spectral envelope peak frequencies using the formula of Potter and Steinberg produces values which differ only marginally from those of LPC analysis.

Peterson, G.E., and Barney, H.L. (1952): "Control methods used in a study of the vowels," Journal of the Acoustical Society of America, 24, 175-184.

Potter, R.K., and Steinberg, J.C. (1950): "Toward the specification of speech," Journal of the Acoustical Society of America, 22, 807-820.

[2]

Klinkert, A. and Maurer, D. (1997): "Fourier spectra and formant patterns of German vowels produced at F0 of 70 - 850 Hz," Journal of the Acoustical Society of America, 101, 3112.

Maurer, D. (1997): "Arguments against formants - the descriptional problem of acoustic phonetics," Proceedings of the Journées d'Etudes Linguistiques 1997, University of Nantes (FR), 106-111.

Maurer, D. and Klinkert, A. (1998): "Vokale und ihre physikalischen Merkmale. I. Lautidentität und Formantmuster," Fortschritte der Akustik - DAGA 98, 386-387.

Maurer, D. and Klinkert, A. (1998): "Vokale und ihre physikalischen Merkmale. II. Lautidentität und Partialtonspektrum," Fortschritte der Akustik - DAGA 98, 388-389.

[3]

Maurer, D. (1999): " 'Anomalous' vowel spectra of Swiss German /a/," manuscript (submitted).

[4]

Maurer, D., and Landis, T. (2000): "Formant pattern ambiguity of vowel sounds," International Journal of Neuroscience, 100, 39-76.

 

Methodological problem of formant estimation

Up to now, no completely objective method of formant estimation has been established, and it is well known that the methodological problem increases with ascending F0. Consequently, for many sounds at higher fundamental frequencies, formant estimation becomes highly problematical, although the perceived vowel quality remains clear. In our vowel sample, the methodological problem proved to be substantial for sounds with F0 > 200 Hz, and severe for sounds with F0 > 300 Hz. However, in some cases, the estimation of formant patterns at lower F0 was also problematical. Thus, there is no methodological basis on which to verify experimentally the hypothesis of a relationship between formant patterns and vowel identities including all clearly identifiable sounds. Most importantly, a statistical approach to relating fundamental frequencies, formant patterns and vowel identities proved to be impossible.

Illustration. The following vowel series illustrates different types of vowel spectra for which the formant pattern cannot be estimated in full.

Vowel series of /a/, /ö/, /ü/ and /i/ illustrating the methodological problem of formant frequency estimation.

For corresponding examples within this presentation, see also

  • "Correlation between F0 and the lower formants I", /u/ at F0 = 394 Hz and /i/ at F0 = 405 Hz;
  • "Correlation between F0 and the lower formants II", all three sounds /u/, /a/ and /i/.


Because of this methodological problem, for the investigation of formant patterns we followed a pragmatic approach using specific series of sounds: to investigate a particular question, we selected those vowel sounds for which the related spectra allowed for clear statements. In many cases, the conditions for such sounds were as follows: the peak structure of the spectrum should be clear; no substantial variations in the results should be brought about by reasonable changes in the parameter settings for LPC analysis; as regards the spectra being compared, the frequency values of relevant harmonics should coincide, in particular the frequency values of harmonics exhibiting a relative energy maximum.

Moreover, even the application of any reasonable method of peak frequency calculation other than the one used in the investigation should in principle produce similar results.


References

Previous studies:

For extensive demonstration of the methodological problem on an earlier vowel sample, see Maurer, D. (1994): Über den Vokal. 2 Volumes. Hartung-Gorre, Konstanz/Germany.

 

Correlation between F0 and the lower formants I: general phenomenon

As mentioned above, many studies of synthesised vowel sounds have indicated a relationship between F0 and the lower formants < 2 kHz [1]. However, the numerical size of these shifts is controversial in the literature, and it is not clear whether the shift of the formants with rising F0 is necessarily associated with a change in the perceived vocal effort or speaker group (men, women and children). Moreover, natural vowels uttered by untrained speakers were barely investigated.

Our investigation of vowels produced by single speakers (non-singers) at different F0 showed that the formant pattern directly related to F0. More precisely,

  • the lower formant frequencies < 1.5 - 2 kHz were found to shift with ascending F0;
  • the shift in F1 in Hz was often very pronounced and sometimes greatly exceeded the shift in F0;
  • the relationship F0-F2 could not be estimated unambiguously, owing to large variations in F2 and the methodological problem of formant estimation; however, pronounced shifts in lower F2 were often apparent;
  • the shifts in the lower formants far exceeded the formant variations usually attributed to gender- and age-differences;
  • the shifts were found to be non-linear, i.e. they were barely noticeable for F0 < 175 Hz, but sometimes drastic for F0 > 250 Hz;
  • the shifts were found to be non-monotonic, i.e. different sizes were found for different vowels;
  • we could not determine a ratio between F0 and the lower formants, irrespective of the frequency scale used;
  • the perceptual identification rate did not in principle decrease with increasing F0 and lower formant frequencies;
  • the correlation between the lower formants and F0 is not necessarily associated with a perceived change in the vocal effort or with a change in the perceived speaker group.

Thus, the spectral envelope shape, and with it the formant pattern, is in principle dependent on F0. Moreover, and most interestingly, the within-speaker variations in the lower formants for different F0 can far exceed both the between-speaker variations at F0 of speech (gender- and age-differences) and the variations due to coarticulation.

Despite the methodological problem mentioned above, the most convincing evidence of this finding can be obtained when comparing sounds with different F0, for which the relevant harmonics (e.g. prominent harmonics forming a spectral envelope peak) are related in terms of their frequency values, above all when comparing spectra at F0 of multiples of whole numbers. Within such comparisons, for utterances of a particular vowel, the prominent harmonics could theoretically be found at similar frequency values, but are experimentally found at different frequencies.

Illustration. The following vowel series illustrate two aspects of the correlation between the lower formants and F0: very pronounced shifts in F1 with ascending F0 for F0 > 250 Hz, and the non-monotonic nature of the correlation. For each vowel, three vocalisations at different F0 were produced by single speakers.

Vowel series of /u/ and /i/ showing a parallel shift of F0 and F1 for F0 > 250 Hz

Vowel series of /o/ and /e/ showing a shift of F1 far exceeding the shift of F0

Vowel series of /a/ and /ä/ showing no or irregular shift of F1 with ascending F0

For corresponding examples within this presentation, see also

  • "Correlation between F0 and the lower formants III", all three vowel series demonstrating the disappearance of gender- and age-differences;
  • "Formant number alteration", the vowel series demonstrating one-formant back vowels at different F0.


References.

[1]

Ainsworth, W.A. (1971): "Perception of synthesized isolated vowels and h-d words as a function of fundamental frequency," Journal of the Acoustical Society of America, 49, 1323-1324.

Carlson, R., Granström, B., and Fant, G. (1970): "Some studies concerning perception of isolated vowels," Speech Transmission Laboratory Quarterly Progress and Status Report, 2-3/1970, 19-35, Royal Institute of Technology, Stockholm.

Carlson, R., Fant, G., and Granström, B. (1975): "Two-formant models, pitch and vowel perception," in Auditory Analysis and Perception of Speech, edited by G. Fant and M. A. A. Tatham (Academic, London), pp. 55-82.

Fujisaki, H., and Kawashima, T. (1968): "The roles of pitch and higher formants in the perception of vowels," IEEE Transactions on Audio and Electroacoustics AU-16, No. 1, 73-77.

Hirahara, T., and Kato, H. (1992): "The effect of F0 on vowel identification," in Speech Perception, Production and Linguistic Structure, edited by Tohkura, Y., Vatikiotis-Bateson, E., and Sagisaka, Y. (Ohmsha, Tokyo), pp. 89-112.

Hoemke, K.A., and Diehl, R.L. (1994): "Perception of vowel height: The role of F1-F0 distance," Journal of the Acoustical Society of America, 96, 661- 674.

Miller, R.L. (1953): "Auditory tests with synthetic vowels," Journal of the Acoustical Society of America, 25, 114-121.

Nearey, T.M., (1989): "Static, dynamic, and relational properties in vowel perception," Journal of the Acoustical Society of America, 85, 2088-2113.

Slawson, A.W. (1968): "Vowel quality and musical timbre as functions of spectrum envelope and fundamental frequency," Journal of the Acoustical Society of America, 43, 87-101.

Traunmüller, H. (1981): "Perceptual dimension of openness in vowels," Journal of the Acoustical Society of America, 69, 1465-1475.

Traunmüller, H. (1985): "The role of the fundamental and the higher formants in the perception of speaker size, vocal effort, and vowel openness," in PERILUS (Institute of Linguistics, University of Stockholm), 4/1985, pp. 92-102.

Previous studies:

Maurer, D., and Landis, T. (1995): "F0-dependence, number alteration, and non-systematic behaviour of the formants in German vowels," International Journal of Neuroscience, 83, 25-44.

See also:

Maurer, D., Cook, N., Landis, T., and d'Heureuse, C. (1992): "Are measured differences between the formants of men, women and children due to F0 differences?," Journal of the International Phonetic Association, 21, 66-79.

Maurer, D. (1994): Über den Vokal. 2 Volumes. Hartung-Gorre, Konstanz/Germany, pp 92-129.

Maurer, D., and Landis, T. (2000): "Formant pattern ambiguity of vowel sounds," International Journal of Neuroscience, 100, 39-76.

 

Correlation between F0 and the lower formants II: high-pitched vowels

If a subject produces a vowel sound at a fundamental frequency above F1 of its normal speech (i.e., above F1 found at the "habitual pitch"), and if the related spectrum is regarded as exhibiting a peak structure, the lower formants obviously have to shift upwards. Thus, the examination of high-pitched vowels can provide additional evidence of a correlation between lower formants and F0.

The main experimental question is whether the perceived vowel identity remains unambiguous for such sounds. When investigating the upper limit for F0 of identifiable natural vowels of untrained subjects, we found a 80% identification rate for all long Swiss German vowels up to F0 = 730 Hz, and a 100% differentiation of /u-a-i/ up to F0 = 850 Hz. F0 of high-pitched vowels /u,ü,i/ far exceeded F1 of the vowels at normal speech level. In many cases, the same held true for vocalisations of /o,ö,e/. Thus, intelligible high-pitched vowels represent an extreme case of formant variation, and they clearly confirm the relationship between F0 and the lower formants. Moreover, the upper limits of F0 of identifiable sounds found in untrained speakers exceeded what was usually found in singers, and the assumption (often made in the literature) that vowel identity decreases, as a matter of course, with increasing F0 was disproved.

Illustration. The following vowel series shows three vocalisations of high-pitched vowels /u/, /a/ and /i/, which scored a 100% identification rate in two different identification tests.

Vocalisations of /u/, /a/ and /i/ at F0 > 800 Hz.

For corresponding examples within this presentation, see also

  • "Methodological problem of formant estimation", /a/ at F0 = 500 Hz and /i/ at F0 = 512 Hz;
  • "Formant number alteration", /a/ at F0 = 497 Hz;
  • "Formant pattern ambiguity", /i/ at F0 = 713 Hz (first vowel series) and /ü/ at F0 = 497 Hz (second vowel series).


References

Previous studies:

Maurer, D., and Landis, T. (1996): "Intelligibility and spectral differences in high pitched vowels," Folia phoniatrica et logopaedica, 48, 1-10.

See also:

Maurer, D. (1994): Über den Vokal. 2 Volumes. Hartung-Gorre, Konstanz/Germany, pp 130-132.

Maurer, D., and Landis, T. (1995): "F0-dependence, number alteration, and non-systematic behaviour of the formants in German vowels," International Journal of Neuroscience, 83, 25-44.

Maurer, D., and Landis, T. (2000): "Formant pattern ambiguity of vowel sounds," International Journal of Neuroscience, 100, 39-76.

 

Correlation between F0 and the lower formants III: disappearance of gender- and age-differences

The discovery of a correlation between F0 and the lower formants in part contradicts the assumption that the statistical differences in the formant frequencies found for men, women and children [1] are a consequence of differences in vocal tract size. Since F0 of speech also differs for the three speaker groups, at least for the lower formants, the formant variations mentioned could relate to F0 differences.

The main experimental question is whether or not gender- and age-differences in the lower formant frequencies appear if F0 of the vocalisations is similar for speakers in all speaker groups.

We compared formant values for men, women and children, with F0 variation for the men and the women. When the vowels were produced at F0 near that of normal speech (i.e., F0 differed for the different speaker groups), the formant frequencies differed according to other formant statistics given in the literature. But when all the speakers produced the vowels at similar F0, no formant frequency differences for the lower formants were found as between men and women, and most of the differences between adults and children disappeared. Thus, the influence of F0 on the lower formants is much more significant than that of the speaker group, and the size of the vocal tract does not directly affect all formants.

If, in addition, one considers gender- and age-specific vowel colour, it can be hypothesised that there is no general gender- and age-difference in the lower formant frequencies.

Illustration. The following vowel series illustrate this finding: Gender- and age-differences in F1 (all vowels presented) and F2 (vowel /o/) for different F0, disappearance of these differences at similar F0. Each vowel was uttered by a single speaker. (For the vowel /i/, the LPC values for the child are less convincing than for the other vocalisations; however, direct comparison of the spectra confirms F1 near F0 for all sounds at F0 > 200 Hz.)

Vowel series of /o/ showing vocalisations by a man, a woman and a child

Vowel series of /e/ showing vocalisations by a man, a woman and a child

Vowel series of /i/ showing vocalisations by a man, a woman and a child

For a corresponding vowel series of /o/ of a child within this presentation, see also "Correlation between F0 and the lower formants I" (second vowel series).


References

[1]

Fant, G. (1959): "Acoustic analysis and synthesis of speech with applications to Swedish," Ericsson Technics, 1, 3-108.

Hillenbrand, J., Getty, L.A., Clark, M.J., and Wheeler, K. (1995): "Acoustic characteristics of American English vowels," Journal of the Acoustical Society of America, 97, 3099-3111.

Petersen, G.E., and Barney, H.L. (1952): "Control methods used in a study of the vowels," Journal of the Acoustical Society of America, 24, 175-184.

Previous studies:

Maurer, D., Cook, N., Landis, T., and d'Heureuse, C. (1992): "Are measured differences between the formants of men, women and children due to F0 differences?" Journal of the International Phonetic Association, 21, 66-79.

See also:

Maurer, D. (1994): Über den Vokal. 2 Volumes. Hartung-Gorre, Konstanz/Germany, pp 141-151.

 

Formant number alteration

In addition to rising lower formant frequencies with ascending F0, we observed a second formant pattern variation which is rarely discussed in relation to natural vowels: the appearance of different numbers of formants relevant for vowel identity.

Early studies of synthesised vowels revealed that "one-formant back vowels" and "two-formant front vowels" could be clearly identified [1]. However, this phenomenon has not been examined in detail on the basis of natural vowels, although, in speech laboratories, it is often observed that expected formants are "missing". In formant statistics, such a lack of formants has often been interpreted as two formants "merging" into one, and the sounds concerned have often been excluded from further analysis.

In the vowel sample investigated, we found many examples of natural "one-formant back vowels" (only one apparent spectral envelope peak < 2 kHz). In our opinion, the interpretation of the related spectra as exhibiting formant merging is, to say the least, doubtful, especially in the case of the sounds /u/ and /o/: for each of these two vowels, F1 of the vocalisations showing only one lower formant corresponded with F1 for the vocalisations exhibiting two lower formants. - Moreover, as was found for synthesised vowels, we were unable to determine a general hypothesis of spectral integration with respect to back vowels, including the sounds of /a/.

We also found instances which can be interpreted as cases of natural "two-formant front vowels" (only two apparent envelope peaks < 4 kHz). However, this finding depended heavily on the method of analysis and of the parameter setting chosen, and is therefore not discussed in detail.

Thus, the number of formants relevant for vowel identity is not constant, and no systematic rule for a spectral integration of two formants into one could be established.

Illustration. Vocalisations of /u/, /o/ and /a/ exhibiting only one spectral envelope peak < 2 kHz.

For a corresponding example within this presentation, see also "Formant pattern ambiguity", /o/ at F0 = 355 Hz (first vowel series).


References

[1]

Carlson, R., Granström, B., and Fant, G. (1970): "Some studies concerning perception of isolated vowels," Speech Transmission Laboratory Quarterly Progress and Status Report, 2-3/1970, 19-35, Royal Institute of Technology, Stockholm.

Carlson, R., Fant, G., and Granström, B. (1975): "Two-formant models, pitch and vowel perception," in Auditory Analysis and Perception of Speech, edited by G. Fant and M. A. A. Tatham (Academic, London), pp. 55-82.

Cohen, A., Slis, I.H., and Hart, J. (1963): "Perceptual tolerances of isolated Dutch vowels," Phonetica, 9, 65-78.

Delattre, P., Liberman, A.M., Cooper, F.S., and Gerstman, L.J. (1952): "An experimental study of the acoustic determinants of vowel color; observations on one- and two-formant vowels synthesized from spectrographic patterns," Word, 8, 195-210.

Miller, R.L. (1953): "Auditory tests with synthetic vowels," Journal of the Acoustical Society of America, 25, 114-121.

Previous studies:

Maurer, D., and Landis, T. (1995): "F0-dependence, number alteration, and non-systematic behaviour of the formants in German vowels," International Journal of Neuroscience, 83, 25-44.

See also:

Maurer, D. (1994): Über den Vokal. 2 Volumes. Hartung-Gorre, Konstanz/Germany, pp 92-129.

Maurer, D., and Landis, T. (2000): "Formant pattern ambiguity of vowel sounds," International Journal of Neuroscience, 100, 39-76.

 

Formant pattern ambiguity

Formants are related to F0, and the number of them relevant to vowel identity can alter. Thus, there arises the question of whether the formant pattern itself might be "ambiguous" in principle [1] - not in the well-known sense of overlapping F1-F2 boundary values for adjacent vowels, but in the sense of similar formant patterns for adjacent as well as non-adjacent vowels, in particular similar F1-F2 when comparing only back vowels, and similar F1-F2-F3 for all other comparisons (for back vowels in German, higher formants > 2 kHz are not relevant for vowel identity [2]).

When testing for similarities between the formant patterns of different vowels, we have found ambiguities of this kind [3]. Most interestingly, similar formant patterns were found for vowels separated by a large phonetic distance. Moreover, there were many cases in which one pattern related to more than two different vowels, i.e. the ambiguity proved to be "multiple".

The demonstration of formant pattern ambiguity constitutes a classic refutation of the assumption that resonance patterns are by themselves the primary acoustic features of vowel quality: with respect to the physical properties of the radiated sound wave, there is neither an exact nor an approximate relationship between formant frequencies and perceived vowel identity; on the contrary, formant patterns are ambiguous in principle.

Illustration. The following vowel series illustrate the two main aspects of the formant pattern ambiguity: similar formant patterns for adjacent as well as non-adjacent vowels, and "multiple" ambiguity.

Vowel pairs /o/-/a/, /o/-/i/ and /e/-/i/ with similar formant patterns

Vowel series of /ä/, /ö/ and /ü/ with similar formant patterns


References

[1]

We are aware of only one study in the literature which reports a formant pattern ambiguity F1-F2-F3 for the two adjacent vowels /e/ and /ø/:

Fant, G., Carlson, R., and Granström, B. (1974): "The [e] - [ø] ambiguity," in Proceedings of the Speech Communication Seminar, Stockholm, April 1-3, 3/1974, pp. 117 -121.

[2]

With regard to German back vowels, three main findings indicate that the higher formants > 2 kHz are not relevant for vowel identity. First, in numerous cases, F3 cannot be measured (Jørgensen, 1969; Maurer et al., 1992). Second, back vowels can be synthesised using only two formant frequencies < 2 kHz, according to F1 and F2 of natural sounds. Conversely, if front vowels are synthesised using only two formant frequencies, F2' and F2 often fail to coincide (see, e.g., Delattre 1948, Delattre et al., 1952, Carlson et al., 1970). Third, different synthesised front vowels can display similar F1' and F2', with only F3' differing (Bladon and Ladefoged, 1982; Bladon, 1983), but no such similarity has been reported for back vowels.

Bladon, A., and Ladefoged, P. (1982): "Afurther test of a two-formant model. Journal of the Acoustical Society of America, 71, S104 (A).

Bladon, A. (1983): "Two-formant models of vowel perception: shortcomings and enhancements," Speech Communication, 2, 305-313.

Carlson, R., Granström, B., and Fant, G. (1970): "Some studies concerning perception of isolated vowels," Speech Transmission Laboratory Quarterly Progress and Status Report, 2-3/1970, 19-35, Royal Institute of Technology, Stockholm.

Delattre, P. (1948): "Un triangle acoustique des voyelles orales du français. French Review, XXI, 6.

Delattre, P., Liberman, A.M., Cooper, F.S., and Gerstman, L.J. (1952): "An experimental study of the acoustic determinants of vowel color; observations on one- and two-formant vowels synthesized from spectrographic patterns," Word, 8, 195-210.

Jørgensen, H.P. (1969): "Die gespannten und ungespannten Vokale in der norddeutschen Hochsprache mit einer spezifischen Untersuchung der Struktur ihrer Formantfrequenzen," Phonetica, 19, 217-245.

Maurer, D., Cook, N., Landis, T., and d'Heureuse, C. (1992): "Are measured differences between the formants of men, women and children due to F0 differences?" Journal of the International Phonetic Association, 21, 66-79.

[3]

Maurer, D., and Landis, T. (2000): "Formant pattern ambiguity of vowel sounds," International Journal of Neuroscience, 100, 39-76.

See also:

Maurer, D. (1994): Über den Vokal. 2 Volumes. Hartung-Gorre, Konstanz/Germany, pp 152-177.

 

"Anomalous" vowel spectra

In addition to one-formant back and two-formant front vowels, we have observed that some vocalisations exhibit spectra which are "unexpected", in the sense of being flat spectra with no clear resonance structure, or of having an unexpectedly large number of envelope peaks. We recently investigated such "anomalies" on the basis of both natural and synthesised sounds. The results showed that the perceptual identification rate for natural vocalisations exhibiting an "anomalous" spectrum did not decrease, and no special perceptual feature (e.g. unusually large or small vocal effort, or nasalisation) could be detected. Moreover, in many cases, estimation of formant frequencies proved to be highly problematical or even impossible to justify. The investigation of synthesised vowel sounds confirmed these findings [1].

Thus, the problem of formant number alteration increases, and vowels prove to be intelligible although they do not exhibit pronounced resonance structures in the related spectra.

Illustration. The following vowel series illustrates the two aspects of "anomalous" vowel spectra in relation to synthesised sounds of open /o/ and /a/: flat spectra with no resonance structure, and an unexpectedly large number of envelope peaks.

Vowel series of synthesised open /o/ and /a/ exhibiting "anomalous" spectra.

For corresponding examples of natural vowels within this presentation, see "Methodological problem of formant estimation".


References

[1]

Maurer, D. (1999): "Anomalous vowel spectra of Swiss German /a/," manuscript (submitted).

See also:

Maurer, D. (1994): Über den Vokal. 2 Volumes. Hartung-Gorre, Konstanz/Germany, pp 178-191.

 

Conclusions and implications for future research

Our investigations proceeded from a purely acoustic (psychophysical) perspective, that is, only the relationship between the vowel spectrum and the perceived vowel quality was considered. Therefore, the conclusions primarily concern the acoustics of the vowel. (However, there are important implications for vowel production and perception which will be touched on briefly in the following section.)

With regard to isolated vowel sounds which exhibit no substantial dynamic transitions in their signal, the findings referred to can be summarised in the form of four main statements:

  • there is no sure method for systematically investigating the spectral envelope and its peak frequencies for all perceptually identifiable vowel sounds;
  • in comparisons of vowel sounds for which the estimation of formant patterns is justifiable, no systematic behaviour of the formant pattern is found: the correlation between the lower formant frequencies and F0 prove to be non-linear and non-monotonic, the number of formants relevant for phoneme identity alters, and no general rule governing spectral integration of two formants into one could be experimentally confirmed;
  • in comparisons of vowel sounds for which the estimation of formant patterns is justifiable, the patterns prove not to be phoneme-specific; on the contrary, they prove to be ambiguous;
  • finally, it is possible to perceive vowel sounds for which the related spectra do not exhibit a clear resonance structure in the sense of defined spectral envelope peaks.

It has to be emphasised that this holds true even for vocalisations by a single speaker. Thus, the formant variation due to F0 alteration often far exceeds the between-speaker variations as well as the variations due to coarticulation. Moreover, the formant variation due to F0 alteration contrasts with the assumption that the size of the vocal tract influences the entire formant pattern.

The findings mentioned are particularly relevant for a frequency range of F0 = 200 &ndash; 400 Hz, which is often covered by the speech of women and children. For this frequency range (one octave), the shifts in the lower formants in relation to F0 are very pronounced, and alterations in the number of formants relevant for vowel identity often occur.

We conclude that, against the background of existing theories in phonetics, acoustic research is confronted with two major problems: the vowel spectrum displays "unexpected" behaviour, and the existing methods do not allow for systematic investigation.

Concerning the acoustics of the vowel, there are two possible approaches to proceeding further in the research: one could either attempt to develop a highly detailed normalisation hypothesis or seek for a new approach other than the formant theory.

We have doubts as to whether a normalisation of the formant pattern will ever make it possible to predict the vowel spectrum reliably. Most importantly, there is no way of experimentally verifying any normalisation hypothesis on natural vowels within the existing methodology of formant estimation. Moreover, vowel sounds exhibiting no spectral envelope peak structures cannot be integrated into such a hypothesis.

Apart from these methodological problems, as mentioned, any normalisation hypothesis would have to be highly detailed: it would have to account for the non-linear and non-monotonic relationship between formant frequencies and F0 as well as for the alteration in the number of the formants; moreover, the influence of relative formant amplitudes and of formant bandwidth on perceived vowel identity (not discussed in this presentation) would also have to be integrated into the hypothesis. Obviously, no simple relationship between F0 and F1-F2-F3 as characterising a particular vowel fulfils these conditions, irrespective of the frequency scale used.

However, because of our doubts, we have abandoned the formant approach as regards the acoustics of the vowel and have turned our attention to the development of a new descriptive hypothesis. Our experiences of the vowel spectrum have led us to suggest that such a development should entail two steps: first, observations of the behaviour of the vowel spectrum should be made as objective as possible; second, classification schemes should be established and experimentally tested.

On the basis of our large vowel sample, we first formulated the rules according to which the spectra differ in relation to vowel identity [1]. However, the development of classification hypotheses requires very thorough investigation of the mathematical and physical aspects of the vowel sound, which we are not in a position to undertake within our research group. One of the aims of this Internet presentation is to seek to establish collaboration within which research on these lines can become promising.


References

[1]

For first and preliminary indications, see:

Maurer, D., and Klinkert, A. (1997): "The spectral difference of different vowels - towards a new acoustical hypothesis," Journal of the Acoustical Society of America, 101, 3112.

Maurer, D., and Klinkert, A. (1998): "Vokale und ihre physikalischen Merkmale. II. Lautidentität und Partialton-spektrum," Fortschritte der Akustik - DAGA 98, 388-389.

 

Implications for vowel production and perception

As we have said, our investigations proceeded from a purely acoustic (psychophysical) perspective. Aspects of speech production and perception were not considered in detail. However, in our opinion, there are interesting implications of the acoustic findings, particularly for the understanding of speech production.

Given a very close relationship between the formant pattern exhibited by a signal and the positions of the articulators (i.e. the configuration of the vocal tract), the finding of a correlation between the lower formant frequencies and F0 in conjunction with the finding of formant pattern ambiguity could lead to the assumptions that

  • different vocal tract configurations are associated with the same vowel, even for sounds uttered by a single speaker;
  • the vocal tract configurations of a vowel depend on the F0 of the vocalisation;
  • a particular vocal tract configuration can (or must) represent different vowels.

If so, however, phonation and articulation could not be considered independent processes, and articulation on its own could not be considered phoneme-specific, two consequences, which are difficult to accept in the context of existing phonetic knowledge. We ourselves are, as a matter of principle, sceptical about these assumptions. However, further investigations of phonation and articulation must account for the acoustic findings mentioned above.

In our own attempts to investigate the relationship between articulation and vowels being produced, for example, we were unable to find confirmation that there is any close relationship between the resonance pattern exhibited by a signal and the positions of the articulators [1]. What is more, we could not find clear indications of an alteration in the vocal tract configuration for utterances of a particular vowel exhibiting rising lower formants with ascending F0.

In our opinion, one aspect of the acoustic findings mentioned above is of primary importance. A speaker could theoretically produce a particular vowel by transforming a source sound by means of one phoneme-specific resonance pattern of his vocal tract. Such a hypothesis is actually very plausible. Yet the acoustic findings show that a speaker does not directly make use of this possibility offered by his anatomy and physiology, but imposes voiced speech in a much more complex manner. This observation could be a key to the understanding of voiced speech sounds.

Concerning speech perception, no statement of substance can be made until the acoustic findings, i.e. the determination of the relationship between the physical characteristics of a sound and perceived vowel identity, are reliable. However, it can be speculated that the role of perception is fundamental to the process of speech production, which would explain why a vowel sound does not simply mirror a phoneme-specific resonance pattern of the vocal tract.


References

[1]

Maurer, D., Gröne, B., Landis, T., Hoch, G., and Schönle, P.W. (1993): "Re-examination of the relation between the vocal tract and the vowel sound with EMA in vocalizations," Clinical Linguistics & Phonetics, 7, 129-143.

Concerning the relationship between the vowel and phonation, see also:

Maurer, D., Hess, M., and Gross, M. (1996): "High-speed imaged vocal fold vibrations and larynx movements within vocalizations of different vowels," Annals of Otology, Rhinology and Laryngology, 105, 975-981.

 

Zähler