Haskins Laboratories

The Science of the Spoken and Written Word

Articulatory Synthesis

An important part of our research program is a computational model of the vocal tract, begun at Bell Laboratories (Mermelstein, 1973) and subsequently refined by Rubin, Baer and Mermelstein (1981) for use in studies of production and perception (e.g., Abramson et al., 1981; Raphael et al.,1979; Browman et al., 1984; Browman and Goldstein, 1986). The most recent software implementation (ASY) provides a kinematic description of speech articulation in terms of the moment-by-moment positions of six major structures; the jaw, velum, tongue body, tongue tip, lips and hyoid bone, all presented graphically for viewing in the midsagittal plane. The positions of the articulators can be controlled manually or by means of a table of specifications over time; the former producing steady-state utterances and the latter dynamic productions. Tables of parameters can also be used to control the amplitude of glottal excitation, its fundamental frequency and its mode of representation (i.e., in the time or frequency domain). The amplitude and tract point-of-insertion for fricative excitation can also be specified. 

Steps in the production of a synthetic utterance begin with the drawing of the first tract configuration on the graphics screen and the superimposition of a grid structure. The intersection of the grid lines with the tract walls leads to a derivation of the sagittal dimensions, the center line and the length of the tract. Then, using formulae based on a variety of vocal tract measurements (Heinz & Stevens, 1964; Ladefoged, Anthony & Riley, 1971; Mermelstein, Maeda & Fujimura, 1971), the sagittal cross-sections are converted to a smoothed area function approximated by a sequence of uniform tubes each 0.875 cm in length. This simplification of the vocal tract shape permits a rapid calculation of the vocal tract transfer function. Speech output is then generated, at a sampling rate of 20 kHz, by feeding the glottal waveform through the digital filter representation of the transfer function which, for voiced sounds, accounts for both oral and nasal branches of the tract. 

An interactive demonstration of the original ASY model is available.