The most famous talking computer of all time, the HAL 9000 from Arthur C. Clark and Stanley Kubrick's 1968 film 2001: A Space Odyssey, spoke perfect English. And when it was lobotomized for being a bit too human, it went out singing “Daisy Bell.” Clark's inspiration for that scene was the first song performed with computer-synthesized speech, realized by Max Mathews at Bell Labs in 1961.
Speech synthesis has come a long way since then; your car, computer, and cell phone routinely speak with synthetic tongue. But, as those examples suggest, speech synthesis does not yet approach the fluency of the HAL 9000. In this article, I'll look at the most common approaches to speech synthesis and suggest a few tools for making your computer speak and sing in your own words.
THE PAST TENSE
While speech synthesis may seem very 21st century, the first attempts at synthesized speech predate the computer and even the use of electricity. In the late 18th century, attempts were underway to build mechanical speaking machines. The idea was to model the human vocal tract with devices — bellows for air, reeds for vocal chords, and a mouth cavity molded out of rubber with holes for nostrils — that could be manipulated to produce words and short sentences. The modern equivalent, known as physical modeling, employs mathematical emulations of the vocal tract in much the same way.
The first mechanical speaking machine was built by Wolfgang von Kempelen in 1791. However, it wasn't until the early 20th century that new approaches evolved: the telephonic transmission of speech spurred research into ways to reduce bandwidth while maintaining intelligibility. The result was Homer Dudley's Vocoder (VOiCe Operated recorDER), which analyzed incoming speech using bandpass filters and used the resulting time-variant band-level information to filter a synthetic sound source (in this case, a pulse-wave oscillator) with a matching bank of bandpass filters. The Vocoder has, of course, had a significant impact on modern electronic music. Dudley used similar technology to build a keyboard-controlled speech synthesizer, the Voder, which, though nearly impossible to play, was a huge hit at the 1939 World's Fair.
A completely different electro-mechanical approach to speech synthesis, called Pattern Playback, was developed in the 1950s by Frank Cooper at Haskins Labs. Light was passed through a spectrogram (more on that later) to control the intensity of 50 sine-wave partials. He used spectrograms of recorded speech as well as hand-painted ones to produce monotonic, but very intelligible, speech. The software application MetaSynth (Mac), from U&I Software, allows you to implement a similar process on your desktop.
WORD FOR WORD
All modern speech-synthesis research and implementation is, naturally, done with the aid of computers. For a compendious view of the history of the field, visit the Web site of the Smithsonian Speech Synthesis History Project (see the sidebar “Speech Synthesis Research”). While most research is still carried out at commercial and academic institutions, the results are readily available to and have many applications for desktop musicians.
Probably the first idea that comes to mind when you think about how to make your computer talk is to record a bunch of words as audio samples and string them together into sentences. This is not a particularly satisfactory approach, because a sentence is much more complex than a sequence of words: the whole is more than the sum of the parts. You can quickly convince yourself of that by trying it in either direction; record some words and try to make a sentence, or record a sentence and try to cut it up into words. Words in sentences tend to be shorter and to blend together. Furthermore, elements that evolve over the course of a sentence such as rhythm, pitch, loudness (emphasis), and syllable length — features which, taken together, are referred to as prosody — are key ingredients of natural-sounding speech. As it turns out, words are the wrong building blocks.
Current linguistic theory holds that about 41 discrete sounds, called phonemes, cover all the sounds used in ordinary spoken English. Linguists typically divide phonemes into categories, as vowels (17), consonants (7), fricatives (9), plosives (6), and affricates (2). Notice that the number of phonemes in the vowel and consonant categories do not correspond to the written alphabet, in which a, e, i, o, u, and sometimes y and w are called vowels and everything else is called a consonant. Phonetically, there are many more vowel sounds (i as in bit versus i as in bite, for example) and the sounds not classified as vowels are categorized according to how they are produced (for example, m, s, p, and j are categorized as consonant, fricative, plosive, and affricate, respectively).
In practice, text-to-speech systems use elements called diphones (the end of one phoneme spliced to the beginning of another), triphones (diphones with a phoneme in the middle), and allophones (slight variations of a single phoneme) instead of simple phonemes. That greatly enlarges the database of basic sounds in the interest of producing more natural-sounding speech. But in the end, it's the art of designing and programming the rules that counts. For example, consider the different soundings of the word record in the sentence “Let's record a record.” The MP3 example Record (see Web Clip 1) pushes that sentence through four online text-to-speech converters, from the University of Twente, Netherlands; the Center for Spoken Language Understanding; Bell Labs/Lucent (whose converter is no longer available online); and AT&T (see the sidebar “Online and on Your Desktop”).
Synthesizing speech by means of rules for concatenating (stringing together) basic elements has many practical uses, but the results remain unnatural sounding and have limited musical application. Analyzing, processing, and resynthesizing real speech is far more effective, but the most sophisticated methods and tools are not for the fainthearted. Still, there are many ways for desktop musicians to adapt the methods of speech synthesis to music making, and that's what I'll look at next. For an excellent overview of the field, see Computer Music, 2nd ed., by Charles Dodge and Thomas A. Jerse (Schirmer, 1997).
YOUR SOUND PALATE
From a synthesist's viewpoint, the voice is the world's oldest subtractive synth. It has one oscillator (the vocal chords), which has one waveform that sounds something like a sawtooth or narrow pulse wave. There is a noise generator (breath), a multiband filter (the oral cavity), and an advanced automation system that allows for independent control of pitch, loudness, and filter contour. It's a one-voice, monophonic instrument.
Consider the variety of sounds you can produce from a pulse wave and noise: clearly all the action is in the filter, much of which is determined by the tongue and lips. You won't produce sophisticated speech using an ordinary subtractive synth, but you can still get interesting, speechlike sounds. To see what's really going on, let's start by analyzing a speech sample.
At the top of Fig. 1 is a visual representation of a sound file of the spoken word electronic. Time progresses from left to right, and the green trace indicates sample level over time. You can clearly see the syllables, but the graphic tells you nothing about frequency. The display at the bottom of Fig. 1 shows the same sound file displayed in a form often used in speech analysis, the spectrogram (sometimes called a sonogram). As with the waveform display on top, time is measured on the horizontal axis, but frequency, rather than level, is measured on the vertical axis. Level is indicated by intensity — from dark red to white. The light blue lines in Fig. 1 indicate octaves; the scale in hertz is shown on the left. Believe it or not, some people actually become proficient at reading speech spectrograms.
Notice in the spectrogram that the bright areas are concentrated in wiggly bands, with dark regions in between. Those bands indicate changing resonances in the vocal tract that characterize “voiced” sounds (sounds made with the vocal chords). When the vocal tract is relaxed, those resonances (called formants) are roughly 1,000 Hz apart starting at 500 Hz. Movements of the tongue, lips, and jaw change the shape of the oral cavity and, as a consequence, move the formants around. Fig. 2 shows the formants for the common vowels, a, e, i, o, and u. The first three formants are the most important for speech intelligibility, while the fourth and fifth are important for voice identification. Formants alone are not sufficient to produce intelligible speech, but they are excellent for imparting a speechlike feel to many sound sources.
Fig. 3 is a block diagram for a simple setup to synthesize vowels that can be implemented in any reasonably endowed subtractive synth. The MP3 example synVowels (see Web Clip 2) is a recording of 20 vowel phonemes using such a synth created in Native Instruments Reaktor; Fig. 4 shows a spectrogram of those sounds. For the first ten vowels, the oscillator pitch was 125 Hz (a typical male voice pitch), and for the next ten, it was 250 Hz (a typical female voice pitch). The formants used, which are shown in the box at the top, were taken from Dodge and Jerse's Computer Music.
Beyond analysis, spectrograms can be used to resynthesize speech in two ways: additively and subtractively. Used additively, each horizontal line represents a sine-wave oscillator; used subtractively, each horizontal line represents a filter band. In the subtractive case, a harmonically rich source is required for filtering, and in both cases, the brightness of the spectrogram controls level.
The MP3 file ElectronicMix (see Web Clip 3) is a recording of the spoken word electronic followed by nine resyntheses from its spectrogram. The first three are synthesized additively with the spectrogram untransposed, transposed up a tritone, and transposed down a tritone. The next three are synthesized subtractively using the same spectrogram, but transposing the narrow pulse-wave source. The final three, which are doubled in length, are also subtractive and use a varying pitch, a chord, and white noise as the source. Notice that changing the pitch with additive resynthesis also changes the formants, producing the familiar Munchkin effect, whereas with subtractive resynthesis, the pitch of the source changes while the formants remain unchanged. The examples were done on a Mac using MetaSynth.
Spectrograms are one example of a general method of analyzing speech called formant tracking. Whatever the final form, the process involves breaking the sound file into small segments called frames (as in the frames in a movie), then computing the frequency spectrum of each frame to extract the formant information. The frame data can then be manipulated graphically or numerically, depending on the software used, and resynthesized. That allows independent time stretching as well as formant and pitch shifting.
A completely different method of analyzing sound files commonly used in speech synthesis, called linear predictive coding (LPC), also uses frames, but does not attempt to extract their frequency spectra. Instead, it calculates 20 or so parameters (coefficients of a linear equation — hence the L in LPC) for calculating future sample values from prior ones, with minimal error. Though the details are beyond the scope of this article, the important point is that new coefficients are calculated for each frame, and they make up the data of the analysis. LPC, which remains strictly in the time domain, turns out to be a better method of speech synthesis for musical purposes, but because there is no direct correlation between the data and what you hear (as there is with frequency-domain information), it is more difficult to control and manipulate. The primary tool available to the desktop musician for LPC is Csound, but a similar process, called resonator/exciter synthesis, is available in Kyma.
Granular synthesis is now widely used in speech synthesis in two very different ways: to generate speech sounds, as in LPC or formant tracking, and as a tool for dissecting and processing sampled speech. To generate speech, the grains are short bursts (typically between 5 and 50 ms) that are equally spaced. To avoid audible clicks, each grain has a fade-in/fade-out envelope. In such a system, pitch is controlled by the grain spacing, and the spectrum of the wave used in the grain acts something like a formant filter. FOF (fonctions d'onde formantique), developed by Xavier Rodet at IRCAM, and VOSIM (VOice SIMulation), developed by Werner Kaegi at the University of Utrecht, are different implementations of that technique. The adventurous can experiment with these methods in Csound, Cycling '74 Max/MSP, IRCAM's jMax, and other DIY applications.
To come full circle, physical modeling, the mechanical version of which dates from the late 1700s, is an active area of computer-generated speech. An example is Perry Cook's Singing Physical Articulatory Synthesis Model (SPASM) developed at Stanford's CCRMA lab, which uses waveguide physical modeling to synthesize singing. Developed on NeXT computers, the system appears not to have matriculated to other systems, but the singing examples on his Web page (listed in the sidebar “Speech Synthesis Research”) are worth a listen.
TRY THIS AT HOME
You don't have to go very far to play with speech synthesis. The sidebar “Online and on Your Desktop” contains links to sources of speech-synthesis software as well as Web sites that will convert text to speech.
If you're a Mac user, you already have such a system on your desktop. Open SimpleText (TextEdit in OS X), type something, and select Speak All from the Sound menu. All recent Mac operating systems contain a speech synthesizer controlled by the Speech Manager Control Panel.
Although true speech synthesis may be beyond the limits of your studio and patience, you can make use of the techniques described here to create speechlike sounds and add an organic flavor to your music. The most readily available technique is formant filtering, which can be accomplished in a variety of ways. If you have a synth with a built-in formant filter, you can simply use that. If you have a synth with enough modularity to allow you to apply three bandpass filters in parallel, you can use that, though it takes a little more effort to morph between vowels. If neither of those alternatives is available to you, but you do have a multiband EQ among your DSP effects, you can use that to process a synth's output or a prerecorded audio clip. As a last resort, you can use three bandpass-filter DSP plug-ins on separate effects buses.
You can use vowel-formant filtering to add a speechlike quality to any harmonically rich source, but a narrow pulse wave or sawtooth oscillator is a good starting point for setting up the band frequencies. The trick is to automate the morphing between vowel formants. If in your setup you can assign MIDI controllers to the filter frequencies and use the same controller with different amounts and polarity, a MIDI Mod Wheel makes a good source for real-time morphing. That doesn't let you move between specific vowel formants as shown in Fig. 2 and Fig. 4, but it will add a vocal-like quality.
Another option is to use a multirow step sequencer — using one row for each bandpass frequency — which allows you to move between specific vowel formants. If you can set up your step sequencer to trigger one pass of the sequence when a MIDI note is played and to select steps at random, try those alternatives. The MP3 file Duet (see Web Clip 4) is an example of that technique. If you're using a multiband EQ or individual bandpass filter plug-ins to process recorded audio clips, use your audio sequencer's automation for formant morphing.
You can also use a vocoder in nonstandard ways to add a vocal quality to your synth patches or audio clips. Instead of using speech as the control source for the vocoder's filter banks, use a morphing vowel audio clip or control the vocoder bands with sequenced automation or MIDI controllers.
If you have a synth or DSP effect that features granular processing, individual vowel sounds make good source material for granulation. Modulation grain parameters such as grain size, pitch, and distribution provide a broad range of vocal-like sounds.
Though a computer as sophisticated as the HAL 9000 is still over the horizon, many of the techniques used in modern speech synthesis are available for desktop musicians today. These techniques are good for more than simply novelty effects; they can significantly expand your musical palette.
Len Sasso can be contacted through his Web site at www.swiftkick.com. Thanks to Dennis Miller for help in researching this article.
SPEECH SYNTHESIS RESEARCH
A graphical description of the vocal tract.
An overview of the state of the art in text-to-speech (TTS) synthesis by Thierry Dutoit.
The Institut de Recherche et Coordination Acoustique/Musique (IRCAM) is a primary source of software, musical examples, and research in acoustic and electronic music.
History of speech synthesis courtesy of Stockholm University.
Smithsonian Speech Synthesis History Project.
Audio examples from Perry Cook's waveguide physical modeling system Singing Physical Articulatory Synthesis Model (SPASM).
History of Speech Synthesis up to 1987 by Dennis Klatt. Includes a large collection of audio examples.
ONLINE AND ON YOUR DESKTOP
AT&T's Interactive Multi-Lingual Demo (www.research.att.com/projects/tts/demo.html) is an interactive online text-to-speech translator.
The Audio Demonstrations (http://cslu.cse.ogi.edu/tts/demos/index.html) page of the Center for Spoken Language Understanding offers a variety of other interactive online text-to-speech translators.
Csounds.com (Mac/Win/Linux; www.csounds.com) is a source for Csound-specific links.
Delay Lama (Mac/Win; www.audionerdz.com/index2.htm) is a donationware vowel-synthesis VSTi plug-in.
Dictionaraoke (www.dictionaraoke.org), the “singing dictionary,” offers popular songs with the lyrics “sung” by speech synthesizers.
Flinger (MS-DOS/Linux; www.cslu.cse.ogi.edu/tts) is a MIDI-to-singing-voice synthesizer for the PC. This site also contains many audio files of Flinger compositions.
The FruityLoops (Win; www.fruityloops.com/English/frames.html) soft-synth workstation includes a speech synthesizer.
HLSyn (Win; www.sens.com/hlsyn_overview.htm) is high-end physical modeling text-to-speech software.
Joe's Reaktor Creations (www.geocities.com/electropop) features an excellent Reaktor Ensemble for synthesizing and manipulating vowel formants, by Joe Orgren.
Kyma (Mac/Win; www.symbolicsound.com) is a sound-design workstation that requires additional hardware.
MacYack (Mac; www.lowtek.com/macyack) is a collection of utilities to enhance the Macintosh Speech Synthesizer.
Max/MSP (Mac/Win; www.cycling74.com) is a graphical music-programming environment.
MetaSynth (Mac; www.uisoftware.com/PAGES/index.html) is a graphic sound-design application.
The University of Delaware offers ModelTalker (Win; www.asel.udel.edu/speech/ModelTalker.html) text-to-speech software.
Reaktor (Mac/Win; www.native-instruments.com) is a software synthesizer and sampler.
SuperCollider (Mac; www.audiosynth.com) is a real-time sound-synthesis programming language.
Say … (wwwtios.cs.utwente.nl/say/form), another interactive online text-to-speech translator, is from the University of Twente, Netherlands.
VocalWriter (Mac; http://kaelabs.com/download.htm) is a shareware application that adds singing text accompaniment to MIDI files.
Yamaha Vocaloid (http://www.global.yamaha.com/news/20030304b.html) is singing-synthesis software currently in development.
These recordings offer excellent examples of the musical uses of speech synthesis.
Charles Dodge, Any Resemblance Is Purely Coincidental (New Albion, 1994)
Paul Lansky, Fantasies & Tableaux (CRI, 1994)
Various Artists, CDCM Computer Music Series, Vol. 5 (Centaur, 1993)