Look Who's Talking

Publish date:
Social count:
Image placeholder title
Image placeholder title

Illustration: Mike Cruz

Call it a singing robot or a talking synthesizer — there's something irresistible about crossing human speech with otherworldly timbres. A vocoder (see Fig. 1) does just that, mapping tonal characteristics of one sound (typically speech) onto another sound (typically a synthesizer). From Wendy Carlos's synthesized Beethoven choir in A Clockwork Orange to recent hits by BT and Madonna, the vocoder has stood the test of time.

Although it may sometimes seem otherwise, nothing says that we can't use a vocoder for more than mapping speech characteristics onto a synth tone. Using a drum pattern to shape the output of a string pad is just one of many interesting and effective variations. Once you understand how a vocoder works its magic, your imagination will doubtless conjure many more uses for it.

After a bit of history, I'll take a look at the inner workings of a vocoder and see what makes it tick. Then I'll discuss what makes one vocoder sound different from another. Be sure to check the EM Web site (www.emusician.com) for audio examples of vocoding in action, as well as for a do-it-yourself guide to vocoding.

Image placeholder title

FIG. 1: This figure shows part of the vocoder collection at the Audio Playground Synthesizer Museum, the world''s largest such museum (www.keyboardmuseum.org), ­assembled for research on this article. On top sits the classic Korg VC-10, dating from 1978. At the top of the rack is the MAM VF-11 (current), and below that is an MAM line mixer. The red unit in the middle is the Electrix Warp Factory (1999), and below that is another Korg, the DVP-1 (1986).


The vocoder and its fraternal twin the voder were hatched in the late 1930s by a Bell Labs engineer named Homer Dudley. The words vocoder and voder are derived from “voice coder,” and the devices were designed to help reduce the bandwidth required for speech transmission. Interestingly, although we might think of vocoding only in musical terms, many of us also use it daily in nonmusical communications. In fact, Dudley's work is the basis for cellular and Internet telephony. Vocoding was also adapted by the Department of Defense to encrypt voice transmissions during World War II.

Dudley realized that the human mouth, throat, and nasal cavity together constitute a complex time-varying acoustic filter that shapes the timbre of the basic tone produced by the vocal folds. To demonstrate this remarkably unromantic observation, try the following test. First, check to be sure you're alone (trust me on this). Now repeat slowly after me: “Waouwayweewohwooo.” Notice that the tonal variations are produced by changes in the position of your lips and tongue. Your vocal folds can change the pitch of your newfound mantra, but the timbre is all in your face.

Based upon this straightforward truth, Dudley constructed the vocoder to analyze the varying timbre of the speech input and apply those variations to a synthesized tone instead of the tone created by the vocal folds. The voder, by contrast, generated speechlike output from oscillators and a filter bank controlled by an operator at a specially designed console. The voder was a hit at the 1939 World's Fair, but the vocoder turned out to have the staying power.


Let's get away from the “talking-machine” paradigm and use the technical name modulator for the speech input, because its timbral variations are used to modulate the sound of the synthesizer. The synthesizer tone is called the carrier, because it carries the tonal imprint of the modulator to the final output.

Image placeholder title

FIG. 2: The modulator is divided by a bank of bandpass filters (F), and then each band is analyzed by an envelope follower (EF). The carrier passes through a similar bank of filters. The control voltage (CV) of each EF controls the output level of the VCA on its corresponding carrier band. The modulated carrier bands are then recombined at the output.

The modulator signal is split into its component frequencies by a bank of bandpass filters much like a graphic equalizer (see Fig. 2). The output of each band is analyzed by an envelope follower, a device that creates a control voltage (CV) corresponding to the signal level present at its input. In this way, as the relative strength of various frequency bands vary in the modulator, a set of CVs are created whose variations track the modulator's changes. The modulator has effectively been reduced to control data.

The carrier wave is simultaneously passed through a bank of bandpass filters, ordinarily set to the same frequency bands as those used on the modulator. Instead of being sent to envelope followers, however, the carrier's component bands are sent to voltage-controlled amplifiers (VCAs). These VCAs are controlled by the CVs from corresponding frequency bands of the modulator and map the amplitude fluctuations measured at each modulator band to its corresponding carrier band. The various bands of the carrier are recombined at the output. Thus, the timbre of the carrier tracks that of the modulator while the carrier provides the pitch information.

Vocoding, then, is an example of subtractive synthesis. No additional frequency components are added to the carrier wave by the process — they are merely altered according to the analysis of the modulator. That is an effective way to synthesize speechlike sounds, because your voice also uses subtractive techniques, as you can demonstrate for yourself. Once again, look over your shoulder to be sure you're alone. Now stick out your tongue until it touches your chin, open your mouth as wide as you can, and sing a tone. Not pretty, is it? That is the brightest timbre your voice can produce — the raw unfiltered waveform of your vocal folds. (Okay, you can stop singing now!) All other sounds that your voice makes are created by filtering this raw tone with your oral and nasal cavities.


I like to think of vocoding as a kind of timbral sculpture. A sculptor never adds anything to a block of marble; rather, he or she only chips away at it to reveal the “form within.” That requires starting with a block that's the right size and shape to accommodate the sculptor's vision. Because we can't add any harmonics in vocoding, we need to start with a carrier that has plenty of harmonics to “chip away.”

Favorite carrier timbres include pulse (square) waves, sawtooth waves, string samples, brass samples, white noise — the same sort of harmonically rich tones on which filter sweeps are effective (see Fig. 3). A sine wave is the worst possible carrier for vocoding, because it has no harmonics to filter. No matter how many frequency bands are tracking the modulator, only one has any energy to be affective, thereby reducing the vocoder to a single-band envelope follower.

By the same logic, a modulator should be either harmonically or rhythmically active or there's little point in vocoding it. A static modulator timbre, for example, takes the envelope followers out of the equation and reduces the vocoder to an equalizer. In addition to speech or song, effective modulators include drums, arpeggiated synths, or even a sound like a trumpet with a wah-wah mute. I know an engineer who used his young nephew's violin to modulate a string patch, thus creating a more natural-sounding attack despite the fact that the violin's pitches were unrelated to the song (to put it politely).

Image placeholder title

FIG. 3: The Orange Vocoder from Prosoniq features both sampled and analog-style carrier waveforms. With two oscillators, EQ, and reverb, it offers lots of tools for shaping sounds.


The number of frequency bands has a profound effect on the character of a vocoder. In general, the more bands of analysis and modulation that are used, the more accurately the modulator is represented at the output. For clear speech vocoding, then, having more bands is usually better. Working with a 2-band vocoder, to take the opposite extreme, is like sculpting with a pick axe. Only gross variations in timbre would be represented. Most vocoders offer ten or more bands, with some software vocoders offering hundreds of bands.

Of course, how these bands are distributed is important, as well. There's nothing that says the bands must be evenly spaced; in fact, concentrating the bands in the midrange frequencies where speech is most interesting can help to achieve clarity.

Basic vocoder design as I've described so far has one major flaw in its ability to reproduce intelligible speech. Although it's quite good at reproducing vowel sounds and sustained consonants such as m or n, it doesn't do a good job with percussive or sibilant consonants such as t or s. Those are called unvoiced sounds, and their waveforms are not periodic, making most of our favorite carriers poor choices to represent them.

One fix for that is to mix some of the modulator input with the modulated carrier so unvoiced sounds are heard in their original form. Alternatively, blending some white noise with the carrier wave before modulation adds some frequency components that help make unvoiced sounds clearer. A more elegant solution is to use a detector circuit capable of distinguishing unvoiced sounds by their frequency characteristics. When an unvoiced sound is detected in the modulator, the vocoder can momentarily switch from its primary carrier wave to a secondary carrier that is better suited to representing percussive or sibilant sounds.

Clarity isn't everything, though. Sometimes your goal will be to make the most ear-catching sound, rather than the most articulate. For that reason, some vocoders feature adjustable resonance on the carrier's filter bank, allowing the user to narrow the bands for a more biting sound.

Gender-bending effects can be produced by remapping the CV signals to the VCAs of noncorresponding frequency bands. Shift the CVs to higher frequency bands, for example, and the timbre of the output takes on a higher character, even becoming chipmunklike in sound when taken to an extreme.

Other variables found in some vocoders include attack and release times for the VCAs, independent output level for each frequency band, chorus, built-in carrier waves, and more. Because vocoders are often built in to effects devices or synthesizers, the list of sound-processing variables is virtually endless.


Any number of alternative techniques exist that imitate the vocoder's ability to create the illusion of speech. The simplest device is the talk box, which is nothing more than a speaker with a tube attached. The tube carries the sound of the carrier from the speaker to the performer's mouth, where the carrier is shaped by that time-varying acoustic filter we discussed earlier. The modulated sound is then picked up by a nearby microphone. The talk box is a favorite with guitarists and has been used famously by Peter Frampton, Joe Walsh, and many others.

Phase vocoding is a digital technique that uses the Fast Fourier Transform (FFT) to analyze the modulator, resulting in a detailed description of its sonic architecture. Once the modulator has been reduced to a set of instructions, it can be reconstructed, either in its original form or in a modified form. Phase vocoding is therefore a resynthesis technique, and it is a particularly smooth way to separate the modulator's pitch and time components for time stretching and pitch shifting. Phase vocoding is found in such products as Csound, U & I's Metasynth, Tom Erbe's freeware SoundHack, and the CDP Composers Desktop Project, to name but a few. The Kyma System from Symbolic Sound also features a huge array of phase-vocoding capabilities.

Digidesign offers a pair of TDM tools called Bruno and Reso that present interesting variations on the vocoder sound. Bruno uses a method of time slicing to extract the timbre of the input signal. These time slices are then crossfaded back together at pitches determined by MIDI input. Reso adds harmonic overtones to the input signal by using a resonance generator, and pitch is also controllable by MIDI input. Notice that Bruno and Reso act directly on the input signal, so there isn't really a modulator/carrier relationship. For that reason, it's useful to think of them as modifying the pitch information of an input signal, as opposed to a traditional vocoder, which modifies the timbral characteristics of the carrier signal. (You say “tomato” …) While they can make vocoder-like talking effects, they are even more interesting to use for adding pitches to drum parts and other nonpitched sounds.


Aside from the sheer giggle factor of sounding like the Cylons from Battlestar Galactica, the best thing about vocoders is that they are a textbook example of the power of modulating one signal with another. If you understand the principles of vocoding, using an LFO to create vibrato on a synthesizer and Velocity switching on a sampler all seem like child's play.

For a do-it-yourself adventure in vocoding, go to www.emusician.com. There are also audio examples to give you an idea of ways you might use a vocoder in your own music. Just remember that anything goes and that creative vocoding involves more than just using speech as a modulator.

Be warned, though: it's almost impossible not to get silly when you're playing with a vocoder. While researching this article I kept looking over my shoulder to see when I would be carted away in a straightjacket for making banjo sounds, reciting limericks, barking into the microphone, and other strange behaviors, all strictly in the name of science. Oh, and if you can manage to use a vocoder and not sing the part you're playing, you have more discipline than I do!

Brian Smithersis Course Director of Audio Workstations at Full Sail Real World Education in Winter Park, Florida. Thanks to Joseph Rivers of the Audio Playground Synthesizer Museum (www.keyboardmuseum.org) for his generous assistance and access to the museum's collection of classic and contemporary vocoders.