A Stitch in Time

Image placeholder title

As every musician knows, pitch and time go hand in hand — if you speed up the playback of a tape machine, the song's tempo will increase, and its pitch will rise. But though nobody has been able to repeal the laws of physics, it is now possible to manipulate a sound's timing and pitch independently: if we can adequately describe a sound's behavior over time, we can reproduce that behavior over a different time scale realistically. This process of analysis and resynthesis is fundamental to the time-compression/expansion, pitch-shifting, and automatic-tuning processes that have inundated music production.

A common technique used to transpose samples is to play the sample back at a different sampling rate. For example, if you play a 44.1 kHz sample back at 22.05 kHz, it will sound an octave lower and for twice its original duration. If it's a rhythmic sample, such as a drum loop, it will play back at half its original tempo. That is exactly analogous to changing the playback speed of a tape machine, and it suffers from the same timbral shifts, creating mud at lower sampling rates and cartoonish effects at higher ones.

Phase One

To do the job properly, you need to separate frequency information (on which pitch and timbre are based) from timing information (on which articulation and rhythm are based). The phase vocoder is equipped to do just that. A phase vocoder — not the more common channel vocoder that makes synthesizers seem to talk — picks apart a sound's harmonic structure, allowing the sound to be manipulated from the inside out and then recombined.

Image placeholder title

FIG. 1: Preserving formants after phase-vocoder pitch shifting is important. When sound A is transposed its formant is transposed (B). Applying the spectral envelope of A to the pitch-shifted version restores the formant to its characteristic location (C).

The phase vocoder extracts the harmonic content of a signal by performing a Fourier transform. The signal is first sliced into small chunks of time called windows. Each window is then analyzed for its frequency content. A large number of frequency bands called bins (typically as many as 1,024) is examined for the amplitude and phase of any harmonic components they contain. The exact pitch of a harmonic is determined by examining the phase of the waveform at two adjacent windows. The greater the phase shift, the further the frequency is from the center frequency of that bin, so the phase shift can be used to identify the frequency precisely.

The phase vocoder is like a filter bank that measures the time-varying energy of the signal passing through each bandpass filter. This interpretation reveals the family resemblance to the channel vocoder, which directly maps this information to envelope generators on corresponding bands of a carrier wave. The phase vocoder, however, can use this information to synthesize an output signal directly. A perfect phase vocoder would be able to analyze a signal and then resynthesize it from scratch accurately.

Note that we have just successfully separated frequency from time. By remapping the envelope of any filter to another filter we can change frequency content without affecting timing. By changing the rate at which the envelope information is re-created, we can change timing without changing pitch.

Smiling Phases

If we shift the entire harmonic spectrum of a sound equally when changing pitch using phasevocoding techniques, we get the same timbral degradation that variable-speed playback exhibits. That is due to the shifting of the formant along with the fundamental. A formant is a characteristic emphasis or peak in a sound's frequency spectrum that does not change along with the fundamental. It is caused by the natural resonance of an instrument or vocal tract, and it is a primary means by which our ears distinguish one voice or instrument from another. It is, in fact, the shifting of the formant that causes the chipmunk effect.

Fixing the formant problem is simple. After the pitch has been shifted, the spectral envelope of the original signal is mapped onto the new sound, restoring the formant (see Fig. 1). Of course, once we know how to fix a formant, we know how to manipulate it for creative effect — such as mapping a female's formant onto a male voice.

Another common failing of the phase vocoder is its tendency to smear transients. Since the Fourier transform is performed in windows of fixed size, it is impractical to describe events such as transients, whose durations are shorter than a window.

Time Out

Given the limitations of phase-vocoder techniques, let's dream up the ideal method for stretching or shrinking the timing of a musical phrase. Each note has a waveform, a pattern that repeats until the next note is sounded. Shortening a note should therefore be a simple matter of deleting a couple of iterations of the waveform, and lengthening a note would involve repeating the waveform a couple of times.

That is essentially the strategy used by Time Domain Harmonic Scaling (TDHS). First the fundamental pitch is determined, and from that a single cycle of the waveform is extracted. This cycle can then be stitched together with overlapping copies of itself to extend a sound (see Fig. 2). To shorten the duration of a sound, some cycles are discarded. To change pitch, cycles are squeezed closer together or spaced farther apart, resulting in more or fewer cycles per second.

Image placeholder title

FIG. 2: Time Domain Harmonic Scaling is a common technique for time stretching. The sound is dissected at the level of a cycle of its ­fundamental. These building blocks are then overlapped and ­recombined at a new time base. Here, the original timing of the cycle T has been increased to an interval T'' that is adjusted by a small ­variable V to allow for overlapping at points where the waveforms are similar.

If you've ever snipped a bit out of the middle of a waveform to shorten a note or copied and crossfaded a segment to extend a note, you've done the same thing; you doubtless chose the spot at which you stitched things together carefully, looking for similar characteristics before and after the crossfade to ensure a smooth transition. TDHS does the same thing, adjusting the splice point between adjacent cycles to find an overlap of similar signals.

A chief advantage of this technique is that it uses the actual transients of notes, rather than resynthesizing them as the phase vocoder does, so there is less smearing. The primary challenge is to accurately determine the fundamental at any given point. In fact, TDHS is less suited to dealing with polyphonic material because of the difficulty of finding an appropriate cycle to copy. Another issue that can arise with TDHS is anisochrony, or irregular timing. The price of fooling Mother Nature is that optimizing sound quality compromises rhythmic precision. Some plug-ins that use TDHS offer a choice of algorithms optimized for sound quality or timing to allow the user to choose the trade-off.

By combining these two approaches, you can get good transient response and deal well with polyphonic material. First, the signal is decomposed into transient elements and a residual signal. The transients are processed with TDHS, and the residual is processed with phase vocoding. The two parts are then recombined. Naturally, the decomposition/recomposition process presents its own challenges — most typically, phase issues.

Pitch a Fit

It's difficult to tell exactly what methods a particular program or plug-in uses, especially when they hide behind marketing buzzwords that suggest they've solved the problem by modeling the human ear (see the sidebar “A Few Favorites” for a list of the software I think is particularly effective). But regardless of technique, understanding the fundamental principles gives you some clues about using time stretching and pitch-shifting successfully.

For example, if you can process mono tracks instead of an entire mix, you'll get better results. And get to know the options available in your program. If it offers different algorithms, test them on various program material and at different amounts of pitch- or time shift. Check out the presets or “intelligent” parameters, but experiment with them to see if you can do better.

To minimize the timing anomalies during a major pitch shift, chop a phrase into small segments. I used this technique to prevent cumulative timing problems when turning a tenor sax part into a baritone sax part. Chopping up the phrase allowed me to keep the timing tight, even though I used the plug-in's timbre-optimized algorithm.

The laws of physics may bend, but they don't break. Newer, more sophisticated algorithms will make our deceptive practices easier to hide, but Nature will continue to fight us. Knowing the pitfalls is the first step in learning to avoid them. Keep the compromises in mind, and always respect your Mother!

Brian Smithers is Course Director of Advanced Audio Workstations at Full Sail Real World Education in Winter Park, Florida.


Pitch- and time manipulation is now a standard feature of most major DAWs, and for making only small changes, any of the built-in effects is useful. Some programs, however, go above and beyond in their capabilities. Digital Performer's Spectral Effects is a standout for its imaginative yet intuitive interface. Pro Tools' TCE Trim tool is easily the most convenient implementation I've used, and the fact that it can use third-party TCE plug-ins is especially nice. For better results at more extreme shifts, SoundToys' Speed is excellent. Although it's out of reach for most project studios, the pitch algorithm in TC Electronics' System 6000 is utterly transparent.

Serato's Pitch ‘n Time features a graphic mode that just begs for creative manipulation. It allows you to manipulate different segments of an audio region independently, so you can slow the initial kick of a drum loop down, changing its character without altering the tempo of the overall loop. Celemony's Melodyne features startlingly good pitch- and time scaling, and has an intuitive display of an audio region's pitch and rhythm characteristics, letting you edit them easily and independently. Finally, Sony's Acid 5 has 19 new time-stretching modes, one of which will surely fit your needs.