HOW TO: Editing Spoken-Word Content

Techniques for more natural-sounding audio
Publish date:
Social count:
Techniques for more natural-sounding audio
Image placeholder title

There’s more spoken-word content being posted online than ever these days—from podcasts to video soundtracks, and it’s useful to know how to edit this material effectively. Spoken-word editing brings with it a unique set of issues and techniques compared to editing music. I’ll take you through some of the more important considerations.

Editing spoken-word material usually involves removing or minimizing glitches and momentary background noises. Unlike in music recordings, where such anomalies are often covered up by instruments, these events are quite obvious to the listener of a spoken-word recording, and will likely need mitigation. What’s more, when people are recorded speaking extemporaneously, they often inject “umms,” “ahhs,” “you knows,” and other interruptions (a.k.a. “disfluencies”) into their speech. You’ll want to remove them to make the program material more intelligible and smooth.

Many of the techniques mentioned here apply to both audio-only and video-soundtrack editing situations, but there is one important difference. If your speaker is on camera, you can’t remove words without creating either a jump cut—if you’re editing the video and audio together—or a disconcerting audio dropout, if you’re just editing the audio. Timing is also affected: In an audio-only situation, you generally have the leeway to completely cut out unnecessary words or noises from the program, slightly shortening the duration. On a video, you have to be careful not to do anything that moves the subsequent audio out of sync.


When editing spoken-word audio, your goal is to make your edits seamless, so nobody will know they’re there. Whether you’re getting rid of disfluencies or cutting out entire sections for editorial reasons, be careful to keep the speaker’s rhythms natural at the point of the edit. You’ll quickly discover that editing some words too close together creates the audio equivalent of a jump cut. Typically, the best way to fix that problem is to undo the edit and try to adjust the boundaries so that there’s a little more space between words. Working at a sufficiently zoomed-in level is key to finding the best edit points. You’ll quickly discover that viewing at a detailed magnified level will open up a world of new edit possibilities.

If you’re unable to extend the pause enough by adjusting your edit, you can use room tone to add space at the edit point. Find a spot in your material where there’s a pause in speaking and only room tone is audible. Copy it and paste it at the edit point. When editing in a DAW, it’s always good practice to crossfade at the edit boundaries to avoid introducing clicks. As with any edit, make sure you’re zoomed in enough to set those boundaries accurately.


Fig. 1. The selected section of this word is the plosive “P” at the beginning before it’s reduced. Microphones—especially when used without pop screens—are vulnerable to picking up plosives, those popped “P,” “B,” and other consonant sounds (see Figure 1). Plosives are usually found at the beginning of words or syllables, and you can see their waveforms pretty easily, as their squiggly shape stands out from the rest of the word’s waveform. You generally don’t want to remove them completely, because you’ll lose the consonant sound altogether, but you do want to reduce their level.

Image placeholder title

Zoom in on the plosive, select it (but not any part of the rest of the word), and use either your DAW or audio editor’s gain feature or volume automation to lower its level. I usually start with about 8 to 10 dB and adjust from there. If you’re using automation, sometimes it helps to angle slightly the drop at the beginning of the word and the rise in volume at the end to smooth transitions. (If you’re using iZotope RX Advanced, it has an amazing De-Plosive module that takes care of plosives in one click.)


Fig. 2. An “S” sound like the one selected here is easy to graft onto another word to repair it. Sometimes, you’ll run into a situation in which a speaker misspoke a letter sound at the beginning or end of a word, or some sort of click or pop occurred at the same time, and the word sounds bad as a consequence. While you might think you have no choice but to leave that pop or click in, you can often either replace just that letter sound with another from the same recording, or replace the entire word. In the case of the former, find the same letter sound in another spot (“s” sounds are the easiest to find and to work with), copy it, and paste it in to replace the problematic letter sound (see Fig. 2). You might have to adjust its level to match it more closely.

Image placeholder title

Replacing an entire word can be more tricky because a person doesn’t always use the same inflections when he or she says the word in a different sentence, and you might copy and paste another instance of it in only to discover that the pitch of the speaker’s voice was different in the sentence you took it from. If that’s the case, it will sound unnatural and out of place. You may have to try a few different versions of the word (if you can find them), and hope that one sounds similar to the problematic word you’re replacing.


Background noise, whether it’s a result of the mic being too far from the subject, a noisy fan, air conditioner, etc., can mar a spoken-word recording. To help mitigate noise, most audio editors have some sort of broadband noise-reduction features built in. (Some are very deep and some aren’t.) A number of noise-reduction plug-ins are available, as well.

Broadband noise reducers “learn” the background noise from a sample that you select (some do it automatically as the audio plays), and use that to figure out which part of the signal needs reduction applied to it and which part should be left alone. The problem is that the process creates artifacts, and can make your audio sound weird and harsh. The trick is to apply as little as you can. Often you have to compromise, lowering, but not eliminating the noise, in order to maintain decent audio quality.

Mike Levine is a musician and producer in the New York area, and edits the weekly podcast of New Jersey Monthly magazine.