We often focus on the best techniques for capturing the voice in recording, and then processing it for the best sound in the mix, but for many projects, more time may be spent editing vocal recordings than anything else. From dialog/voiceover tracks to musical performances, the technical (and often creative) techniques employed to whip a voice recording into shape are a big part of what goes into that seamless podcast or musical arrangement. Here are several tips on many of the various techniques for slicing and dicing vocal recordings to perfection.
All voice editing requires the ability to make good, clean cuts. Percussive sounds can be easy to edit—you can usually see the sharp transient edges of each note, and cutting at those transients may help to mask any clicks that might result from slicing up the wave. But vocals—spoken and sung—can be more continuous, and cutting the wave where needed can often result in a digital pop or click from the resulting discontinuity in the wave, caused by the cut.
There are two main ways to avoid this. One method is to add very short fades at the beginning and end of each cut, or short crossfades when two edited regions are joined. These should be around 10 milliseconds or so—they won’t be audible, but they will smooth out any sudden jumps in the wave from the edits, that would result in clicks in playback. A more exacting approach (sometimes warranted on more troublesome edits) is to cut at “Zero Crossings”. These are points where the wave is crossing the center line—momentary silence. Cuts and transitions at these points won’t make any sound—it’s the best approach, though obviously more time-consuming (some DAWs offer the option to force all cuts to Zero-Crossings, but that can’t always be used, especially when musical timing comes into play).
With podcasts, talking books, and the like firmly established, more people may find themselves editing the spoken word. This can bring its own issues—since the voice is isolated, the editing concerns can be more critical than for a vocal half-buried in a musical arrangement. One such concern is background sound. A dialog recording needs to be clean—there’s no accompaniment to cover over unwanted leakage of background sound like ambient noise, AC hum, rustling, buzzing, or other potential annoyances. Again, there are two traditional approaches to eliminating the distraction of leakage in a dialog recording.
A processing approach would be to use a Noise Gate—a real-time processor that removes audio below a user-set level (Threshold). Most leakage is lower in level than the voice itself, and will be covered up when the person is speaking, so it only needs to be removed when it becomes audible, between words and phrases. But you have to be careful using a Gate—even with the best settings, it can often unintentionally remove little bits of the vocal itself (quiet beginnings/endings of words). You may want to check from beginning to end for any unacceptable artifacts—the same setting may simply not work through an entire recording.
An editing alternative would be the Strip-Silence feature included with most DAWs. This does the same thing as a Gate, but by actually cutting the Regions/Clips into pieces. The disadvantage is now having all those little bits to contend with; the advantage is that it’ll be easier to fix individual regions with further (manual) tweaks to the stripped Regions’ boundaries.
When a dialog recording has been Gated/Stripped, the dead silence between regions may be distracting—most voice tracks have some subliminally audible ambient sound—“Room Tone”—under the voice, and if this is gating in & out, it calls attention to what would otherwise never be noticed. Dialog editors need to insert regions of (matched) Room Tone (extracted from the same recording) between words and phrases, to maintain continuity, and keep the background ambience from calling attention to by its occasional absence.
Dialog editing often involves more creative decision-making. You may need to edit out “uhs”, stutters, distracting pauses, vocal stumbles, and sometimes certain words or phrases deemed unnecessary, or unsuitable for broadcast. Done well, these kinds of edits can make a tentative, halting speaker sound smooth and focused, but done poorly, it could potentially sound unnatural. There are a number of considerations that come into play—here are a few of the main ones.
• Cadence: most speakers have their own natural rhythm, and when editing a particular person—especially if more extensive cuts are made for running time and content—it’s important to listen before you cut! You need to get a sense of the natural rhythm and patterns of that individual’s speech (his/her cadence), and when you make the edits, you’ll want to preserve that quality, so the edited part actually sounds like the same person. If they’re normally a fast talker, then your edits shouldn’t slow down only certain sections of the recording; if they’re a more deliberate speaker, the edits shouldn’t make them sound like they’re on a coffee rush; if they tend to phrase things a certain way, then any longer edits that create a new phase should maintain that same speaking style, integrating edited sections seamlessly into the whole.
• Inflection: Sometimes you have to substitute a phrase, a word, or an individual syllable (or even a letter, like an “s”!). When you do, you need to make sure the inflection of the new bit fits in properly with the overall phrase. Substituting a word with a higher, or rising pitch can accidentally turn a statement into a question (!), and poor matches can be distracting in any number of ways.
• Timing and Phrasing: Sometimes—especially in Video Post Production—you may fly in an alternate take, or a looped (re-recorded) dub, of a line or phrase that sounds better than the original (location) recording, but doesn’t quite match up to the timing of the lips onscreen, resulting in poor lip sync. The tedious cutting/moving of the old days can now be done much more easily with time-based editing features like Elastic Audio & Flex Edit. For the specific situation I described, the granddaddy of this is Synchro Arts’ Revoice Pro (formerly VocAlign), which automatically matches the timing of one phrase to another—an indispensable tool for Post editors who have to make tons of these edits in a day’s work.
In both dialog and sung vocal recordings, breath noise can be an issue. You may want to edit out distracting breaths between phrases (often brought out more by compression), but the trick is to not remove them entirely—which can subliminally make the performance sound unnatural—but to reduce them to an appropriately subtle level. This can be accomplished with Automation (drawing in level drops for the breaths), or in real time with an Expander (plug-in), which is a Gate set to a less-extreme setting—instead of removing the breaths entirely, it would be set to lower their levels enough to make them stop calling attention to themselves.
The blasts of air that naturally occur on some syllables—“P” and “B”, mostly—can shake the mic’s diaphragm, causing loud low-frequency thumps whenever those letters come around—these are called “P-pops”, or, more technically, Plosives. They’re best avoided at the source, but if you’re stuck with them in a recording, an editing solution is to cut right at the point of each pop—they should be clearly visible if you zoom in a little on the wave—and add a short Fade-in through the Pop, softening it. The Fade length can be adjusted to make sure the consonant still comes through clearly.
To make life easier, some editing software offers dedicated automatic processing for both Breath noise and Plosives—check out iZotope’s RX audio repair suite.
For music vocal tracks, one of the most widespread vocal editing techniques is Comping. While many singers are capable of sustaining a good performance from start to finish of a take, the repeated listening a recording will be subjected to demands what may often be an unattainable level of musical perfection. So the singer lays down several takes of a part, and the editor goes through them, selecting the best bits of each take, and combining these into a single take—the “Comp”, or Composite finished vocal track.
There are different schools of thought on how to approach this. Some editors like to assemble the Comp from tiny bits, grabbing individual words and short phrases into the final version. Others (and I’m one of those others) feel you get a more musical result by selecting the best overall take, or at least long sections of a couple of takes, and only comping those shorter bits that are needed to fix clams or less-than-ideal phrasing here and there. This approach may yield a more natural musical arc over the course of the entire song (though I must admit that for some, shall we say, “less-experienced” singers, I have used the more piecemeal approach successfully). Many DAWs make Comping easier by grouping and displaying alternate Takes, and allowing the editor to simply swipe over the desired sections to create the final Comp (at the top), which can then be rendered (Bounced) when it’s done, for convenience in further editing and mixing.
The other most common editing technique for vocals is probably doubling— combining two (or more) tracks of the same vocal part, for the richer, thicker tone that comes from voices singing together. There are many ways to go about this, and not all involve editing. Doubling can be accomplished at the outset, in recording, by simply having the singer record (at least) two complete takes. And it can be done at the mix stage, by duplicating a vocal track, and delaying the copy by about 15–20 milliseconds, to create an artificial doubled part (a.k.a. ADT—Artificial Double-Tracking).
But if you don’t have any extra takes, and if you prefer the looser, more natural doubling of two different performances to the unnaturally tight quality of the ADT effect, then there’s an editing technique that can help, at least with some vocal parts. This will work on sections of a song that repeat verbatim, like Choruses—you can rearrange different sections of the same recording. So, you could grab the vocal part from Chorus 2 and fly it in as a doubling under the same track’s vocal in Chorus 1 (again, assuming the lyric & timing are a good-enough match). With Choruses that are made up of a repeating line, I’ve even taken several lines from the same chorus and mixed them up, for a nice, natural, human doubled feel.
These are some of the most common editing considerations and techniques that any editor is likely to find him/herself dealing with, in all kinds of vocal sessions. While modern software may often (at least try to) offer turnkey solutions, these are all still essential editing skills that any engineer will need, when dealing with that most important production element—the voice.