Modeling the human voice is a daunting challenge.
Ever since Yamaha introduced the VL1 in 1994, the importance of physical modeling as a synthesis and signal-processing paradigm has continued to grow. This technique requires lots of computational power to re-create the sound of most acoustic instruments, but it's relatively simple compared with modeling the human voice. Wind and string instruments all have fixed dimensions and constant acoustic properties, with the exception of changing the length of a vibrating string or air column. By contrast, the human vocal tract changes size, shape, and stiffness continuously and dynamically, making an accurate mathematical description almost impossible.
A more manageable approach is to analyze the acoustic properties of vocal sound and use this information to modify certain aspects of that sound. TC Helicon (www.tc-helicon.com), a joint venture of IVL Technologies and TC Electronic, is using this method to develop tools for people who work with the singing and speaking voice.
The first step is to understand the characteristics of various types of singers and styles of music. According to Brian Gibson, chief technical officer at IVL, "We look at things that happen in the resonant structures of the vocal tract as people sing higher and lower. We also look at different effects in the glottis [see Fig. 1], such as vibrato and trilling." In addition, the team does perceptual studies to determine which acoustic features of various vocal sounds make an identifiable difference to the average listener.
Even though TC Helicon is not trying to model the actual physics of the singing voice, the task is daunting enough. For example, most people think of vibrato as simple pitch modulation, but vocal vibrato is much more complex: it involves modulation of pitch, amplitude, resonance, phase, and other aspects of the sound. These patterns are very difficult to simulate, but without them it's immediately apparent that the vocal sound is not natural.
The result of all this analysis is a set of more than 100 parameters applied to an input signal in near - real time. The signal is dissected into various elements, such as voiced (periodic) and unvoiced (aperiodic noise) components. These elements are then modified according to the model parameters and resynthesized, all within 15 or 20 ms.
The first commercial product to use this research is the VoiceCraft plug-in card for the VoicePrism pitch and formant processor. The card - which should be shipping in March 2001 - is a monophonic lead-vocal processor that uses a Motorola DSP56362 chip running at 100 MIPS. (The rest of the unit provides four conventional harmony voices and various effects.) The preset models let you select, among other things, the type of sound, musical style, and gender of the output. For example, you can give your voice that popular raspy quality without the physical damage caused by actually singing that way. Also, the VoiceCraft can subtly enhance your vocal quality - giving it more resonance, say, or more body in the upper range. Using the model to improve such deficiencies goes beyond static equalization because the precise effect depends on the frequency and other vocal parameters.
Of course, nothing beats improving your technique, but in the meantime this technology can really help. It also lets you create timbral variations you might not be able to achieve naturally.
Other possible applications include speech in addition to singing. For example, one individual could provide many different voices for commercial voice-overs. The potential of this technology is vast, befitting any activity that makes use of the human voice. I look forward to trying it out and following its development in the future.