Humanoid or Vocaloid?

Many acoustic instruments can be simulated convincingly with various synthesis techniques, such as sampling and physical modeling. But one instrument

Many acoustic instruments can be simulated convincingly with various synthesis techniques, such as sampling and physical modeling. But one instrument has resisted most simulation attempts: the singing voice. That is because singing exhibits an unusually wide range of timbres, articulations, and transitions between sounds. In addition, singing usually communicates lyrics as well as melody, which results in a double layer of meaning not found in other instruments. Finally, the human ear is so attuned to the voice that the subtlest tonal shifts, errors, or anomalies are immediately apparent.

At the 2003 Musikmesse in Germany and the Audio Engineering Society convention in the Netherlands this past March, Yamaha demonstrated a new vocal-synthesis technology called Vocaloid (, which achieves a new level of sophistication in this area. Using Visual C++ on a Windows computer, a team at the Yamaha Advanced System Development Center in Japan has written software that mimics the singing voice with surprising accuracy.

The team starts with recordings of professional male and female vocalists singing specially constructed phrases of nonsense words with all possible transitions between syllables. The transitions are slightly different depending on the combination of speech sounds called phonemes. Those differences are a big part of how we understand words and why a vocal track sounds natural or artificial. For example, the phoneme p sounds slightly different at the beginning of a word than it does at the end, and it affects the vowels next to it differently than, say, the phoneme t.

The recorded phrases are converted to the frequency domain using Fast Fourier Transform and divided into separate phonetic transitions. Those elements are then stored in a phonetic database for use with the synthesis engine. Expressive elements such as vibrato, pitch bend, and attack are also extracted and stored in a separate database.

To create a vocal track, you enter music and lyrics into the score editor (see Fig. 1). The music can be entered manually or imported from a Standard MIDI File; the lyrics must be entered manually. Expressive elements can be imported from a MIDI File as Control Change messages or entered from a graphic palette.

The data from the score file is sent to the synthesis engine, which draws on the phonetic and expression databases to create the track. To sing the word part, for example, the software combines four elements from the phonetic database: p (as it sounds at the beginning of a word), p-ar (the transition from p to ar), ar-t (the transition from ar to t), and t (as it sounds at the end of a word). The two ar elements are blended together, and the resulting vowel a is lengthened to accommodate the melodic line.

Different pitches are derived by shifting the fundamental and overtones while leaving the vowel formants relatively untouched. The database elements were originally sung at different pitches, limiting the amount of shifting the engine must do. A Pentium 4/2 GHz computer takes less than one-third real time to render the track and convert it back into the time domain. For example, a 1-minute track can be rendered in less than 20 seconds.

Yamaha intends to commercialize Vocaloid by licensing it to producers of vocal libraries and software marketing firms. The obvious applications include background vocals and rough sketches of arrangements. However, the potential for this technology is virtually unlimited.