Zero-G's Leon and Lola are the first two commercial programs to use Yamaha's Vocaloid “singing-synthesis” technology. Several components make up each package, including the Vocaloid Editor, a sequencerlike interface for entering and editing data; a Vocaloid voice database, which contains the phonemes (speech building blocks) that are used for the synthesis; and a library of expression elements that can be applied to the singing voice. There's also a VSTi plug-in for running the virtual singers under a VST host.
Vocaloid is part sequencer, part soft synth, and part notation program. But above all else, it is a compositional and “voice-designer” environment in which you enter notes and lyrics, then apply any of a vast number of expression parameters, and the program will sing your input. It offers limited access to other programs: although you can import a MIDI sequence, it must be monophonic (no notes can overlap), and its own data cannot be output as a standard MIDI file. You can, however, output an audio file (WAV only) of the singing, and you can sync the program with a ReWire host.
Leon and Lola are the first two of a planned series of voices that will use Vocaloid (another female voice, named Miriam, will be appearing by the time you read this). Each has its own unique library of phoneme and phonetic articulations and transitions, which was built from recordings of a professional singer, and all share the same user interface, which I will describe in a moment.
Speech synthesis has been the goal of researchers since at least the 18th century. Late-19th-century experiments in the laboratories of Frenchman Etienne-Jules Marey led to mechanical devices, shaped like human mouths, capable of producing all five vowels (see the article “Voices from the Machine” in the February 2004 issue of EM for more on speech synthesis). A number of approaches evolved during the 20th century, although few, if any, provided a successful method for the synthesis of any arbitrary text, much less realistic singing. (The IRCAM program Chant is one exception, but it can't sing words.)
One of the problems associated with producing convincing synthetic speech and singing is the huge number of possible combinations of syllables. Moreover, the transition between any two syllables is a particularly difficult challenge for researchers. Vocaloid addresses these issues by providing a large library of phonetic transitions to interpret your lyrics and find the best match for any given configuration of syllables.
The library, however, is not simply a collection of audio samples. Instead, the actual recordings that were used to create the library have been subjected to spectral analysis (see the Square One spectral analysis article “Look Through Any Window” in the July 2004 issue of EM) to create a library of data that the program uses to resynthesize the voice using your input. This allows your vocal parts to be transposed or time-shifted in nearly unlimited ways. Although the software has default associations between specific phonemes and the syllables of your lyrics, those mappings can be edited by the user. For example, you could make up your own words — or even your own language — and train the software to speak them.
UP AND RUNNING
Leon and Lola ship in separate packages and use a somewhat untraditional installation scheme. Rather than a dongle or a challenge code, activation requires that a network device be installed on the host computer. (You don't have to be connected to the Internet to run the Vocaloid editor.) This won't be a problem for most users, but you cannot run the software without a device installed. Zero-G allows up to three activations, which could be handy if you have multiple computers in your studio. If you reformat your hard drive or upgrade your operating system, you don't use up an activation, as the code is tied to your system hardware.
A full install requires around 1.3 GB of drive space, and although you can install it on any of your system's drives, the drive containing your Temp directory must also have the same amount of space free. (You can change the location of your Temp directory by editing the Environment Variable entry found in the Control Panel/System/Advanced dialog.) If you purchase more than one singer, you need to install only the requisite files for the second singer, but there's no break on the price.
I tested the Leon and Lola programs using a Pentium 4/3.02 GHz processor with 2 GB of RAM running Windows XP SP1. The audio hardware used was an E-mu Emulator 1820m system. Right before finishing this review, a beta version of 1.05 was released. Yamaha claims the new version will offer significant performance enhancements to the synthesis engine, but my initial tests of the beta revealed mixed results. Hence, this review is based on the current official version, 1.02.
Unlike some soft synths, Vocaloid doesn't offer much in the way of audio setup beyond picking a sampling rate. However, there are numerous areas of the program that you can customize, including some that affect the display of data and others that more directly impact performance. I'll discuss some of these later in this review.
On the surface, the interface for Leon and Lola resembles a slightly aging sequencer (see Fig. 1). Notes are entered on a piano-roll-style timeline, and the program supports up to 16 tracks. Tempo (from 20 to 300 bpm) and time signature (which the editor calls beat) are controlled by entering values into a grid area that appears above the note display. Every tempo value is discrete — there's no way to add an accelerando or ritardando.
As you enter notes, you can quantize their starting position and length by adjusting the value for those two parameters, but there is no traditional quantize function for notes that are already entered (with one exception, which I will discuss in a moment). Notes are entered either by clicking at the appropriate position on the timeline and dragging to set the note's duration or by using one of the seven default note values (64th note to whole note, with dotted and triplet for each).
The Vocaloid interface cannot record MIDI events in real time. In fact, there is no option of any kind for note entry from an external controller. You also won't find even the most common editing features, such as transpose, reverse, or “fit time.” So, although you can copy a track and paste it to another track (to create a harmony part, for example), you must then highlight all the notes in the track and drag them up or down by hand.
After some head-scratching about the absence of these and other features, I quickly realized that it's best to forget about using Vocaloid for the vast majority of traditional sequencer functions — it is not intended to replace your sequencer, and its strengths clearly lie elsewhere. Think of it as a stop in your production workflow: you'll enter notes by hand, output a WAV file, then process the audio (for example, pitch-shift or add reverb and chorus) in another program. You can open a preexisting MIDI file, though you must either tweak the data before loading it into Vocaloid or use the Normalize Objects command to ensure that no notes overlap.
Even with the MIDI file option, you'll quickly run into limitations. For example, you can't insert data into an existing project. Instead, when you use the Open command to load a MIDI file, the current file (if any) closes, and the data ends up in what's called the Premeasure area. Premeasures are negative-number measures (-1, -2, and so on) that don't play back. (In fact, you'll get an error message if you accidentally put a note there.) You'll need to copy and move the data to a “legal” point in the file. All in all, it's not a very elegant process.
Once you look around the interface a bit, you'll find the parameters that distinguish Vocaloid and give you access to its real power. These parameters fall into two main categories: a collection of editable expression functions that appear in the Icon Palette, and a number of vocal control parameters that are entered in a work area at the bottom of the screen called the Control Track. These two groups of controls can turn a robotic voice into a more realistic singer, although there is a lot of time and effort involved in making that transformation.
When you first enter notes or import a MIDI sequence into Vocaloid, the program assigns a default phoneme (ooh) to each note. Every note also uses the default voice parameters, which you can edit in the Singer Editor (see Fig. 2). If you click on the Play button, you'll hear the notes interpreted with only the defaults, which is useful for checking pitches, but not for much more.
To hear the syllables you've entered, click on the Phoneme Transformation button so Vocaloid can match your lyrics with its database of phonemes. If Vocaloid finds your word in its internal dictionary, it will correctly translate your text into the proper phonemes. If not, it will substitute ooh, in which case you'll have to enter the phonemes by hand to build the word. Thankfully, this is a one-time operation, as any such user-created words can be saved in the User Dictionary, and Vocaloid will learn them for the next time they appear.
On average, Vocaloid correctly mapped about half of my text. At times, however, even common syllables such as yeah, bo, and ka were not be mapped correctly.
You can add expression to the notes by using the preset expression markings. Open the Icon Palette and you'll see expression controls for Attack, Vibrato, Dynamics, Crescendo, and Diminuendo. To assign a control to a note, just drag the icon from the Palette and drop it on the target. Once assigned, you can double-click the expression icon to edit it, and you can save edited versions of the controls and reuse them in other projects.
A close look at one of the icons gives you an idea of the program's depth. Double-click on the Vibrato effect, for example, and you will find a dialog box in which you can specify the starting point of the vibrato and its duration (see Fig. 3). Click on the Depth Setup area, and you'll see a two-dimensional grid — representing value over time — in which you can design the exact vibrato you want in great detail. There are three modes for entering the vibrato shape (Dot, Free, and Line), and you can also type in a specific value (between 0 and 127) for any point by selecting the point and using the Value Bar. You can also use the Interval Bar to set the distance between control points by hand. With so much control, you can easily create exactly the effect that suits the music you're writing.
Attack settings, logically, are always attached to the beginning of a note. Several presets give a strong accent to the start of a note, others give a note an upward bend, and two add a short trill (either a whole tone or a semitone) at the start of a note. As with other settings, you can use only one expression mark per note, so you couldn't, for example, use a sharp accent and a trill on the same pitch. This limitation makes sense in some cases (using only one dynamic marking or crescendo type on a note), but in the Attack category, more than one expression might be appropriate.
Crescendo, Diminuendo, and Dynamics are mostly self-explanatory, but it's worth noting that the Dynamics icons are far more flexible than MIDI Velocity (which appears in the Control Track), because they can be used to make an immediate change in volume (even in the middle of a note), and they can apply to more than a single note. In effect, they act like MIDI Volume control changes.
The second main area for tweaking is the Control Track, which is found at the bottom of the screen (see Fig. 1, bottom). This area provides access to a large number of parameters, such as Velocity and Pitch Bend, but also to other, less-traditional note-shaping functions. The most familiar of these is a set of four resonance filters, each with a frequency, bandwidth, and gain control. As with other parameters, frequency values are in relative increments (0 to 127), with no indication of actual Hertz values. You enter values using the same three data-entry tools found in the Vibrato dialog (Dot, Free, and Line), and if you've ever drawn a controller curve for notes in a sequencer, you'll get the process immediately.
The most unique aspect of the Control Track is a set of five controls that you won't find in any sequencer. These are Harmonics, Noise, Brightness, Clearness, and Gender Factor. Each has a different impact on the sound of your singer, but as the manual implies, it takes some trial and error to nail down the exact affect of each.
For example, Noise controls the relative amount of the nonsinusoidal component of the sound (the part produced by the mouth and windpipe rather than by the vocal chords). So, with a single sustained note on the syllable me, even broad sweeps of changing values produced only a subtle effect in the voice. With a syllable such as hush, Noise can make a huge difference to the intelligibility and presence of the word in your mix. Gender Factor can also produce a dramatic change in the sound. Brightness and Clearness are very similar in their impact, and Harmonics adds upper harmonics to a sound, which makes it both brighter and louder. Combined, these five parameters give you a vast range of time-varying control over both the individual notes and the musical phrases.
UNDER THE HOOD
In addition to the top-level controls described above, Vocaloid offers even more ways to manipulate its output. The most important of these are accessed using the Phoneme Editor, and deeper still, in the Phoneme Property and Phoneme Parameter Setup dialog boxes. Here is where budding linguists can get to the heart of the matter and make fine adjustments, such as changing the length of a consonant or modifying the phoneme used to synthesize a syllable.
From the main screen, double-click on a note, and the Phoneme Editor will open (see Fig. 4). On the right there will be a chart listing the default associations for the syllables in your lyrics, complete with a sample word that shows the phoneme usage. If you don't know a schwa from a high-front unrounded tense vowel, then you better grab a phonetic alphabet and catch up on your dipthongs: changing phonemes without a clear idea of the impact this will have might produce something useful, but it is typically not a productive endeavor.
The manual isn't much help in this area, but a short tutorial produced by Zero-G contains a few pointers that are general enough to be useful. Yet even here, you aren't provided with any information regarding how to change a phoneme mapping for a syllable that is making no sound or rules about when you might want to lengthen a consonant. The power is there and the results are undeniable, it's just that you are very much on your own in the undertaking. (The tutorial, by the way, guides you through the creation of a short phrase of seven notes. It took me nearly 15 minutes to complete it.)
On my Pentium 4 running Vocaloid version 1.02, I was not able to play a single, fairly complex track with the Play with Synthesis feature enabled. (Play with Synthesis is supposed to render the voice as it plays.) With Play with Synthesis disabled, the lag between pressing playback and hearing a sound increased to a few seconds for a single track, but the track played perfectly. I then tested Vocaloid with up to 13 tracks, and though the lag time increased proportionally, reaching upwards of 30 seconds, actual playback was also perfect.
Low notes take considerably longer to render: a single ooh pitched at C-2 using the maximum note length of 8 measures, with an extreme vibrato and an upward bend, took more than nine minutes to render. (Forget about transcribing those Russian opera bass parts.) The same note at C2 took six seconds. As is the case with a soft synth, the faster the machine the better, and I would consider Zero-G's minimum system requirements a bare minimum.
The included printed manual is good for getting started and even takes you to an advanced beginner or intermediate level of usage. But it falls short in more detailed application areas, such as exactly how to produce singing whispers or how to tackle the Phoneme Editor. There is also a video tutorial on the CD, but it's very short and does not offer much that isn't already in the manual. Trial and error is expected with a program of this type, but a detailed guide would be very useful in making Vocaloid more accessible to many people.
I can think of dozens of enhancements to the interface that would make working with Vocaloid much easier: the ability to enter a sequence of the same syllable (la, la, la, for example), or the ability to see not just the notes of one track when working on another, but the noncurrent track's expression data as well. It would also be great to have keyboard shortcuts for more features, such as changing from data-entry to select mode. And why couldn't the cursor return to the starting point when you stop playback? The maximum note length of just eight measures seems odd — though a real singer couldn't produce a tone much longer, that shouldn't be a problem for a virtual singer.
It's not so much that Vocaloid is difficult to learn. In fact, the main parameters that you will be working with are mostly intuitive and straightforward. But creating a realistic voice is time consuming and requires adjustments to quite a few settings. Perhaps help is on the way — Jasmine, makers of the Onyx Arranger, is working on a program called YV Enhancer that can generate Vocaloid parameters automatically using its Performance Modeling technology. The demo songs available at the company's Web site (www.jasminemusic.com) definitely show promise.
Yamaha has also announced a major upgrade, which is planned for this summer. With any luck, Vocaloid will get a big boost in usability. That's not to say that you'll instantly become an expert voice designer, but for the type of work that most musicians will need Vocaloid for, the upgrade should make the program more efficient.
VOICES FROM THE MACHINE
So who needs Vocaloid? If you simply want a sketch pad or something to use to hear how a melody might sound with a male or female voice, you are good to go right out of the box. The same is true if you are the experimental type and are after vocal-like sonic textures or other sound-design elements. You can find several examples of Vocaloid used in more experimental ways at the EM Web site (see Web Clips 1, 2, and 3).
On the other hand, if you use Vocaloid's output in a demo, perhaps as backing tracks, you will have a fair amount of work to do. To even consider using a track in a commercial recording, you should plan on a considerable investment in time to master the program (and find a good source of information on phonetics). You can hear the program's potential in Zero-G's demos (check out Miriam, the newest voice in the Zero-G's line-up, singing “Scarborough Fair” at the company's Web site). There is no doubt that the software is capable of amazing things.
With those caveats in mind, Vocaloid Leon and Lola are two singers that you will probably much enjoy having on hand for your next session. And if you get in the game now, you will be well-prepared as the technology continues to mature.
EM Associate EditorDennis Milleris happy to let Vocaloid sing his parts in his band. In fact, he is thinking of using it on his audition tape for American Idol.
Minimum System Requirements
Vocaloid 1.02 Leon and Lola
Pentium III/1 GHz; 512 MB RAM; Windows XP/2000; Network device
Vocaloid 1.02 Leon and Lola voice-modeling synthesizer
$369.95 (each voice)
FEATURES4.0EASE OF USE2.5QUALITY OF SOUND4.5VALUE4.0RATING PRODUCTS FROM 1 TO 5
PROS: Potential for highly realistic-sounding vocal parts. Up to 16 individual parts.
CONS: Vocaloid Editor not easy to use. No discount on purchase of more than one voice. Insufficient tutorials and documentation.