Portable music players are a wonder to behold. Each new compression format sounds better than the last, and the file sizes keep getting smaller. What's the secret behind this magic? It's a process called psychoacoustic sub-band coding. The compression schemes (or codecs, for encode/decode) used by music players — AAC, MP3, MP4, and so on — are based on an analysis of the acoustic material and a model of what actually reaches our brain, and when. Material that, in theory, we wouldn't hear anyway can then be removed. Discarding what we don't need makes file sizes smaller, with negligible loss in audio quality.
In this article, I'll give some basic facts about how the human auditory system works and how it sends information to the brain. Then I'll discuss how the brain processes signals from the auditory system, which is the realm of psychoacoustics. Psychoacoustics underlies everything we hear. Understanding how we make sense of our world through sound is the basis of music codecs. You'll see that it can also be the basis of a successful music mix.
THE EARIE CANAL
When acoustic energy reaches our ears, there's a straightforward transfer of energy. Air-pressure changes cause the eardrum to move back and forth in response. The eardrum is connected to the three tiny ossicles bones in the middle ear — the hammer, the anvil, and the stirrup — that amplify the motion. The amplified movement is then transferred to the cochlea, a small snail-shell-shaped tube in the inner ear.
FIG. 1: The cochlea, shown here “unrolled,” is a vital part of our complex auditory system and helps relay signals from the outer ear to the brain.
The stirrup bone makes contact with the cochlea at a soft membrane called the oval window. The cochlea is a hollow tube filled with a jellylike fluid called perilymph. The cochlea, easiest to visualize if it is “unrolled” (see Fig. 1), is 35 mm long and is coiled two and a half times. The tube is divided into two levels by the springy basilar membrane, which doesn't quite split the tube. There is a small opening at the far end from the oval window. When the stirrup bone pushes against the oval window, it creates a ripple along the basilar membrane. The moving wave travels across, down, and around the chamber, and is equalized by a corresponding opposite movement of the round window below. When the stirrup is pulled away from the cochlea, it pulls the oval window outward with it, and the round window moves inward to compensate.
The wave shakes the basilar membrane, the way you might shake out a dusty rug. The basilar membrane is embedded with some 30,000 hair cells. Depending on the frequency of the vibrations originating from the stirrup bone, there are different points of maximum displacement (largest ripple) along the basilar membrane. Low frequencies, with longer wavelengths, create a large ripple down toward the opposite end of the cochlea from the oval window. High frequencies, with their shorter wavelengths, cause the maximum displacement to be closer to the oval window. Wherever the point of maximum displacement happens to be, the hair cell there fires an electrical impulse. These hair cells are connected to nerve fibers that lie along the outside of the cochlea like a horse's mane, interwoven with its swirls. The impulse from the hair cells excites a group of nerve fibers, which then send the signals to the brain. Thus, different nerve fibers respond to different frequencies, meaning that the auditory nerves send a spectrum of a sound event to the brain.
LET'S ALL BAND TOGETHER
When a sound excites the basilar membrane, a small group of cells at the point of maximum deviation fires with both barrels. The neighboring cells on either side of this point are also disturbed, but to a lesser degree. (They also fire impulses, but not as strongly.) Each point of the basilar membrane is the point of maximum excitation for some frequency, but will also join in the firing squad when a different frequency excites one of its neighbors. So a sound at a given frequency excites nerves belonging to a range of frequencies.
The amount of basilar-membrane real estate that jumps into the excitement is called the critical band. The frequency range spanned by this section of real estate is called the critical bandwidth. It's important to understand the difference between the two, because they do not maintain a constant relationship. A frequency of 350 Hz stimulates a band of cells having a bandwidth of roughly 100 Hz (300 to 400 Hz). But a frequency of 4 kHz excites a band of cells having a bandwidth of 700 Hz (3.7- to 4.4 Hz). The critical bandwidth is much wider for higher frequencies than it is for lower frequencies.
FIG. 2: A masking sound, in this case narrowband noise, raises the audibility threshold of neighboring frequencies.
Precisely where along the membrane this point of excitement occurs is another important aspect of the auditory system. As frequencies are doubled, the point of excitement moves in equal increments — an equal length of basilar membrane is traversed to reach the points excited by 500 Hz, 1 kHz, 2 kHz, 4 kHz, 8 kHz, and so on. This corresponds to our perception of pitch: we hear doubling of frequencies as a change of an octave. Other musical intervals are also based on the ratio of any two given frequencies, not the absolute distance between them. For example, a perfect fifth above some note is always a ratio of 3/2: the E above A 440 is 660, an increase of 220 Hz and a ratio of 3 (660) to 2 (440). An additional perfect fifth above 660 Hz is not, however, at 880 Hz (660 + 220), but rather it's at 990 Hz (660 * 3/2). We say that such a relationship, based on multiplication rather than addition, is logarithmic rather than linear.
From the example above, you can see that the logarithmic relationship of the basilar membrane to the spectrum applies to our perception of pitch. A logarithmic relationship is also behind the changes in our sensitivity to frequency differences over the frequency spectrum/basilar membrane. When two tones are played consecutively, the minimum frequency difference they must have in order for listeners to notice that difference is called the just noticeable difference (JND). The JND depends on a variety of factors, including frequency range, suddenness of the change, and level of musical training of the listener. Generally speaking, however, the JND below 1 kHz is about 3 Hz. The difference between 50- and 53 Hz is about a semitone, and the difference between 997- and 1,000 Hz is about one-twelfth of a semitone. From 1 kHz to 4 kHz, the JND remains at about 0.5 percent of any frequency (about one-twelfth of a semitone). The JND becomes indistinct, but noticeably larger, above 5 kHz. Sine-wave melodies transposed into that range tend to melt into a bunch of screaming, high beeps. (Critical bands play a role with not only sequential frequencies but also simultaneous frequencies. That discussion, however, will have to wait for another article.)
TAKE OFF YOUR MASK!
The auditory system is not completely egalitarian. The minimum power level required for a sound to be audible is called the threshold of audibility, and different frequencies have different ones. Lower frequencies must be played at much greater power levels than higher frequencies in order for them to be heard at equal volumes, if at all. The threshold is lowest for frequencies around 1.3 kHz, the range of the spoken voice (see the equal-loudness-contours graph in the article “Loud, Louder, Loudest!” in the August 2003 issue of EM.). But crank up those lows, and you'll find they can drown out lower-level high frequencies (maybe not on your monitors, but it can happen in your ears).
High frequencies excite the basilar membrane at points near the oval window, leaving more distant points relatively undisturbed. But low frequencies, which excite the membrane at points more distant from the oval window, create waves in the membrane that have to travel past those closer points excited by higher frequencies. So when high and low frequencies are heard together, the lows can, in some circumstances, interfere with the highs.
FIG. 3: Like many portable music players, Apple''s iPod uses psychoacoustic encoding to compress music files. The user can determine how much compression is applied.
Furthermore, any sound at a high level centered at a given frequency raises the hearing-level threshold for frequencies in its critical band. This phenomenon of some sounds rendering other sounds inaudible is termed masking. Systematic studies have shown how tones or narrowband noise at a given frequency raises the audibility threshold of tones at neighboring frequencies (see Fig. 2). A broadband noisy sound, like a sweeper or a loud ventilation system, can be relied upon to raise the hearing threshold of just about everything.
Masking can also be a factor in recording music. Imagine you're mixing something with a kick drum on one track and a bass guitar on another. It might seem that you'd want as much synchronization as possible. The low frequencies of these instruments tend, however, to differ by no more than a few hertz, meaning that these two bottom dwellers could have a number of neighboring frequencies that mask each other. A precise synchronization results in a sound that's more like a dull thud than a blend of two interesting sonic personalities.
There are several ways that you can prevent the kick and bass from stepping on each other. Introducing a delay of a few milliseconds on one can break them up and maintain their individual personalities without producing any audible echo. Alternatively, you can pan them separately in the mix. If you can do a polarity flip on one of these panned channels, so much the better.
ENTER THE CODEC
Uncompressed pulse-code modulation, as we find on CDs, tries to create a perfect representation of acoustic information. Codecs, like the ones used by portable music players, pare down the signal by separating the wheat from the chaff (see Fig. 3). Psychoacoustic sub-band coding refers to the process of sending the block of samples through a bank of bandpass filters, breaking it into sub-bands, and performing an analysis based on the activity within each spectral region (the output of each filter, both alone and in comparison to the output of the other filters). The results are compared with a psychoacoustic model that emulates the auditory system by attempting to remove what the auditory system would not pass on to the brain anyway.
For instance, some low-level frequency regions might be masked by other regions. You could resynthesize a sound containing such a region and reduce its size by eliminating those frequencies altogether. Alternatively, the lower-level regions could be resynthesized with less precision than the more salient frequencies, with the resulting distortion masked. Another thing a codec might do is to watch for transients, such as sudden attacks when new notes are played. A transient can also mask softer tones that occur immediately following it or even immediately prior to it. Removing material that would be masked by transients is another way to resynthesize material with greater economy.
This article presents just a brief introduction to the field of psychoacoustics and its relevance to musicians. Type “Albert Bregman” or “Diana Deutsch” into your Web search engine, and you will meet two researchers who have produced a wealth of information about how we hear. Whether you're composing, mixing, or philosophizing, understanding how your brain perceives sound can enrich your work in ways that will probably surprise you.
Mark Ballorateaches music technology at Penn State University. Special thanks to Curtis Craig for his assistance with this piece.