Interpretation of Data in a WAV FileNUMBER OF BYTESFILE DATA (IN HEX)INTERPRETATIONVALUE
Illustration: Tami Needham
Everyone who works with digital audio soon encounters a wide variety of sound-file formats: WAV, AIFF, SND, Sound Designer I and II, and MP3, to name just a few. In most cases, the different formats present few problems. Software simply opens, plays, edits, and saves the audio files, sparing you from knowing the details of exactly how each format is constructed. But how does a program know what type of data the file contains? And what exactly is in an audio file?
A digital sound file is basically a long list of numbers representing the momentary values of an analog waveform measured (sampled) at a periodic rate. A file containing just those numbers is called a raw-data sound file. Usually, a lot more information must be embedded in a file for it to be read and played back properly. Aside from the sampling rate, the necessary information includes the resolution (which is the number of binary digits, or bits, that represent each sample). Other information indicates whether the file is monaural or stereo and whether the file creator has included looping information and cue points, a title, the name of the engineer or composer, a copyright notice, or other similar text.
That kind of information is included in a header, typically found at the beginning of a file. Different types of files have headers that are configured in distinct ways. (A raw-data sound file is also known as a headerless file.) For an application to read or write sound in a particular format, it must understand how the data is organized in that format.
PLAYING DIGITAL RIFFS
As an example, here's a close look at the familiar WAV sound-file format. A WAV file is a type of Resource Interchange File Format (RIFF) file, a format developed by Microsoft and IBM for multimedia files. (The familiar AVI video format is another type of RIFF file.) WAV files have been in use since Windows 3.1 and so are very widespread.
A WAV file is divided into sections, or chunks, that contain certain prescribed information. It's a more flexible arrangement than having just a single header. At the beginning of the file, a RIFF chunk defines the data as a WAV file and also reports its total length. Embedded within the RIFF chunk are two other chunks: a format chunk with information about sampling rate, resolution, number of channels, type of coding, and so on; and a data chunk, in which the actual sample values are stored.
FIG. 1: This is a binary-file view of a simple WAV file containing a single cycle of a sine wave. The color coding indicates the different chunks of data in the file. All of the numbers on the left are in hexadecimal, and each 8-bit byte of data is represented as two hexadecimal digits. On the far right, the same data is reproduced in ASCII code, so you can see any text embedded in the file. When the data is not text, you just see gibberish in that column.
To examine the format, I created a simple WAV file with my audio editor. The file contains a single cycle of a mono 2,205 Hz sine wave, synthesized at a 44,100 Hz sampling rate with 16-bit resolution. After saving the file, I displayed the data in the standard binary file-viewing format shown in Fig. 1. (There are many binary file-viewing utilities, including one called debug that is part of DOS. For the display in Fig. 1, I used Helios Software Solutions' TextPad, which lets you view files in many formats, including binary.) Color coding is added to differentiate each chunk.
GOOD TO THE LAST BYTE
Notice in Fig. 1 that the RIFF-chunk header information is found in the first 12 bytes of the file (highlighted in pink). Every pair of numbers represents a unique byte; in the table “Interpretation of Data in a WAV File,” a space between each byte shows how the data is organized. You can see the meaning of each byte in the table. Note also that the file data is in hexadecimal (hex) format. (If you're not familiar with hex, see the sidebar, “All About Numbering Systems.”)
The first four bytes in the file (the hexadecimal numbers 52, 49, 46, and 46) represent the ASCII characters for the acronym “RIFF,” which denotes the format type. (ASCII characters are numbers that represent letters of the alphabet. See the Value column at the far right of the table.) The next four bytes (4E 00 00 00) indicate the total number of bytes of data in the file after the first eight bytes of the header. This four-byte integer is in a format called little-endian, which means that the least significant bytes come first when the computer lists them byte by byte. That takes some getting used to, because the string of bytes actually appears in the opposite order than you'd expect. In other words, the four bytes, 4E 00 00 00, signify the hexadecimal number 0×0000004E, which can be shortened to 4E or 0×4E. (The 0x prefix is often used to indicate that the number is in hexadecimal format.)
52 49 46 46ASCII characters identifying file as a RIFF file“RIFF”
4E 00 00 00total size of file minus 8 bytes of header0×4E (hex) or 78 (decimal)
57 41 56 45ASCII characters identifying file as WAV file“WAVE”
(The term little-endian in computer lingo is taken from Jonathan Swift's Gulliver's Travels. At one point in the story, the Lilliputians are divided into two warring political camps: the Little-Endians, who believe you should first crack a soft-boiled egg on the little end; and the Big-Endians, who believe the opposite. The computer term big-endian, as you would guess, means numbers are listed with the most significant digits first.)
The next four bytes (57 41 56 45) are the ASCII characters “WAVE”; they tell any application reading the file that this is WAV-audio format and not one of the other possible RIFF multimedia file types.
The next 24 bytes of data in Fig. 1 (shown in blue) represent the format chunk, where several of the file's important characteristics are coded. This segment begins with the bytes 66 6D 74 20. The first three bytes of this string are the ASCII symbols for “fmt,” and the “20” indicates a space, which just fills out this segment so that it takes up a full four bytes. The next four bytes (10 00 00 00) indicate the length of the format chunk. That value is hex 0×10, or decimal 16. The table “Format-Chunk Data” shows how the format data is arranged. As before, the second column shows the exact sequence of numbers in hex as they appear in the file.
Notice that the Type of Coding is PCM (Pulse Code Modulation). PCM is a common uncompressed-audio data format. Other possibilities for coding include µ-law (pronounced mu-law, designated by the number 0×0101) and a-law (0×0102). Both are methods of scaling the sample data to try to minimize the audible quantization noise. (Quantization noise is a rounding error that occurs when you translate analog audio information into the more limited realm of digital numbers. If you use large enough digital data words, the quantization noise can be made so small that it does not cause audible problems.)
Another coding technique you may encounter is ADPCM (Adaptive Delta Pulse Code Modulation), which is designated by the number 0×0103. Interested programmers can find the exact formulas for those coding techniques on the Internet. (I'll deal only with uncoded PCM data here.)
Format-Chunk DataNUMBER OF BYTESFILE DATA (IN HEX)INTERPRETATIONVALUEAmplitude Values for a Sine Wave
The format chunk also indicates that the number of channels is 1 (monaural). The sampling rate is coded as 44 AC 00 00, or 0×AC44 (44,100 in decimal). The next four bytes (88 58 01 00; or 0×15888 in hex and 88,200 in decimal) define the number of bytes per second. This value is two times the sampling rate (44,100), because each sample contains two bytes of data. Next is the number of bytes per sample: 02 00, or 0×2 (2 in decimal). Finally, 10 00, or 0×10 (16 in decimal), shows the number of bits in each sample.
The data chunk (highlighted in yellow) follows the format chunk and is announced in Fig. 1 by the ASCII characters for the word data (64 61 74 61). The next four bytes, 2A 00 00 00, or 0×2A, indicate that the chunk length is 42 bytes. The table “Amplitude Values for a Sine Wave” lists the remaining 21, 16-bit integers (stored in little-endian format) and their decimal values. These are the bytes that represent the actual amplitude values for the sine wave.
FIG. 2: This figure is a graphical representation of the amplitude values found in the data chunk of the sound file in Fig. 1. The graph plots the digitized sine wave encoded by the Pulse Code Modulation data in the sound file.
The decimal numbers in the “Amplitude Values for a Sine Wave” table are two's complement numbers in the range -32,768 to 32,767. Two's complement is a way of representing both positive and negative numbers. If you look at the graph in Fig. 2, you can easily see how these numbers trace the shape of the sine wave.
Several additional chunks are optional in a WAV file. They are a fact chunk, used to store information about the file contents; a cue chunk, for indicating cue or marker points; a playlist chunk, to establish play order of cue points and looping information; and an associated-data-list chunk for attaching annotations to parts of the sample data. The RIFF format also supports an info chunk where you can place a title, copyright notice, creation date, and other similar text information.
Another common sound-file format is AIF or AIFF (Audio Interchange File Format), developed for the Macintosh platform. AIFF files are similar to WAV files and have chunks with similar functions. The chunks are identified by different names, however. For example, “FORM” is used to identify the sound-file format, “COMM” is the chunk with the format information, and “SSND” is the chunk that contains the sound data. Like WAV files, AIFF files require that these three chunks be present in all files, and again as in the WAV format, these words are followed by the length of the chunk. However, AIFF files use big-endian format for multibyte data. So unlike WAV data, a sequence of bytes for a 16-bit integer such as AC 04 would be the number 0×AC04 (as opposed to 0×04AC, the little-endian interpretation). There are other differences — the layout of the format chunk is not exactly identical, for example — but the basic idea is the same.
66 6D 74 20ASCII characters identifying the format chunkASCII characters “fmt” plus a space
10 00 00 00length of format chunk (in bytes) after this point0×10 (hex) or 16 (decimal)
01 00Type of CodingPCM
01 00Number of Channels1 = mono
44 AC 00 00Sampling Rate (per second)0×AC44 (hex) or 44,100 (decimal)
88 58 01 00Byte Rate (per second)0×015888 (hex) or 88,200 (decimal)
02 00Bytes per Sample0×02 (hex) or 2 (decimal)
10 00Bits per Sample0×10 (hex) or 16 (decimal)
AIFF also allows optional chunks such as a marker chunk, where cue points can be stored; an instrument chunk, containing playback data (such as looping information) for sampling keyboards; and a MIDI-data chunk, for MIDI System Exclusive messages or any other type of MIDI data. It also permits an audio-recording chunk that has information used by some audio-recording devices; an application-specific chunk that could be used by a specific application for any desired use; a comments chunk; and additional text chunks for name, author, copyright, and annotation.
Other sound-file formats have different specifics, but you will recognize similar principles. The Sound Designer I and II formats were created for Digidesign's Sound Tools and Pro Tools software. The AU and SND formats are associated with Sun computers, the NeXT, and UNIX machines. If you need more information about these or other formats, a good place to start is Chris Bagwell's Audio File Format FAQ (http://home.sprynet.com/~cbagwell/audio.html).
So what happens if you need to use a sound-file format that is not supported by the hardware or software you are using? If you use Windows, you should consider a sound-file format-conversion program such as Lance Norskog's SoX (short for Sound Exchange; available at www.spies.com/Sox/). SoX is a DOS command-line program and has a simple command syntax. You should also look at FMJ-Software's Awave (www.fmjsoft.com), which can read more than 60 formats and convert them into nearly 30 flavors.
Mac users should check out Tom Erbe's Soundhack (www.soundhack.com). It can read almost any sound-file format you'll encounter on a Mac and save it in many other formats. Soundhack also includes a wondrous array of powerful processing options.
It's not often that you'll have to dig into an audio file, but knowing what's inside can often be a help when problems arise. The information these files contain helps ensure that your digital sound files sound the way they're supposed to and allows you to go about your business of making music.
Peter Hamlinis a composer who teaches at St. Olaf College. He is also a member of the live electronic-music improv band Data Stream. He is currently developing a set of electronic pieces based on the paintings of George Todd.
ALL ABOUT NUMBERING SYSTEMS
The numbering system people are most familiar with is called base-10, or decimal. No doubt it developed because you have ten fingers to count with. A decimal number has a ones column, a tens column, a hundreds column, and so on, depending on the size of the number. The decimal number 158 really means this:
(1 × 100)
(5 × 10)
But base-10 isn't the only possibility. When you tell time with minutes and seconds, you use a base-60 system. That is, 1 hour, 5 minutes, 30 seconds, which might be represented as 1:05:30, really means this:
(1 × 60 × 60)
(5 × 60)
Computers use base-2, or binary, numbers because at the most basic electrical level, computers have two states: off (0) and on (1). A one-digit binary (or 1-bit) number has two values, 0 and 1. A 2-bit number can have four values: 00, 01, 10, or 11 (decimal values 0 to 3). An 8-bit number (called a byte) can have 256 values (0 to 255), and a 16-bit number can have 65,536 values. When referring to audio files, the size of a binary number is called the resolution, and the larger the number, the more accurately a sound can be represented.
Binary numbers are difficult to read, so they are typically represented with a base-16 (hexadecimal) system. Hexadecimal numbers use the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F (0 to 15 in decimal), and each hexadecimal digit can represent four binary bits. That means an 8-bit byte of data such as 1001 1110 can easily be represented by two hexadecimal digits, which, in this case, are 9E. (To be clear, the prefix 0x is often added to specify that the number is in hexadecimal format — for example, 0×9E.)
Hexadecimal numbers, like decimal numbers, are written in columns. But in place of the hundreds, tens, and ones columns, a hexadecimal number has columns based on the number 16. In the number 9E, for example, the left column is for multiples of 161, so the number 9 is multiplied by 16 to give a decimal value of 144. The right column is for multiples of 160 (that is, 1), and there you find the letter E, which is equivalent to the decimal number 14. So in total, 9E in hex is the same as 158 in decimal:
(9 × 161 = 9 × 16)
(hex E × 161 = 14 × 1)
If you had a larger hexadecimal number, you would have more columns, each representing a further power of 16 (162, or 16 × 16; 163, or 165 16 × 16; and so forth) depending on the size of the number.
In summary, the decimal number 158 can be represented in hexadecimal notation as 9E or in binary notation as 1001 1110. It's the same value in each case, but the system of showing that value is different.
A common way to represent negative numbers in binary arithmetic is with two's complement numbers. You get the two's complement by changing all 1s to 0s and all 0s to 1s (a process called complementing) and adding 1. The number 10,126 in the table “Amplitude Values for a Sine Wave” is shown in hexadecimal format as 0×278E. In binary, 0×278E is 0010 0111 1000 1110. The complement of that number is 1101 1000 0111 0001, and when you add 1, you get 1101 1000 0111 0010. This number can be represented in hexadecimal numbers as 0×D872, and in the table, the value 0×D872 is used to represent -10,126 when the waveform is negative.
Computers understand only numbers, so when you want to use letters, you need to represent them in numeric code. American Standard Code for Information Interchange (ASCII) is a standard code in which the lowercase letters a through z are represented by the numbers 0×61 through 0×7A, and A through Z (the capital letters) are represented by 0×41 through 0×5A. Commas, periods, spaces, and other commonly used language symbols are given their own number codes, as well.