De-compression |
(from Hi-Fi World, August 2006 issue) BUY THE MAGAZINE (back issues subject to availability)
De-compression Love it or loathe it, compressed audio is everywhere these days, from DAB to MP3. Steven Green explains why it’s necessary, how it works, and why it’s not quite as bad as you think…
Compressed digital audio has acquired a very poor reputation for audio quality, which is deserved in some cases, but undeserved in others. The basic reflex reaction of any audiophile is that it’s automatically the enemy of music, as it’s ‘data reduction’ - and that goes against all our highest principles, doesn’t it?
Yet there’s no denying that, by doing what it does, it has brought more music to millions of people sooner than would have otherwise have happened (imagine an iPod Nano with just 15 songs worth of storage capacity!) It also makes DAB possible, at least in its present format. Without compressed audio, Digital Radio may not have happened until much later, with far fewer stations available.
How then does it work? Well, the ‘bit rate’ (or data rate) of a digital audio stream determines the bandwidth (i.e. frequency range) required for the stream on broadcasting systems, and also the amount of memory required to store audio on a PC or MP3 player. Unfortunately, ‘uncompressed’ audio, such as the PCM (pulse-code modulation) format used by CD, uses a bit rate of 1,411 kbps (thousand bits per second). Such a high bit rate rules out using uncompressed audio for some important digital audio applications. For example, even if the highest capacity possible on a DAB multiplex were used, then only one uncompressed radio station could be carried on a multiplex! And as UK radio listeners can typically receive 4 DAB multiplexes, this would mean they could only receive 4 radio stations in total...
Also, applications such as storing record collections on PC hard drives, portable MP3 players, legal music downloads and file-sharing networks simply wouldn’t have been feasible until very recently if only uncompressed audio could be used - ‘iTunes’ would have been ‘myTune’! Clearly, a step-change reduction in the bit rate used for digital audio was needed. Early forms of compression were not very successful, however, and they only managed to reduce the bit rate to around 1,000 kbps, still far too high. These early compression schemes used ‘lossless’ compression, where the datastream at the output of the decoder is identical to the original audio datastream. The major breakthrough came by taking a radically different approach to compressing audio. This involved abandoning lossless compression in favour of ‘lossy’ compression – the original and final datastream do not have to be identical – and taking the way the human hearing system actually works into account, which is the field of psychoacoustics.
IN THE MIND Psychoacoustics is the study of human auditory perception, ranging from the biological design of the ear to the brain’s interpretation of aural information. One of the main concepts central to psychoacoustics is the minimum threshold of hearing (MTH). The minimum threshold of hearing is, as the name suggests, the sound pressure level threshold below which tones (a sinewave at a certain frequency) are inaudible.
The minimum threshold of hearing can be seen in Figure 1 below as the vaguely U-shaped curve that runs from 20Hz to 20kHz, which is the bandwidth of human hearing. One thing the curve shows is that human hearing is very sensitive to sound at frequencies between around 1kHz and 6kHz (where the curve is the lowest) and is insensitive at high and low frequencies. For example, a tone at 20Hz would need to have an amplitude that is approximately 65 dB higher (over 3 million times higher!) than that of a tone at 1kHz for both tones to be barely audible. The figure also shows a 40Hz tone (dashed arrow on the left side of the figure) that has an amplitude below the minimum threshold of hearing, and is therefore inaudible.
Figure 1. Minimum threshold of hearing and masking curves
The other central concept to psychoacoustics is ‘amplitude masking’, or just ‘masking’ for short. Masking is the process whereby a stronger tone at one frequency can make a weaker tone inaudible that is nearby in frequency.
Researchers in the field of psychoacoustics painstakingly measured the amplitude across a range of frequencies when a tone became just audible when a strong tone was also being played. This led to the concept of the ‘masking curve’, and researchers became able to predict whether tones at different frequencies will be audible depending on whether the amplitude is above or below the masking curve. An example of a masking curve can be seen in Figure 1, where the high amplitude tone at a frequency of 400Hz (solid black arrow), referred to as the ‘masker’, has generated the triangular-shaped masking curve surrounding it. The tone at 600Hz has an amplitude that is above the minimum threshold of hearing but below the masking curve, which means that the 400Hz tone has made the 600Hz tone inaudible.
Masking also occurs over time as well as over frequency. For example, if you sound two tuning forks in quick succession that are both tuned to the same frequency, and you sound the first louder than the second, you will only be able to hear the first tuning fork.
Real audio signals are of course far more complex than the combination of a handful of tones, but what researchers found was that by applying these masking curves to real audio signals, whole sections of the audio spectrum were being ‘hidden’ by strong masking tones, and the information in the inaudible parts of the audio spectrum could be discarded without significantly affecting the perceived audio quality of the signal. Most important of all, this method of discarding inaudible information allowed the bit rates to be reduced to around 256 - 384kbps (between 3.5 and 5.5 times lower than the bit rate of uncompressed audio) while maintaining good audio quality.
PERCEPTUAL AUDIO CODING This new method of audio compression was called perceptual audio coding, due to the fact that only the information that humans can perceive is kept, and it forms the basis of all reduced bit rate audio codecs (codec stands for enCOder/DECoder) that have emerged since it was first developed in the mid 1980s.
As Figure 2 shows, perceptual audio encoders work by first sending the input audio signal to a ‘psychoacoustic model’ that tries to mimic the way the human hearing system works by generating masking curves for all of the tones present in the input signal. Following analysis of all of the masking curves, the psychoacoustic model informs the compression algorithm which parts of the audio spectrum are audible, and only these frequency bands will be encoded – the data representing the inaudible parts is discarded. The compression algorithm’s job is then to represent the audible information as accurately as possible given the number of bits available.
Figure 2. Simplified block diagram of a perceptual audio encoder TWO TRIBES There are two types of perceptual audio codec: subband codecs and transform codecs. The fundamental difference between these two is that subband codecs compress the audio waveform data (the amplitude versus time signal), whereas transform codecs compress the data that makes up the spectrum representation (the amplitude versus frequency representation) of the audio. This difference between the two types of codec has a major influence on their performance, which is as a result of differences in their ‘frequency resolution’.
Subband codecs, such as MP2, only split the input signal up into 32 equal-bandwidth subbands. This means that for an input audio signal with a sampling frequency of 48kHz each subband is 750Hz wide, which is the frequency resolution of MP2 (the audio bandwidth is half the sampling frequency, because of something called Nyquist’s Sampling Theorem).
Transform codecs take their name from the fact that the input audio is transformed to the spectrum representation, and a positive by-product of changing to the spectral view is that the frequency resolution is hugely improved. Using the example of the AAC audio codec, the signal is usually represented by 2,048 ‘spectral lines’, which for a sampling frequency of 48kHz equates to a frequency resolution of 23Hz. The advantage of having a very fine frequency resolution is that inaudible information can be discarded on a far finer scale, which means that significantly more inaudible information can be discarded overall.
Looking at it from the opposite perspective, the more inaudible information that is discarded the less information there is left to encode, so for a fixed bit rate level, the remaining information can be encoded more accurately, because the accuracy depends on the number of bits used to represent the value. And the higher the accuracy at which the information is represented the higher the audio quality will be, which is why transform codecs provide higher audio quality than subband codecs when they’re used at the same bit rate level.
Another way of looking at this is in terms of ‘compression efficiency’, which is the bit rate required to provide a set level of audio quality. Transform codecs require a lower bit rate than subband codecs to provide the same level of audio quality, so transform codecs are generally much more efficient than subband codecs. For example, AAC, which is a transform codec, provides the same level of audio quality at 96kbps as MP2 (a subband codec) provides at 192kbps.
SOUND ADVICE Of the commonly used codecs, only MP2 (which is used on DAB) is of the subband variety, and MP3, AAC (Advanced Audio Coding), HE AAC (High Efficiency AAC, also known as AAC+), Ogg Vorbis and Windows Media Audio (WMA) are all transform codecs. With the exception of HE AAC, which is only used at very low bit rates, all of the above transform codecs can provide near CD-quality at bit rate levels of around 128kbps, whereas MP2 needs to use a bit rate level of 224kbps to provide near CD-quality.
However, it is universally true that, for a given audio codec, the higher the bit rate used the higher the audio quality will be, so it is recommended that bit rates of 192kbps and above are used with the transform codecs listed above for compressing your own audio at home or for an MP3 player – AAC provides the best audio quality, but apart from Apple’s iPod it isn’t supported by many MP3-playing hardware devices such as MP3 players and car stereos, whereas MP3 has universal support on such devices.
The importance of using a good quality encoder also cannot be overstated, and the best AAC/AAC+ and MP3 encoders currently available are the Nero and Lame encoders, respectively, which are free and can be downloaded from http://tinyurl.com/pps6d and http://tinyurl.com/d9mds. However, these are ‘command-line encoders’, so they’re not recommended for beginners, and they’re best used in conjunction with a ‘front-end’ application, such as Exact Audio Copy or similar programs. If ease-of-use is more important to you, try the iTunes and Razorlame applications for AAC and MP3, respectively.
It is also recommended that you use VBR (variable bit rate) mode when you are compressing audio at home. This is due to the inherent variable nature of audio. For example, when a single instrument is being played there may only be a relatively small number of frequencies that are audible, so these frequencies can be encoded with relatively few bits. If more instruments begin playing, more frequencies become audible and these also need to be encoded, so more bits need to be used to maintain good audio quality. VBR mode allocates the appropriate – variable – number of bits in order to provide a constant level of audio quality, which the user chooses in the encoder options.
It should also be said that when AAC, MP3, WMA9 or Ogg Vorbis are used at high bit rates, they’re capable of providing audio quality very close to that of a CD. For example, if you played a high bit rate compressed audio file through a good quality DAC into a decent hi-fi system, I think many of the sceptics who write-off audio compression as being of generally poor audio quality would be very surprised at how good it can be – comparing the audio quality of a compressed audio file played back through a low quality DAC (e.g. a computer sound card or an MP3 player) with the quality produced by an expensive CD player is an unfair comparison, which shouldn’t be a surprising result, but this is often overlooked.
DAB = BAD As I’ve just discussed, when audio codecs are used as intended they are capable of good results. DAB in the UK, however, is an example of how not to use an audio codec! The overriding problem is that the broadcasters are using grossly insufficient bit rate levels for the inefficient MP2 audio codec that is used on DAB – 98% of all stereo stations on DAB in the UK are using 128kbps, when the MP2 codec needs to be used at a bit rate of 224kbps to provide near CD-quality.
Audio on DAB uses the constant bit rate (CBR) mode rather than the more efficient VBR mode, and this combination of using CBR at an insufficient bit rate level results in the quality degrading the more complex the audio is to encode. This means that while audio that is very simple to encode can sound good, examples being some smooth jazz and speech, anything other than simple to encode audio will sound poor or dreadful.
The best example of dreadful sounding audio on DAB is rock music, especially during parts of tracks when the electric guitars and drums are being played simultaneously and loudly. The problem with such material is that it has a wideband and high amplitude audio spectrum, so very little inaudible information can be discarded. This results in the available bits being spread so thinly that the accuracy at which the waveform is represented is very low, which drastically reduces the quality and definition of the music. Also, the lower the accuracy at which the signal is represented the higher the level of ‘coding noise’ will be – coding noise is the difference between the uncompressed and compressed signals, and at high bit rates it is inaudible. So you have a ‘catch 22’ situation, where both the low accuracy and the coding noise conspire against the audio, providing a very ill-defined, dull and messy sound.
Another problem with the MP2 codec is that it is limited to using a ‘joint stereo’ coding technique called ‘intensity stereo’. Joint stereo coding is a method that audio codecs use to save bits by encoding both the left and right channels simultaneously rather than separately. All the modern audio codecs can use the ‘mid/side’ joint stereo technique that is lossless, which means that no information is destroyed. Intensity stereo on the other hand destroys the phase information between the left and right channels, which frequently leads to either the total collapse of the stereo image or to the stereo image being variable and unstable – this is why many pop and rock stations appear to be virtually mono. As well as being a very inefficient codec, MP2 is also particularly bad at representing high frequencies, and this situation is exacerbated when an insufficient bit rate level is used, resulting in a metallic splish-splosh kind of sound.
Overall then, high quality modern codecs used intelligently (i.e. at decent bitrates) really aren’t that bad – it’s just the old, obsolete and inefficient ones like MP2 in Digital Audio Broadcasting that let the side down. Digital audio compression can be a great ‘enabling technology’, or an audiophile’s bad joke – depending on which system you use.
LINKS Nero encoder download http://tinyurl.com/pps6d Lame encoder download http://tinyurl.com/d9mds |