How Libvorbis Implements Psychoacoustic Masking
This article explains how the open-source libvorbis
audio codec implements psychoacoustic masking to achieve high-quality,
lossy audio compression. It breaks down the mathematical and biological
principles the encoder uses to analyze sound, including frequency and
temporal masking, and details how these models guide the quantization
process to discard inaudible data without sacrificing perceived sound
quality.
The Role of Psychoacoustics in Vorbis
The core goal of lossy audio compression is to remove data that the
human ear cannot perceive. libvorbis achieves this by
running input audio through a sophisticated psychoacoustic model. This
model estimates the “threshold of hearing” at any given moment in the
audio stream, determining which sounds are loud enough to be heard and
which are masked by other, more dominant sounds.
By identifying these inaudible components, the encoder can allocate fewer bits (or zero bits) to them, saving bandwidth for the parts of the audio spectrum that the human ear actually registers.
Time-to-Frequency Domain Conversion
Before applying psychoacoustic rules, libvorbis converts
the incoming time-domain audio signal into the frequency domain. It does
this using the Modified Discrete Cosine Transform
(MDCT).
The MDCT analyzes the audio in overlapping blocks (windows).
libvorbis dynamically switches between long blocks
(typically 2048 samples) for stationary, steady signals to get high
frequency resolution, and short blocks (typically 256 samples) during
fast transients (like a drum hit) to prevent pre-echo artifacts.
Implementing Simultaneous (Frequency) Masking
Simultaneous masking occurs when a loud sound (the masker) makes a
quieter, nearby frequency (the masked signal) inaudible.
libvorbis models this using several key steps:
1. Tone and Noise Detection
The human ear tolerates more noise around noise-like signals than
around pure tones. libvorbis analyzes the MDCT spectrum to
differentiate between tonal (sinusoidal) components and noise-like
components. It calculates a “tonality” metric for different frequency
bands because pure tones require much stricter (lower) masking
thresholds than noisy bands.
2. Excitation Curves and Spreading Functions
A single frequency does not just mask its exact coordinate; it
projects a “masking curve” (or excitation curve) onto neighboring
frequencies. libvorbis calculates these curves using a
spreading function. This function mimics the basilar membrane in the
human inner ear, where masking spills over more heavily into higher
frequencies than into lower frequencies.
3. Absolute Threshold of Hearing
The encoder factors in the absolute threshold of hearing (ATH)—the minimum curve of audibility in a perfectly quiet room. Any audio component that falls below the ATH is immediately discarded, regardless of whether other sounds are masking it.
Implementing Temporal Masking
Temporal masking occurs when sounds closely preceding or following a loud sound are obscured. * Pre-masking: A quiet sound immediately before a loud sound is masked (lasts about 10–20 milliseconds). * Post-masking: A quiet sound after a loud sound is masked (lasts up to 100–200 milliseconds).
libvorbis handles temporal masking primarily through its
block-size switching mechanism. When a transient is
detected, the encoder quickly switches to short blocks. This localizes
the high-energy noise of the transient to a very short window of time,
ensuring that the quantization noise does not spill backward in time
(pre-echo) into a quiet zone where the ear would easily detect it.
The Floor and Residue Representation
Once libvorbis calculates the final masking threshold
across the frequency spectrum, it represents this threshold using two
distinct abstractions: the Floor and the
Residue.
The Floor (Spectral Envelope)
The “Floor” is a highly compressed approximation of the
psychoacoustic masking threshold curve. libvorbis primarily
uses Floor 1, which represents the masking curve as a
series of line segments (a piecewise linear representation) on a
Bark-frequency scale.
The encoder computes this envelope to match the calculated psychoacoustic masking threshold. During decoding, the player reconstructs this Floor first.
The Residue
The “Residue” is the actual audio signal that remains after the Floor (the masking curve) is subtracted from the original MDCT spectrum.
During encoding, libvorbis divides the MDCT coefficients
by the Floor values. If the residue in a specific frequency band is
below the masking threshold, it is quantized to zero. The remaining
audible residue is grouped, vectorized, and compressed using vector
quantization and Huffman entropy coding.
By dynamic adjustment of the Floor based on real-time psychoacoustic
calculations, libvorbis ensures that quantization noise is
safely hidden beneath the masking threshold, delivering transparent
audio quality at highly optimized bitrates.