Hubbry Logo
search
logo

Computational auditory scene analysis

logo
Community Hub0 Subscribers
Write something...
Be the first to start a discussion here.
Be the first to start a discussion here.
See all
Computational auditory scene analysis

Computational auditory scene analysis (CASA) is the study of auditory scene analysis by computational means. In essence, CASA systems are "machine listening" systems that aim to separate mixtures of sound sources in the same way that human listeners do. CASA differs from the field of blind signal separation in that it is (at least to some extent) based on the mechanisms of the human auditory system, and thus uses no more than two microphone recordings of an acoustic environment. It is related to the cocktail party problem.

Since CASA serves to model functionality parts of the auditory system, it is necessary to view parts of the biological auditory system in terms of known physical models. Consisting of three areas, the outer, middle and inner ear, the auditory periphery acts as a complex transducer that converts sound vibrations into action potentials in the auditory nerve. The outer ear consists of the external ear, ear canal and the ear drum. The outer ear, like an acoustic funnel, helps locating the sound source. The ear canal acts as a resonant tube (like an organ pipe) to amplify frequencies between 2–5.5 kHz with a maximum amplification of about 11 dB occurring around 4 kHz. As the organ of hearing, the cochlea consists of two membranes, Reissner’s and the basilar membrane. The basilar membrane moves to audio stimuli through the specific stimulus frequency matches the resonant frequency of a particular region of the basilar membrane. The movement the basilar membrane displaces the inner hair cells in one direction, which encodes a half-wave rectified signal of action potentials in the spiral ganglion cells. The axons of these cells make up the auditory nerve, encoding the rectified stimulus. The auditory nerve responses select certain frequencies, similar to the basilar membrane. For lower frequencies, the fibers exhibit "phase locking". Neurons in higher auditory pathway centers are tuned to specific stimuli features, such as periodicity, sound intensity, amplitude and frequency modulation. There are also neuroanatomical associations of ASA through the posterior cortical areas, including the posterior superior temporal lobes and the posterior cingulate. Studies have found that impairments in ASA and segregation and grouping operations are affected in patients with Alzheimer's disease.

As the first stage of CASA processing, the cochleagram creates a time-frequency representation of the input signal. By mimicking the components of the outer and middle ear, the signal is broken up into different frequencies that are naturally selected by the cochlea and hair cells. Because of the frequency selectivity of the basilar membrane, a filter bank is used to model the membrane, with each filter associated with a specific point on the basilar membrane.

Since the hair cells produce spike patterns, each filter of the model should also produce a similar spike in the impulse response. The use of a gammatone filter provides an impulse response as the product of a gamma function and a tone. The output of the gammatone filter can be regarded as a measurement of the basilar membrane displacement. Most CASA systems represent the firing rate in the auditory nerve rather than a spike-based. To obtain this, the filter bank outputs are half-wave rectified followed by a square root. (Other models, such as automatic gain controllers have been implemented). The half-rectified wave is similar to the displacement model of the hair cells. Additional models of the hair cells include the Meddis hair cell model which pairs with the gammatone filter bank, by modeling the hair cell transduction. Based on the assumption that there are three reservoirs of transmitter substance within each hair cell, and the transmitters are released in proportion to the degree of displacement to the basilar membrane, the release is equated with the probability of a spike generated in the nerve fiber. This model replicates many of the nerve responses in the CASA systems such as rectification, compression, spontaneous firing, and adaptation.

Important model of pitch perception by unifying 2 schools of pitch theory:

The correlogram is generally computed in the time domain by autocorrelating the simulated auditory nerve firing activity to the output of each filter channel. By pooling the autocorrelation across frequency, the position of peaks in the summary correlogram corresponds to the perceived pitch.

Because the ears receive audio signals at different times, the sound source can be determined by using the delays retrieved from the two ears. By cross-correlating the delays from the left and right channels (of the model), the coincided peaks can be categorized as the same localized sound, despite their temporal location in the input signal. The use of interaural cross-correlation mechanism has been supported through physiological studies, paralleling the arrangement of neurons in the auditory midbrain.

To segregate the sound source, CASA systems mask the cochleagram. This mask, sometimes a Wiener filter, weighs the target source regions and suppresses the rest. The physiological motivation behind the mask results from the auditory perception where sound is rendered inaudible by a louder sound.

See all
User Avatar
No comments yet.