Part of Advances in Neural Information Processing Systems 13 (NIPS 2000)
Sam Roweis
Source separation, or computational auditory scene analysis , attempts to extract individual acoustic objects from input which contains a mixture of sounds from different sources, altered by the acoustic environment. Unmixing algorithms such as lCA and its extensions recover sources by reweighting multiple obser(cid:173) vation sequences, and thus cannot operate when only a single observation signal is available. I present a technique called refiltering which recovers sources by a nonstationary reweighting ("masking") of frequency sub-bands from a single recording, and argue for the application of statistical algorithms to learning this masking function . I present results of a simple factorial HMM system which learns on recordings of single speakers and can then separate mixtures using only one observation signal by computing the masking function and then refiltering.
1 Learning from data in computational auditory scene analysis Imagine listening to many pianos being played simultaneously. If each pianist were striking keys randomly it would be very difficult to tell which note came from which piano. But if each were playing a coherent song, separation would be much easier because of the structure of music. Now imagine teaching a computer to do the separation by showing it many musical scores as "training data". Typical auditory perceptual input contains a mix(cid:173) ture of sounds from different sources, altered by the acoustic environment. Any biological or artificial hearing system must extract individual acoustic objects or streams in order to do successful localization, denoising and recognition. Bregman [1] called this process auditory scene analysis in analogy to vision. Source separation, or computational auditory scene analysis (CASA) is the practical realization of this problem via computer analysis of microphone recordings and is very similar to the musical task described above. It has been investigated by research groups with different emphases. The CASA community have focused on both multiple and single microphone source separation problems under highly realistic acoustic conditions, but have used almost exclusively hand designed systems which include substantial knowledge of the human auditory system and its psychophysical characteristics (e.g. [2,3]). Unfortunately, it is difficult to incorporate large amounts of detailed statistical knowledge about the problem into such an approach. On the other hand, machine learning researchers, especially those working on independent components analysis (lCA) and related algorithms, have focused on the case of multiple microphones in simplified mixing environments and have used powerful "blind" statistical techniques. These "unmixing" algorithms (even those which attempt to recover more sources than signals) cannot operate on single recordings. Furthermore, since they often depend only on the joint amplitude histogram of the observations they can be very sensitive to the details of filtering and reverberation in the environment. The goal of this paper is to bring together the robust representations of CAS A and methods which learn from data to solve a restricted version of the source separation problem - isolating acoustic objects from only a single microphone recording.
2 Refiltering vs. unmixing Unmixing algorithms reweight multiple simultaneous recordings mk (t) (generically called microphones) to form a new source object s(t):