Listening to Sounds of Silence for Speech Denoising

Supplementary Material

Overview

This off-line webpage organizes our denoising audio results. We first show the comparison results on AudioSet dataset, DEMAND dataset and real-world audio recordings. When demonstrating with real-world recordings, we also include scenarios in which audiovisual denoising would fail because of the lack of frontal faces in video footage as well as scenarios in which multiple persons speak. Both are common scenarios in daily life. We then present examples of our denoising results on other languages, which are all resulted from our model trained solely in English. Lastly, we show two clips from the song "The Sound of Silence", as an echo of our title, to further demonstrate our model's ability to reduce non-stationary noise like music.

Please click each item below to see and hear individual audio results.

Synthetic data:  AVSPEECH + Audioset

Here we show denoising results on synthetic input signals. The input signals are generated using audio clips in AVSPEECH as foreground speech and in AudioSet as background noise. Under seven different input SNRs (from -10dB to 10dB), we compare the denosing results of our model with other methods.

Please note that here the Ours-GTSI method uses ground-truth silent interval labels. It is by no means a practical approach. But rather, it is meant to to show the best possible (upper-bound) denoising quality when the silent intervals is perfectly predicted.

Synthetic data:  AVSPEECH + DEMAND

Here we provide comparison results similar to the previous section. Instead of using AudioSet data as background noise, here we use DEMAND, another dataset used in previous denoising works, as the source of background noise. All other setups are the same as the previous section.

Real world data

Here we provide denoising results on real-world recordings, in comparison to other methods. These examples are recorded in video with a single person from front-face view. Only the (mono-channel) audio signals in the recordings are provided to the denoising methods, except VSE, which requires audiovisual input (i.e., video footage is also provided as input).

Real world data:  Audiovisual failed

Here we provide additional comparison results on examples from real world. VSE (the audiovisual method) results are excluded since it cannot successful extract mouth information from these examples.

Real world data:  multi-person

Here we provide additional comparison results on examples from real world where there are more than one person talking. The denoising result is supposed to keep voices from every person.

Other languages

Here we provide our denoising results on audio examples of other languages. Our silent interval detection results are also provided.

"The Sound of Silence"

Lastly, listen to "The Sound of Silence" by our method that listens to the sounds of silence!