Collaborative Audio Enhancement
— Crowdsource Your Recording


Such as in crowdsourcing, the project aims to improve the quality of the recordings from audio scenes, e.g. music concerts, talks, lectures, etc, by separating out the only interesting sources from multiple low-quality user-created recordings. This could be seen as a challenging microphone array setting with channels that are not synched, defected in unique ways, with different sampling rates.


We achieve the separation by using an extended probabilistic topic model that enables sharing of some topics (sources) across the recordings. To put it another way, we do the usual matrix factorization for each recording, but fix some of the sources to be the same (with different global weights) across the simultaneous factorizations for all recordings.



We could get better separation from some synthetic concert recordings than the oracle matrix factorization results with ideal bases pre-learned from the ground truth clean recording. We plan to accelerate this algorithm so that it can cope with a really big audio dataset.

Check out our award-winning paper about this project: "Collaborative Audio Enhancement Using Probabilistic Latent Component Sharing (ICASSP 2013)"
And some audio clips:
  • Input#1: low-pass filtered recording (8kHz) with a speech interference (wav)
  • Input#2: high-pass filtered recording (500Hz) with another speech interference (wav)
  • Input#3: low-pass filtered (11.5Hz) and high-pass filtered (500kHz) recording with clipping artifacts (wav)
  • Enhanced audio using PLCS plus both priors (wav)

※ This material is based upon work supported by the National Science Foundation under Grant: III: Small: MicSynth: Enhancing and Reconstructing Sound Scenes from Crowdsourced Recordings. Award #:1319708
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.




Irregular Matrix Factorization

Do you want to apply Nonnegative Matrix Factorization (NMF) to the non-matrix types of data? Here is an efficient way to do the job. We sometimes observe irregular data structures e.g. reassigned spectra, very sparse landmarks, etc, that cannot be efficiently represented with ordinary matrices with the grid-structure. Furthermore, we might want to decompose this observation to discover some underlying patterns as if we do with the regular matrix factorization techniques.



The main idea is to represent the input with the sparse form, i.e. pairs of indices and values, and to reformulate the original NMF problem for the vectorized input. To expedite the learning, we involve the concept of the non-parametric density estimation, so that each data point is affected by the density only from the closest observations. If we apply this to some difficult music transcription tasks, such as a music piece where bass guitar and drum playing at the same time. The case is particularly difficult since we need very high resolutions in both time and frequency axes, but the usual short time Fourier transform can do that only along one of the directions. We can do the nice decomposition on this non-matrix form of data using the proposed method.



See our paper, "Non-Negative Matrix Factorization for Irregularly-Spaced Transforms (WASPAA 2013)," for more detail.


Manifold Preserving Source Separation

Usual topic models or NMF-like factorizations result in a wrapper, i.e. a convex hull or cone, that compactly contains the input spectra. When it comes to separation, we assume pre-defined several wrappers like that are available, one for each training source, and hope that they do not overlap each other, while in practice they often do so.



The wrapper is a lossy representation of the full training spectra that works like a dictionary of templates. What it sacrifices is the minute details of the data manifold, which is sometimes critical for recovering quality audio signals. We are working on some probabilistic topic models with sparsity constraints so as to learn the hyper topics each of which represents only its local neighbors. Those hyper topics play a role in getting manifold preserving quantization of training signals instead of wrappers



and in leading the recovered source spectra to lying on the original data manifold.



See our paper for more detail: "Manifold Preserving Hierarchical Topic Models for Quantization and Approximation (ICML 2013)"


Inter-channel Phase Difference Modeling

One famous approach to blind multi-channel source separation is the Degenerate Unmixing Estimation Technique (DUET). This straightforward approach involves clustering Inter-channel Phase Difference (IPD) features that are extracted from a multi-channel recording. Each microphone in an array records a mixture of the sound sources present. Assuming there are no geometric ambiguities in the array set-up, each microphone will see a uniquely time-delayed and attenuated version of each source. We compute STFTs of each mixture, and compute IPDs between corresponding Time-Frequency (TF) bins. Bins with lots of energy from one of the sources will have the same IPD signature. Here is an example of the magnitude spectrograms corresponding to two speech signals:



And here are two IPD heat maps plotted on the same TF plane for a mixture of those two sources:



These figures correspond to a recording with closely-spaced and widely-spaced pairs of microphones. It's clear that there are two sources that can be clustered according to the IPD cues we extracted. Thanks to the delay property of the Fourier transform, we would expect each source's IPDs to vary linearly with frequency, hence the gradual change in color across frequency. However, phase is a wrapped quantity and the IPD lines undergo a modulo-2 pi operation when the microphones are more than 1 cm apart (at a sampling rate of 16 kHz), as shown in the IPD map on the right. Further non-linearities are introduced when reverberation and channel mismatch are present (these lead to characteristic bends in the IPD function). Thus, to be robust to these real-world factors, we should generalize the basic DUET model. The IPD models we have explored include the Mean-Locked Mixture of Wrapped Gaussians (ML-MoWG) and the Wrapped Regression Spline (WRS).

The ML-MoWG is a probabilistic model for circular-linear data that has one linear component (i.e. frequency) and one circular component (i.e. IPD). It assumes that the observed data in each frequency is generated from a Mixture of Wrapped Gaussians (MoWG) and that the mean parameter for each source is a linear function of frequency. This captures the wrapped-line pattern of the data. Fitting this model amounts to an EM algorithm that alternates between estimating cluster assignments (E step) and updating the wrapped line slopes (M step). The following figure illustrates the learned model with several wrapped Gaussian distributions overlaid and IPD features colored according to source assignment probability:



The WRS goes further to model the data from each source as having been sampled from a circular-linear distribution whose mean is parameterized by a cubic regression spline. One additional twist is that the spline values are wrapped to the interval [-pi,pi]. Fitting a mixture of WRSs corresponds to an iterative EM algorithm. The following figure illustrates the difference between the wrapped line and spline models:



Current work involves extending these techniques further to incorporate a dictionary model of magnitude information in the TF plane. IPDs rely exclusively on phase information while, for example, NMF models deal exclusively with magnitude. Combining these into a joint framework has proven more powerful than either one in isolation.

See our papers for more detail:
"Blind Multichannel Source Separation by Circular-Linear Statistical Modeling of Phase Differences (ICASSP 2013)"
"Multichannel Source Separation and Tracking with RANSAC and Directional Statistics (TASLP 2014)"
"Robust Interchannel Phase Difference Modeling with Wrapped Regression Splines (SAM 2014)"

Phase and Level Difference Fusion

Inter-channel Phase Difference (IPD) and Inter-channel Level Difference (ILD) features have been used extensively as cues for clustering TF bins and performing blind multichannel audio separation. However, there exists a dichotomy between these features. IPDs are more useful when derived from closely-spaced microphones while ILDs are the opposite. This is because IPDs wrap in the interval [-pi,pi] for large spacings and ILDs corresponding to two sources are difficult to distinguish for small spacings. Thus, we desire an approach that can take advantage of both regimes. The following figures illustrates the two features for small (red) and large (blue) inter-microphone spacings for a 2-source mixture:



To accomplish the tradeoff, we generalized a previously-developed algorithm that uses the Random Sample Consensus (RANSAC) algorithm to quickly and robustly fit wrapped line models to IPD data. This is a quick guess-and-check approach that has a high probability of fitting an accurate model. Data points are selected at random from the dataset, candidate models (wrapped lines) are fit with each one, and the model that agrees with the most points in the rest of the dataset is selected as the solution. If multiple lines are to be fit, the inliers of the chosen line are removed and the process is repeated. This is especially useful when performing regression in a heavily contaminated dataset since outliers are ignored with a high probability. Extending this to include ILD features involves tuning the inlier thresholds of the RANSAC procedure for both feature types simultaneously. We found that this leads to significant improvements in robustness to microphone spacing as measured with several separation quality metrics:



See our papers for more detail:
"Blind Multichannel Source Separation by Circular-Linear Statistical Modeling of Phase Differences (ICASSP 2013)"
"Phase and Level Difference Fusion for Robust Multichannel Source Separation (ICASSP 2014)"

Tracking on Directional Manifolds

Bayesian filtering has been widely applied to solve the problem of tracking one or more targets in a dynamical system. One of the fundamental time-series models in this setting is the Linear Dynamical System (LDS), for which the optimal sequential inference algorithm is the Kalman Filter (KF). However, when tracking sound sources with a compact microphone array, the state space is a circle (for 2D tracking by azimuth angle alone) or a sphere (for 3D tracking).

We found that greater accuracy could be achieved by tracking the source orientations directly on these directional manifolds. To achieve this, we defined new dynamical systems that explicitly treat the state variables as directional. This involves replacing Gaussian distributions in the LDS with wrapped Gaussian (for 2D tracking) or von Mises-Fisher (for 3D tracking) distributions. Sequential inference can be performed efficiently with deterministic approximations.

We extended this approach to the multi-source case using Probabilistic Data Association (PDA) techniques. The following figures show results with the Factorial Wrapped Kalman Filter (FWKF) and the Factorial von Mises-Fisher Filter (FvMFF):



See our papers for more detail:
"A Wrapped Kalman Filter for Azimuthal Speaker Tracking (SPL 2013)"
"Multiple Speaker Tracking with the Factorial von Mises-Fisher Filter (MLSP 2014)"

Directional NMF

Two mathematical frameworks that have shown great promise for audio enhancement are beamforming and Nonnegative Matrix Factorization (NMF). Beamforming makes use of physics models for how sound waves propagate from sources to sensors while NMF attempts to represent observed nonnegative matrices as a sum of appropriately superimposed spectral templates. In the Directional NMF (DNMF) model, we leverage the strong localization cues developed in the beamforming literature to extract features for an NMF-like learning procedure that simultaneously localizes and separates sound sources in the vicinity of a microphone array.

In DNMF, a feature matrix (L) is decomposed into three terms that capture the localization (W), source presence (A), and TF mask information (H). The following figure demonstrates one result for a 3-source mixture:



See our paper for more detail:
"Directional NMF for Joint Source Localization and Separation (WASPAA 2015)"

Spectral Learning for Time Series Models

  • TBD

  • Smooth Source Separation

  • TBD

  • Speech Denoiser: An Ultimate Adaptation

  • TBD

  • (Deep) Neural Networks for Audio

  • TBD

  • Big Audio Data Analysis

  • TBD

  • Paris' previous projects