{"title": "Optimal spectral transportation with application to music transcription", "book": "Advances in Neural Information Processing Systems", "page_first": 703, "page_last": 711, "abstract": "Many spectral unmixing methods rely on the non-negative decomposition of spectral data onto a dictionary of spectral templates. In particular, state-of-the-art music transcription systems decompose the spectrogram of the input signal onto a dictionary of representative note spectra. The typical measures of fit used to quantify the adequacy of the decomposition compare the data and template entries frequency-wise. As such, small displacements of energy from a frequency bin to another as well as variations of timber can disproportionally harm the fit. We address these issues by means of optimal transportation and propose a new measure of fit that treats the frequency distributions of energy holistically as opposed to frequency-wise. Building on the harmonic nature of sound, the new measure is invariant to shifts of energy to harmonically-related frequencies, as well as to small and local displacements of energy. Equipped with this new measure of fit, the dictionary of note templates can be considerably simplified to a set of Dirac vectors located at the target fundamental frequencies (musical pitch values). This in turns gives ground to a very fast and simple decomposition algorithm that achieves state-of-the-art performance on real musical data.", "full_text": "Optimal spectral transportation with application to\n\nmusic transcription\n\nR\u00e9mi Flamary\n\nUniversit\u00e9 C\u00f4te d\u2019Azur, CNRS, OCA\n\nremi.flamary@unice.fr\n\nNicolas Courty\n\nUniversit\u00e9 de Bretagne Sud, CNRS, IRISA\n\ncourty@univ-ubs.fr\n\nC\u00e9dric F\u00e9votte\n\nCNRS, IRIT, Toulouse\n\ncedric.fevotte@irit.fr\n\nValentin Emiya\n\nAix-Marseille Universit\u00e9, CNRS, LIF\nvalentin.emiya@lif.univ-mrs.fr\n\nAbstract\n\nMany spectral unmixing methods rely on the non-negative decomposition of spec-\ntral data onto a dictionary of spectral templates. In particular, state-of-the-art\nmusic transcription systems decompose the spectrogram of the input signal onto\na dictionary of representative note spectra. The typical measures of \ufb01t used to\nquantify the adequacy of the decomposition compare the data and template entries\nfrequency-wise. As such, small displacements of energy from a frequency bin\nto another as well as variations of timbre can disproportionally harm the \ufb01t. We\naddress these issues by means of optimal transportation and propose a new measure\nof \ufb01t that treats the frequency distributions of energy holistically as opposed to\nfrequency-wise. Building on the harmonic nature of sound, the new measure is\ninvariant to shifts of energy to harmonically-related frequencies, as well as to\nsmall and local displacements of energy. Equipped with this new measure of \ufb01t,\nthe dictionary of note templates can be considerably simpli\ufb01ed to a set of Dirac\nvectors located at the target fundamental frequencies (musical pitch values). This in\nturns gives ground to a very fast and simple decomposition algorithm that achieves\nstate-of-the-art performance on real musical data.\n\n1 Context\n\nMany of nowadays spectral unmixing techniques rely on non-negative matrix decompositions. This\nconcerns for example hyperspectral remote sensing (with applications in Earth observation, astronomy,\nchemistry, etc.) or audio signal processing. The spectral sample vn (the spectrum of light observed at\na given pixel n, or the audio spectrum in a given time frame n) is decomposed onto a dictionary W of\nelementary spectral templates, characteristic of pure materials or sound objects, such that vn \u2248 Whn.\nThe composition of sample n can be inferred from the non-negative expansion coef\ufb01cients hn. This\nparadigm has led to state-of-the-art results for various tasks (recognition, classi\ufb01cation, denoising,\nseparation) in the aforementioned areas, and in particular in music transcription, the central application\nof this paper.\nIn state-of-the-art music transcription systems, the spectrogram V (with columns vn) of a musical\nsignal is decomposed onto a dictionary of pure notes (in so-called multi-pitch estimation) or chords. V\ntypically consists of (power-)magnitude values of a regular short-time Fourier transform (Smaragdis\nand Brown, 2003). It may also consists of an audio-speci\ufb01c spectral transform such as the Mel-\nfrequency transform, like in (Vincent et al., 2010), or the Q-constant based transform, like in (Oudre\net al., 2011). The success of the transcription system depends of course on the adequacy of the\ntime-frequency transform & the dictionary to represent the data V. In particular, the matrix W must\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fbe able to accurately represent a diversity of real notes. It may be trained with individual notes using\nannotated data (Boulanger-Lewandowski et al., 2012), have a parametric form (Rigaud et al., 2013)\nor be learnt from the data itself using a harmonic subspace constraint (Vincent et al., 2010).\nOne important challenge of such methods lies in their ability to cope with the variability of real notes.\nA simplistic dictionary model will assume that one note characterised by fundamental frequency \u03bd0\n(e.g., \u03bd0 = 440 Hz for note A4) will be represented by a spectral template with non-zero coef\ufb01cients\nplaced at \u03bd0 and at its multiples (the harmonic frequencies). In reality, many instruments, such as the\npiano, produce musical notes with either slight frequency misalignments (so-called inharmonicities)\nwith respect to the theoretical values of the fundamental and harmonic frequencies, or amplitude\nvariations at the harmonic frequencies with respect to recording conditions or played instrument\n(variations of timbre). Handling these variabilities by increasing the dictionary with more templates\nis typically unrealistic and adaptive dictionaries have been considered in (Vincent et al., 2010; Rigaud\net al., 2013). In these papers, the spectral shape of the columns of W is adjusted to the data at hand,\nusing speci\ufb01c time-invariant semi-parametric models. However, the note realisations may vary in time,\nsomething which is not handled by these approaches. This work presents a new spectral unmixing\nmethod based on optimal transportation (OT) that is fully \ufb02exible and remedies the latter dif\ufb01culties.\nNote that Typke et al. (2004) have previously applied OT to notated music (e.g., score sheets) for\nsearch-by-query in databases while we address here music transcription from audio spectral data.\n\n2 A relevant baseline: PLCA\n\nBefore presenting our contributions, we start by introducing the PLCA method of Smaragdis et al.\n(2006) which is heavily used in audio signal processing. It is based on the Probabilistic Latent\nSemantic Analysis (PLSA) of Hofmann (2001) (used in text retrieval) and is a particular form of non-\nnegative matrix factorisation (NMF). Simplifying a bit, in PLCA the columns of V are normalised\nto sum to one. Each vector vn is then treated as a discrete probability distribution of \u201cfrequency\nquanta\u201d and is approximated as V \u2248 WH. The matrices W and H are of size M \u00d7 K and K \u00d7 N,\nrespectively, and their columns are constrained to sum to one. As a result, the columns of the\napproximate \u02c6V = WH sum to one as well and each distribution vector vn is as such approximated\nby the counterpart distribution \u02c6vn in \u02c6V. Under the assumption that W is known, the approximation\nis found by solving the optimisation problem de\ufb01ned by\n\nwhere DKL(v|\u02c6v) =(cid:80)\nextension DKL(V| \u02c6V) =(cid:80)\n\nn DKL(vn|\u02c6vn).\n\nDKL(V|WH)\n\ns.t \u2200n,(cid:107)hn(cid:107)1 = 1,\n\nmin\nH\u22650\n\n(1)\n\ni vi log(vi/\u02c6vi) is the KL divergence between discrete distributions, and by\n\nAn important characteristic of the KL divergence is its separability with respect to the entries of its\narguments. It operates a frequency-wise comparison in the sense that, at every frame n, the spectral\ncoef\ufb01cient vin at frequency i is compared to its counterpart \u02c6vin, and the results of the comparisons\nare summed over i. In particular, a small displacement in the frequency support of one observation\nmay disproportionally harm the divergence value. For example, if vn is a pure note with fundamental\nfrequency \u03bd0, a small inharmonicity that shifts energy from \u03bd0 to an adjacent frequency bin will\nunreasonably increase the divergence value, when vn is compared with a purely harmonic spectral\ntemplate with fundamental frequency \u03bd0. As explained in Section 1 such local displacements of\nfrequency energy are very common when dealing with real data. A measure of \ufb01t invariant to small\nperturbations of the frequency support would be desirable in such a setting, and this is precisely what\nOT can bring.\n\n3 Elements of optimal transportation\n\nGiven a discrete probability distribution v (a non-negative real-valued column vector of dimension M\nand summing to one) and a target distribution \u02c6v (with same properties), OT computes a transportation\nmatrix T belonging to the set \u0398\ni=1 tij =\n\u02c6vj}. T establishes a bi-partite graph connecting the two distributions. In simple words, an amount\n(or, in typical OT parlance, a \u201cmass\u201d) of every coef\ufb01cient of vector v is transported to an entry of \u02c6v.\nThe sum of transported amounts to the jth entry of \u02c6v must equal \u02c6vj. The value of tij is the amount\n\n|\u2200i, j = 1, . . . , N,(cid:80)M\n\nj=1 tij = vi,(cid:80)M\n\n= {T \u2208 RM\u00d7M\ndef\n\n+\n\n2\n\n\ftransported from the ith entry of v to the jth entry of \u02c6v. In our particular setting, the vector v is a\ndistribution of spectral energies v1, . . . , vM at sampling frequencies f1, . . . , fM .\nWithout additional constraints, the problem of \ufb01nding a non-negative matrix T \u2208 \u0398 has an in\ufb01nite\nnumber of solutions. As such, OT takes into account the cost of transporting an amount from the ith\nentry of v to the jth entry of \u02c6v, denoted cij (a non-negative real-valued number). Endorsed with this\ncost function, OT involves solving the optimisation problem de\ufb01ned by\n\n(cid:88)\n\nij\n\nJ(T|v, \u02c6v, C) =\n\nmin\n\nT\n\ncijtij\n\ns.t T \u2208 \u0398,\n\n(2)\n\nwhere C is the non-negative square matrix of size M with elements cij. Eq. (2) de\ufb01nes a convex\nlinear program. The value of the function J(T|v, \u02c6v, C) at its minimum is denoted DC(v|\u02c6v). When\nC is a symmetric matrix such that cij = (cid:107)fi\u2212 fj(cid:107)p\np, where we recall that fi and fj are the frequencies\nin Hertz indexed by i and j, DC(v|\u02c6v) de\ufb01nes a metric (i.e., a symmetric divergence that satis\ufb01es\nthe triangle inequality) coined Wasserstein distance or earth mover\u2019s distance (Rubner et al., 1998;\nVillani, 2009). In other cases, in particular when the matrix C is not even symmetric like in the next\nsection, DC(v|\u02c6v) is not a metric in general, but is still a valid measure of \ufb01t. For generality, we will\nrefer to it as the \u201cOT divergence\u201d.\nBy construction, the OT divergence can explicitly embed a form of invariance to displacements of\nsupport, as de\ufb01ned by the transportation cost matrix C. For example, in the spectral decomposition\nsetting, the matrix with entries of the form cij = (fi \u2212 fj)2 will increasingly penalise frequency\ndisplacements as the distance between frequency bins increases. This precisely remedies the limitation\nof the separable KL divergence presented in Section 2. As such, the next section addresses variants\nof spectral unmixing based on the Wasserstein distance.\n\n4 Optimal spectral transportation (OST)\n\nUnmixing with OT.\nIn light of the above discussion, a direct solution to the sensibility of PLCA to\nsmall frequency displacements consists in replacing the KL divergence with the OT divergence. This\namounts to solving the optimisation problem given by\n\nDC(V|WH)\n\ns.t \u2200n,(cid:107)hn(cid:107)1 = 1,\n\nmin\nH\u22650\n\nwhere DC(V| \u02c6V) = (cid:80)\n\n(3)\nn DC(vn|\u02c6vn), W is \ufb01xed and populated with pure note spectra and C\npenalises large displacements of frequency support. This approach is a particular case of NMF with\nthe Wasserstein distance, which has been considered in a face recognition setting by Sandler and\nLindenbaum (2011), with subsequent developments by Zen et al. (2014) and Rolet et al. (2016).\nThis approach is relevant to our spectral unmixing scenario but as will be discussed in Section 5 is\non the downside computationally intensive. It also requires the columns of W to be set to realistic\nnote templates, which is still constraining. The next two sections describes a computationally more\nfriendly approach which additionally removes the dif\ufb01culty of choosing W appropriately.\n\nHarmonic-invariant transportation cost.\nIn the approach above, the harmonic modelling is\nconveyed by the dictionary W (consisting of comb-like pure note spectra) and the invariance to small\nfrequency displacements is introduced via the matrix C. In this section we propose to model both\nharmonicity and local invariance through the transportation cost matrix C. Loosely speaking, we\nwant to de\ufb01ne a class of equivalence between musical spectra, that takes into account their inherent\nharmonic nature. As such, we essentially impose that a harmonic frequency (i.e., a close multiple\nof its fundamental) can be considered equivalent to its fundamental, the only target of multi-pitch\nestimation. As such, we assume that a mass at one frequency can be transported to a divisor frequency\nwith no cost. In other words, a mass at frequency fi can be transported with no cost to fi/2, fi/3,\nfi/4, and so on until sampling resolution. One possible cost matrix that embeds this property is\n\ncij = min\n\nq=1,...,qmax\n\n(fi \u2212 qfj)2 + \u0001 \u03b4q(cid:54)=1,\n\n(4)\n\nwhere qmax is the ceiling of fi/fj and \u0001 is a small value. The term \u0001 \u03b4q(cid:54)=1 favours the discrimination\nof octaves. Indeed, it penalises the transportation of a note of fundamental frequency 2\u03bd0 or \u03bd0/2 to\nthe spectral template with fundamental frequency \u03bd0, which would be costless without this additive\nterm. Let us denote by Ch the transportation cost matrix de\ufb01ned by Eq. (4). Fig. 1 compares Ch\n\n3\n\n\fFigure 1: Comparison of transportation cost matrices C2 and Ch (full matrices and selected columns).\n\nMeasure of \ufb01t D(cid:96)2\n1.13\n1.13\n0.91\n\nD(v1|\u02c6v)\nD(v2|\u02c6v)\nD(v3|\u02c6v)\n\nDKL\n72.92\n5.42\n2.02\n\nDC2\n145.00\n10.00\n1042.67\n\nDCh\n134.32\n10.00\n1.00\n\nFigure 2: Three example spectra vn compared to a given template \u02c6v (left) and computed divergences\n(right). The template is a mere Dirac vector placed at a particular frequency \u03bd0. D(cid:96)2 denotes the\nstandard quadratic error (cid:107)x\u2212 y(cid:107)2\n2. By construction of DCh, sample v3 which is harmonically related\nto the template returns a very good \ufb01t with the latter OT divergence. Note that it does not make sense\nto compare output values of different divergences; only the relative comparison of output values of\nthe same divergence for different input samples is meaningful.\n\nto the more standard quadratic cost C2 de\ufb01ned by cij = (fi \u2212 fj)2. With the quadratic cost, only\nlocal displacements are permissible. In contrast, the harmonic-invariant cost additionally permits\nlarger displacements to divisor frequencies, improving robustness to variations of timbre besides to\ninharmonicities.\n\nDictionary of Dirac vectors. Having designed an OT divergence that encodes inherent properties of\nmusical signals, we still need to choose a dictionary W that will encode the fundamental frequencies\nof the notes to identify. Typically, these will consist of the physical frequencies of the 12 notes of the\nchromatic scale (from note A to note G, including half-tones), over several octaves. As mentioned\nin Section 1, one possible strategy is to populate W with spectral note templates. However, as also\ndiscussed, the performance of the resulting unmixing method will be capped by the representativeness\nof the chosen set of templates.\nA most welcome consequence of using the OT divergence built on the harmonic-insensitive cost\nmatrix Ch is that we may use for W a mere set of Dirac vectors placed at the fundamental frequencies\n\u03bd1, . . . , \u03bdK of the notes to identify and separate. Indeed, under the proposed setting, a real note\nspectra (composed of one fundamental and multiple harmonic frequencies) can be transported with\nno cost to its fundamental. Similarly, a spectral sample composed of several notes can be transported\nto mixture of Dirac vectors placed at their fundamental frequencies. This simply eliminates the\nproblem of choosing a representative dictionary! This very appealing property is illustrated in Fig. 2.\nFurthermore, the particularly simple structure of the dictionary leads to a very ef\ufb01cient unmixing\nalgorithm, as explained in the next section. In the following, the unmixing method consisting of the\ncombined use of the harmonic-invariant cost matrix Ch and of the dictionary of Dirac vectors will be\ncoined \u201coptimal spectral transportation\u201d (OST).\nAt this level, we assume for simplicity that the set of K fundamental frequencies {\u03bd1, . . . , \u03bdK} is\ncontained in the set of sampled frequencies {f1, . . . , fM}. This means that wk (the kth column of\nW) is zero everywhere except at some entry i such that fi = \u03bdk where wik = 1. This is typically\nnot the case in practice, where the sampled frequencies are \ufb01xed by the sampling rate, of the form\nfi = 0.5(i/T )fs, and where the fundamental frequencies \u03bdk are \ufb01xed by music theory. Our approach\ncan actually deal with such a discrepancy and this will be explained later in Section 5.\n\n4\n\nQuadraticcostC2(logscale)j=1...100i=1...100j=1...100cijSelectedcolumnsofC2i=20i=25i=30i=35HarmoniccostCh(logscale)j=1...100i=1...100j=1...100cijSelectedcolumnsofChi=20i=25i=30i=35010203040506070809000.51OneDiracspectraltemplateandthreedatasamples\u02c6vv1v2v3\f5 Optimisation\n\nOT unmixing with linear programming. We start by describing optimisation for the state-of-the-\nart OT unmixing problem described by Eq. (3) and proposed by Sandler and Lindenbaum (2011).\nFirst, since the objective function is separable with respect to samples, the optimisation problem\ndecouples with respect to the activation columns hn. Dropping the sample index n and combining\nEqs. (2) and (3), optimisation thus reduces to solving for every sample a problem of the form\n\nmin\n\nh\u22650,T\u22650\n\n(5)\nwhere 1M is a vector of dimension M containing only ones and (cid:104)\u00b7,\u00b7(cid:105) is the Frobenius inner product.\nVectorising the variables T and h into a single vector of dimension M 2 + K, problem (5) can be\nturned into a canonical linear program. Because of the large dimension of the variable (typically in\nthe order of 105), resolution can however be very demanding, as will be shown in experiments.\n\ns.t. T1M = v, T\n\n1M = Wh,\n\ntijcij\n\nij\n\n(cid:62)\n\n(cid:88)\n\n(cid:104)T, C(cid:105) =\n\nOptimisation for OST. We now assume that W is a set of Dirac vectors as explained at the end\nof Section 4. We also assume that K < M, which is the usual scenario. Indeed, K is typically\nin the order of a few tens, while M is in the order of a few hundreds. In such a setting \u02c6v = Wh\ncontains by design at most K non-zero coef\ufb01cients, located at the entries such that fi = \u03bdk. We\ni tij = 0, by\nthe second constraint of Eq. (5). Additionally, by the non-negativity of T this also implies that T has\n\ndenote this set of frequency indices by S. Hence, for j /\u2208 S, we have \u02c6vj = 0 and thus(cid:80)\nonly K non-zero columns, indexed by j \u2208 S. Denoting by(cid:101)T this subset of columns, and by (cid:101)C the\n\ncorresponding subset of columns of C, problem (5) reduces to\n\n(cid:104)(cid:101)T,(cid:101)C(cid:105)\n\ns.t.\n\n(cid:101)T1K = v,\n\n(cid:101)T\n\n(cid:62)\n\nh\u22650,(cid:101)T\u22650\n\nmin\n\n1M = h.\n\n(6)\n\nk\n\n(7)\n\ns.t.\n\n\u02dctik\u02dccik\n\nmin\n\u02dcti\u22650\n\n\u02dctik = vi.\n\nThis is an optimisation problem of signi\ufb01cantly reduced dimension (M + 1)K. Even more appealing,\nthe problem has a simple closed-form solution. Indeed, the variable h has a virtual role in problem (6).\nIt only appears in the second constraint, which de facto becomes a free constraint. Thus problem (6)\n\ncan be solved with respect to (cid:101)T regardless of h, and h is then simply obtained by summing the\ncolumns of(cid:101)T(cid:62) at the solution. Now, the problem\n(cid:104)(cid:101)T,(cid:101)C(cid:105)\nmin(cid:101)T\u22650\ndecouples with respect to the rows \u02dcti of(cid:101)T, and becomes, \u2200i = 1, . . . , M,\n(cid:88)\n\n(cid:101)T1K = v\ns.t. (cid:88)\n\n(8)\ni = arg mink{\u02dccik}, and \u02dctik = 0 for k (cid:54)= k(cid:63)\nThe solution is simply given by \u02dctik(cid:63)\ni .\nIntroducing the labelling matrix L which is everywhere zero except for indices (i, k(cid:63)\ni ) where it is\nequal to 1, the solution to OST is trivially given by \u02c6h = L(cid:62)v. Thus, under the speci\ufb01c assumption\nthat W is a set of Dirac vectors, the challenging problem (5) has been reduced to an effortless\nassignment problem to solve for T and a simple sum to solve for h. Note that the algorithm is\nindependent of the particular structure of C. In the end, the complexity per frame of OST reduces to\nO(M ), which starkly contrasts with the complexity of PLCA, in the order O(KM ) per iteration.\nIn Section 4, we assumed for simplicity that the set of fundamental frequencies {\u03bdk}k was contained\nin the set of sampled frequencies {fi}i. As a matter of fact, this assumption can be trivially lifted in\n\nthe proposed setting of OST. Indeed, we may construct the cost matrix (cid:101)C (of dimensions M \u00d7 K)\nNamely, we may simply set the coef\ufb01cients of (cid:101)C to be(cid:101)cik = minq(fi \u2212 q\u03bdk)2 + \u0001 \u03b4q(cid:54)=1, in the\nimplementation. Then, the matrix(cid:101)T indicates how each sample v is transported to the Dirac vectors\n\nby replacing the target frequencies fj in Eq. (4) by the theoretical fundamental frequencies \u03bdk.\n\nplaced at fundamental frequencies {\u03bdk}k, without the need for the actual Dirac vectors themselves,\nwhich elegantly solves the frequency sampling problem.\n\ni = vi for k(cid:63)\n\nk\n\nOST with entropic regularisation (OSTe). The procedure described above leads to a winner-\ntakes-all transportation of all of vi to its cost-minimum target entry k(cid:63)\ni . We found it useful in\n\n5\n\n\fik\n\npractice to relax this hard assignment and distribute energies more evenly by using the entropic\n\nregularisation of Cuturi (2013). It consists of penalising the \ufb01t (cid:104)(cid:101)T,(cid:101)C(cid:105) in Eq. (6) with an additional\nterm \u2126e((cid:101)T) =(cid:80)\n\u02dctik log(\u02dctik), weighted by the hyper-parameter \u03bbe. The negentropic term \u2126e((cid:101)T)\npromotes the transportation of vi to several entries, leading to a smoother estimate of(cid:101)T. As explained\nM \u00d7 K matrix with coef\ufb01cients lik = exp(\u2212\u02dccik/\u03bbe)/(cid:80)\nin the supplementary material, one can show that the negentropy-regularised problem is a Bregman\nprojection (Benamou et al., 2015) and has again a closed-form solution \u02c6h = L(cid:62)\ne v where Le is the\np exp(\u2212\u02dccip/\u03bbe). Limiting cases \u03bbe = 0\nand \u03bbe = \u221e return the unregularised OST estimate and the maximum-entropy estimate hk = 1/K,\nrespectively. Because Le becomes a full matrix, the complexity per frame of OSTe becomes O(KM ).\n\nwhich promotes group-sparsity at column level (Huang et al., 2009). Unlike OST or OSTe, OSTg\ndoes not offer a closed-form solution. Following Courty et al. (2014), a majorisation-minimisation\n\nOST with group regularisation (OSTg). We have explained above that the transportation matrix\nT has a strong group structure in the sense that it contains by construction M \u2212 K null columns,\n\nand that only the subset (cid:101)T needs to be considered. Because a small number of the K possible\nnotes will be played at every time frame, the matrix (cid:101)T will additionally have a signi\ufb01cant number\nof null columns. This heavily suggests using group-sparse regularisation in the estimation of (cid:101)T.\n(cid:113)(cid:107)(cid:101)tk(cid:107)1\nAs such, we also consider problem (6) penalised by the additional term \u2126g((cid:101)T) = (cid:80)\nprocedure based on the local linearisation of \u2126g((cid:101)T) can be employed and the details are given in\nOST, as of Eq. (6), with the iteration-dependent transportation cost matrix (cid:101)C(iter) = (cid:101)C + (cid:101)R(iter),\nwhere (cid:101)R(iter) is the M \u00d7 K matrix with coef\ufb01cients(cid:101)r(iter)\n2(cid:107)(cid:101)t(iter)\ngroup-regularisation of(cid:101)T corresponds to a sparse regularisation of h. This is because hk = (cid:107)(cid:101)tk(cid:107)1\nand thus, \u2126g((cid:101)T) =(cid:80)\n(cid:104)(cid:101)T,(cid:101)C(cid:105) + \u03bbe \u2126e((cid:101)T) + \u03bbg \u2126g((cid:101)T), addressed in the supplementary material.\n\nhk. Finally, note that OSTe and OSTg can be implemented simultaneously,\nleading to OSTe+g, by considering the optimisation of the doubly-penalised objective function\n\nthe supplementary material. The resulting algorithm consists in iteratively applying unregularised\n\n. Note that the proposed\n\n= 1\n\nk\n\n\u221a\n\nk\n\nk\n\nik\n\n(cid:107)\u2212 1\n\n2\n\n1\n\n6 Experiments\n\nToy experiments with simulated data.\nIn this section we illustrate the robustness, the \ufb02exibility\nand the ef\ufb01ciency of OST on two simulated examples. The top plots of Fig. 3 display a synthetic\ndictionary of 8 harmonic spectral templates, referred to as the \u201charmonic dictionary\u201d. They have\nbeen generated as Gaussian kernels placed at a fundamental frequency and its multiples, and using\nexponential dampening of the amplitudes. As everywhere in the paper, the spectral templates are\nnormalised to sum to one. Note that the 8th template is the upper octave of the \ufb01rst one. We compare\nthe unmixing performance of \ufb01ve methods in two different scenarios. The \ufb01ve methods are as follows.\nPLCA is the method described in Section 2, where the dictionary W is the harmonic dictionary.\nConvergence is stopped when the relative difference of the objective function between two iterations\nfalls below 10\u22125 or the number of iterations (per frame) exceeds 1000. OTh is the unmixing method\nwith the OT divergence, as in the \ufb01rst paragraph of Section 4, using the harmonic transportation cost\nmatrix Ch and the harmonic dictionary. OST is like OTh, but using a dictionary of Dirac vectors\n(placed at the 8 fundamental frequencies characterising the harmonic dictionary). OSTe, OSTg and\nOSTe+g are the regularised variants of OST, described at the end of Section 4. The iterative procedure\nin the group-regularised variants is run for 10 iterations (per frame).\nIn the \ufb01rst experimental scenario, reported in Fig. 3 (a), the data sample is generated by mixing the\n1st and 4th elements of the harmonic dictionary, but introducing a small shift of the true fundamental\nfrequencies (with the shift being propagated to the harmonic frequencies). This mimics the effect\nof possible inharmonicities or of an ill-tuned instrument. The middle plot of Fig. 3 (a), displays\nthe generated sample, together with the \u201ctheoretical sample\u201d, i.e., without the frequencies shift.\nThis shows how a slight shift of the fundamental frequencies can greatly impact the overall spectral\ndistribution. The bottom plot displays the true activation vector and the estimates returned by the \ufb01ve\nmethods. The table reports the value of the (arbitrary) error measure (cid:107)\u02c6h \u2212 htrue(cid:107)1 together with the\nrun time (on an average desktop PC using a MATLAB implementation) for every method. The results\nshow that group-regularised variants of OST lead to best performance with very light computational\n\n6\n\n\f(a) Unmixing with shifted fundamental frequencies\n\n(b) Unmixing with wrong harmonic amplitudes\n\nMethod\n(cid:96)1 error\nTime (s)\n\nPLCA OTh\n0.340\n0.900\n0.057\n6.541\n\nOST OSTg OSTe OSTe+g\n0.015\n0.534\n0.006\n0.013\n\n0.660\n0.007\n\n0.021\n0.007\n\nMethod\n(cid:96)1 error\nTime (s)\n\nPLCA OTh\n0.430\n0.791\n0.019\n6.529\n\nOST OSTg OSTe OSTe+g\n0.048\n0.971\n0.006\n0.010\n\n0.911\n0.005\n\n0.045\n0.006\n\nFigure 3: Unmixing under model misspeci\ufb01cation. See text for details.\n\nburden, and without using the true harmonic dictionary. In the second experimental scenario, reported\nin Fig. 3 (b), the data sample is generated by mixing the 1st and 6th elements of the harmonic\ndictionary, with the right fundamental and harmonic frequencies, but where the spectral amplitudes at\nthe latters do not follow the exponential dampening of the template dictionary (variation of timbre).\nHere again the group-regularised variants of OST outperforms the state-of-the-art approaches, both\nin accuracy and run time.\n\nTranscription of real musical data. We consider in this section the transcription of a selection\nof real piano recordings, obtained from the MAPS dataset (Emiya et al., 2010). The data comes\nwith a ground-truth binary \u201cpiano-roll\u201d which indicates the active notes at every time. The note\nfundamental frequencies are given in MIDI, a standard musical integer-valued frequency scale that\nmatches the keys of a piano, with 12 half-tones (i.e., piano keys) per octave. The spectrogram of\neach recording is computed with a Hann window of size 93-ms and 50% overlap (fs = 44.1Hz). The\ncolumns (time frames) are then normalised to produce V. Each recording is decomposed with PLCA,\nOST and OSTe, with K = 60 notes (5 octaves). Half of the recording is used for validation of the\nhyper-parameters and the other half is used as test data. For PLCA, we validated 4 and 3 values of the\nwidth and amplitude dampening of the Gaussian kernels used to synthesise the dictionary. For OST,\nwe set \u0001 = q\u00010 in Eq. (4), which was found to satisfactorily improve the discrimination of octaves\nincreasingly with frequency, and validated 5 orders of magnitude of \u00010. For OSTe, we additionally\nvalidated 4 orders of magnitude of \u03bbe. Each of the three methods returns an estimate of H. The\nestimate is turned into a 0/1 piano-roll by only retaining the support of its Pn maximum entries at\nevery frame n, where Pn is the ground-truth number of notes played in frame n. The estimated\npiano-roll is then numerically compared to its ground truth using the F-measure, a global recognition\nmeasure which accounts both for precision and recall and which is bounded between 0 (critically\nwrong) and 1 (perfect recognition). Our evaluation framework follows standard practice in music\ntranscription evaluation, see for example (Daniel et al., 2008). As detailed in the supplementary\nmaterial, it can be shown that OSTg and OSTe+g do not change the location of the maximum entries\nin the estimates of H returned by OST and OSTe, respectively, but only their amplitude. As such, they\nlead to the same F-measures than OST and OSTe, and we did not include them in the experiments of\nthis section.\nWe \ufb01rst illustrate the complexity of real-data spectra in Fig. 4, where the amplitudes of the \ufb01rst\nsix partials (the components corresponding to the harmonic frequencies) of a single piano note are\nrepresented along time. Depending on the partial order q, the amplitude evolves with asynchronous\nbeats and with various slopes. This behaviour is characteristic of piano sounds in which each note\ncomes from the vibration of up to three coupled strings. As a consequence, the spectral envelope\nof such notes cannot be well modelled by a \ufb01xed amplitude pattern. Fig. 4 shows that, thanks to\nits \ufb02exibility, OSTe can perfectly recover the true fundamental frequency (MIDI 50) while PLCA\n\n7\n\n\f(a) Thresholded OSTe transcription\n\n(b) Thresholded PLCA transcription\n\nTime (s)\n\nFigure 4: First 6 partials and transcription of a single piano note (note D3, \u03bd0 = 147 Hz, MIDI 50).\n\nTable 1: Recognition performance (F-measure values) and average computational unmixing times.\n\nMAPS dataset \ufb01le IDs\nchpn_op25_e4_ENSTDkAm\nmond_2_SptkBGAm\nmond_2_SptkBGCl\nmuss_1_ENSTDkAm 4\nmuss_2_AkPnCGdD\nmz_311_1_ENSTDkCl\nmz_311_1_StbgTGd2\nAverage\nTime (s)\n\nPLCA PLCA+noise\n0.679\n0.616\n0.645\n0.613\n0.587\n0.561\n0.663\n0.624\n14.861\n\n0.671\n0.713\n0.687\n0.478\n0.574\n0.593\n0.617\n0.619\n15.420\n\nOST\n0.566\n0.470\n0.583\n0.513\n0.531\n0.580\n0.701\n0.563\n0.004\n\nOST+noise OSTe OSTe+noise\n\n0.564\n0.534\n0.676\n0.550\n0.611\n0.628\n0.718\n0.612\n0.005\n\n0.695\n0.610\n0.695\n0.671\n0.667\n0.625\n0.747\n0.673\n0.210\n\n0.695\n0.607\n0.730\n0.667\n0.675\n0.665\n0.747\n0.684\n0.202\n\nis prone to octave errors (confusions between MIDI 50 and MIDI 62). Then, Table 1 reports the\nF-measures returned by the three competing approaches on seven 15-s extracts of pieces from Chopin,\nBeethoven, Mussorgski and Mozart. For each of the three methods, we have also included a variant\nthat incorporates a \ufb02at component in the dictionary that can account for noise or non-harmonic\ncomponents. In PLCA, this merely consists in adding a constant vector wf (K+1) = 1/M to W. In\n\nOST or OSTe this consists in adding a constant column to(cid:101)C, whose amplitude has also been validated\n\nover 3 orders of magnitude. OST performs comparably or slightly inferiorly to PLCA but with an\nimpressive gain in computational time (\u223c3000\u00d7 speedup). Best overall performance is obtained with\nOSTe+noise with an average \u223c10% performance gain over PLCA and \u223c750\u00d7 speedup.\nA Python implementation of OST and real-time demonstrator are available at https://github.\ncom/rflamary/OST\n\n7 Conclusions\n\nIn this paper we have introduced a new paradigm for spectral dictionary-based music transcription.\nAs compared to state-of-the-art approaches, we have proposed a holistic measure of \ufb01t which is\nrobust to local and harmonically-related displacements of frequency energies. It is based on a\nnew form of transportation cost matrix that takes into account the inherent harmonic structure of\nmusical signals. The proposed transportation cost matrix allows in turn to use a simplistic dictionary\ncomposed of Dirac vectors placed at the target fundamental frequencies, eliminating the problem\nof choosing a meaningful dictionary. Experimental results have shown the robustness and accuracy\nof the proposed approach, which strikingly does not come at the price of computational ef\ufb01ciency.\nInstead, the particular structure of the dictionary allows for a simple algorithm that is way faster\nthan state-of-the-art NMF-like approaches. The proposed approach offers new foundations, with\npromising results and room for improvement. In particular, we believe exciting avenues of research\nconcern the learning of Ch from examples and extensions to other areas such as in remote sensing,\nusing application-speci\ufb01c forms of C.\n\nAcknowledgments. This work is supported in part by the European Research Council (ERC) under\nthe European Union\u2019s Horizon 2020 research & innovation programme (project FACTORY) and by\nthe French ANR JCJC program MAD (ANR-14-CE27-0002). Many thanks to Antony Schutz for\ngenerating & providing some of the musical data.\n\n8\n\n0.811.21.41.61.822.22.42.62.8Pitch (MIDI)4060800.811.21.41.61.822.22.42.62.8Pitch (MIDI)406080\fReferences\nJ.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyr\u00e9. Iterative Bregman projections for\nregularized transportation problems. SIAM Journal on Scienti\ufb01c Computing, 37(2):A1111\u2013A1138,\n2015.\n\nN. Boulanger-Lewandowski, Y. Bengio, and P. Vincent. Discriminative non-negative matrix factoriza-\ntion for multiple pitch estimation. In Proc. International Society for Music Information Retrieval\nConference (ISMIR), 2012.\n\nN. Courty, R. Flamary, and D. Tuia. Domain adaptation with regularized optimal transport. In\nProc. European Conference on Machine Learning and Principles and Practice of Knowledge\nDiscovery in Databases (ECML PKDD), 2014.\n\nM. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transportation. In Advances on\n\nNeural Information Processing Systems (NIPS), 2013.\n\nA. Daniel, V. Emiya, and B. David. Perceptually-based evaluation of the errors usually made when\nautomatically transcribing music. In Proc. International Society for Music Information Retrieval\nConference (ISMIR), 2008.\n\nV. Emiya, R. Badeau, and B. David. Multipitch estimation of piano sounds using a new probabilistic\nspectral smoothness principle. IEEE Trans. Audio, Speech, and Language Processing, 18(6):\n1643\u20131654, 2010.\n\nT. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42\n\n(1):177\u2013196, 2001.\n\nJ. Huang, S. Ma, H. Xie, and C.-H. Zhang. A group bridge approach for variable selection. Biometrika,\n\n96(2):339\u2013355, 2009.\n\nL. Oudre, Y. Grenier, and C. F\u00e9votte. Chord recognition by \ufb01tting rescaled chroma vectors to chord\n\ntemplates. IEEE Trans. Audio, Speech and Language Processing, 19(7):2222 \u2013 2233, 2011.\n\nF. Rigaud, B. David, and L. Daudet. A parametric model and estimation techniques for the inhar-\nmonicity and tuning of the piano. The Journal of the Acoustical Society of America, 133(5):\n3107\u20133118, 2013.\n\nA. Rolet, M. Cuturi, and G. Peyr\u00e9. Fast dictionary learning with a smoothed Wasserstein loss. In\n\nProc. International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2016.\n\nY. Rubner, C. Tomasi, and L. Guibas. A metric for distributions with applications to image databases.\n\nIn Proc. International Conference in Computer Vision (ICCV), 1998.\n\nR. Sandler and M. Lindenbaum. Nonnegative matrix factorization with earth mover\u2019s distance metric\nfor image analysis. IEEE Trans. Pattern Analysis and Machine Intelligence, 33(8):1590\u20131602,\n2011.\n\nP. Smaragdis and J. C. Brown. Non-negative matrix factorization for polyphonic music transcription.\nIn Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA),\n2003.\n\nP. Smaragdis, B. Raj, and M. V. Shashanka. A probabilistic latent variable model for acoustic\n\nmodeling. In Proc. NIPS workshop on Advances in models for acoustic processing, 2006.\n\nR. Typke, R. C. Veltkamp, and F. Wiering. Searching notated polyphonic music using transportation\n\ndistances. In Proc. ACM International Conference on Multimedia, 2004.\n\nC. Villani. Optimal transport: old and new. Springer, 2009.\n\nE. Vincent, N. Bertin, and R. Badeau. Adaptive harmonic spectral decomposition for multiple pitch\n\nestimation. IEEE Trans. Audio, Speech and Language Processing, 18:528 \u2013 537, 2010.\n\nG. Zen, E. Ricci, and N. Sebe. Simultaneous ground metric learning and matrix factorization with\nearth mover\u2019s distance. In Proc. International Conference on Pattern Recognition (ICPR), 2014.\n\n9\n\n\f", "award": [], "sourceid": 399, "authors": [{"given_name": "R\u00e9mi", "family_name": "Flamary", "institution": "Universit\u00e9 C\u00f4te d'Azur"}, {"given_name": "C\u00e9dric", "family_name": "F\u00e9votte", "institution": "CNRS"}, {"given_name": "Nicolas", "family_name": "Courty", "institution": "IRISA / University South Brittany"}, {"given_name": "Valentin", "family_name": "Emiya", "institution": "Aix-Marseille University"}]}