2006:Audio Onset Detection

From MIREX Wiki

Results

Proposer

Original proposers of 2005 contest:

Paul Brossier (Queen Mary) <paul.brossier at elec dot qmul dot ac dot uk>

Pierre Leveau (Laboratoire d'Acoustique Musicale, GET-ENST (Télécom Paris)) <leveau at lam dot jussieu dot fr>

Title

Onset Detection Contest

Description

The onset detection contest is a continuation of the 2005 Onset Detection contest. The main interest for a repeated evaluation is the fact that in 2005 there was not enough time to run the algorithms with different parameters, such that the initial goal to create and compare ROC curves could not be achieved. Having established the basic framework this years goal is to allow participants to submit their algorithms with a number of different parameter sets, such that the ROC curves of the algorithms can be computed and compared.

1.) Input data

essentially the same as last year

Audio format:

The data are monophonic sound files, with the associated onset times and data about the annotation robustness.

  • CD-quality (PCM, 16-bit, 44100 Hz)
  • single channel (mono)
  • file length between 2 and 36 seconds (total time: 14 minutes)
  • File names:

Audio content:

The dataset is subdivided into classes, because onset detection is sometimes performed in applications dedicated to a single type of signal (ex: segmentation of a single track in a mix, drum transcription, complex mixes databases segmentation...). The performance of each algorithm will be assessed on the whole dataset but also on each class separately.

The dataset contains 85 files from 5 classes annotated as follows:

  • 30 solo drum excerpts cross-annotated by 3 people
  • 30 solo monophonic pitched instruments excerpts cross-annotated by 3 people
  • 10 solo polyphonic pitched instruments excerpts cross-annotated by 3 people
  • 15 complex mixes cross-annotated by 5 people

Moreover the monophonic pitched instruments class is divided into 6 sub-classes: brass (2 excerpts), winds (4), sustained strings (6), plucked strings (9), bars and bells (4), singing voice (5). Nomenclature <AudioFileName>.wav for the audio file

2) Output data

The onset detection algorithms will return onset times in a text file: <Results of evaluated Algo path>/<AudioFileName>.output.


Onset file Format

<onset time(in seconds)>\n

where \n denotes the end of line. The < and > characters are not included.

3) README file

A README file accompanying each submission should contain explicit instructions on how to to run the program. In particular, each command line to run should be specified, using %input% for the input sound file and %output% for the resulting text file.

For instance, to test the program foobar with different values for parameters param1 and param2, the README file would look like:

foobar -param1 .1 -param2 1 -i %input% -o %output%
foobar -param1 .1 -param2 2 -i %input% -o %output%
foobar -param1 .2 -param2 1 -i %input% -o %output%
foobar -param1 .2 -param2 2 -i %input% -o %output%
foobar -param1 .3 -param2 1 -i %input% -o %output%
...

For a submission using MATLAB, the README file could look like:

matlab -r "foobar(.1,1,'%input%','%output%');quit;"
matlab -r "foobar(.1,2,'%input%','%output%');quit;"
matlab -r "foobar(.2,1,'%input%','%output%');quit;"
matlab -r "foobar(.2,2,'%input%','%output%');quit;"
matlab -r "foobar(.3,1,'%input%','%output%');quit;"
...

The different command lines to evaluate the performance of each parameter set over the whole database will be generated automatically from each line in the README file containing both '%input%' and '%output%' strings.

Participants

  • Axel Roebel (IRCAM), <roebel at ircam dot fr>
  • Alexandre Lacoste and Douglas Eck (University of Montreal), <lacostea at sympatico dot ca>, <eckdoug at iro dot umontreal dot ca>
  • Paul Brossier (Queen Mary, University of London), <paul.brossier at elec dot qmul dot ac dot uk>
  • Leslie Smith (Computing Science and Mathematics, University of Stirling, UK) <lss at cs dot stir dot ac dot uk>

Other Potential Participants

  • Kris West (University of East Anglia), <kw at cmp dot uea dot ac dot uk>
  • Nick Collins (University of Cambridge), <nc272 at cam dot ac dot uk>
  • Antonio Pertusa, Jos├⌐ M. I├▒esta (University of Alicante) and Anssi Klapuri (Tampere University of Technology), <pertusa at dlsi dot ua dot es>, <inesta at dlsi dot ua dot es>, klap at cs dot tut dot fi>
  • Julien Ricard and Gilles Peterschmitt (no affiliation, algorithm previously developped at University Pompeu Fabra), <julien.ricard at gmail dot com>, <gpeter at iua dot upf dot es>
  • Balaji Thoshkahna (Indian Institute of Science,Bangalore), <balajitn at ee dot iisc dot ernet dot in>
  • MIT, MediaLab

Tristan Jehan <tristan{at}medialab{dot}mit{dot}edu>

  • LAM, France

Pierre Leveau <leveau at lam dot jussieu dot fr> Laurent Daudet <daudet at lam dot jussieu dot fr>

  • IRCAM, France

Xavier Rodet (rod{at}ircam{dot}fr), Geoffroy Peeters (peeters{at}ircam{dot}fr);

  • Koen Tanghe (Ghent University), <koen.tanghe at ugent dot be>
  • Yunfeng Du (Institute of Acoustics, CAS), <ydu at hccl dot ioa dot ac dot cn>

Evaluation Procedures

This text has been copied from the 2005 Onset detection page

The detected onset times will be compared with the ground-truth ones. For a given ground-truth onset time, if there is a detection in a tolerance time-window around it, it is considered as a correct detection (CD). If not, there is a false negative (FN). The detections outside all the tolerance windows are counted as false positives (FP). Doubled onsets (two detections for one ground-truth onset) and merged onsets (one detection for two ground-truth onsets) will be taken into account in the evaluation. Doubled onsets are a subset of the FP onsets, and merged onsets a subset of FN onsets.


We define:

Precision

P = Ocd / (Ocd +Ofp)


Recall

R = Ocd / (Ocd + Ofn)


and the F-measure:

F = 2*P*R/(P+R)


with these notations:

Ocd: number of correctly detected onsets (CD)

Ofn: number of missed onsets (FN)

Om: number of merged onsets

Ofp: number of false positive onsets (FP)

Od: number of double onsets


Other indicative measurements:

FP rate:

FP = 100. * (Ofp) / (Ocd+Ofp)

Doubled Onset rate in FP

D = 100 * Od / Ofp

Merged Onset rate in FN

M = 100 * Om / Ofn


Because files are cross-annotated, the mean Precision and Recall rates are defined by averaging Precision and Recall rates computed for each annotation.

To establish a ranking (and indicate a winner...), we will use the F-measure, widely used in string comparisons. This criterion is arbitrary, but gives an indication of performance. It must be remembered that onset detection is a preprocessing step, so the real cost of an error of each type (false positive or false negative) depends on the application following this task.


Evaluation measures:

  • percentage of correct detections / false positives (can also be expressed as precision/recall)
  • time precision (tolerance from +/- 50 ms to less). For certain file, we can't be much more accurate than 50 ms because of the weak annotation precision. This must be taken into account.
  • separate scoring for different instrument types (percussive, strings, winds, etc)

More detailed data:

  • percentage of doubled detections
  • speed measurements of the algorithms
  • scalability to large files
  • robustness to noise, loudness

Relevant Test Collections

The test data that will be used is the same that has been used for the 2005 contest. The description of the data is given below.

Audio data are commercial CD recordings, recordings made by MTG at UPF Barcelona and excerpts from the RWC database. Annotations were conducted by the Centre for Digital Music at QMU London (62% of annotations), Musical Acoustics Lab at Paris 6 University (25%), MTG at UPF Barcelona (11%) and Analysis Synthesis Group at IRCAM Paris (2%). MATLAB annotation software by Pierre Leveau (http://www.lam.jussieu.fr/src/Membres/Leveau/SOL/SOL.htm ) was used for this purpose. Annotaters were provided with an approximate aim (catching all onsets corresponding to music notes, including pitched onsets and not only percussive ones), but no further supervision of annotation was performed.

The defined ground-truth can be critical for the evaluation. Precise instructions on which events to annotate must be given to the annotators. Some sounds are easy to annotate: isolated notes, percussive instruments, quantized music (techno). It also means that the annotations by several annotators are very close, because the visualizations (signal plot, spectrogram) are clear enough. Other sounds are quite impossible to annotate precisely: legato bowed strings phrases, even more difficult if you add reverb. Slightly broken chords also introduce ambiguities on the number of onsets to mark. In these cases the annotations can be spread, and the annotation precision must be taken into account in the evaluation.

Article about annotation by Pierre Leveau et al.: http://www.lam.jussieu.fr/src/Membres/Leveau/ressources/Leveau_ISMIR04.pdf


Comments of particpants

Leslie Smith

I am interested in audio onset detection as a competition in 2006. Looking at the results from last year, it's clear that there's room for improvement...

Are there others interested in this as well?

Leslie Smith (see http://www.cs.stir.ac.uk/~lss/lsspapers.html for my papers, some of which are on onset detection, generally applied to speech...).

Axel Roebel

Evaluation: What is still needed here is a procedure to combine different results into ROC curves! Does such a procedure exist in the D2K framework, if so how can we specify what parameters should be used to produce the ROC curve.

Probably each algorithm that supports parameters should accept command line parameters that can be used to control the algorithm. The remaining question is how the set of parameters can be used within the D2K framework to generate the command lines. And how the set of results can be evaluated in D2K. up to know the OnsetDetection evaluation module only reads and compares a single parameter set.

Test set:

Note, that due to the small number of available labled data, no test set has been published for the 2005 contest. This obviously limits the validity of the comparisons between the algorithms. Still, rerunning updated versions of the algorithms is certainly valuable for participants. Moreover, given the fact that this years contest allows to provide a larger set of parameters reduces the problem of the missing test set. At least the problem that the selected set of parameters may have been adapted with a test collection with different labeling strategies is less important.