Difference between revisions of "2007:Audio Onset Detection"

From MIREX Wiki
(Evaluation procedures)
(Participants)
 
(8 intermediate revisions by 3 users not shown)
Line 7: Line 7:
 
* [http://www.elec.qmul.ac.uk/department/staff/research/dans.htm Dan Stowell] (Queen Mary)
 
* [http://www.elec.qmul.ac.uk/department/staff/research/dans.htm Dan Stowell] (Queen Mary)
 
* [http://www-etud.iro.umontreal.ca/~lacostea/ Alexandre Lacoste] (Montréal)
 
* [http://www-etud.iro.umontreal.ca/~lacostea/ Alexandre Lacoste] (Montréal)
 +
* [http://recherche.ircam.fr/equipes/analyse-synthese/roebel Axel Roebel] (Paris)
 +
* Ruohua Zhou (queen mary / epfl)
 +
* [http://viola.usc.edu/People/people.jsp?uid=wanchilee Wanchi Lee] (Los Angeles)
  
 
==Description==
 
==Description==
  
''The text of this section is largely copied from the 2006 page''
+
:''The text of this section is largely copied from [https://www.music-ir.org/mirex2006/index.php/Audio_Onset_Detection the 2006 page]''
  
The onset detection contest is a continuation of the 2005 Onset Detection contest. The main interest for a repeated evaluation is the fact that in 2005 there was not enough time to run the algorithms with different parameters, such that the initial goal to create and compare ROC curves could not be achieved. Having established the basic framework this years goal is to allow participants to submit their algorithms with a number of different parameter sets, such that the ROC curves of the algorithms can be computed and compared.
+
The onset detection contest is a continuation of the 2005/2006 Onset Detection contest.  
  
 
===Input data===
 
===Input data===
Line 72: Line 75:
 
==Evaluation procedures==
 
==Evaluation procedures==
  
''This text has been copied from the [https://www.music-ir.org/mirex2006/index.php/Audio_Onset_Detection 2006 Onset detection] page''
+
:''This text has been copied from the [https://www.music-ir.org/mirex2006/index.php/Audio_Onset_Detection 2006 Onset detection] page''
  
 
The detected onset times will be compared with the ground-truth ones. For a given ground-truth onset time, if there is a detection in a tolerance time-window around it, it is considered as a correct detection (CD). If not, there is a false negative (FN). The detections outside all the tolerance windows are counted as false positives (FP). Doubled onsets (two detections for one ground-truth onset) and merged onsets (one detection for two ground-truth onsets) will be taken into account in the evaluation. Doubled onsets are a subset of the FP onsets, and merged onsets a subset of FN onsets.
 
The detected onset times will be compared with the ground-truth ones. For a given ground-truth onset time, if there is a detection in a tolerance time-window around it, it is considered as a correct detection (CD). If not, there is a false negative (FN). The detections outside all the tolerance windows are counted as false positives (FP). Doubled onsets (two detections for one ground-truth onset) and merged onsets (one detection for two ground-truth onsets) will be taken into account in the evaluation. Doubled onsets are a subset of the FP onsets, and merged onsets a subset of FN onsets.
Line 114: Line 117:
 
* robustness to noise, loudness
 
* robustness to noise, loudness
  
==Dataset(s)==
+
==Comments from participants==
 +
 
 +
===Dataset(s)===
  
 
I (Dan) am happy to use the dataset as used in 2005/2006 - any comments/agreement/disagreement re that? Those approaches that use machine learning should presumably be trained on ''other'' data.
 
I (Dan) am happy to use the dataset as used in 2005/2006 - any comments/agreement/disagreement re that? Those approaches that use machine learning should presumably be trained on ''other'' data.
  
:'''''Note:''''' I found some problems with the dataset - a couple of the files are faulty (e.g. they're annotations of the wrong audio). At Queen Mary's we've been replacing those faulty files with new annotations, and we'd be happy to share the "fixed" dataset. I'd suggest that it's better to use accurate annotations, even though that sacrifices an element of comparability against the 05/06 results. ''(Still, it's only a small fraction of files that were at fault, so the results will be largely comparable.)''
+
:'''''Note:''''' I found some problems with the dataset - a couple of the files are faulty (e.g. they're annotations of the wrong audio). At Queen Mary's we've been replacing those faulty files with new annotations, and we'd be happy to share the "fixed" dataset. I'd suggest that it's better to use accurate annotations, even though that sacrifices an element of comparability against the 05/06 results. ''(Still, it's only a small fraction of files that were at fault, so the results will be largely comparable.)'' --[[User:Danstowell|Danstowell]] 09:39, 23 February 2007 (CST)
 +
 
 +
 
 +
----
 +
 
 +
''Axel writing:''
 +
 
 +
Hi, Dan. Have you not been part of the previous evaluations? I wonder because in the previous experiments nobody had access to the original data. I don't have much insight behind the scenes, but the idea was that nobody would ever have the possibility to train on the real data. (Does this exclude people from Queen Mary who took part in the construction of the data tests??)
 +
 
 +
Now, obviously we have a problem if some of the data is wrong. So I guess you should contact the people that actually run the tests to send your corrected data.
 +
As I understand they took only part of the available datasets, so they should check whether they use these wrong annotations.
 +
 
 +
----
 +
''From Andreas''
 +
 
 +
All of the data did come from Queen Mary, and we are aware of the faulty ground truths. There were only a few ones that were mislabeled, and I don't think it should have a large impact, as all files were actually validated against 3-5 annotations a piece. Some groups having source data is a reality we do have to live with though, as we depend on contributions from the community, especially in tasks like this where annotation is so labourious. Melody extraction's dataset, for instance, came exclusively from one participant as well. So we have to allow a certain amount of trust and integrity in such things. I think the fact that we do the parameter sweeping will hopefully even things, and any possible advantages, out in the end.

Latest revision as of 10:19, 27 July 2007

Proposers

Originally proposed (2005) by Paul Brossier and Pierre Leveau [1]. Has run in 2005 and 2006.

Participants

Description

The text of this section is largely copied from the 2006 page

The onset detection contest is a continuation of the 2005/2006 Onset Detection contest.

Input data

essentially the same as 2005/2006

Audio format

The data are monophonic sound files, with the associated onset times and data about the annotation robustness.

  • CD-quality (PCM, 16-bit, 44100 Hz)
  • single channel (mono)
  • file length between 2 and 36 seconds (total time: 14 minutes)

Audio content

The dataset is subdivided into classes, because onset detection is sometimes performed in applications dedicated to a single type of signal (ex: segmentation of a single track in a mix, drum transcription, complex mixes databases segmentation...). The performance of each algorithm will be assessed on the whole dataset but also on each class separately.

The dataset contains 85 files from 5 classes annotated as follows:

  • 30 solo drum excerpts cross-annotated by 3 people
  • 30 solo monophonic pitched instruments excerpts cross-annotated by 3 people
  • 10 solo polyphonic pitched instruments excerpts cross-annotated by 3 people
  • 15 complex mixes cross-annotated by 5 people

Moreover the monophonic pitched instruments class is divided into 6 sub-classes: brass (2 excerpts), winds (4), sustained strings (6), plucked strings (9), bars and bells (4), singing voice (5). Nomenclature <AudioFileName>.wav for the audio file

Output data

The onset detection algorithms will return onset times in a text file: <Results of evaluated Algo path>/<AudioFileName>.output.

Onset file Format

<onset time(in seconds)>\n

where \n denotes the end of line. The < and > characters are not included.

README file

A README file accompanying each submission should contain explicit instructions on how to to run the program. In particular, each command line to run should be specified, using %input% for the input sound file and %output% for the resulting text file.

For instance, to test the program foobar with different values for parameters param1 and param2, the README file would look like:

foobar -param1 .1 -param2 1 -i %input% -o %output%
foobar -param1 .1 -param2 2 -i %input% -o %output%
foobar -param1 .2 -param2 1 -i %input% -o %output%
foobar -param1 .2 -param2 2 -i %input% -o %output%
foobar -param1 .3 -param2 1 -i %input% -o %output%
...

For a submission using MATLAB, the README file could look like:

matlab -r "foobar(.1,1,'%input%','%output%');quit;"
matlab -r "foobar(.1,2,'%input%','%output%');quit;"
matlab -r "foobar(.2,1,'%input%','%output%');quit;" 
matlab -r "foobar(.2,2,'%input%','%output%');quit;"
matlab -r "foobar(.3,1,'%input%','%output%');quit;"
...

The different command lines to evaluate the performance of each parameter set over the whole database will be generated automatically from each line in the README file containing both '%input%' and '%output%' strings.

Evaluation procedures

This text has been copied from the 2006 Onset detection page

The detected onset times will be compared with the ground-truth ones. For a given ground-truth onset time, if there is a detection in a tolerance time-window around it, it is considered as a correct detection (CD). If not, there is a false negative (FN). The detections outside all the tolerance windows are counted as false positives (FP). Doubled onsets (two detections for one ground-truth onset) and merged onsets (one detection for two ground-truth onsets) will be taken into account in the evaluation. Doubled onsets are a subset of the FP onsets, and merged onsets a subset of FN onsets.

We define:

Precision
P = Ocd / (Ocd +Ofp)
Recall
R = Ocd / (Ocd + Ofn)
and the F-measure
F = 2*P*R/(P+R)

with these notations:

Ocd
number of correctly detected onsets (CD)
Ofn
number of missed onsets (FN)
Om
number of merged onsets
Ofp
number of false positive onsets (FP)
Od
number of double onsets

Other indicative measurements:

FP rate
FP = 100. * (Ofp) / (Ocd+Ofp)
Doubled Onset rate in FP
D = 100 * Od / Ofp
Merged Onset rate in FN
M = 100 * Om / Ofn

Because files are cross-annotated, the mean Precision and Recall rates are defined by averaging Precision and Recall rates computed for each annotation.

To establish a ranking (and indicate a winner...), we will use the F-measure, widely used in string comparisons. This criterion is arbitrary, but gives an indication of performance. It must be remembered that onset detection is a preprocessing step, so the real cost of an error of each type (false positive or false negative) depends on the application following this task.

Evaluation measures:

  • percentage of correct detections / false positives (can also be expressed as precision/recall)
  • time precision (tolerance from +/- 50 ms to less). For certain file, we can't be much more accurate than 50 ms because of the weak annotation precision. This must be taken into account.
  • separate scoring for different instrument types (percussive, strings, winds, etc)

More detailed data:

  • percentage of doubled detections
  • speed measurements of the algorithms
  • scalability to large files
  • robustness to noise, loudness

Comments from participants

Dataset(s)

I (Dan) am happy to use the dataset as used in 2005/2006 - any comments/agreement/disagreement re that? Those approaches that use machine learning should presumably be trained on other data.

Note: I found some problems with the dataset - a couple of the files are faulty (e.g. they're annotations of the wrong audio). At Queen Mary's we've been replacing those faulty files with new annotations, and we'd be happy to share the "fixed" dataset. I'd suggest that it's better to use accurate annotations, even though that sacrifices an element of comparability against the 05/06 results. (Still, it's only a small fraction of files that were at fault, so the results will be largely comparable.) --Danstowell 09:39, 23 February 2007 (CST)



Axel writing:

Hi, Dan. Have you not been part of the previous evaluations? I wonder because in the previous experiments nobody had access to the original data. I don't have much insight behind the scenes, but the idea was that nobody would ever have the possibility to train on the real data. (Does this exclude people from Queen Mary who took part in the construction of the data tests??)

Now, obviously we have a problem if some of the data is wrong. So I guess you should contact the people that actually run the tests to send your corrected data. As I understand they took only part of the available datasets, so they should check whether they use these wrong annotations.


From Andreas

All of the data did come from Queen Mary, and we are aware of the faulty ground truths. There were only a few ones that were mislabeled, and I don't think it should have a large impact, as all files were actually validated against 3-5 annotations a piece. Some groups having source data is a reality we do have to live with though, as we depend on contributions from the community, especially in tasks like this where annotation is so labourious. Melody extraction's dataset, for instance, came exclusively from one participant as well. So we have to allow a certain amount of trust and integrity in such things. I think the fact that we do the parameter sweeping will hopefully even things, and any possible advantages, out in the end.