Difference between revisions of "The Utrecht Agreement on Chord Evaluation"

From MIREX Wiki
(General Discussion)
 
(5 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=Initial text by Matthias Mauch 08.20.2010 =
+
==Initial text by Matthias Mauch 08.20.2010 ==
During the 2010 ISMIR conference in Utrecht, an interest group of 10 people concerned with modelling of harmony and chords followed an invitation by Matthias Mauch to discuss chord evaluation within MIREX and its implications for chord transcription research. The meeting was harmonious (no wonder!) and successful because everyone agreed that change is needed for better evaluation and comparison between methods.
+
During the 2010 ISMIR conference in Utrecht, an interest group of 10 people concerned with modelling of harmony and chords followed my invitation to discuss chord evaluation within MIREX and its implications for chord transcription research. The meeting was harmonious (no wonder!) and successful because everyone agreed that change is needed for better evaluation and comparison between methods.
  
 
The people present were
 
The people present were
Mert Bay
+
* Mert Bay
Juan Pablo Bello
+
* Juan Pablo Bello
John Ashley Burgoyne
+
* John Ashley Burgoyne
Elaine Chew
+
* Elaine Chew
Andreas F Ehmann
+
* Andreas F Ehmann
Maksim Khadkevich
+
* Maksim Khadkevich
Matthias Mauch
+
* Matthias Mauch
Matt McVicar
+
* Matt McVicar
Johan Pauwels
+
* Johan Pauwels
Thomas Rocher
+
* Thomas Rocher
  
 
All agreed on the following points:
 
All agreed on the following points:
  
==EVALUATION==
+
===Evaluation===
  
 
* The evaluation used in the MIREX task is a good start, but we need more advanced evaluation in order to  
 
* The evaluation used in the MIREX task is a good start, but we need more advanced evaluation in order to  
Line 31: Line 31:
 
* The new metric should be introduced from next year, even if it leads to a sudden drop in MIREX scores.
 
* The new metric should be introduced from next year, even if it leads to a sudden drop in MIREX scores.
  
GROUND TRUTH
+
===Ground Truth===
 
* We applaud the efforts of Ashley Burgoyne et al. and Juan Pablo Bello et al. to produce new ground truth data.
 
* We applaud the efforts of Ashley Burgoyne et al. and Juan Pablo Bello et al. to produce new ground truth data.
 
* There is a need to define standard train-test sets, to make research results more easily comparable.
 
* There is a need to define standard train-test sets, to make research results more easily comparable.
  
THINKING AHEAD
+
===Thinking Ahead===
 
* Matthias Mauch will start a conversation on the music-ir list and an accompanying MIREX Wiki page for further discussion on
 
* Matthias Mauch will start a conversation on the music-ir list and an accompanying MIREX Wiki page for further discussion on
 
** the mix of metrics for *the* MIREX 2011 metric
 
** the mix of metrics for *the* MIREX 2011 metric
Line 41: Line 41:
 
** the definition and acquisition of ground truth train-test sets
 
** the definition and acquisition of ground truth train-test sets
 
* Ashley Burgoyne will provide a detailed instruction manual for chord transcription as used in their current project, so that other people interested will have a basis for future annotation work.
 
* Ashley Burgoyne will provide a detailed instruction manual for chord transcription as used in their current project, so that other people interested will have a basis for future annotation work.
 +
 +
==Implementation of Evaluation Metrics==
 +
 +
==Ground Truth Annotations==
 +
 +
==General Discussion==
 +
So what do people think? Should we begin talking about evaluation methods? My personal preference would be something based on the hamming distance between chords, although perhaps somethign more musically meaningful would be better. Perhaps even the [http://en.wikipedia.org/wiki/Jaccard_index Jaccard Index], which measures intersection of notes divided by union of notes. This would need modifying in order to account for bass/inversion.
 +
 +
Any other suggestions?

Latest revision as of 06:33, 8 March 2011

Initial text by Matthias Mauch 08.20.2010

During the 2010 ISMIR conference in Utrecht, an interest group of 10 people concerned with modelling of harmony and chords followed my invitation to discuss chord evaluation within MIREX and its implications for chord transcription research. The meeting was harmonious (no wonder!) and successful because everyone agreed that change is needed for better evaluation and comparison between methods.

The people present were

  • Mert Bay
  • Juan Pablo Bello
  • John Ashley Burgoyne
  • Elaine Chew
  • Andreas F Ehmann
  • Maksim Khadkevich
  • Matthias Mauch
  • Matt McVicar
  • Johan Pauwels
  • Thomas Rocher

All agreed on the following points:

Evaluation

  • The evaluation used in the MIREX task is a good start, but we need more advanced evaluation in order to
    • provide a more musically precise and relevant evaluation
    • encourage future research to produce more musically relevant transcriptions
  • The metrics used should include
    • segmentation
    • more detailed frame-based evaluation
      • several levels of chord detail (e.g. root, triad, first extended note, second extended note)
      • bass note
  • There should be one metric, a musically sensible mix of the above, which is "the" MIREX 2011 metric for ranking.
    • this is important also for non-MIREX use for better comparison between chord recognition papers, it enforces
    • of course, MIREX should also offer all metrics separately for detailed inspection
  • The new metric should be introduced from next year, even if it leads to a sudden drop in MIREX scores.

Ground Truth

  • We applaud the efforts of Ashley Burgoyne et al. and Juan Pablo Bello et al. to produce new ground truth data.
  • There is a need to define standard train-test sets, to make research results more easily comparable.

Thinking Ahead

  • Matthias Mauch will start a conversation on the music-ir list and an accompanying MIREX Wiki page for further discussion on
    • the mix of metrics for *the* MIREX 2011 metric
    • its implementation details
    • the definition and acquisition of ground truth train-test sets
  • Ashley Burgoyne will provide a detailed instruction manual for chord transcription as used in their current project, so that other people interested will have a basis for future annotation work.

Implementation of Evaluation Metrics

Ground Truth Annotations

General Discussion

So what do people think? Should we begin talking about evaluation methods? My personal preference would be something based on the hamming distance between chords, although perhaps somethign more musically meaningful would be better. Perhaps even the Jaccard Index, which measures intersection of notes divided by union of notes. This would need modifying in order to account for bass/inversion.

Any other suggestions?