Difference between revisions of "2009:Audio Chord Detection"

From MIREX Wiki
(New page: == Introduction == For many applications in music information retrieval, extracting the harmonic structure is very desirable, for example for segmenting pieces into characteristic segment...)
 
(Discussions)
 
(81 intermediate revisions by 13 users not shown)
Line 1: Line 1:
== Introduction ==
+
== Description ==
 +
 
 +
The text of this section is copied from the 2008 page. This task was first run in 2008. Please add your comments and discussions for 2009.
  
 
For many applications in music information retrieval, extracting the harmonic structure is very desirable, for example for segmenting pieces into characteristic segments, for finding similar pieces, or for semantic analysis of music.
 
For many applications in music information retrieval, extracting the harmonic structure is very desirable, for example for segmenting pieces into characteristic segments, for finding similar pieces, or for semantic analysis of music.
Line 9: Line 11:
 
Regarding this we suggest to introduced the new evaluation task ''Audio Chord Detection''.
 
Regarding this we suggest to introduced the new evaluation task ''Audio Chord Detection''.
  
The deadline for this task is August 22nd.
+
The deadline for this task is September 14th.
 +
 
 +
== Data ==
  
 +
Christopher Harte`s Beatles dataset is used for the evaluations last year. This dataset consists of 12 Beatles albums [6].
 +
An approach for text annotation of musical chords is presented in [6].
 +
This year an extra dataset was donated by Matthias Mauch which consists of 38 songs from Queen and Zweieck. The data will be provided as 44.1 kHz 16bit mono wav.
 +
The ground-truth looks like this:
  
== Data ==
+
41.2631021 44.2456460 B
  
As this is intended for music information retrieval, the analysis should be performed on real world audio, not resynthesized MIDI or special renditions of single chords. We suggest the test bed consists of WAV-files in CD quality (with a sampling rate of 44,1kHz and a solution of 16 bit).
+
44.2456460 45.7201130 E
A representative test bed should consist of more than 50 songs of different genres like pop, rock, jazz and so on.
 
  
For each song in the test bed, a ground truth is needed. This should comprise all detectable chords in this piece with their tonic, type and temporal position (onset and duration) in a machine readable format that is still to be specified.  
+
45.7201130 47.2061900 E:7/3
  
To define the ground truth, a set of detectable chords has to be identified.  
+
47.2061900 48.6922670 A
We propose to use the following set of chords build upon each of the twelve semitones.
 
  
Triads: major, minor, diminished, augmented, suspended4
+
48.6922670 50.1551240 A:min/b3
  Quads: major-major 7, major-minor 7, major add9, major maj7/#5
 
        minor-major 7, minor-minor 7, minor add9, minor 7/b5
 
        maj7/sus4, 7/sus4
 
  
An approach for text annotation of musical chords is presented in [6].
+
== I/O Format ==
  
We could contribute excerpts of approximately 30 pop and rock songs including a ground truth.
+
This year I/O format needs to be changed to evaluate on all triads an quads.
 +
We are planning to use the format suggested by Christopher Harte [6].
 +
The chord root is given as a natural (A|B|C|D|E|F|G) followed by optional sharp or flat modifiers (#|b). For the evaluation process we may assume enharmonic equivalence for chord roots. For a given chord type on root X, the chord labels can be given as a list of intervals or as a shorthand notation as shown in the following table:
  
== Evaluation ==
+
{|border="1" cellpadding="5" cellspacing="0" align="center"
 +
|-
 +
!NAME
 +
!INTERVALS
 +
!SHORTHAND
 +
|-
 +
|-*Triads:
 +
|-
 +
|-
 +
|major
 +
|X:(1,3,5)
 +
|X or X:maj
 +
|-
 +
|-
 +
|minor
 +
|X:(1,b3,5)
 +
|X:min
 +
|-
 +
|-
 +
|diminished
 +
|X:(1,b3,b5)
 +
|X:dim
 +
|-
 +
|-
 +
|augmented
 +
|X:(1,3,#5)
 +
|X:aug
 +
|-
 +
|-
 +
|suspended4
 +
|X:(1,4,5)
 +
|X:sus4
 +
|-
 +
|-
 +
|possible 6th triad:
 +
|
 +
|
 +
|-
 +
|-
 +
|suspended2
 +
|X:(1,2,5)
 +
|X:sus2
 +
|-
 +
|-
 +
|*Quads:
 +
|
 +
|
 +
|-
 +
|-
 +
|major-major7
 +
|X:(1,3,5,7)
 +
|X:maj7
 +
|-
 +
|-
 +
|major-minor7
 +
|X:(1,3,5,b7)
 +
|X:7
 +
|-
 +
|-
 +
|major-add9
 +
|X:(1,3,5,9)
 +
|X:maj(9)
 +
|-
 +
|-
 +
|major-major7-#5
 +
|X:(1,3,#5,7)
 +
|X:aug(7)
 +
|-
 +
|-
 +
|minor-major7
 +
|X:(1,b3,5,7)
 +
|X:min(7)
 +
|-
 +
|-
 +
|minor-minor7
 +
|X:(1,b3,5,b7)
 +
|X:min7
 +
|-
 +
|-
 +
|minor-add9
 +
|X:(1,b3,5,9)
 +
|X:min(9)
 +
|-
 +
|-
 +
|minor 7/b5 (ambiguous - could be either of the following)
 +
|
 +
|
 +
|-
 +
|-
 +
|minor-major7-b5
 +
|X:(1,b3,b5,7)
 +
|X:dim(7)
 +
|-
 +
|-
 +
|minor-minor7-b5  (a half diminished-7th)
 +
|X:(1,b3,b5,b7)
 +
|X:hdim7
 +
|-
 +
|-
 +
|sus4-major7
 +
|X:(1,4,5,7)
 +
|X:sus4(7)
 +
|-
 +
|-
 +
|sus4-minor7
 +
|X:(1,4,5,b7)
 +
|X:sus4(b7)
 +
|-
 +
|-
 +
|omitted from list on wiki:
 +
|
 +
|
 +
|-
 +
|-
 +
|diminished7
 +
|X:(1,b3,b5,bb7)
 +
|X:dim7
 +
|-
 +
|-
 +
|No Chord
 +
|N
 +
|
 +
|}
  
Two common measures from field of information retrieval are recall and precision. They can be used to evaluate a chord detection system.  
+
However, we still accept participants who would only like to be evaluated on major/minor and want to use last year`s format which is an integer chord id on range 0-24, where values 0-11  denote the C major, C# major, ..., B major  and  12-23 denote the C minor, C# minor, ..., B minor and        24    denotes silence or no-chord segments
  
'''Recall''': number of time units where the chords have been correctly identified by the algorithm divided by the number of time units which contain detectable chords in the ground truth.
+
== Submission Format ==
  
'''Precision''': number of time units where the chords have been correctly identified by the algorithm divided by the total number of time units where the algorithm detected a chord event.
+
Submissions have to conform to the specified format below:
  
 +
''extractFeaturesAndTrain  "/path/to/trainFileList.txt"  "/path/to/scratch/dir" ''
  
 +
Where fileList.txt has the paths to each wav file. The features extracted on this stage can be stored under "/path/to/scratch/dir"
 +
The ground truth files for the supervised learning will be in the same path with a ".txt" extension at the end. For example for "/path/to/trainFile1.wav", there will be a corresponding ground truth file called "/path/to/trainFile1.wav.txt" .
  
 +
For testing:
  
Points to discuss:
+
''doChordID.sh "/path/to/testFileList.txt"  "/path/to/scratch/dir" "/path/to/results/dir" ''
 +
 
 +
If there is no training, you can ignore the second argument here. In the results directory, there should be one file for each testfile with same name as the test file + .txt .
 +
 
 +
Programs can use their working directory if they need to keep temporary cache files or internal debuggin info. Stdout and stderr will be logged.
 +
 
 +
== Evaluation ==
 +
 
 +
Algorithms should output text files with a similar format to that used in the ground truth transcriptions. That is to say, they should be flat text files with chord segment labels and times arranged thus:
  
- Are the measures mentioned above sufficient to evaluate the algorithms? In particular: Can an algorithm which achieves high precision and recall on many ''time'' units, but has an otherwise "jagged" output (i.e. is wrong often, but for a short time) be considered as good as a smoother one with equal precision and recall?
+
start_time end_time chord_label
  
- Should chord data be expressed in absolute (aka "F major-minor 7") or relative (aka "C: IV major-minor 7") terms?
+
with elements separated by white spaces, times given in seconds, chord labels corresponding to the syntax described in [6] and one chord segment per line.
  
- Should different inversions of chords be considered in the evaluation process?
+
Please note that two things have changed in the syntax since it was originally described in [6]. The first change is that the root is no longer implied as a voiced element of a chord so a C major chord (notes C, E and G) should be written C:(1,3,5) instead of just C:(3,5) if using the interval list representation. As before, the labels C and C:maj are equivalent to C:(1,3,5). The second change is that the shorthand label "sus2" (intervals 1,2,5) has been added to the available shorthand list.--Chrish 17:05, 9 September 2009 (UTC)
  
- What temporal resolution should be used for ground truth and results?
 
  
- How should enharmonic and other confusions of chords be handled?
+
=== Segmentation Score ===
  
- How will Ground Truth be determined? 
+
The segmentation score will be calculated using directional hamming distance as described in [8]. An over-segmentation value (m) and an under-segmentation value (f) will be calculated and the final segmentation score will be calculated using the worst case from these two i.e:
  
- What degree of chordal/tonal complexity will the music contain?
+
segmentation score = 1 - max(m,f)
  
- Will we include any atonal or polytonal music in the Ground Truth dataset?
+
m and f are not independent of each other so combining them this way ensures that a good score in one does not hide a bad score in the other. The combined segmentation score can take values between 0 and 1 with 0 being the worst and 1 being the best result.-- Chrish 17:05, 9 September 2009 (UTC)
  
- What is the maximal acceptable onset deviation between ground truth and result?
+
=== Frame-based recall ===
  
- What file format should be used for ground truth and output?
 
  
== Submission Format ==
+
For recall evaluation, we may define a different chord dictionary for each level of evaluation (dyads, triads, tetrads etc). Each dictionary is a text file containing chord shorthands / interval lists of the chords that will be considered in that evaluation. The following dictionaries are proposed:
  
Submissions have to conform to the specified format below:
+
For dyad comparison of major/minor chords only:
  
''extractFeaturesAndTrain  "/path/to/trainFileList.txt"  "/path/to/scratch/dir" ''
+
N<br>
 +
X:maj<br>
 +
X:min<br>
  
Where fileList.txt has the paths to each wav file. The features extracted on this stage can be stored under "/path/to/scratch/dir"
+
For comparison of standard triad chords:
The ground truth files for the supervised learning will be in the same path with a ".txt" extension at the end. For example for "/path/to/trainFile1.wav", there will be a corresponding ground truth file called "/path/to/trainFile1.wav.txt" .
 
  
For testing:
+
N<br>
 +
X:maj<br>
 +
X:min<br>
 +
X:aug<br>
 +
X:dim<br>
 +
X:sus2<br>
 +
X:sus4<br>
  
''doChordID.sh "/path/to/testFileList.txt"  "/path/to/scratch/dir" "/path/to/results/dir" ''
+
For comparison of tetrad (quad) chords:
  
If there is no training, you can ignore the second argument here. In the results directory, there should be one file for each testfile with same name as the test file + .txt .
+
N <br>
The results file should be structured as below described by Matti.
+
X:maj <br>
 +
X:min<br>
 +
X:aug<br>
 +
X:dim<br>
 +
X:sus2<br>
 +
X:sus4<br>
 +
X:maj7<br>
 +
X:7<br>
 +
X:maj(9)<br>
 +
X:aug(7) <br>
 +
X:min(7)<br>
 +
X:min7<br>
 +
X:min(9)<br>
 +
X:dim(7)<br>
 +
X:hdim7 <br>
 +
X:sus4(7)<br>
 +
X:sus4(b7)<br>
 +
X:dim7<br>
  
  
Programs can use their working directory if they need to keep temporary cache files or internal debuggin info. Stdout and stderr will be logged.
+
For each evaluation level, the ground truth annotation is compared against the dictionary. Any chord label not belonging to the current dictionary will be replaced with an "X" in a local copy of the annotation and will not be included in the recall calculation.
  
== Potential Participants ==
+
Note that the level of comparison in terms of intervals can be varied. For example, in a triad evaluation we can consider the first three component intervals in the chord so that a major (1,3,5) and a major7 (1,3,5,7) will be considered the same chord. For a tetrad (quad) evaluation, we would consider the first 4 intervals so major and major7 would then be considered to be different chords.
  
* H.Papadopoulos (papadopo@ircam.fr)
+
For the maj/min evaluation (using the first example dictionary), using an interval comparison of 2 (dyad) will compare only the first two intervals of each chord label. This would map augmented and diminished chords to major and minor respectively (and any other symbols that had a major 3rd or minor 3rd as their first interval). Using an interval comparison of 3 with the same dictionary would keep only those chords that have major and minor triads as their first 3 intervals so augmented and diminished chords would be removed from the evaluation.
* Jan Weil (weil@nue.tu-berlin.de), Jean-Louis Durrieu (durrieu@enst.fr)
 
* Markus Mehnert, Gabriel Gatzsche (markus.mehnert@tu-ilmenau.de, gze@idmt.fraunhofer.de)
 
* Yuki Uchiyama(uchiyama@hil.t.u-tokyo.ac.jp)
 
* Matti Ryynänen and Anssi Klapuri (Tampere University of Technology), matti.ryynanen <at> tut.fi, anssi.klapuri <at> tut.fi
 
* Xinglin Zhang and Colan Lash (University of Regina, zhang46x@uregina.ca, Lash111c@uregina.ca)
 
* Alexey Egorov (alexey@cbmsnetworks.com)
 
* Dan Ellis (dpwe@ee.columbia.edu)
 
* Maksim Khadkevich (khadkevich <_at_> fbk.eu)
 
* Juan P. Bello (jpbello@nyu.edu)
 
* Kyogu Lee (klee@gracenote.com)
 
* Johan Pauwels (Ghent University, Belgium) johan.pauwels<sp@m>elis.ugent.be
 
  
== Bibliography ==
+
After the annotation has been "filtered" using a given dictionary, it can be compared against the machine generated estimates output by the algorithm under test. The chord sequences described in the annotation and estimate text files are sampled at a given frame rate (in this case 10ms per frame) to give two sequences of chord frames which may be compared directly with each other. For calculating a hit or a miss, the chord labels from the current frame in each sequence will be compared.  Chord comparison is done by converting each chord label into an ordered list of pitch classes then comparing the two lists element by element. If the lists match to the required number of intervals then a hit is recorded, otherwise the estimate is considered a miss. It should be noted that, by converting to pitch classes in the comparison, this evaluation ignores enharmonic pitch and interval spellings so the following chords (slightly silly example just for illustration) will all evaluate as identical:
  
1.Harte,C.A. and Sandler,M.B.(2005). '''Automatic chord identification using a quantised chromagram.''' Proceedings of 118th Audio Engineering Society's Convention.
+
C:maj = Dbb:maj = C#:(b1,b3,#4)
  
2.Sailer,C. and Rosenbauer K.(2006). '''A bottom-up approach to chord detection.''' Proceedings of International Computer Music Conference 2006.
 
  
3.Shenoy,A. and Wang,Y.(2005). '''Key, chord, and rythm tracking of popular music recordings.''' Computer Music Journal 29(3), 75-86.
+
Basic recall calculation algorithm:
  
4.Sheh,A. and Ellis,D.P.W.(2003). '''Chord segmentation and recognition using em-trained hidden markov models.''' Proceedings of 4th International Conference on Music Information Retrieval.
+
1) filter annotated transcription using chord dictionary for a defined number of intervals
  
5.Yoshioka,T. et al.(2004). '''Automatic Chord Transcription with concurrent recognition of chord symbols and boundaries.''' Proceedings of 5th International Conference on Music Information Retrieval.
+
2) sample annotated transcription and machine estimated transcription at 10ms intervals to create a sequence of annotation frames and estimate frames
  
6.Harte,C. and Sandler,M. and Abdallah,S. and G├│mez,E.(2005). '''Symbolic representation of musical chords: a proposed syntax for text annotations.''' Proceedings of 6th International Conference on Music Information Retrieval.
+
3) start at the first frame
  
7.Papadopoulos,H. and Peeters,G.(2007). '''Large-scale study of chord estimation algorithms based on chroma representation and HMM.''' Proceedings of 5th International Conference on Content-Based Multimedia Indexing.
+
4) get chord label for current annotation frame and estimate frame
  
== Comments/Discussion ==
+
5) check annotation label:<br>
  
'''Matti's comments''' (06/08/08).
+
IF symbol is 'X' (i.e. non-dictionary) <br>
  
Hi all, I'm glad to see several potential participants to this new task. Should we start to decide the output format for the submissions?
+
THEN ignore frame (record number of ignored frames)<br>
  
Although it would be nice to have a large set of possible chords in the evaluation, I suggest performing the evaluation based on triads alone, even perhaps only on major and minor thirds if just the evaluation data is suitable for this. In this case, the output format could be <time stamp><white-space><chord identifier>, where the time stamp would be in seconds and the chord identifier an integer between 0-23, where values 0-11 denote C major, C# major, D major, and so forth until B major, and similarly values between 12-23 C minor, C# minor etc. The time stamps could denote either the chord onset times or the frame times (e.g., on a 10-ms grid). Should the evaluation take silent or no-chord segments into account? Any comments or ideas are welcome.
+
ELSE compare annotated/estimated chords for the predefined number of intervals <br>
 +
increment hit count if chords match<br>
  
'''Kyogu's comments''' (August 7, 2008)
+
ENDIF
  
I'm with Matti in reducing the types of chords only to triads or even to major/minor triads. Since this is going to be the first chord detection task (if it ever happens), we can start from a simple task and later move on to more difficult tasks after analyzing the results. As to the output format, we'd better stick to absolute times in (milli)seconds than to the frame numbers since they should vary from algo to algo.
+
6) increment frame count
  
'''Xinglin Zhang's comments''' (August 7, 2008)
+
7) go back to 4 until final chord frame
 +
--Chrish 17:05, 9 September 2009 (UTC)
  
Matti's suggest sounds good. But don't forget that  some parts of the song do not have a chord (for example,at the beginning of the song where there is nothing or there are only drums going on). Thus 0-23 is not enough. We can introduce a 24 for none-chord or we can use the representations give by Reference 6 on this page, if you have read it. Let's determine the format as soon as possible. BTW, I sent Kyogu an email a couple of weeks ago on some problems regarding chord recognition, but I received no response. I am not sure whether you have got that email or not. Thanks.
+
== Discussions ==
  
'''Matti's comments''' (August 8, 2008)
+
Points to discuss:
  
Hi all, thanks for your comments. Based on the above discussion, I suggest the following submission format for this task. Input: audio filename. Output: filename of a text file where the analysis results are written in the following format:
+
* The '''Precision''' measure has not been used last year, and I believe it should not because (unlike in beat extraction) we can assume a contiguous sequence of chords, i.e. all time units should feature a chord label. --[[User:Matthiasmauch|Matthias]] 11:04, 27 June 2009 (UTC)
  
<onset time 1><whitespace><chord id 1>
+
* I would like to disagree on Matthias' previous point: I think we cannot assume that there is a chord present in every frame, one can think for instance of a drum solo, an acapella break, ethnic music or simply the beginning and ending of a file. In melody extraction or beat detection, there also isn't a continuity assumption. I must say that at the moment, our system isn't able to generate a no-chord either, so it is not in my personal interest to add this to the evaluation, but I feel this should be part of a general chord extraction system. I've also learned from some premature experiments that with the current frame-based evaluation, it is actually not even beneficial to include such a no-chord generator, because of the inequality of prior chances between a chord and a no-chord (14% for a N.C. in our little dataset, I suspect it to be even less for the Beatles set). The consequence is that the chord/no-chord distinction must be very accurate in order to increase the performance. A related, minor topic is the naming of this task. Why isn't it "audio chord extraction" just like "melody extraction". For me "chord detection" is making the distinction between chords and no-chords and "chord extraction" is naming detected chords. Anyway, just nitpicking on that one. --Johan 15:43, 16 July 2009 (CET)
<onset time 2><whitespace><chord id 2>
 
...
 
<onset time n><whitespace><chord id n>
 
 
where
 
 
chord id: an integer on range 0-24, where values 0-11  denote the C major, C# major, ..., B major
 
                                                  12-23 denote the C minor, C# minor, ..., B minor
 
                                                  24    denotes silence or no-chord segments
 
onset time: onset time in seconds
 
 
Output example of chord sequence of no-chord (silence), C major, D minor, G major:
 
 
0.000 24
 
2.416 0
 
6.347 14
 
9.123 7
 
  
I guess, however, that the participants have also methods which produce a chord label for each analysis frame without onset detection. Such submissions can use this format by labeling each frame:
+
* I think we can assume a contiguous sequence of chords if we treat "no chord" as a chord. --[[User:Matthiasmauch|Matthias]] 16:01, 7 August 2009 (UTC)
  
  0.000 24
+
* I believe we should move forward in two ways to get a more meaningful evaluation:
  0.010 24
+
*# evaluate separate recall measures for several chord classes, my proposal is ''major'', ''minor'', ''diminished'', ''augmented'', ''dominant'' (meaning major chords with a minor seventh). A final recall score can then be calculated as a (weighted) average of the recall on different chords. --[[User:Matthiasmauch|Matthias]] 11:04, 27 June 2009 (UTC)
...
+
*# We use just triads ''major'', ''minor'', ''diminished'' & ''augmented'' which I think is a more sensible distinction. Once you start using quads, why limit yourself to ''dominant 7'' and not use ''minor 7'', ''major 7'', ''full diminished'', etc. So I'm more in favour of just triads (maybe add sus too) or more quads. -- Johan 15:48, 16 July 2009 (CET)
  2.410 24
+
*# Segmentation should be considered. For example, a chord extraction algorithm that has reasonable recall may still be heavily fragmented thus producing an output difficult to read for humans. One measure to check for similarity in segmentation is ''directional Hamming distance'' (or ''divergence''). --[[User:Matthiasmauch|Matthias]] 11:04, 27 June 2009 (UTC)
2.420 0
+
*# Agree, while the frame-based evaluation is certainly easy, it is not the most musically sensible. An evaluation on note-level or chord-segment basis might be a little too complicated for now, but this is a start. -- Johan 15:51, 16 July 2009 (CET)
  2.430 0
+
*# Do you think we could consider several evaluation cases for chord detection with various chord dictionnaries? (For instance one with major/minor triads, one with major/minor/diminished/augmented/dominant etc.) so that each particpant can choose a case that can be handled by his/her algorithm? --Helene (IRCAM)
...
+
*# to Helene: Several (maybe two) evaluation cases /could/ be good, but I think everyone's algorithm should be tested on every task, I think choosing the one you want would mean you can't compare it to other people's method.
 +
*# to Johan: in your chord list, did you mean to give the same list as I did? Anyway, I like to add "dominant" because it is used often, and musically (Jazz harmony theory) there's a functional difference between "dominant" and "major" (not between "minor" and "minor 7", and not between "major" and "major 7" or "major 6"). --[[User:Matthiasmauch|Matthias]] 16:01, 7 August 2009 (UTC)
 +
*# @Matthias: no I intended to list just the triads (the ''dominant'' shouldn't have been copy-pasted, corrected it now) -- Johan 10:49, 25 Aug 2009 (CET)
 +
*# to Matthias: I think that using the label "dominant" as you suggest here is not the correct use of that musical term in this context. In music theory (both in classical and jazz harmony) it is true that a dominant-seventh chord is always the major triad + minor 7th chord shape i.e. (1,3,5,b7). However, a (1,3,5,b7) chord does not always function as a dominant. Whether such a chord can be labelled dominant or not is entirely based on the chord's position in a given chord sequence relative to other chords which define its function in the progression. It is precisely for that reason that I argue against the use of the term 'dominant' for context-free chord labelling in the ISMIR05 chord labels paper [6]. -- Chrish 17:00, 10 August 2009 (UTC)
 +
*# It seems counter-intuitive to have an evaluation that includes some quads but not all. It is fair to evaluate across the triad shapes major, minor, augmented, diminished and suspended (although sus2 and sus4 are inversions of each other) as these are the naturally occurring triad shapes in western harmony. It would seem sensible to include all the other quads that are labelled "7th" chords if including the (1,3,5,b7) shape in an evaluation. Given this problem, one possible way to compare algorithms that detect different sets of chord labels would be to split the results between triad recognition and quad/quint etc recognition. All algorithms can be tested on the triad evaluation - any algorithm that can detect quads can be compared directly against an algorithm that deals only with triads simply by taking the equivalent first three intervals of each chord label in the transcription as a triad. Only those algorithms that can recognise quads need to be evaluated in results that include quad chords. To make this process easy, it might be sensible for the labels that chord recognition algorithms produce to be given in terms of the intervals in the chords themselves rather than a chord name - i.e. a C major could be "C:(1,3,5)" and C major seventh "C:(1,3,5,7)" - both would evaluate as C major in a triad evaluation but the second one could also be evaluated in a test that looked at quads as well. This would also take away problems with possible labelling ambiguities that chord name labels could introduce. A list of the triads and quads that are acceptable in each level of evaluation would need to be drawn up but there is a list something like that at the top of this page already. -- Chrish 17:00, 10 August 2009 (UTC)
 +
*# I think keeping the MIREX2008 evaluation procedure would make sense. I like the idea of a "merged maj/min" score in order to evaluate the root precision, and then, maybe another score taking only into account the root and mode(maj/min). Then, we could progressively extend the chord dictionary, by adding triads and quads (as suggested by Chris and Johan) and calculating new scores based on these extension.-- Thomas 12:45, 12 August 2009 (UTC)
 +
*# I like the idea of using a segmentation measure based on directional hamming distance - We must make sure that the measure captures both over-segmentation (fragmentation) and under-segmentation though. The fragmentation measure (1-f) based on the inverse directional hamming distance described by Abdallah et al [8] only measures fragmentation by measuring the distance d_MG between the estimated sequence (M) and the ground-truth annotation (G). They also propose an under-segmentation measure (1-m) which uses the forward directional hamming distance d_GM. Using the fragmentation measure alone would mean that a chord recognition algorithm that output one chord label for the entire piece would score 100% correct for fragmentation. Both of these measures give a value between 0 and 1 so we could combine them to give an overall "chord segmentation" measure: 1-max(f,m) (f and m are not independent so it's probably best to just use the worst one rather than combine them geometrically). I think this measure could complement the frame-based recall quite well. -- Chrish 10:00, 11 August 2009 (UTC)
 +
*# anyone know how to make the LaTeX maths work on the wiki? -- Chrish 17:01, 10 August 2009 (UTC)
 +
* Something to consider when broadening the scope of used chords, is the inequality in prior chance of different chords (much like the problem with chords/no-chords I mentioned above). When looking for augmented and diminished triads in the Beatles set in addition to major and minor, I'm quite positive the (or at least my) overall performance will decrease. Some processing/selecting could level the priors, but just limitting the data set to the duration of the least frequent chord won't leave us with much data, I'm afraid. The thing is of course that the inequality is also there in reality, so I'm not really convinced myself that this should be done. Another option is not changing the data, but letting the evaluation take it into account. -- Johan 16:23, 16 July 2009 (CET)
  
If the onset detection is not considered this year, then the evaluation should give similar results for both formats.
 
  
An obvious evaluation metric would simply measure the time proportion of correct overlapping chord ids over the audio file. For example, given the reference (correct) chords of 10 seconds of audio:
 
  
0.000 0
+
* Should chord data be expressed in absolute (aka "F major-minor 7") or relative (aka "C: IV major-minor 7") terms?
5.000 7
 
  
and an analysis result file of
+
Absolute is better for the moment - if an algorithm can generate relative information then it can be converted easily to absolute. Going the other way is not easy if the algorithm doesn't estimate the key before trying to recognise chords... -- Chrish 17:04, 9 September 2009 (UTC)
  
0.000 0
+
* Should different inversions of chords be considered in the evaluation process?
4.000 7
 
8.000 20
 
  
This would give result of 70 % (C major correct for 4 s, G major correct for 3 s, altogether 7 seconds correct for 10 seconds of audio). Also error analysis could be carried out, for example, by measuring the distance between reference and labeled chord on a circle of fifths.
+
The evaluation procedure I've outlined above ignores inversion at the moment. -- Chrish 17:04, 9 September 2009 (UTC)
  
Any suggestions or corrections to this?
+
* What temporal resolution should be used for ground truth and results?
  
'''Kyogu's comments''' (August 10, 2008)
+
If all algorithms output a text file with chord start/end times and labels then the underlying resolutions of different systems can be different. In the evaluation stage I suggested sampling the transcriptions in 10ms frames because that's the limit of what we can perceive in terms of separate onsets anyway.-- Chrish 17:04, 9 September 2009 (UTC)
  
Hi Matti, thanks much for your suggestions. Everything sounds fine, but I'd like to suggest a "don't care" for which the evaluation is not performed. This corresponds to an "N (non-chord) in Harte and Sandler's annotations. That way, we don't need to have a silence or percussion-only detector. This doesn't require the output format to be changed, but only matters when evaluating. I'm also fine with giving partial scores for the related chords. Thanks.
+
* How should enharmonic and other confusions of chords be handled?
  
'''Gene's comment''' (August 12, 2008)
+
Although the chord symbol syntax allows enharmonic spellings to be differentiated, the evaluation process converts the symbols into lists of pitch classes. As long as the symbol is a valid one according to the syntax in [6] (with the two caveats I outlined above) then the system will have no trouble accepting any chord symbol.-- Chrish 17:04, 9 September 2009 (UTC)
  
Matti -- we use exactly the time-based precision metrics you outlined above (time during which the correct chord is on, including silence (24), divided by the song's total time). We have a free Winamp plugin that does chord detection and even mixes the detected triads into the original song in real time (good for ear testing).
+
* How will Ground Truth be determined? 
 +
*# One thing we should consider doing is using the combined results to find possible errors or inconsistencies in the data.  I have found that there are some that exists in Harte's labels (this was, after all, extremely difficult).  If there are sections that are systematically error-prone, then maybe it is best to reevaluate these labels for future runs  Also, we need to start finding other databases. Since we are only using Beatles data each year, we have no measure of generalization.  Maybe we should make a requirement that participation includes a requirement to label 2-3 songs?  These could be distributed to a few people and for verification as well.  Just a thought.  -- [[User:Jeremy|Jeremy]] 14:45, 25 August 2009 (UTC)
 +
*# If anyone finds mistakes in the Beatles data please let me know about them so I can fix them before I release the new versions (this will be in time for ISMIR 2009) -- Chrish 17:04, 9 September 2009 (UTC)
  
'''Dan's comments''' (August 13, 2008)
+
* What degree of chordal/tonal complexity will the music contain?
  
I'm interested in separating the effects of algorithm/representation and training data.  I think it's interesting to see what systems can do "the best", but to do that fairly you need to evaluate on data which hasn't been used in developing the system, which excludes Chris Harte's data. But having that high-quality data available raises the possibility of defining a task using that data (e.g. train/test folds) and having everyone run on the same, open data.
+
This can be variable using different chord dictionaries.-- Chrish 17:04, 9 September 2009 (UTC)
  
I've put together a simple baseline system that uses 4 fold cross validation on the Harte data (i.e. 9 albums train/ 3 albums test in 4 cuts).  I've also generated beat-synchronous chroma features for the entire 180 track set (as well as 32 kbps MP3 audio).  I'm thinking about releasing this, but it seems too late to be useful.
+
* Will we include any atonal or polytonal music in the Ground Truth dataset?
  
I see two different paradigms: one is a system that takes training data, learns models for the labeled chords, then takes test data and labels into the same classes.  The second is a system that is submitted including pre-learned models and only processes the test data.  I guess the second system will be easier for more people to provide, but it confounds the training data used with the algorithmic approach, which limits how much we learn from the evaluation.
+
Are there any recognisable chords in these types of music as such? -- Chrish 17:04, 9 September 2009 (UTC)
  
'''Mert`s Comments''' (August 13, 2008)
+
* What is the maximal acceptable onset deviation between ground truth and result?
  
Hi Matti, the I/O format you suggested is fine, we can use it if there are no objections. We should take a vote for Dan`s suggestions about training.
+
If using a frame based recall and a segmentation measure based on directional hamming distance, we do not need to specify an allowable onset deviation. -- Chrish 17:04, 9 September 2009 (UTC)
  
'''Matti's comments''' (August 14, 2008)
+
* What file format should be used for ground truth and output?
  
Hi all, I agree with Dan that N-fold training/testing will make the evaluation more informative. However, my submission has been already fixed in a C++ implementation (yes, the models have been trained with the Beatles data) and I'm not sure if I have enough time to prepare the learning codes as well. Two subtasks, pre-trained and N-fold testing, would sound good to me.
+
Flat text - effectively the same as the .lab files used for the Beatles transcriptions.-- Chrish 17:04, 9 September 2009 (UTC)
  
'''Xinglin Zhang's comments''' (August 14, 2008)
+
=== Comment`s by MB 08.12===
 +
First I want to make clear that we are using the [http://ismir2005.ismir.net/proceedings/1080.pdf Christopher`s Harte`s Beatles dataset ] which includes quad chords.
 +
Last year the evaluations were based only on major,minor and non chords. This year if there are enough participants we can extend the evaluations to the rest of the  triads diminished, augmented suspended and to quads.
 +
Please vote:
  
I am trying an deterministic way which does not need training. Training and learning should be a better way because even if we know nothing about the data, the algorithm can learn it after training(this is based on that our assumption of the distribution of the feature accords to the actual distribution if we use parametric learning methods). But as long as we know the details of the data(chord composition theory for this problem), we can setup some rules according to musical theory and filter the feature based on the rules. I have not finished this implementation yet. I am not sure whether I am able to have it done before 18th (is this the deadline?),not to mention to write the codes using training. Everyone seems to be hoping the arrangements fit his or her implementation well. Ha~ I am here with the two subtasks suggestion (I made a wrong poll below^_^). BTW, is everyone with Matt's suggestion on the vocabulary that contains only major and minor triads?
+
<poll>
 +
How would you like to evaluations to be performed ?
 +
Same as last year evaluate on major, minor and non chord
 +
All triads: major, minor, diminished, augmented, suspended + non chord
 +
All triads + quads + non chord
 +
</poll>
  
'''Juan's comments''' (August 14, 2008)
+
If we decide to extend this task, we have to change the I/O format. For ex: C(1,3,5,7) so that we can evaluate same results against triads or quads easily.
 +
In terms of evaluations, we performed a really simple one last year. This year we welcome evaluations scripts written by the community.
  
Matti's suggestion of output file format seems good to me (should output files be simply called audiofilename.txt). May I suggest using an audio list rather than a filename as input? In this way everyone gets to code their own batch processing function. Also, where should we write the output files to? how about creating an "output" subfolder in our algorithm's folder? 
+
A possible simplification of the data output could be to use chromatic numeric notation for the intervals. For example, C(1,3,5,7) would be C(0,4,7,11) or to be a bit more pure, something like 0(0,4,7,11) is cool but a bit redundant, leading us to 0,4,7,11. Dr. Downie prefers the chromatic numeric notation as it instantaneously gets rid of the enharmonic spelling problem.
I am supportive of including both a train/test and only test evaluations. My system does not include a supervised training stage in any case. However, I must point out that system parameters have been optimized using the Beatles data, which I guess is common to many chord ID systems out there and a an unavoidable shortcoming of this first version of the chord ID MIREX task.
 
Focusing on major, minor triads and non-chords seem OK to me (this is all my system can do anyway). However, if there is enough quorum for other chords (augmented, diminished and quads) I think we should include a separate evaluation for those. Finally, I am in support of assessing errors according to chord confusions (e.g. relative major/minor, parallel major/minor, dominant/sub-dominant) but I think the main evaluation metric should be the true positives rate, as originally suggested by Matti.
 
Thanks everyone for putting this together!
 
  
''' Mert`s comment''' (August 15, 2008)
 
  
I have just updated the submission format. Please follow it in your submissions. We are going to have to 2 subtasks, one with 3-fold test/train ( 2/3 of data for train 1/3 for test , 3 times) and another subtask with pretrained systems. In the README file of your submissions please state clearly which of your programs is running for which task. A system submitted for test/train subtask should not have any pre tuning or training to the dataset.
 
I would like to get an idea of how fast are the systems compared to real time. Please email me this info with the type of machine you tested the speed.  There are around 180 songs. If performance is an issue, We might select a subset of the data at least for the  test/train subtask.
 
  
'''Xinglin Zhang's comment''' (August 18, 2008)
 
  
I'd like to know for the 3-fold test/train subtask, will you provide labeled ground truth for supervised training? And the format of the training files will follow Matti's suggestion?
+
Looking at the last years results for Chord Detection:
 +
https://www.music-ir.org/mirex/2008/index.php/Audio_Chord_Detection_Results
  
'''Dan Ellis: Clarifying file names''' (August 19, 2008)
+
the performance increase for doing a training for chord detection  seems to be  insignificant.
 +
Would you consider dropping the test train part of the task this year?
  
Following Xinglin's comment, can I assume that if the training file list includes /some/path/file1.wav, that I will also be able to find /some/path/file1.txt as the training labels, also in the 0-24 integer label format? 
+
== Potential Participants ==
  
Also, when we write an output file, is it to /path/to/output/file1.txt (not /path/to/output/file1.wav.txt)?
+
* Johan Pauwels/Ghent University, Belgium (firstname.lastname@elis.ugent.be) (still interested)
 +
* Matthias Mauch, Centre for Digital Music, Queen Mary, University of London --[[User:Matthiasmauch|Matthias]] 10:33, 27 June 2009 (UTC)
 +
* Laurent Oudre, TELECOM ParisTech, France (firstname.lastname@telecom-paristech.fr) (still interested, probably 2 algorithms)
 +
* Maksim Khadkevich, Fondazione Bruno Kessler, Italy (lastname_at_fbk_dot_eu) (still interested, 1 algorithm)
 +
* Thomas Rocher, LaBRI Université Bordeaux 1, France (firstname.lastname@labri.fr)
 +
* Yushi Ueda, The University of Tokyo, Japan (lastname@hil.t.u-tokyo.ac.jp)
 +
* Christopher Harte, Centre for Digital Music, Queen Mary, University of London (firstname_dot_lastname_at_elec_dot_qmul_dot_ac_dot_uk)
 +
* Helene Papadopoulos, IRCAM (firstname_dot_lastname_at_ircam.fr)
 +
* Adrian Weller, Daniel Ellis and Tony Jebara, Columbia University, NY, USA (aw2506@columbia.edu)
 +
* Your name here
  
'''Mert`s Comments August 20, 2008'''
+
== Bibliography ==
  
Thanks for the noting this. Let`s keep everything simple. The ground truth text files for supervised learning will be kept in the same directory. So if there is a file in the train list called "/some/path/file1.wav" there will be a corresponding ground truth file called "/some/path/file1.wav.txt" in the 0-24 integer label format.  
+
1.Harte,C.A. and Sandler,M.B.(2005). '''Automatic chord identification using a quantised chromagram.''' Proceedings of 118th Audio Engineering Society's Convention.
Also the output files should be called "/path/to/output/file1.wav.txt" .
 
  
'''Johan`s comments (August 21, 2008)'''
+
2.Sailer,C. and Rosenbauer K.(2006). '''A bottom-up approach to chord detection.''' Proceedings of International Computer Music Conference 2006.
Hi all, I'm a first year PhD student at Ghent University and I'm (but mostly will be) focussing on chord and key detection. The contest comes a couple of months too early for me, as I didn't plan to have a completely integrated chord extraction program working at this stage, but I'm busy hacking together some old and new code of our group so that I can jump in at short notice. I have one remark: our program doesn't need any training (since I didn't have time yet to implement it, but it's on the to do list), it just uses parameters derived from musical theory, so no Beatles data were used. I'll submit it to the pre-trained category, since it doesn't need training, but it's more not-trained. To be complete, just the feature extraction step has some optimized parameters (for the pitch tracking) which we set according to our own testset (not containing Beatles), but this wasn't evaluated in the context of chord detection and are supposed to be general parameters anyway (based on common sense mostly). Also, because of time constraints I won't be able to compile for Linux, although it should be perfectly portable C++, but I'll try to use static linking as much as possible (unfortunately, libsndfile's licence for instance prohibites this) and will send a Win32 console app. Lastly, the abstract will probably be ready August 29th the latest. I read in a mail that the sharp deadline is only 5 days before ISMIR, so that should be ok, but can somebody confirm this?
 
  
Additional question:
+
3.Shenoy,A. and Wang,Y.(2005). '''Key, chord, and rythm tracking of popular music recordings.''' Computer Music Journal 29(3), 75-86.
It's obvious that all sevenths chords, ninths and higher will be reduced to their triad, but will dim, aug, sus4 and so on all be represented as "24" in the evaluation reference?
 
  
'''Xinglin's Comments (August 27, 2008)'''
+
4.Sheh,A. and Ellis,D.P.W.(2003). '''Chord segmentation and recognition using em-trained hidden markov models.''' Proceedings of 4th International Conference on Music Information Retrieval.
  
I fixed some bugs of my program just now and updated the submission. The submission page allowed me to do so, but not sure whether it's ok or not for the committee?
+
5.Yoshioka,T. et al.(2004). '''Automatic Chord Transcription with concurrent recognition of chord symbols and boundaries.''' Proceedings of 5th International Conference on Music Information Retrieval.
  
== Trainging Poll ==
+
6.Harte,C. and Sandler,M. and Abdallah,S. and G├│mez,E.(2005). '''Symbolic representation of musical chords: a proposed syntax for text annotations.''' Proceedings of 6th International Conference on Music Information Retrieval.
  
 +
7.Papadopoulos,H. and Peeters,G.(2007). '''Large-scale study of chord estimation algorithms based on chroma representation and HMM.''' Proceedings of 5th International Conference on Content-Based Multimedia Indexing.
  
<poll>
+
8.Samer Abdallah, Katy Noland, Mark Sandler, Michael Casey & Christophe Rhodes: '''Theory and Evaluation of a Bayesian Music Structure Extractor''' (pp. 420-425) Proc. 6th International Conference on Music Information Retrieval, ISMIR 2005.
How would you like the systems perform training?
 
Pre-trained
 
N-fold train test setup
 
Both. One subtask with pre-trained systems, one with N-fold train/test setup.
 
No need training
 
</poll>
 

Latest revision as of 00:44, 15 December 2011

Description

The text of this section is copied from the 2008 page. This task was first run in 2008. Please add your comments and discussions for 2009.

For many applications in music information retrieval, extracting the harmonic structure is very desirable, for example for segmenting pieces into characteristic segments, for finding similar pieces, or for semantic analysis of music.

The extraction of the harmonic structure requires the detection of as many chords as possible in a piece. That includes the characterisation of chords with a key and type as well as a chronological order with onset and duration of the chords.

Although some publications are available on this topic [1,2,3,4,5], comparison of the results is difficult, because different measures are used to assess the performance. To overcome this problem an accurately defined methodology is needed. This includes a repertory of the findable chords, a defined test set along with ground truth and unambiguous calculation rules to measure the performance.

Regarding this we suggest to introduced the new evaluation task Audio Chord Detection.

The deadline for this task is September 14th.

Data

Christopher Harte`s Beatles dataset is used for the evaluations last year. This dataset consists of 12 Beatles albums [6]. An approach for text annotation of musical chords is presented in [6]. This year an extra dataset was donated by Matthias Mauch which consists of 38 songs from Queen and Zweieck. The data will be provided as 44.1 kHz 16bit mono wav. The ground-truth looks like this:

41.2631021 44.2456460 B

44.2456460 45.7201130 E

45.7201130 47.2061900 E:7/3

47.2061900 48.6922670 A

48.6922670 50.1551240 A:min/b3

I/O Format

This year I/O format needs to be changed to evaluate on all triads an quads. We are planning to use the format suggested by Christopher Harte [6]. The chord root is given as a natural (A|B|C|D|E|F|G) followed by optional sharp or flat modifiers (#|b). For the evaluation process we may assume enharmonic equivalence for chord roots. For a given chord type on root X, the chord labels can be given as a list of intervals or as a shorthand notation as shown in the following table:

NAME INTERVALS SHORTHAND
major X:(1,3,5) X or X:maj
minor X:(1,b3,5) X:min
diminished X:(1,b3,b5) X:dim
augmented X:(1,3,#5) X:aug
suspended4 X:(1,4,5) X:sus4
possible 6th triad:
suspended2 X:(1,2,5) X:sus2
*Quads:
major-major7 X:(1,3,5,7) X:maj7
major-minor7 X:(1,3,5,b7) X:7
major-add9 X:(1,3,5,9) X:maj(9)
major-major7-#5 X:(1,3,#5,7) X:aug(7)
minor-major7 X:(1,b3,5,7) X:min(7)
minor-minor7 X:(1,b3,5,b7) X:min7
minor-add9 X:(1,b3,5,9) X:min(9)
minor 7/b5 (ambiguous - could be either of the following)
minor-major7-b5 X:(1,b3,b5,7) X:dim(7)
minor-minor7-b5 (a half diminished-7th) X:(1,b3,b5,b7) X:hdim7
sus4-major7 X:(1,4,5,7) X:sus4(7)
sus4-minor7 X:(1,4,5,b7) X:sus4(b7)
omitted from list on wiki:
diminished7 X:(1,b3,b5,bb7) X:dim7
No Chord N

However, we still accept participants who would only like to be evaluated on major/minor and want to use last year`s format which is an integer chord id on range 0-24, where values 0-11 denote the C major, C# major, ..., B major and 12-23 denote the C minor, C# minor, ..., B minor and 24 denotes silence or no-chord segments

Submission Format

Submissions have to conform to the specified format below:

extractFeaturesAndTrain  "/path/to/trainFileList.txt"  "/path/to/scratch/dir"  

Where fileList.txt has the paths to each wav file. The features extracted on this stage can be stored under "/path/to/scratch/dir" The ground truth files for the supervised learning will be in the same path with a ".txt" extension at the end. For example for "/path/to/trainFile1.wav", there will be a corresponding ground truth file called "/path/to/trainFile1.wav.txt" .

For testing:

doChordID.sh "/path/to/testFileList.txt"  "/path/to/scratch/dir" "/path/to/results/dir"  

If there is no training, you can ignore the second argument here. In the results directory, there should be one file for each testfile with same name as the test file + .txt .

Programs can use their working directory if they need to keep temporary cache files or internal debuggin info. Stdout and stderr will be logged.

Evaluation

Algorithms should output text files with a similar format to that used in the ground truth transcriptions. That is to say, they should be flat text files with chord segment labels and times arranged thus:

start_time end_time chord_label

with elements separated by white spaces, times given in seconds, chord labels corresponding to the syntax described in [6] and one chord segment per line.

Please note that two things have changed in the syntax since it was originally described in [6]. The first change is that the root is no longer implied as a voiced element of a chord so a C major chord (notes C, E and G) should be written C:(1,3,5) instead of just C:(3,5) if using the interval list representation. As before, the labels C and C:maj are equivalent to C:(1,3,5). The second change is that the shorthand label "sus2" (intervals 1,2,5) has been added to the available shorthand list.--Chrish 17:05, 9 September 2009 (UTC)


Segmentation Score

The segmentation score will be calculated using directional hamming distance as described in [8]. An over-segmentation value (m) and an under-segmentation value (f) will be calculated and the final segmentation score will be calculated using the worst case from these two i.e:

segmentation score = 1 - max(m,f)

m and f are not independent of each other so combining them this way ensures that a good score in one does not hide a bad score in the other. The combined segmentation score can take values between 0 and 1 with 0 being the worst and 1 being the best result.-- Chrish 17:05, 9 September 2009 (UTC)

Frame-based recall

For recall evaluation, we may define a different chord dictionary for each level of evaluation (dyads, triads, tetrads etc). Each dictionary is a text file containing chord shorthands / interval lists of the chords that will be considered in that evaluation. The following dictionaries are proposed:

For dyad comparison of major/minor chords only:

N
X:maj
X:min

For comparison of standard triad chords:

N
X:maj
X:min
X:aug
X:dim
X:sus2
X:sus4

For comparison of tetrad (quad) chords:

N
X:maj
X:min
X:aug
X:dim
X:sus2
X:sus4
X:maj7
X:7
X:maj(9)
X:aug(7)
X:min(7)
X:min7
X:min(9)
X:dim(7)
X:hdim7
X:sus4(7)
X:sus4(b7)
X:dim7


For each evaluation level, the ground truth annotation is compared against the dictionary. Any chord label not belonging to the current dictionary will be replaced with an "X" in a local copy of the annotation and will not be included in the recall calculation.

Note that the level of comparison in terms of intervals can be varied. For example, in a triad evaluation we can consider the first three component intervals in the chord so that a major (1,3,5) and a major7 (1,3,5,7) will be considered the same chord. For a tetrad (quad) evaluation, we would consider the first 4 intervals so major and major7 would then be considered to be different chords.

For the maj/min evaluation (using the first example dictionary), using an interval comparison of 2 (dyad) will compare only the first two intervals of each chord label. This would map augmented and diminished chords to major and minor respectively (and any other symbols that had a major 3rd or minor 3rd as their first interval). Using an interval comparison of 3 with the same dictionary would keep only those chords that have major and minor triads as their first 3 intervals so augmented and diminished chords would be removed from the evaluation.

After the annotation has been "filtered" using a given dictionary, it can be compared against the machine generated estimates output by the algorithm under test. The chord sequences described in the annotation and estimate text files are sampled at a given frame rate (in this case 10ms per frame) to give two sequences of chord frames which may be compared directly with each other. For calculating a hit or a miss, the chord labels from the current frame in each sequence will be compared. Chord comparison is done by converting each chord label into an ordered list of pitch classes then comparing the two lists element by element. If the lists match to the required number of intervals then a hit is recorded, otherwise the estimate is considered a miss. It should be noted that, by converting to pitch classes in the comparison, this evaluation ignores enharmonic pitch and interval spellings so the following chords (slightly silly example just for illustration) will all evaluate as identical:

C:maj = Dbb:maj = C#:(b1,b3,#4)


Basic recall calculation algorithm:

1) filter annotated transcription using chord dictionary for a defined number of intervals

2) sample annotated transcription and machine estimated transcription at 10ms intervals to create a sequence of annotation frames and estimate frames

3) start at the first frame

4) get chord label for current annotation frame and estimate frame

5) check annotation label:

IF symbol is 'X' (i.e. non-dictionary)

THEN ignore frame (record number of ignored frames)

ELSE compare annotated/estimated chords for the predefined number of intervals
increment hit count if chords match

ENDIF

6) increment frame count

7) go back to 4 until final chord frame --Chrish 17:05, 9 September 2009 (UTC)

Discussions

Points to discuss:

  • The Precision measure has not been used last year, and I believe it should not because (unlike in beat extraction) we can assume a contiguous sequence of chords, i.e. all time units should feature a chord label. --Matthias 11:04, 27 June 2009 (UTC)
  • I would like to disagree on Matthias' previous point: I think we cannot assume that there is a chord present in every frame, one can think for instance of a drum solo, an acapella break, ethnic music or simply the beginning and ending of a file. In melody extraction or beat detection, there also isn't a continuity assumption. I must say that at the moment, our system isn't able to generate a no-chord either, so it is not in my personal interest to add this to the evaluation, but I feel this should be part of a general chord extraction system. I've also learned from some premature experiments that with the current frame-based evaluation, it is actually not even beneficial to include such a no-chord generator, because of the inequality of prior chances between a chord and a no-chord (14% for a N.C. in our little dataset, I suspect it to be even less for the Beatles set). The consequence is that the chord/no-chord distinction must be very accurate in order to increase the performance. A related, minor topic is the naming of this task. Why isn't it "audio chord extraction" just like "melody extraction". For me "chord detection" is making the distinction between chords and no-chords and "chord extraction" is naming detected chords. Anyway, just nitpicking on that one. --Johan 15:43, 16 July 2009 (CET)
  • I think we can assume a contiguous sequence of chords if we treat "no chord" as a chord. --Matthias 16:01, 7 August 2009 (UTC)
  • I believe we should move forward in two ways to get a more meaningful evaluation:
    1. evaluate separate recall measures for several chord classes, my proposal is major, minor, diminished, augmented, dominant (meaning major chords with a minor seventh). A final recall score can then be calculated as a (weighted) average of the recall on different chords. --Matthias 11:04, 27 June 2009 (UTC)
    2. We use just triads major, minor, diminished & augmented which I think is a more sensible distinction. Once you start using quads, why limit yourself to dominant 7 and not use minor 7, major 7, full diminished, etc. So I'm more in favour of just triads (maybe add sus too) or more quads. -- Johan 15:48, 16 July 2009 (CET)
    3. Segmentation should be considered. For example, a chord extraction algorithm that has reasonable recall may still be heavily fragmented thus producing an output difficult to read for humans. One measure to check for similarity in segmentation is directional Hamming distance (or divergence). --Matthias 11:04, 27 June 2009 (UTC)
    4. Agree, while the frame-based evaluation is certainly easy, it is not the most musically sensible. An evaluation on note-level or chord-segment basis might be a little too complicated for now, but this is a start. -- Johan 15:51, 16 July 2009 (CET)
    5. Do you think we could consider several evaluation cases for chord detection with various chord dictionnaries? (For instance one with major/minor triads, one with major/minor/diminished/augmented/dominant etc.) so that each particpant can choose a case that can be handled by his/her algorithm? --Helene (IRCAM)
    6. to Helene: Several (maybe two) evaluation cases /could/ be good, but I think everyone's algorithm should be tested on every task, I think choosing the one you want would mean you can't compare it to other people's method.
    7. to Johan: in your chord list, did you mean to give the same list as I did? Anyway, I like to add "dominant" because it is used often, and musically (Jazz harmony theory) there's a functional difference between "dominant" and "major" (not between "minor" and "minor 7", and not between "major" and "major 7" or "major 6"). --Matthias 16:01, 7 August 2009 (UTC)
    8. @Matthias: no I intended to list just the triads (the dominant shouldn't have been copy-pasted, corrected it now) -- Johan 10:49, 25 Aug 2009 (CET)
    9. to Matthias: I think that using the label "dominant" as you suggest here is not the correct use of that musical term in this context. In music theory (both in classical and jazz harmony) it is true that a dominant-seventh chord is always the major triad + minor 7th chord shape i.e. (1,3,5,b7). However, a (1,3,5,b7) chord does not always function as a dominant. Whether such a chord can be labelled dominant or not is entirely based on the chord's position in a given chord sequence relative to other chords which define its function in the progression. It is precisely for that reason that I argue against the use of the term 'dominant' for context-free chord labelling in the ISMIR05 chord labels paper [6]. -- Chrish 17:00, 10 August 2009 (UTC)
    10. It seems counter-intuitive to have an evaluation that includes some quads but not all. It is fair to evaluate across the triad shapes major, minor, augmented, diminished and suspended (although sus2 and sus4 are inversions of each other) as these are the naturally occurring triad shapes in western harmony. It would seem sensible to include all the other quads that are labelled "7th" chords if including the (1,3,5,b7) shape in an evaluation. Given this problem, one possible way to compare algorithms that detect different sets of chord labels would be to split the results between triad recognition and quad/quint etc recognition. All algorithms can be tested on the triad evaluation - any algorithm that can detect quads can be compared directly against an algorithm that deals only with triads simply by taking the equivalent first three intervals of each chord label in the transcription as a triad. Only those algorithms that can recognise quads need to be evaluated in results that include quad chords. To make this process easy, it might be sensible for the labels that chord recognition algorithms produce to be given in terms of the intervals in the chords themselves rather than a chord name - i.e. a C major could be "C:(1,3,5)" and C major seventh "C:(1,3,5,7)" - both would evaluate as C major in a triad evaluation but the second one could also be evaluated in a test that looked at quads as well. This would also take away problems with possible labelling ambiguities that chord name labels could introduce. A list of the triads and quads that are acceptable in each level of evaluation would need to be drawn up but there is a list something like that at the top of this page already. -- Chrish 17:00, 10 August 2009 (UTC)
    11. I think keeping the MIREX2008 evaluation procedure would make sense. I like the idea of a "merged maj/min" score in order to evaluate the root precision, and then, maybe another score taking only into account the root and mode(maj/min). Then, we could progressively extend the chord dictionary, by adding triads and quads (as suggested by Chris and Johan) and calculating new scores based on these extension.-- Thomas 12:45, 12 August 2009 (UTC)
    12. I like the idea of using a segmentation measure based on directional hamming distance - We must make sure that the measure captures both over-segmentation (fragmentation) and under-segmentation though. The fragmentation measure (1-f) based on the inverse directional hamming distance described by Abdallah et al [8] only measures fragmentation by measuring the distance d_MG between the estimated sequence (M) and the ground-truth annotation (G). They also propose an under-segmentation measure (1-m) which uses the forward directional hamming distance d_GM. Using the fragmentation measure alone would mean that a chord recognition algorithm that output one chord label for the entire piece would score 100% correct for fragmentation. Both of these measures give a value between 0 and 1 so we could combine them to give an overall "chord segmentation" measure: 1-max(f,m) (f and m are not independent so it's probably best to just use the worst one rather than combine them geometrically). I think this measure could complement the frame-based recall quite well. -- Chrish 10:00, 11 August 2009 (UTC)
    13. anyone know how to make the LaTeX maths work on the wiki? -- Chrish 17:01, 10 August 2009 (UTC)
  • Something to consider when broadening the scope of used chords, is the inequality in prior chance of different chords (much like the problem with chords/no-chords I mentioned above). When looking for augmented and diminished triads in the Beatles set in addition to major and minor, I'm quite positive the (or at least my) overall performance will decrease. Some processing/selecting could level the priors, but just limitting the data set to the duration of the least frequent chord won't leave us with much data, I'm afraid. The thing is of course that the inequality is also there in reality, so I'm not really convinced myself that this should be done. Another option is not changing the data, but letting the evaluation take it into account. -- Johan 16:23, 16 July 2009 (CET)


  • Should chord data be expressed in absolute (aka "F major-minor 7") or relative (aka "C: IV major-minor 7") terms?

Absolute is better for the moment - if an algorithm can generate relative information then it can be converted easily to absolute. Going the other way is not easy if the algorithm doesn't estimate the key before trying to recognise chords... -- Chrish 17:04, 9 September 2009 (UTC)

  • Should different inversions of chords be considered in the evaluation process?

The evaluation procedure I've outlined above ignores inversion at the moment. -- Chrish 17:04, 9 September 2009 (UTC)

  • What temporal resolution should be used for ground truth and results?

If all algorithms output a text file with chord start/end times and labels then the underlying resolutions of different systems can be different. In the evaluation stage I suggested sampling the transcriptions in 10ms frames because that's the limit of what we can perceive in terms of separate onsets anyway.-- Chrish 17:04, 9 September 2009 (UTC)

  • How should enharmonic and other confusions of chords be handled?

Although the chord symbol syntax allows enharmonic spellings to be differentiated, the evaluation process converts the symbols into lists of pitch classes. As long as the symbol is a valid one according to the syntax in [6] (with the two caveats I outlined above) then the system will have no trouble accepting any chord symbol.-- Chrish 17:04, 9 September 2009 (UTC)

  • How will Ground Truth be determined?
    1. One thing we should consider doing is using the combined results to find possible errors or inconsistencies in the data. I have found that there are some that exists in Harte's labels (this was, after all, extremely difficult). If there are sections that are systematically error-prone, then maybe it is best to reevaluate these labels for future runs Also, we need to start finding other databases. Since we are only using Beatles data each year, we have no measure of generalization. Maybe we should make a requirement that participation includes a requirement to label 2-3 songs? These could be distributed to a few people and for verification as well. Just a thought. -- Jeremy 14:45, 25 August 2009 (UTC)
    2. If anyone finds mistakes in the Beatles data please let me know about them so I can fix them before I release the new versions (this will be in time for ISMIR 2009) -- Chrish 17:04, 9 September 2009 (UTC)
  • What degree of chordal/tonal complexity will the music contain?

This can be variable using different chord dictionaries.-- Chrish 17:04, 9 September 2009 (UTC)

  • Will we include any atonal or polytonal music in the Ground Truth dataset?

Are there any recognisable chords in these types of music as such? -- Chrish 17:04, 9 September 2009 (UTC)

  • What is the maximal acceptable onset deviation between ground truth and result?

If using a frame based recall and a segmentation measure based on directional hamming distance, we do not need to specify an allowable onset deviation. -- Chrish 17:04, 9 September 2009 (UTC)

  • What file format should be used for ground truth and output?

Flat text - effectively the same as the .lab files used for the Beatles transcriptions.-- Chrish 17:04, 9 September 2009 (UTC)

Comment`s by MB 08.12

First I want to make clear that we are using the Christopher`s Harte`s Beatles dataset which includes quad chords. Last year the evaluations were based only on major,minor and non chords. This year if there are enough participants we can extend the evaluations to the rest of the triads diminished, augmented suspended and to quads. Please vote:

<poll> How would you like to evaluations to be performed ? Same as last year evaluate on major, minor and non chord All triads: major, minor, diminished, augmented, suspended + non chord All triads + quads + non chord </poll>

If we decide to extend this task, we have to change the I/O format. For ex: C(1,3,5,7) so that we can evaluate same results against triads or quads easily. In terms of evaluations, we performed a really simple one last year. This year we welcome evaluations scripts written by the community.

A possible simplification of the data output could be to use chromatic numeric notation for the intervals. For example, C(1,3,5,7) would be C(0,4,7,11) or to be a bit more pure, something like 0(0,4,7,11) is cool but a bit redundant, leading us to 0,4,7,11. Dr. Downie prefers the chromatic numeric notation as it instantaneously gets rid of the enharmonic spelling problem.



Looking at the last years results for Chord Detection: https://www.music-ir.org/mirex/2008/index.php/Audio_Chord_Detection_Results

the performance increase for doing a training for chord detection seems to be insignificant. Would you consider dropping the test train part of the task this year?

Potential Participants

  • Johan Pauwels/Ghent University, Belgium (firstname.lastname@elis.ugent.be) (still interested)
  • Matthias Mauch, Centre for Digital Music, Queen Mary, University of London --Matthias 10:33, 27 June 2009 (UTC)
  • Laurent Oudre, TELECOM ParisTech, France (firstname.lastname@telecom-paristech.fr) (still interested, probably 2 algorithms)
  • Maksim Khadkevich, Fondazione Bruno Kessler, Italy (lastname_at_fbk_dot_eu) (still interested, 1 algorithm)
  • Thomas Rocher, LaBRI Universit├⌐ Bordeaux 1, France (firstname.lastname@labri.fr)
  • Yushi Ueda, The University of Tokyo, Japan (lastname@hil.t.u-tokyo.ac.jp)
  • Christopher Harte, Centre for Digital Music, Queen Mary, University of London (firstname_dot_lastname_at_elec_dot_qmul_dot_ac_dot_uk)
  • Helene Papadopoulos, IRCAM (firstname_dot_lastname_at_ircam.fr)
  • Adrian Weller, Daniel Ellis and Tony Jebara, Columbia University, NY, USA (aw2506@columbia.edu)
  • Your name here

Bibliography

1.Harte,C.A. and Sandler,M.B.(2005). Automatic chord identification using a quantised chromagram. Proceedings of 118th Audio Engineering Society's Convention.

2.Sailer,C. and Rosenbauer K.(2006). A bottom-up approach to chord detection. Proceedings of International Computer Music Conference 2006.

3.Shenoy,A. and Wang,Y.(2005). Key, chord, and rythm tracking of popular music recordings. Computer Music Journal 29(3), 75-86.

4.Sheh,A. and Ellis,D.P.W.(2003). Chord segmentation and recognition using em-trained hidden markov models. Proceedings of 4th International Conference on Music Information Retrieval.

5.Yoshioka,T. et al.(2004). Automatic Chord Transcription with concurrent recognition of chord symbols and boundaries. Proceedings of 5th International Conference on Music Information Retrieval.

6.Harte,C. and Sandler,M. and Abdallah,S. and G├│mez,E.(2005). Symbolic representation of musical chords: a proposed syntax for text annotations. Proceedings of 6th International Conference on Music Information Retrieval.

7.Papadopoulos,H. and Peeters,G.(2007). Large-scale study of chord estimation algorithms based on chroma representation and HMM. Proceedings of 5th International Conference on Content-Based Multimedia Indexing.

8.Samer Abdallah, Katy Noland, Mark Sandler, Michael Casey & Christophe Rhodes: Theory and Evaluation of a Bayesian Music Structure Extractor (pp. 420-425) Proc. 6th International Conference on Music Information Retrieval, ISMIR 2005.