2010:Audio Chord Estimation
Contents
Description
The text of this section is copied from the 2009 page. This task was first run in 2008. Please add your comments and discussions for 2010.
For many applications in music information retrieval, extracting the harmonic structure is very desirable, for example for segmenting pieces into characteristic segments, for finding similar pieces, or for semantic analysis of music.
The extraction of the harmonic structure requires the detection of as many chords as possible in a piece. That includes the characterisation of chords with a key and type as well as a chronological order with onset and duration of the chords.
Although some publications are available on this topic [1,2,3,4,5], comparison of the results is difficult, because different measures are used to assess the performance. To overcome this problem an accurately defined methodology is needed. This includes a repertory of the findable chords, a defined test set along with ground truth and unambiguous calculation rules to measure the performance.
Regarding this we suggest to introduced the new evaluation task Audio Chord Detection.
Data
Christopher Harte`s Beatles dataset is used for the evaluations last year. This dataset consists of 12 Beatles albums [6]. An approach for text annotation of musical chords is presented in [6]. This year an extra dataset was donated by Matthias Mauch which consists of 38 songs from Queen and Zweieck. The data will be provided as 44.1 kHz 16bit mono wav. The ground-truth looks like this:
41.2631021 44.2456460 B
44.2456460 45.7201130 E
45.7201130 47.2061900 E:7/3
47.2061900 48.6922670 A
48.6922670 50.1551240 A:min/b3
I/O Format
This year I/O format needs to be changed to evaluate on all triads an quads. We are planning to use the format suggested by Christopher Harte [6]. The chord root is given as a natural (A|B|C|D|E|F|G) followed by optional sharp or flat modifiers (#|b). For the evaluation process we may assume enharmonic equivalence for chord roots. For a given chord type on root X, the chord labels can be given as a list of intervals or as a shorthand notation as shown in the following table:
NAME | INTERVALS | SHORTHAND |
---|---|---|
major | X:(1,3,5) | X or X:maj |
minor | X:(1,b3,5) | X:min |
diminished | X:(1,b3,b5) | X:dim |
augmented | X:(1,3,#5) | X:aug |
suspended4 | X:(1,4,5) | X:sus4 |
possible 6th triad: | ||
suspended2 | X:(1,2,5) | X:sus2 |
*Quads: | ||
major-major7 | X:(1,3,5,7) | X:maj7 |
major-minor7 | X:(1,3,5,b7) | X:7 |
major-add9 | X:(1,3,5,9) | X:maj(9) |
major-major7-#5 | X:(1,3,#5,7) | X:aug(7) |
minor-major7 | X:(1,b3,5,7) | X:min(7) |
minor-minor7 | X:(1,b3,5,b7) | X:min7 |
minor-add9 | X:(1,b3,5,9) | X:min(9) |
minor 7/b5 (ambiguous - could be either of the following) | ||
minor-major7-b5 | X:(1,b3,b5,7) | X:dim(7) |
minor-minor7-b5 (a half diminished-7th) | X:(1,b3,b5,b7) | X:hdim7 |
sus4-major7 | X:(1,4,5,7) | X:sus4(7) |
sus4-minor7 | X:(1,4,5,b7) | X:sus4(b7) |
omitted from list on wiki: | ||
diminished7 | X:(1,b3,b5,bb7) | X:dim7 |
No Chord | N |
However, we still accept participants who would only like to be evaluated on major/minor and want to use last year`s format which is an integer chord id on range 0-24, where values 0-11 denote the C major, C# major, ..., B major and 12-23 denote the C minor, C# minor, ..., B minor and 24 denotes silence or no-chord segments
Evaluation
Algorithms should output text files with a similar format to that used in the ground truth transcriptions. That is to say, they should be flat text files with chord segment labels and times arranged thus:
start_time end_time chord_label
with elements separated by white spaces, times given in seconds, chord labels corresponding to the syntax described in [6] and one chord segment per line.
Please note that two things have changed in the syntax since it was originally described in [6]. The first change is that the root is no longer implied as a voiced element of a chord so a C major chord (notes C, E and G) should be written C:(1,3,5) instead of just C:(3,5) if using the interval list representation. As before, the labels C and C:maj are equivalent to C:(1,3,5). The second change is that the shorthand label "sus2" (intervals 1,2,5) has been added to the available shorthand list.--Chrish 17:05, 9 September 2009 (UTC)
Segmentation Score
The segmentation score will be calculated using directional hamming distance as described in [8]. An over-segmentation value (m) and an under-segmentation value (f) will be calculated and the final segmentation score will be calculated using the worst case from these two i.e:
segmentation score = 1 - max(m,f)
m and f are not independent of each other so combining them this way ensures that a good score in one does not hide a bad score in the other. The combined segmentation score can take values between 0 and 1 with 0 being the worst and 1 being the best result.--Chrish 17:05, 9 September 2009 (UTC)
Frame-based recall
For recall evaluation, we may define a different chord dictionary for each level of evaluation (dyads, triads, tetrads etc). Each dictionary is a text file containing chord shorthands / interval lists of the chords that will be considered in that evaluation. The following dictionaries are proposed:
For dyad comparison of major/minor chords only:
N
X:maj
X:min
For comparison of standard triad chords:
N
X:maj
X:min
X:aug
X:dim
X:sus2
X:sus4
For comparison of tetrad (quad) chords:
N
X:maj
X:min
X:aug
X:dim
X:sus2
X:sus4
X:maj7
X:7
X:maj(9)
X:aug(7)
X:min(7)
X:min7
X:min(9)
X:dim(7)
X:hdim7
X:sus4(7)
X:sus4(b7)
X:dim7
For each evaluation level, the ground truth annotation is compared against the dictionary. Any chord label not belonging to the current dictionary will be replaced with an "X" in a local copy of the annotation and will not be included in the recall calculation.
Note that the level of comparison in terms of intervals can be varied. For example, in a triad evaluation we can consider the first three component intervals in the chord so that a major (1,3,5) and a major7 (1,3,5,7) will be considered the same chord. For a tetrad (quad) evaluation, we would consider the first 4 intervals so major and major7 would then be considered to be different chords.
For the maj/min evaluation (using the first example dictionary), using an interval comparison of 2 (dyad) will compare only the first two intervals of each chord label. This would map augmented and diminished chords to major and minor respectively (and any other symbols that had a major 3rd or minor 3rd as their first interval). Using an interval comparison of 3 with the same dictionary would keep only those chords that have major and minor triads as their first 3 intervals so augmented and diminished chords would be removed from the evaluation.
After the annotation has been "filtered" using a given dictionary, it can be compared against the machine generated estimates output by the algorithm under test. The chord sequences described in the annotation and estimate text files are sampled at a given frame rate (in this case 10ms per frame) to give two sequences of chord frames which may be compared directly with each other. For calculating a hit or a miss, the chord labels from the current frame in each sequence will be compared. Chord comparison is done by converting each chord label into an ordered list of pitch classes then comparing the two lists element by element. If the lists match to the required number of intervals then a hit is recorded, otherwise the estimate is considered a miss. It should be noted that, by converting to pitch classes in the comparison, this evaluation ignores enharmonic pitch and interval spellings so the following chords (slightly silly example just for illustration) will all evaluate as identical:
C:maj = Dbb:maj = C#:(b1,b3,#4)
Basic recall calculation algorithm:
1) filter annotated transcription using chord dictionary for a defined number of intervals
2) sample annotated transcription and machine estimated transcription at 10ms intervals to create a sequence of annotation frames and estimate frames
3) start at the first frame
4) get chord label for current annotation frame and estimate frame
5) check annotation label:
IF symbol is 'X' (i.e. non-dictionary)
THEN ignore frame (record number of ignored frames)
ELSE compare annotated/estimated chords for the predefined number of intervals
increment hit count if chords match
ENDIF
6) increment frame count
7) go back to 4 until final chord frame --Chrish 17:05, 9 September 2009 (UTC)
Submission Format
Submissions have to conform to the specified format below:
extractFeaturesAndTrain "/path/to/trainFileList.txt" "/path/to/scratch/dir"
Where fileList.txt has the paths to each wav file. The features extracted on this stage can be stored under "/path/to/scratch/dir" The ground truth files for the supervised learning will be in the same path with a ".txt" extension at the end. For example for "/path/to/trainFile1.wav", there will be a corresponding ground truth file called "/path/to/trainFile1.wav.txt" .
For testing:
doChordID.sh "/path/to/testFileList.txt" "/path/to/scratch/dir" "/path/to/results/dir"
If there is no training, you can ignore the second argument here. In the results directory, there should be one file for each testfile with same name as the test file + .txt .
Programs can use their working directory if they need to keep temporary cache files or internal debuggin info. Stdout and stderr will be logged.
Discussions for 2010
Discussions from 2009
https://www.music-ir.org/mirex/2009/index.php/Audio_Chord_Detection#Discussions
Potential Participants
Your name here
Bibliography
1.Harte,C.A. and Sandler,M.B.(2005). Automatic chord identification using a quantised chromagram. Proceedings of 118th Audio Engineering Society's Convention.
2.Sailer,C. and Rosenbauer K.(2006). A bottom-up approach to chord detection. Proceedings of International Computer Music Conference 2006.
3.Shenoy,A. and Wang,Y.(2005). Key, chord, and rythm tracking of popular music recordings. Computer Music Journal 29(3), 75-86.
4.Sheh,A. and Ellis,D.P.W.(2003). Chord segmentation and recognition using em-trained hidden markov models. Proceedings of 4th International Conference on Music Information Retrieval.
5.Yoshioka,T. et al.(2004). Automatic Chord Transcription with concurrent recognition of chord symbols and boundaries. Proceedings of 5th International Conference on Music Information Retrieval.
6.Harte,C. and Sandler,M. and Abdallah,S. and G├│mez,E.(2005). Symbolic representation of musical chords: a proposed syntax for text annotations. Proceedings of 6th International Conference on Music Information Retrieval.
7.Papadopoulos,H. and Peeters,G.(2007). Large-scale study of chord estimation algorithms based on chroma representation and HMM. Proceedings of 5th International Conference on Content-Based Multimedia Indexing.
8.Samer Abdallah, Katy Noland, Mark Sandler, Michael Casey & Christophe Rhodes: Theory and Evaluation of a Bayesian Music Structure Extractor (pp. 420-425) Proc. 6th International Conference on Music Information Retrieval, ISMIR 2005.