2009:Audio Chord Detection
Contents
Introduction
For many applications in music information retrieval, extracting the harmonic structure is very desirable, for example for segmenting pieces into characteristic segments, for finding similar pieces, or for semantic analysis of music.
The extraction of the harmonic structure requires the detection of as many chords as possible in a piece. That includes the characterisation of chords with a key and type as well as a chronological order with onset and duration of the chords.
Although some publications are available on this topic [1,2,3,4,5], comparison of the results is difficult, because different measures are used to assess the performance. To overcome this problem an accurately defined methodology is needed. This includes a repertory of the findable chords, a defined test set along with ground truth and unambiguous calculation rules to measure the performance.
Regarding this we suggest to introduced the new evaluation task Audio Chord Detection.
The deadline for this task is August 22nd.
Data
As this is intended for music information retrieval, the analysis should be performed on real world audio, not resynthesized MIDI or special renditions of single chords. We suggest the test bed consists of WAV-files in CD quality (with a sampling rate of 44,1kHz and a solution of 16 bit). A representative test bed should consist of more than 50 songs of different genres like pop, rock, jazz and so on.
For each song in the test bed, a ground truth is needed. This should comprise all detectable chords in this piece with their tonic, type and temporal position (onset and duration) in a machine readable format that is still to be specified.
To define the ground truth, a set of detectable chords has to be identified. We propose to use the following set of chords build upon each of the twelve semitones.
Triads: major, minor, diminished, augmented, suspended4 Quads: major-major 7, major-minor 7, major add9, major maj7/#5 minor-major 7, minor-minor 7, minor add9, minor 7/b5 maj7/sus4, 7/sus4
An approach for text annotation of musical chords is presented in [6].
We could contribute excerpts of approximately 30 pop and rock songs including a ground truth.
Evaluation
Two common measures from field of information retrieval are recall and precision. They can be used to evaluate a chord detection system.
Recall: number of time units where the chords have been correctly identified by the algorithm divided by the number of time units which contain detectable chords in the ground truth.
Precision: number of time units where the chords have been correctly identified by the algorithm divided by the total number of time units where the algorithm detected a chord event.
Points to discuss:
- Are the measures mentioned above sufficient to evaluate the algorithms? In particular: Can an algorithm which achieves high precision and recall on many time units, but has an otherwise "jagged" output (i.e. is wrong often, but for a short time) be considered as good as a smoother one with equal precision and recall?
- Should chord data be expressed in absolute (aka "F major-minor 7") or relative (aka "C: IV major-minor 7") terms?
- Should different inversions of chords be considered in the evaluation process?
- What temporal resolution should be used for ground truth and results?
- How should enharmonic and other confusions of chords be handled?
- How will Ground Truth be determined?
- What degree of chordal/tonal complexity will the music contain?
- Will we include any atonal or polytonal music in the Ground Truth dataset?
- What is the maximal acceptable onset deviation between ground truth and result?
- What file format should be used for ground truth and output?
Submission Format
Submissions have to conform to the specified format below:
extractFeaturesAndTrain "/path/to/trainFileList.txt" "/path/to/scratch/dir"
Where fileList.txt has the paths to each wav file. The features extracted on this stage can be stored under "/path/to/scratch/dir" The ground truth files for the supervised learning will be in the same path with a ".txt" extension at the end. For example for "/path/to/trainFile1.wav", there will be a corresponding ground truth file called "/path/to/trainFile1.wav.txt" .
For testing:
doChordID.sh "/path/to/testFileList.txt" "/path/to/scratch/dir" "/path/to/results/dir"
If there is no training, you can ignore the second argument here. In the results directory, there should be one file for each testfile with same name as the test file + .txt . The results file should be structured as below described by Matti.
Programs can use their working directory if they need to keep temporary cache files or internal debuggin info. Stdout and stderr will be logged.
Potential Participants
- H.Papadopoulos (papadopo@ircam.fr)
- Jan Weil (weil@nue.tu-berlin.de), Jean-Louis Durrieu (durrieu@enst.fr)
- Markus Mehnert, Gabriel Gatzsche (markus.mehnert@tu-ilmenau.de, gze@idmt.fraunhofer.de)
- Yuki Uchiyama(uchiyama@hil.t.u-tokyo.ac.jp)
- Matti Ryynänen and Anssi Klapuri (Tampere University of Technology), matti.ryynanen <at> tut.fi, anssi.klapuri <at> tut.fi
- Xinglin Zhang and Colan Lash (University of Regina, zhang46x@uregina.ca, Lash111c@uregina.ca)
- Alexey Egorov (alexey@cbmsnetworks.com)
- Dan Ellis (dpwe@ee.columbia.edu)
- Maksim Khadkevich (khadkevich <_at_> fbk.eu)
- Juan P. Bello (jpbello@nyu.edu)
- Kyogu Lee (klee@gracenote.com)
- Johan Pauwels (Ghent University, Belgium) johan.pauwels<sp@m>elis.ugent.be
Bibliography
1.Harte,C.A. and Sandler,M.B.(2005). Automatic chord identification using a quantised chromagram. Proceedings of 118th Audio Engineering Society's Convention.
2.Sailer,C. and Rosenbauer K.(2006). A bottom-up approach to chord detection. Proceedings of International Computer Music Conference 2006.
3.Shenoy,A. and Wang,Y.(2005). Key, chord, and rythm tracking of popular music recordings. Computer Music Journal 29(3), 75-86.
4.Sheh,A. and Ellis,D.P.W.(2003). Chord segmentation and recognition using em-trained hidden markov models. Proceedings of 4th International Conference on Music Information Retrieval.
5.Yoshioka,T. et al.(2004). Automatic Chord Transcription with concurrent recognition of chord symbols and boundaries. Proceedings of 5th International Conference on Music Information Retrieval.
6.Harte,C. and Sandler,M. and Abdallah,S. and G├│mez,E.(2005). Symbolic representation of musical chords: a proposed syntax for text annotations. Proceedings of 6th International Conference on Music Information Retrieval.
7.Papadopoulos,H. and Peeters,G.(2007). Large-scale study of chord estimation algorithms based on chroma representation and HMM. Proceedings of 5th International Conference on Content-Based Multimedia Indexing.
Comments/Discussion
Matti's comments (06/08/08).
Hi all, I'm glad to see several potential participants to this new task. Should we start to decide the output format for the submissions?
Although it would be nice to have a large set of possible chords in the evaluation, I suggest performing the evaluation based on triads alone, even perhaps only on major and minor thirds if just the evaluation data is suitable for this. In this case, the output format could be
Kyogu's comments (August 7, 2008)
I'm with Matti in reducing the types of chords only to triads or even to major/minor triads. Since this is going to be the first chord detection task (if it ever happens), we can start from a simple task and later move on to more difficult tasks after analyzing the results. As to the output format, we'd better stick to absolute times in (milli)seconds than to the frame numbers since they should vary from algo to algo.
Xinglin Zhang's comments (August 7, 2008)
Matti's suggest sounds good. But don't forget that some parts of the song do not have a chord (for example,at the beginning of the song where there is nothing or there are only drums going on). Thus 0-23 is not enough. We can introduce a 24 for none-chord or we can use the representations give by Reference 6 on this page, if you have read it. Let's determine the format as soon as possible. BTW, I sent Kyogu an email a couple of weeks ago on some problems regarding chord recognition, but I received no response. I am not sure whether you have got that email or not. Thanks.
Matti's comments (August 8, 2008)
Hi all, thanks for your comments. Based on the above discussion, I suggest the following submission format for this task. Input: audio filename. Output: filename of a text file where the analysis results are written in the following format:
<onset time 1><whitespace><chord id 1> <onset time 2><whitespace><chord id 2> ... <onset time n><whitespace><chord id n> where chord id: an integer on range 0-24, where values 0-11 denote the C major, C# major, ..., B major 12-23 denote the C minor, C# minor, ..., B minor 24 denotes silence or no-chord segments onset time: onset time in seconds Output example of chord sequence of no-chord (silence), C major, D minor, G major: 0.000 24 2.416 0 6.347 14 9.123 7
I guess, however, that the participants have also methods which produce a chord label for each analysis frame without onset detection. Such submissions can use this format by labeling each frame:
0.000 24 0.010 24 ... 2.410 24 2.420 0 2.430 0 ...
If the onset detection is not considered this year, then the evaluation should give similar results for both formats.
An obvious evaluation metric would simply measure the time proportion of correct overlapping chord ids over the audio file. For example, given the reference (correct) chords of 10 seconds of audio:
0.000 0 5.000 7
and an analysis result file of
0.000 0 4.000 7 8.000 20
This would give result of 70 % (C major correct for 4 s, G major correct for 3 s, altogether 7 seconds correct for 10 seconds of audio). Also error analysis could be carried out, for example, by measuring the distance between reference and labeled chord on a circle of fifths.
Any suggestions or corrections to this?
Kyogu's comments (August 10, 2008)
Hi Matti, thanks much for your suggestions. Everything sounds fine, but I'd like to suggest a "don't care" for which the evaluation is not performed. This corresponds to an "N (non-chord) in Harte and Sandler's annotations. That way, we don't need to have a silence or percussion-only detector. This doesn't require the output format to be changed, but only matters when evaluating. I'm also fine with giving partial scores for the related chords. Thanks.
Gene's comment (August 12, 2008)
Matti -- we use exactly the time-based precision metrics you outlined above (time during which the correct chord is on, including silence (24), divided by the song's total time). We have a free Winamp plugin that does chord detection and even mixes the detected triads into the original song in real time (good for ear testing).
Dan's comments (August 13, 2008)
I'm interested in separating the effects of algorithm/representation and training data. I think it's interesting to see what systems can do "the best", but to do that fairly you need to evaluate on data which hasn't been used in developing the system, which excludes Chris Harte's data. But having that high-quality data available raises the possibility of defining a task using that data (e.g. train/test folds) and having everyone run on the same, open data.
I've put together a simple baseline system that uses 4 fold cross validation on the Harte data (i.e. 9 albums train/ 3 albums test in 4 cuts). I've also generated beat-synchronous chroma features for the entire 180 track set (as well as 32 kbps MP3 audio). I'm thinking about releasing this, but it seems too late to be useful.
I see two different paradigms: one is a system that takes training data, learns models for the labeled chords, then takes test data and labels into the same classes. The second is a system that is submitted including pre-learned models and only processes the test data. I guess the second system will be easier for more people to provide, but it confounds the training data used with the algorithmic approach, which limits how much we learn from the evaluation.
Mert`s Comments (August 13, 2008)
Hi Matti, the I/O format you suggested is fine, we can use it if there are no objections. We should take a vote for Dan`s suggestions about training.
Matti's comments (August 14, 2008)
Hi all, I agree with Dan that N-fold training/testing will make the evaluation more informative. However, my submission has been already fixed in a C++ implementation (yes, the models have been trained with the Beatles data) and I'm not sure if I have enough time to prepare the learning codes as well. Two subtasks, pre-trained and N-fold testing, would sound good to me.
Xinglin Zhang's comments (August 14, 2008)
I am trying an deterministic way which does not need training. Training and learning should be a better way because even if we know nothing about the data, the algorithm can learn it after training(this is based on that our assumption of the distribution of the feature accords to the actual distribution if we use parametric learning methods). But as long as we know the details of the data(chord composition theory for this problem), we can setup some rules according to musical theory and filter the feature based on the rules. I have not finished this implementation yet. I am not sure whether I am able to have it done before 18th (is this the deadline?),not to mention to write the codes using training. Everyone seems to be hoping the arrangements fit his or her implementation well. Ha~ I am here with the two subtasks suggestion (I made a wrong poll below^_^). BTW, is everyone with Matt's suggestion on the vocabulary that contains only major and minor triads?
Juan's comments (August 14, 2008)
Matti's suggestion of output file format seems good to me (should output files be simply called audiofilename.txt). May I suggest using an audio list rather than a filename as input? In this way everyone gets to code their own batch processing function. Also, where should we write the output files to? how about creating an "output" subfolder in our algorithm's folder? I am supportive of including both a train/test and only test evaluations. My system does not include a supervised training stage in any case. However, I must point out that system parameters have been optimized using the Beatles data, which I guess is common to many chord ID systems out there and a an unavoidable shortcoming of this first version of the chord ID MIREX task. Focusing on major, minor triads and non-chords seem OK to me (this is all my system can do anyway). However, if there is enough quorum for other chords (augmented, diminished and quads) I think we should include a separate evaluation for those. Finally, I am in support of assessing errors according to chord confusions (e.g. relative major/minor, parallel major/minor, dominant/sub-dominant) but I think the main evaluation metric should be the true positives rate, as originally suggested by Matti. Thanks everyone for putting this together!
Mert`s comment (August 15, 2008)
I have just updated the submission format. Please follow it in your submissions. We are going to have to 2 subtasks, one with 3-fold test/train ( 2/3 of data for train 1/3 for test , 3 times) and another subtask with pretrained systems. In the README file of your submissions please state clearly which of your programs is running for which task. A system submitted for test/train subtask should not have any pre tuning or training to the dataset. I would like to get an idea of how fast are the systems compared to real time. Please email me this info with the type of machine you tested the speed. There are around 180 songs. If performance is an issue, We might select a subset of the data at least for the test/train subtask.
Xinglin Zhang's comment (August 18, 2008)
I'd like to know for the 3-fold test/train subtask, will you provide labeled ground truth for supervised training? And the format of the training files will follow Matti's suggestion?
Dan Ellis: Clarifying file names (August 19, 2008)
Following Xinglin's comment, can I assume that if the training file list includes /some/path/file1.wav, that I will also be able to find /some/path/file1.txt as the training labels, also in the 0-24 integer label format?
Also, when we write an output file, is it to /path/to/output/file1.txt (not /path/to/output/file1.wav.txt)?
Mert`s Comments August 20, 2008
Thanks for the noting this. Let`s keep everything simple. The ground truth text files for supervised learning will be kept in the same directory. So if there is a file in the train list called "/some/path/file1.wav" there will be a corresponding ground truth file called "/some/path/file1.wav.txt" in the 0-24 integer label format. Also the output files should be called "/path/to/output/file1.wav.txt" .
Johan`s comments (August 21, 2008) Hi all, I'm a first year PhD student at Ghent University and I'm (but mostly will be) focussing on chord and key detection. The contest comes a couple of months too early for me, as I didn't plan to have a completely integrated chord extraction program working at this stage, but I'm busy hacking together some old and new code of our group so that I can jump in at short notice. I have one remark: our program doesn't need any training (since I didn't have time yet to implement it, but it's on the to do list), it just uses parameters derived from musical theory, so no Beatles data were used. I'll submit it to the pre-trained category, since it doesn't need training, but it's more not-trained. To be complete, just the feature extraction step has some optimized parameters (for the pitch tracking) which we set according to our own testset (not containing Beatles), but this wasn't evaluated in the context of chord detection and are supposed to be general parameters anyway (based on common sense mostly). Also, because of time constraints I won't be able to compile for Linux, although it should be perfectly portable C++, but I'll try to use static linking as much as possible (unfortunately, libsndfile's licence for instance prohibites this) and will send a Win32 console app. Lastly, the abstract will probably be ready August 29th the latest. I read in a mail that the sharp deadline is only 5 days before ISMIR, so that should be ok, but can somebody confirm this?
Additional question: It's obvious that all sevenths chords, ninths and higher will be reduced to their triad, but will dim, aug, sus4 and so on all be represented as "24" in the evaluation reference?
Xinglin's Comments (August 27, 2008)
I fixed some bugs of my program just now and updated the submission. The submission page allowed me to do so, but not sure whether it's ok or not for the committee?
Trainging Poll
<poll> How would you like the systems perform training? Pre-trained N-fold train test setup Both. One subtask with pre-trained systems, one with N-fold train/test setup. No need training </poll>