2005:Audio and Symbolic Key
*Training Data Set*
Download (Windows) :
Download (MacOS/Linux) :
Arpi Mardirossian, Ching-Hua Chuan and Elaine Chew (University of Southern California) firstname.lastname@example.org
Evaluation of Key Finding Algorithms
Determination of the key is a prerequisite for any analysis of tonal music. As a result, extensive work has been done in the area of automatic key detection. However, among this plethora of key finding algorithms, what seems to be lacking is a formal and extensive evaluation process. We propose this first step in the evaluation of key-finding algorithms at the 2005 MIREX.
There are significant contributions in the area of key finding for both audio and symbolic representation. This evaluation process should consider algorithms in both areas. Algorithms that determine the key from audio should be robust enough to handle frequency interferences and harmonic effects caused by the use of multiple instruments.
- Steffen Pauws (Philips Eindhoven), email@example.com
- Yongwei Zhu (Institute for Infocomm Research(A*STAR)), firstname.lastname@example.org
- Ching-Hua Chuan and Elaine Chew (University of Southern California), email@example.com, firstname.lastname@example.org
- Emilia G├│mez (University Pompeu Fabra), email@example.com
- Ozgur Izmirli (Connecticut College), firstname.lastname@example.org
- David Temperley (University of Rochester), email@example.com
- David Rizo (University of Alicante), firstname.lastname@example.org
- Arpi Mardirossian and Elaine Chew (University of Southern California), email@example.com, firstname.lastname@example.org
- Yongwei Zhu (Institute for Infocomm Research(A*STAR)), email@example.com
Other Potential Participants
- Tuomas Eerola (firstname.lastname@example.org) and Petri Toiviainen (email@example.com) [high]
- Ming Li (firstname.lastname@example.org) and Ronan Sleep (email@example.com) [high]
- Olli Yli-Harja (firstname.lastname@example.org), Ilya Schmulevich (email@example.com), and Kjell Lemstr├╢m (firstname.lastname@example.org) [high]
- Craig Sapp (email@example.com) [moderate]
Input: Call to individual .wav or .mid files, or an ASCII file list of all files (with full paths).
Ground-truth: One ground-truth file per .wav file, in ASCII tab delimited format:
<pitch (e.g. Ab, A, A#, Bb, B ΓÇª, G#>\t< major or minor>\n where the < and > characters are not included and \t denotes a tab and \n denotes a new line.
Note: The framework is aware of the equivalence of certain notes and will handle the mapping internally.
Output: One output file per .wav file, in ASCII tab delimited format:
<pitch (e.g. Ab, A, A#, Bb, B ΓÇª, G#>\t< major or minor>\n
Audio: (PCM, 16-bit, 44100 Hz) single channel (mono) Excerpts synthesized from MIDI
MIDI: Excerpts of MIDI files
Test Set: The test set we propose to use will consist of pieces for which the keys are known. For example, symphonies and concertos by well-known composers often have the keys stated in the title of the piece. The excerpts will typically be the beginnings of the pieces as this is one part of the piece for which establishing of the global and known key can be guaranteed. Different excerpt durations will be considered: 30 seconds, 20 seconds and 10 seconds.
Input/Output: The input to the system should be some musical excerpt (either audio or MIDI) and the output should be a key name, for example C major or E flat minor. Only pitch class numbers will be taken into account during evaluation, for instance C sharp major and D flat major will be considered equivalent.
System Calibration: The test set will be randomly split into training and test data. Training data will be provided to the participants so that they determine the optimal settings for the parameters of their algorithms.
Evaluation : The error analysis will center on comparing the key identified by the algorithm to the actual key of the piece. The key of the piece is the one defined by the composer in the title of the piece. We will then determine how ΓÇÿcloseΓÇÖ each identified key is to the corresponding correct key. Keys will be considered as ΓÇÿcloseΓÇÖ if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor. A correct key assignment will be given a full point, and incorrect assignments will be allocated fractions of a point according to the following table:
|Relation to correct key||Points|
Comments: Many excellent suggestions were made in the review process. Some of the ideas included: using actual audio files from recordings for the audio portion of the contest, employing other metrics used in information retrieval literature, using test data from a wider variety of genres, and considering the detection of key modulations.
As this is a first attempt at evaluating key-finding across different systems employing a variety of algorithm combinations, we have opted to keep the evaluation procedure as simple and streamlined as possible. The results of this contest will lay the groundwork from which we can expand the techniques for key-finding evaluation.
Relevant Test Collections
Symbolic Data: The dataset contains 500 classical music MIDI files selected from the Classical Music Archives (http://www.classicalarchives.com) and labelled with the key stated in their title.
Examples of pieces include, but are not limited to, the following:
Pieces from the Baroque period: Bach (http://www.classicalarchives.com/bach.html) ΓÇô Keyboard Works, Chamber Works, and Orchestral Works. Vivaldi (http://www.classicalarchives.com/vivaldi.html) ΓÇô Concerti and Chamber Works.
Pieces from the Classical period: Handel (http://www.classicalarchives.com/handel.html) ΓÇô Orchestral Works, Keyboard Works, and Chamber Works. Haydn (http://www.classicalarchives.com/haydn.html) ΓÇô Keyboard Works, Chamber Works, and Orchestral Works. Mozart (http://www.classicalarchives.com/mozart.html) ΓÇô Keyboard Works, Symphonies and Concertos, and Chamber Works. Early Beethoven (http://www.classicalarchives.com/beethovn.html) ΓÇô Piano Works, Symphonies, Concertos, and Chamber Works.
Pieces from the Romantic period: Late Beethoven (http://www.classicalarchives.com/beethovn.html) ΓÇô Piano Works, Symphonies, Concertos, and Chamber Works. Brahms (http://www.classicalarchives.com/brahms.html) ΓÇô Keyboard Works, Chamber Works, Concertos and Orchestral Works. Chopin (http://www.classicalarchives.com/chopin.html) ΓÇô Piano Works.
Audio Data: The dataset contains the same pieces sythesized from MIDI to CD-quality (16-bit, 44100 Hz, mono) WAV files using various software MIDI synthesizers (Winamp, Cakewalk, etc). The synthetizer for each piece was selected randomly.
By using the same data for both the symbolic and audio key-finding methods, we will be able to evaluate and compare both approaches. It should be noted that even though synthesized MIDI is a simple alternative to actual audio, it is an appropriate approach for an evaluation where we are considering both audio and symbolic algorithms. Also, this controlled method eliminates possible tuning issues that are sometimes present in recorded audio.
The proposals contemplate two different evaluations for key estimation: one for MIDI and another one for Audio Data. Maybe these two proposals could be merged in a single one. At least part of the data could be shared among done by having a test collection including Audio Data and its MIDI representation, or MIDI representation and the Audio generated by a MIDI synthesizer. This way, we could evaluate and compare approaches dealing with MIDI & Audio.
[Arpi 02.08.05]: We agree with this and believe that the best approach would be to synthesize audio data from MIDI.
Regarding the key estimation contest from audio data, it seems that only classical music is considered. It would be possible to generalize to some other styles? For instance popular music which key is known.
[Arpi 02.08.05]: Having test data from a variety of genres would be ideal. The advantage of classical music is that many pieces are labeled with the key name. We welcome suggestions on finding labeled music in other genres.
[Hendrik 02.26.05]: Key finding makes only sense for music of major/minor tonality. Some music is very clear in its tonal reference, e.g., Mozart or most of the songs in the charts, other is at the edge of tonality, e.g. Gesualdo, some Wagner, Debussy, Hindemith, Berg, and Modern Jazz. Other music has tonal centers but no major/minor tonality, e.g. Raga or Gamelan. So it could be useful to specify the realm of the challenge, the composers, epochs, or genres, e.g. from Telemann to Beethoven (or Brahms, or Mahler?), Top 40 Hits 1950-2005, and New Orleans to Bebob.
Regarding evaluation measures for audio data, it is said that "Keys will be considered as 'close' if they have one of the following relationships: distance of perfect fifth, relative major and minor, and parallel major and minor".
[Chinghua 02.10.05]: Those relationships can be considered as the key close to the main key, still they are not the main key. But if the algorithm give those answers, it does achieve some points. So I suggest that we may give multiple levels of scores to the different answers. For example, the main key gets the whole points (may be 5), the perfect fifth gets 75% or 80% of the whole point (may be 3), and so on.
What about tuning errors? In the case of audio, there are different tuning systems that can be used. The detection algorithm should be able to estimate where the key is "tuned" (A 440 or 442,...). Keys should be also considered as 'close' if they have a relationship of "1 semitone", to consider this difference between real key (according to its tuning) & labelled key (A major). In the case of MIDI, this problem does not appear.
[Chinghua 02.10.05]: Since we will use MIDI synthesizer to generate the audio, the tuning won't be a serious problem. The detection algorithm should have the ability to regard both 440 and 442 Hz as pitch A. If the original piece is written in A Major but the arrangement of MIDI shifted a half step down to Ab Major, then the algorithm (both MIDI and Audio part) should detect it as Ab Major instead of A Major.
Will it be some training data, so that participants can try their algorithms?
[Arpi 02.08.05]: Great idea!
[Chinghua 02.10.05]: Some data will be provided for participants to verify their algorithms, but may be just a few pieces. Since different systems may need different amount of data for training, the participants need to find a good training data set for their own systems. Participants can use the provided data to train their systems, but the quantity and quality of the data will not be guaranteed to be good for their training purpose.
[ Perfe 02/24/05: I think that training data are a must. Training data should be a subset of the whole test set originally gathered. If train and test come from different populations then the estimations that we may get with the test will not be reliable; the goal of the train set is that of providing a reliable estimation of the expected performance with the test data].
[Hendrik 02.26.05]: Assuming the data would be partitioned into training, (validation ?), and test set, how could a true test set be provided that consists of valid representatives of the same population as the training set but is not known to the participants, that is, e.g., an 'unknown' Bach piece is to be found that is generally accepted to be Bach's...
I cannot tell whether the suggested participants are willing to participate. Other potential candidate could be: Hendrik Purwins
[Arpi 02.08.05]: Good addition. We have added him to the list of possible participants.
General comments: Title: Evaluation of Key Finding Algorithms Using Audio Data or Evaluation of Key Finding Algorithms Part 1 Description Paragraph: Par 2, Line 2 - sentence requires correction
[Arpi 02.08.05]: Thank you. This has been corrected.
The problem is well defined and the mentioned possible participants seem likely to participate.
Regarding the evaluation procedures, length of input excerpt would have to be determined (15 to 30 seconds - any studies on the ideal length?)
[Arpi 02.08.05]: We would like to receive further input in regards to this. We are open to using the entire piece or an excerpt (i.e. 15, 30 seconds).
Assumption of closeness:
- Perfect 5th: Is this generally accepted as an almost similar key?
[Arpi 02.08.05]: Yes it is. Please refer to http://www-rcf.usc.edu/~echew/papers/CiM2003 for further details.
[EC 02.08.05]: Keys a perfect fifth apart share all but one pitch (with the differing pitches being only one half step apart). The above paper describes three models for tonality (by Krumhansl, Lerdahl and Chew) with similar relative distances between keys which are consistent with that mentioned in our proposal.
- Parallel major or minor: Not too certain if this needs to be clarified (Ignore this comment if this is generally understood by the majority working in this field)
Based on the error analysis approach outlined, would the algorithm that performs best with the new parameter settings be considered superior ?
[Arpi 02.08.05]: Key finding and its evaluation is a complex matter. This is a good question to which there is no straightforward answer. We would like to explore the definition of algorithm superiority further. Input from participants would be valuable.
The test data are relevant. Are there any alternative data sets if the Naxos collection does not become available?
[Arpi 02.08.05]: The Naxos collection only contains audio data. We propose using MIDI data and audio synthesized from MIDI. Please refer to comments made in Review 1.
1. Am intrigued and heartened by the fact that both an audio and a symbolic version of the task has been proposed.
2. The modality question does arise and like Review #2, I would like to understand better the gradations of "failure" (i.e., the Perfect 5th issue), etc.
3. I would very much like to see a direct tie in with symbolic and audio data (i.e., a one-to-one match of score with audio), if possible.
4. Wonder if we could frame this for evaluation purposes as a more traditional IR task? For example, Find all pieces in Key X...find all pieces in a minor mode.....and the kicker...find all pieces transposed from their original keys!
[Arpi 02.08.05]: This is a great idea. This approach will certainly give us new metrics. We can further explore this if time permits.
I was the one to decide that the original proposal on key finding should be split into two proposals on audio key finding and symbolic key finding. Indeed the audio and symbolic parts involve completely separate data and separate participants. From the committee point of view, this needs as much annotation and testing work as two independent proposals. I did not ask the authors about it, so it's not their fault.
I am strongly in favor of merging the two proposals into a single one again. But then the symbolic and audio data need to correspond to the same titles as much as possible, so that the performances can be compared. Can the RWC database or another database be used for it ? Also the participants need to submit algorithms for both tasks if possible. I suppose it won't be too hard for audio key finding algorithms to work also on symbolic data, since audio data may be easily synthesized from symbolic data using a conventional midi synthesizer.
[ Perfe 02/24/05: See my comment above. Rendering midi into audio will create files that have less "acoustic complexity" than truly recorded music; results on them will not be totally extrapolable to audio-based music]
As Emmanuel stated, we submitted a single proposal for audio and symbolic key-finding. We have now re-combined the two proposals. Please refer to Emmanuels comments for further details.
Hello, my name is Emilia G├│mez, from Universitat Pompeu Fabra, Barcelona. First of all, thank you for organizing this evaluation! I was involved in the organization of last year's contests and I know it is a lot of work. I will try to participate in the evaluation of key estimation from audio recordings. I agree with some reviewers in some issues I would like to comment:
1.- I think it is important to provide some training data so that participants can evaluate their algorithms according to the evaluation material: genres, audio format, etc. I think this can be useful also to test that the algorithm is working within the evaluation environment. If participants provide the output of their algorithm to this training data, it can serve as a way to test that the algorithm is performing well in the evaluation platform, giving the same results. This was one of the problems we found last year. It avoids some problems when running algorithms in different systems/platforms, languages,...
2.- It is important to establish some kind of rules for submission: binaries, matlab code, java???. Is it possible to submit different versions of the algorithm for the same participant?
[Hendrik 02.26.05]: matlab would be very convenient.
3.- I think that the use of Audio from synthesized MIDI would be a simplistic solution not representative of the complexity of the problem. Maybe we could try to find MIDI + real performances, or to have some MIDI synthesized but not all of the evaluation material. Then, I agree with reviewer 2 that tuning errors should be considered as closed tonalities.
4.- I also think it is important to use a representation of different musical genres. I think you can find some annotated material from known artists (for instance, from The Beatles). Then, I refer again to the need of having some training data.
5.- I would propose to contact Marc Leman and his group, they have done a lot of work on perception based music analysis and they may be interested in participating: Marc.Leman@UGent.be. They have also a lot of experience in manual annotation.
Best regards and thanks,